Skip to content

BUG: groupby agg fails silently with mixed dtypes #43213

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 44 commits into from
Sep 29, 2021

Conversation

debnathshoham
Copy link
Member

@simonjayhawkins simonjayhawkins added Dtype Conversions Unexpected or buggy dtype conversions Groupby Regression Functionality that used to work in a prior pandas version labels Aug 25, 2021
@simonjayhawkins simonjayhawkins added this to the 1.3.3 milestone Aug 25, 2021
).astype({("a", "j"): dtype, ("b", "j"): dtype})
result = df.groupby(level=1, axis=1).sum()
expected = DataFrame(
[[5, 7, 9], [5, 7, 9], [5, 7, 9]], columns=["i", "j", "k"], dtype=dtype
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is the expected also a mixed dtype dataframe?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here the expected would be either entirely Int64 or int64

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this test representative of the reported issue?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, it's the exact same example (mixed int one)

@jreback
Copy link
Contributor

jreback commented Aug 31, 2021

cc @rhshadrach @jbrockmendel for a look

@jreback
Copy link
Contributor

jreback commented Aug 31, 2021

ok 2 comments @debnathshoham pls ping when reasolved & green

@debnathshoham
Copy link
Member Author

@jreback added comments to both places + Green

Copy link
Member

@simonjayhawkins simonjayhawkins left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @debnathshoham I'm not sure that this should be backported as I don't understand the reasoning for the change in behavior. see #43213 (comment)

simplified code sample

import pandas as pd

print(pd.__version__)
df = pd.DataFrame([[1, 2, 3, 4, 5, 6]] * 3)
df.columns = pd.MultiIndex.from_product([["a", "b"], ["i", "j", "k"]])
for col in df.columns:
    if col[-1] == "j":
        df[col] = df[col].astype("Int64")
print("mixed int:")
result = df.groupby(level=1, axis=1).sum()
print(result)
print(result.dtypes)

on 1.2.5

1.2.5
mixed int:
   i    j  k
0  5  7.0  9
1  5  7.0  9
2  5  7.0  9
i      int64
j    float64
k      int64
dtype: object

this PR

1.4.0.dev0+537.g8d1bfb1cb9
mixed int:
   i  j  k
0  5  7  9
1  5  7  9
2  5  7  9
i    Int64
j    Int64
k    Int64
dtype: object

Can you explain why the dtypes of columns i and k are now Int64

@debnathshoham
Copy link
Member Author

sorry, I didn't follow it in the first comment.

Actually in master, I don't think dtype of the output is being maintained in any of these funcs
'count', 'sum', 'std', 'var', 'sem', 'mean', 'median', 'prod', 'min', 'max'

Is the expectation that the output will have the original dtypes, i.e. int64, Int64, int64 ?

@simonjayhawkins
Copy link
Member

Is the expectation that the output will have the original dtypes, i.e. int64, Int64, int64 ?

for the code sample, that would be reasonable, although returning float64 for the Int64 columns would probably also be acceptable as that was the return type before the regression.

How does the changes in PR relate to the changes in the PR that caused the regression #41706? (or to put it a different way, which change in #41706 caused the regression and how does this PR rectify that?)

@debnathshoham
Copy link
Member Author

Before #41706, the actual calculation was happening in agg_general in result = self.aggregate(lambda x: npfunc(x, axis=self.axis)).
Honestly, I didn't try to undo what that patch did, I went on to fix the inconsistency in the result coming post that patch.

@rhshadrach
Copy link
Member

Agreed @debnathshoham - on master, the dtypes in @simonjayhawkins' example come out at all float. What you have here is certainly an improvement over that, but it doesn't fully fix the regression. It's unclear to me if this patch is a good step in fully resolving the regression, or if fully solving it would make this patch obsolete. Will need to investigate more.

@jbrockmendel
Copy link
Member

Doesn't the PR that changes DataFrame.transpose fix this? That seems like a much better option than reverting #41706, which fixed a bunch of bugs.

Again shout it from the rooftops: this is caused by the lack of 2D EAs.

@debnathshoham
Copy link
Member Author

yes, but that was casting the the result df to the common_dtype (which wasn't matching 1.2.5 behaviour)

@jreback
Copy link
Contributor

jreback commented Sep 16, 2021

can you show what these test functions return in 1.2.5 vs what this PR gives. and show any difference in a comment.

@debnathshoham
Copy link
Member Author

added comments in the tests.
To summarise, for df with mixed dtypes like below.
For sum - Columns with Int64 was cast to float64 in 1.2.5, whereas in this PR it would cast to int64.
For std - ValueError : Length mismatch in 1.2.5. In this PR they come as float64.
Fot var - DataError : No numeric types to aggregate in 1.2.5. In this PR they come as float64

In [32]:     df = DataFrame(
    ...:         np.arange(12).reshape(3, 4),
    ...:         index=Index([0, 1, 0], name="y"),
    ...:         columns=Index([10, 20, 10, 20], name="x"),
    ...:         dtype="int64",
    ...:     ).astype({10: "Int64"})

In [33]: df.dtypes
Out[33]: 
x
10    Int64
20    int64
10    Int64
20    int64
dtype: object

@debnathshoham
Copy link
Member Author

@jreback @simonjayhawkins
wondering if you got a chance to look into this?
xref - #43501 (I think this is related)

Copy link
Member

@rhshadrach rhshadrach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor request on the tests; otherwise this looks great to me. The code change is small and makes sense, and the dtypes are improved over 1.2.5.

Comment on lines 2519 to 2529
# 1.2.5: ValueError: Length mismatch: Expected axis
# has 0 elements, new values have 3 elements
("var", [4.5] * 3, int, {"i": float, "j": float, "k": float}),
# 1.2.5: DataError: No numeric types to aggregate
("sum", [5, 7, 9], "Int64", {"j": "int64"}),
# 1.2.5: j:float64
("std", [4.5 ** 0.5] * 3, "Int64", {"i": float, "j": float, "k": float}),
# 1.2.5: ValueError: Length mismatch: Expected axis
# has 0 elements, new values have 3 elements
("var", [4.5] * 3, "Int64", {"i": "float64", "j": "float64", "k": "float64"}),
# 1.2.5: DataError: No numeric types to aggregate
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you remove the comments here - appreciate the extra detail, but they shouldn't be part of a test; surfacing this in a comment on github directly would be great (and you already have another comment that does something similar - thanks for that as well).

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

request & pls merge master ping on green.

@@ -2509,3 +2509,53 @@ def test_rolling_wrong_param_min_period():
result_error_msg = r"__init__\(\) got an unexpected keyword argument 'min_period'"
with pytest.raises(TypeError, match=result_error_msg):
test_df.groupby("name")["val"].rolling(window=2, min_period=1).sum()


@pytest.mark.parametrize(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

move these to pandas/tests/groupby/aggregate/test_aggreagte.py (or maybe test_cython) see where similar are.

@debnathshoham
Copy link
Member Author

@jreback Green

@jreback jreback merged commit 2e5ed86 into pandas-dev:master Sep 29, 2021
@jreback
Copy link
Contributor

jreback commented Sep 29, 2021

thanks @debnathshoham

@jreback
Copy link
Contributor

jreback commented Sep 29, 2021

@meeseeksdev backport 1.3.x

@lumberbot-app
Copy link

lumberbot-app bot commented Sep 29, 2021

Something went wrong ... Please have a look at my logs.

@debnathshoham debnathshoham deleted the gh43209 branch September 30, 2021 04:30
phofl pushed a commit that referenced this pull request Sep 30, 2021
gasparitiago pushed a commit to gasparitiago/pandas that referenced this pull request Oct 9, 2021
if self.axis:
obj = self._obj_with_exclusions.T
else:
obj = self._obj_with_exclusions
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

im a little late here, but could potentially use _get_data_to_aggregate here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Dtype Conversions Unexpected or buggy dtype conversions Groupby Regression Functionality that used to work in a prior pandas version
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: groupby aggregation methods produce empty DataFrame with mixed DataTypes
5 participants