-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG: groupby agg fails silently with mixed dtypes #43213
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
debnathshoham
commented
Aug 25, 2021
- closes BUG: groupby aggregation methods produce empty DataFrame with mixed DataTypes #43209
- tests added / passed
- Ensure all linting tests pass, see here for how to run them
- whatsnew entry
pandas/tests/groupby/test_groupby.py
Outdated
).astype({("a", "j"): dtype, ("b", "j"): dtype}) | ||
result = df.groupby(level=1, axis=1).sum() | ||
expected = DataFrame( | ||
[[5, 7, 9], [5, 7, 9], [5, 7, 9]], columns=["i", "j", "k"], dtype=dtype |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is the expected also a mixed dtype dataframe?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
here the expected would be either entirely Int64 or int64
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this test representative of the reported issue?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, it's the exact same example (mixed int one)
cc @rhshadrach @jbrockmendel for a look |
ok 2 comments @debnathshoham pls ping when reasolved & green |
@jreback added comments to both places + Green |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @debnathshoham I'm not sure that this should be backported as I don't understand the reasoning for the change in behavior. see #43213 (comment)
simplified code sample
import pandas as pd
print(pd.__version__)
df = pd.DataFrame([[1, 2, 3, 4, 5, 6]] * 3)
df.columns = pd.MultiIndex.from_product([["a", "b"], ["i", "j", "k"]])
for col in df.columns:
if col[-1] == "j":
df[col] = df[col].astype("Int64")
print("mixed int:")
result = df.groupby(level=1, axis=1).sum()
print(result)
print(result.dtypes)
on 1.2.5
1.2.5
mixed int:
i j k
0 5 7.0 9
1 5 7.0 9
2 5 7.0 9
i int64
j float64
k int64
dtype: object
this PR
1.4.0.dev0+537.g8d1bfb1cb9
mixed int:
i j k
0 5 7 9
1 5 7 9
2 5 7 9
i Int64
j Int64
k Int64
dtype: object
Can you explain why the dtypes of columns i and k are now Int64
sorry, I didn't follow it in the first comment. Actually in master, I don't think dtype of the output is being maintained in any of these funcs Is the expectation that the output will have the original dtypes, i.e. |
for the code sample, that would be reasonable, although returning How does the changes in PR relate to the changes in the PR that caused the regression #41706? (or to put it a different way, which change in #41706 caused the regression and how does this PR rectify that?) |
Before #41706, the actual calculation was happening in |
Agreed @debnathshoham - on master, the dtypes in @simonjayhawkins' example come out at all float. What you have here is certainly an improvement over that, but it doesn't fully fix the regression. It's unclear to me if this patch is a good step in fully resolving the regression, or if fully solving it would make this patch obsolete. Will need to investigate more. |
Doesn't the PR that changes DataFrame.transpose fix this? That seems like a much better option than reverting #41706, which fixed a bunch of bugs. Again shout it from the rooftops: this is caused by the lack of 2D EAs. |
yes, but that was casting the the result df to the |
can you show what these test functions return in 1.2.5 vs what this PR gives. and show any difference in a comment. |
added comments in the tests.
|
@jreback @simonjayhawkins |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor request on the tests; otherwise this looks great to me. The code change is small and makes sense, and the dtypes are improved over 1.2.5.
pandas/tests/groupby/test_groupby.py
Outdated
# 1.2.5: ValueError: Length mismatch: Expected axis | ||
# has 0 elements, new values have 3 elements | ||
("var", [4.5] * 3, int, {"i": float, "j": float, "k": float}), | ||
# 1.2.5: DataError: No numeric types to aggregate | ||
("sum", [5, 7, 9], "Int64", {"j": "int64"}), | ||
# 1.2.5: j:float64 | ||
("std", [4.5 ** 0.5] * 3, "Int64", {"i": float, "j": float, "k": float}), | ||
# 1.2.5: ValueError: Length mismatch: Expected axis | ||
# has 0 elements, new values have 3 elements | ||
("var", [4.5] * 3, "Int64", {"i": "float64", "j": "float64", "k": "float64"}), | ||
# 1.2.5: DataError: No numeric types to aggregate |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you remove the comments here - appreciate the extra detail, but they shouldn't be part of a test; surfacing this in a comment on github directly would be great (and you already have another comment that does something similar - thanks for that as well).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
request & pls merge master ping on green.
pandas/tests/groupby/test_groupby.py
Outdated
@@ -2509,3 +2509,53 @@ def test_rolling_wrong_param_min_period(): | |||
result_error_msg = r"__init__\(\) got an unexpected keyword argument 'min_period'" | |||
with pytest.raises(TypeError, match=result_error_msg): | |||
test_df.groupby("name")["val"].rolling(window=2, min_period=1).sum() | |||
|
|||
|
|||
@pytest.mark.parametrize( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
move these to pandas/tests/groupby/aggregate/test_aggreagte.py (or maybe test_cython) see where similar are.
@jreback Green |
thanks @debnathshoham |
@meeseeksdev backport 1.3.x |
Something went wrong ... Please have a look at my logs. |
…#43808) Co-authored-by: Shoham Debnath <[email protected]>
if self.axis: | ||
obj = self._obj_with_exclusions.T | ||
else: | ||
obj = self._obj_with_exclusions |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
im a little late here, but could potentially use _get_data_to_aggregate here