-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG: Fix groupby observed=True when aggregating a column #24412
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 7 commits
debb02a
f450dc2
3b50900
e6e9894
5181639
4c68d41
77cf3c6
572d9c3
8793995
b75bc7b
ef48de6
d87e159
b01d809
7a6898f
d584327
3e9178a
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -860,3 +860,15 @@ def test_groupby_multiindex_categorical_datetime(): | |
expected = pd.DataFrame( | ||
{'values': [0, 4, 8, 3, 4, 5, 6, np.nan, 2]}, index=idx) | ||
assert_frame_equal(result, expected) | ||
|
||
|
||
def test_groupby_agg_observed_true_single_column(): | ||
# GH-23970 | ||
expected = pd.DataFrame({ | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I can create a test case where I originally had There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. can you parameterize over as_index=True/False There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I was parameterising
I understand that the issue is with not using CategoricalIndex at MultiIndex[0] but I have no idea how to go about doing it. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. use There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thanks a lot! |
||
'a': pd.Series([1, 1, 2], dtype='category'), | ||
'b': [1, 2, 2], 'x': [1, 2, 3]}) | ||
|
||
result = expected.groupby(['a', 'b'], | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Might be reading this wrong but do we need the non-categorical column in the grouper to emulate this behavior? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes. This is because the bug generates a group that doesn't exist due to using some weird Cartesian product of the two columns used in the grouper. Therefore, while the only valid groupings are There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ok thanks for the responses. In that case I think you can improve the whatsnew to say that the observed keyword was essentially being ignored when selecting one column as part of the GroupBy. It’s currently too vague IMO There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Also used to be the case. But I changed it to be simpler on @mroeschke 's advice (in a previous resolved convo). I can change it back if needed but would just like a confirmation. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Clarity is always welcome. But generally whatsnew entries should refer to externally facing methods and implications. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yea the previous entry was arguably too specific mentioning the exact internal method where the problem arose. The current entry is too vague to be useful. If you can meet somewhere in the middle on that I think it again would be mentioning that the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Pushed it. Hopefully this is a better middle ground between the 2 previous approaches! |
||
as_index=False, observed=True)['x'].sum() | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. To further clarify, no |
||
|
||
tm.assert_frame_equal(result, expected) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is when its a categorical column; also this is only for
as_index=False
, right?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any operation over a selected column was effectively ignoring the
observed
parameter that was being passed to it. Withas_index=True
theobserved
value isn't used so the code behaves as expected.