-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG: Fix groupby sorting on ordered Categoricals (GH25871) #25908
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
* BUG: Fix groupby on ordered Categoricals (GH25871) As documented in pandas-dev#25871, groupby() on an ordered Categorical messes up category order when 'observed=True' is specified. Specifically, group labels will be ordered by first occurrence (as for an unordered Categorical), but grouped aggregation results will retain the Categorical's order. The fix is a modified subset of pandas-dev#25173, which fixes a related case, but has not been merged yet. * BUG: Fix groupby on ordered Categoricals (GH25871) * new test * Fix groupby on ordered Categoricals (GH25871) Testing all combinations of: - ordered vs. unordered grouping column - 'observed' True vs. False - 'sort' True vs. False In all cases, result group ordering must be correct. The test is built such that the result index labels are equal to aggregation results if all goes well (except for the one unobserved category)
Hello @kpflugshaupt! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found: There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻 Comment last updated at 2019-04-03 08:05:15 UTC |
Fixed new PEP 8 issues, too |
Failing test reproduced here. Will investigate |
This test had an adjustment for column order when 'observed=True' is set. This hid the fact that, with that parameter set, the data columns were not actually reordered -- it was just the column group labels (analogous to index labels in pandas-dev#25871), leaving the data columns in place and out of sync. (This was not visible as the data consisted only of ones). I've made the test more sensitive (unsyncing of data columns will be caught now) and removed the special case for 'observed=True'. As there are no unobserved categories in this case, the result should not be influenced by this parameter.
Adapted the failing test, reopening |
@kpflugshaupt you don't need to close and reopen a PR for every commit (causes unnecessary messaging spam) - just push to the PR as you have new commits. Is this a WIP or ready for review? |
@WillAyd: Sorry for the spamming, I did not know. Will leave open in the future, thanks! |
Codecov Report
@@ Coverage Diff @@
## master #25908 +/- ##
==========================================
- Coverage 91.84% 91.84% -0.01%
==========================================
Files 175 175
Lines 52550 52552 +2
==========================================
- Hits 48266 48265 -1
- Misses 4284 4287 +3
Continue to review full report at Codecov.
|
doc/source/whatsnew/v0.25.0.rst
Outdated
@@ -347,7 +347,7 @@ Groupby/Resample/Rolling | |||
- Bug in :func:`pandas.core.groupby.GroupBy.agg` when applying a aggregation function to timezone aware data (:issue:`23683`) | |||
- Bug in :func:`pandas.core.groupby.GroupBy.first` and :func:`pandas.core.groupby.GroupBy.last` where timezone information would be dropped (:issue:`21603`) | |||
- Ensured that ordering of outputs in ``groupby`` aggregation functions is consistent across all versions of Python (:issue:`25692`) | |||
|
|||
- Ensured that result group order is correct when grouping on an ordered Categorical and specifying ``observed=True`` (:issue:`25871`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you use double back-ticks around Categorical
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
The added test from #25167 fails. Will investigate. |
Got the expected values wrong, don't know what I was thinking. Fixed now. |
…ead of 'int64'), also PEP 8 fixes
Windows checks were failing because 'int' type apparently means 'int32' on that planet. I learned something today. |
This is a 32-bit issue, not Windows as such. This is a bit annoying, but pandas uses In general, you should in pandas always use explicit types, so either |
That’s good to know. As I‘m working on Mac and Linux only, I assumed everyone defaulted to int64. |
That should be it -- all checks passing now. I've added #25167 to the solved issues in the whatsnew file. |
Added a generic assert with a custom message to make problem more obvious.
Changed test as per @jreback 's comments. |
thanks @kpflugshaupt nice patch! |
You're welcome, nice working with you. |
As documented in #25871, groupby() on an ordered Categorical messes up category order when 'observed=True' is specified.
Specifically, group labels will be ordered by first occurrence (as for an unordered Categorical), but grouped aggregation results will retain the Categorical's order.
The fix is a modified subset of #25173, which fixes a related case, but has not been merged yet.
git diff upstream/master -u -- "*.py" | flake8 --diff