Skip to content

BUG: Fix groupby sorting on ordered Categoricals (GH25871) #25908

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 12 commits into from
Apr 5, 2019

Conversation

kpflugshaupt
Copy link
Contributor

@kpflugshaupt kpflugshaupt commented Mar 28, 2019

As documented in #25871, groupby() on an ordered Categorical messes up category order when 'observed=True' is specified.
Specifically, group labels will be ordered by first occurrence (as for an unordered Categorical), but grouped aggregation results will retain the Categorical's order.
The fix is a modified subset of #25173, which fixes a related case, but has not been merged yet.

* BUG: Fix groupby on ordered Categoricals (GH25871)

As documented in pandas-dev#25871, groupby() on an ordered Categorical messes up category order when 'observed=True' is specified. 
Specifically, group labels will be ordered by first occurrence (as for an unordered Categorical), but grouped aggregation results will retain the Categorical's order.
The fix is a modified subset of pandas-dev#25173, which fixes a related case, but has not been merged yet.

* BUG: Fix groupby on ordered Categoricals (GH25871)

* new test

* Fix groupby on ordered Categoricals (GH25871)

Testing all combinations of:
- ordered vs. unordered grouping column
- 'observed' True vs. False
- 'sort' True vs. False
In all cases, result group ordering must be correct. 
The test is built such that the result index labels are equal to aggregation results if all goes well (except for the one unobserved category)
@pep8speaks
Copy link

pep8speaks commented Mar 28, 2019

Hello @kpflugshaupt! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2019-04-03 08:05:15 UTC

@kpflugshaupt
Copy link
Contributor Author

Fixed new PEP 8 issues, too

@kpflugshaupt kpflugshaupt reopened this Mar 28, 2019
@kpflugshaupt
Copy link
Contributor Author

Failing test reproduced here. Will investigate

This test had an adjustment for column order when 'observed=True' is set. This hid the fact that, with that parameter set, the data columns were not actually reordered -- it was just the column group labels (analogous to index labels in pandas-dev#25871), leaving the data columns in place and out of sync. (This was not visible as the data consisted only of ones).

I've made the test more sensitive (unsyncing of data columns will be caught now) and removed the special case for 'observed=True'. As there are no unobserved categories in this case, the result should not be influenced by this parameter.
@kpflugshaupt
Copy link
Contributor Author

Adapted the failing test, reopening

@kpflugshaupt kpflugshaupt reopened this Mar 28, 2019
@WillAyd
Copy link
Member

WillAyd commented Mar 28, 2019

@kpflugshaupt you don't need to close and reopen a PR for every commit (causes unnecessary messaging spam) - just push to the PR as you have new commits.

Is this a WIP or ready for review?

@kpflugshaupt
Copy link
Contributor Author

@WillAyd: Sorry for the spamming, I did not know. Will leave open in the future, thanks!
As soon as automatic checks come back OK, this is good for review.

@codecov
Copy link

codecov bot commented Mar 28, 2019

Codecov Report

Merging #25908 into master will decrease coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #25908      +/-   ##
==========================================
- Coverage   91.84%   91.84%   -0.01%     
==========================================
  Files         175      175              
  Lines       52550    52552       +2     
==========================================
- Hits        48266    48265       -1     
- Misses       4284     4287       +3
Flag Coverage Δ
#multiple 90.39% <100%> (ø) ⬆️
#single 41.89% <0%> (-0.08%) ⬇️
Impacted Files Coverage Δ
pandas/core/groupby/grouper.py 98.18% <100%> (+0.01%) ⬆️
pandas/io/gbq.py 75% <0%> (-12.5%) ⬇️
pandas/core/frame.py 96.79% <0%> (-0.12%) ⬇️
pandas/core/groupby/categorical.py 100% <0%> (+4.54%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 4814a28...c9e3883. Read the comment docs.

@gfyoung gfyoung added Bug Groupby Categorical Categorical Data Type labels Mar 28, 2019
@@ -347,7 +347,7 @@ Groupby/Resample/Rolling
- Bug in :func:`pandas.core.groupby.GroupBy.agg` when applying a aggregation function to timezone aware data (:issue:`23683`)
- Bug in :func:`pandas.core.groupby.GroupBy.first` and :func:`pandas.core.groupby.GroupBy.last` where timezone information would be dropped (:issue:`21603`)
- Ensured that ordering of outputs in ``groupby`` aggregation functions is consistent across all versions of Python (:issue:`25692`)

- Ensured that result group order is correct when grouping on an ordered Categorical and specifying ``observed=True`` (:issue:`25871`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you use double back-ticks around Categorical

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@kpflugshaupt
Copy link
Contributor Author

The added test from #25167 fails. Will investigate.

@kpflugshaupt
Copy link
Contributor Author

Got the expected values wrong, don't know what I was thinking. Fixed now.

@kpflugshaupt
Copy link
Contributor Author

Windows checks were failing because 'int' type apparently means 'int32' on that planet. I learned something today.

@topper-123
Copy link
Contributor

Windows checks were failing because 'int' type apparently means 'int32' on that planet. I learned something today.

This is a 32-bit issue, not Windows as such. This is a bit annoying, but pandas uses np.int64 always in some cases and np.intp in other cases. In this case you should likely use np.intp.

In general, you should in pandas always use explicit types, so either np.int64 or np.intp, depening on needs. Using just int causes various issues unless you actually want a python int.

@kpflugshaupt
Copy link
Contributor Author

That’s good to know. As I‘m working on Mac and Linux only, I assumed everyone defaulted to int64.

@kpflugshaupt
Copy link
Contributor Author

That should be it -- all checks passing now. I've added #25167 to the solved issues in the whatsnew file.

@kpflugshaupt
Copy link
Contributor Author

Changed test as per @jreback 's comments.

@jreback jreback added this to the 0.25.0 milestone Apr 5, 2019
@jreback jreback merged commit 1a30601 into pandas-dev:master Apr 5, 2019
@jreback
Copy link
Contributor

jreback commented Apr 5, 2019

thanks @kpflugshaupt nice patch!

@kpflugshaupt
Copy link
Contributor Author

You're welcome, nice working with you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Categorical Categorical Data Type Groupby
Projects
None yet
Development

Successfully merging this pull request may close these issues.

groupby aggregation on ordered Categorial with 'observed=True' breaks order
6 participants