BUG: Fix groupby sorting on ordered Categoricals (GH25871) #25908

kpflugshaupt · 2019-03-28T11:29:16Z

As documented in #25871, groupby() on an ordered Categorical messes up category order when 'observed=True' is specified.
Specifically, group labels will be ordered by first occurrence (as for an unordered Categorical), but grouped aggregation results will retain the Categorical's order.
The fix is a modified subset of #25173, which fixes a related case, but has not been merged yet.

closes groupby aggregation on ordered Categorial with 'observed=True' breaks order #25871, also Groupby with observed=True doesn't sort #25167
tests added / passed
passes git diff upstream/master -u -- "*.py" | flake8 --diff
whatsnew entry

* BUG: Fix groupby on ordered Categoricals (GH25871) As documented in pandas-dev#25871, groupby() on an ordered Categorical messes up category order when 'observed=True' is specified. Specifically, group labels will be ordered by first occurrence (as for an unordered Categorical), but grouped aggregation results will retain the Categorical's order. The fix is a modified subset of pandas-dev#25173, which fixes a related case, but has not been merged yet. * BUG: Fix groupby on ordered Categoricals (GH25871) * new test * Fix groupby on ordered Categoricals (GH25871) Testing all combinations of: - ordered vs. unordered grouping column - 'observed' True vs. False - 'sort' True vs. False In all cases, result group ordering must be correct. The test is built such that the result index labels are equal to aggregation results if all goes well (except for the one unobserved category)

pep8speaks · 2019-03-28T11:29:20Z

Hello @kpflugshaupt! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2019-04-03 08:05:15 UTC

2nd shot...

kpflugshaupt · 2019-03-28T11:35:58Z

Fixed new PEP 8 issues, too

kpflugshaupt · 2019-03-28T12:49:28Z

Failing test reproduced here. Will investigate

This test had an adjustment for column order when 'observed=True' is set. This hid the fact that, with that parameter set, the data columns were not actually reordered -- it was just the column group labels (analogous to index labels in pandas-dev#25871), leaving the data columns in place and out of sync. (This was not visible as the data consisted only of ones). I've made the test more sensitive (unsyncing of data columns will be caught now) and removed the special case for 'observed=True'. As there are no unobserved categories in this case, the result should not be influenced by this parameter.

kpflugshaupt · 2019-03-28T15:29:54Z

Adapted the failing test, reopening

WillAyd · 2019-03-28T15:35:04Z

@kpflugshaupt you don't need to close and reopen a PR for every commit (causes unnecessary messaging spam) - just push to the PR as you have new commits.

Is this a WIP or ready for review?

kpflugshaupt · 2019-03-28T15:37:24Z

@WillAyd: Sorry for the spamming, I did not know. Will leave open in the future, thanks!
As soon as automatic checks come back OK, this is good for review.

codecov · 2019-03-28T17:39:43Z

Codecov Report

Merging #25908 into master will decrease coverage by <.01%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master   #25908      +/-   ##
==========================================
- Coverage   91.84%   91.84%   -0.01%     
==========================================
  Files         175      175              
  Lines       52550    52552       +2     
==========================================
- Hits        48266    48265       -1     
- Misses       4284     4287       +3

Flag	Coverage Δ
#multiple	`90.39% <100%> (ø)`	⬆️
#single	`41.89% <0%> (-0.08%)`	⬇️

Impacted Files	Coverage Δ
pandas/core/groupby/grouper.py	`98.18% <100%> (+0.01%)`	⬆️
pandas/io/gbq.py	`75% <0%> (-12.5%)`	⬇️
pandas/core/frame.py	`96.79% <0%> (-0.12%)`	⬇️
pandas/core/groupby/categorical.py	`100% <0%> (+4.54%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 4814a28...c9e3883. Read the comment docs.

pandas/tests/groupby/test_categorical.py

jreback · 2019-03-28T20:47:43Z

doc/source/whatsnew/v0.25.0.rst

@@ -347,7 +347,7 @@ Groupby/Resample/Rolling
 - Bug in :func:`pandas.core.groupby.GroupBy.agg` when applying a aggregation function to timezone aware data (:issue:`23683`)
 - Bug in :func:`pandas.core.groupby.GroupBy.first` and :func:`pandas.core.groupby.GroupBy.last` where timezone information would be dropped (:issue:`21603`)
 - Ensured that ordering of outputs in ``groupby`` aggregation functions is consistent across all versions of Python (:issue:`25692`)
-
+- Ensured that result group order is correct when grouping on an ordered Categorical and specifying ``observed=True`` (:issue:`25871`)


can you use double back-ticks around Categorical

kpflugshaupt · 2019-03-29T12:36:04Z

The added test from #25167 fails. Will investigate.

kpflugshaupt · 2019-03-29T13:41:34Z

Got the expected values wrong, don't know what I was thinking. Fixed now.

…ead of 'int64'), also PEP 8 fixes

kpflugshaupt · 2019-03-29T14:12:48Z

Windows checks were failing because 'int' type apparently means 'int32' on that planet. I learned something today.

topper-123 · 2019-03-29T17:37:16Z

Windows checks were failing because 'int' type apparently means 'int32' on that planet. I learned something today.

This is a 32-bit issue, not Windows as such. This is a bit annoying, but pandas uses np.int64 always in some cases and np.intp in other cases. In this case you should likely use np.intp.

In general, you should in pandas always use explicit types, so either np.int64 or np.intp, depening on needs. Using just int causes various issues unless you actually want a python int.

kpflugshaupt · 2019-03-29T18:48:05Z

That’s good to know. As I‘m working on Mac and Linux only, I assumed everyone defaulted to int64.

pandas/tests/groupby/test_categorical.py

kpflugshaupt · 2019-03-30T18:53:42Z

That should be it -- all checks passing now. I've added #25167 to the solved issues in the whatsnew file.

Added a generic assert with a custom message to make problem more obvious.

kpflugshaupt · 2019-04-03T08:33:26Z

Changed test as per @jreback 's comments.

jreback · 2019-04-05T00:27:17Z

thanks @kpflugshaupt nice patch!

kpflugshaupt · 2019-04-05T09:33:48Z

You're welcome, nice working with you.

kpflugshaupt added 2 commits March 28, 2019 11:55

BUG: GH25871 -- fix PEP 8 issues on test source

50e7d64

kpflugshaupt closed this Mar 28, 2019

BUG: GH25871 -- fix PEP 8 issues on test source

0aaa347

2nd shot...

kpflugshaupt reopened this Mar 28, 2019

kpflugshaupt closed this Mar 28, 2019

kpflugshaupt reopened this Mar 28, 2019

gfyoung added Bug Groupby Categorical Categorical Data Type labels Mar 28, 2019

gfyoung reviewed Mar 28, 2019

View reviewed changes

pandas/tests/groupby/test_categorical.py Show resolved Hide resolved

jreback requested changes Mar 28, 2019

View reviewed changes

kpflugshaupt added 2 commits March 29, 2019 09:24

Fix whatsnew file formatting

2abf281

Extend unit test with code sample from pandas-dev#25167

b8b2011

Fix test: expected values

5200544

Fix test: expected values dtype (Windows takes 'int' as 'int32', inst…

916f3eb

…ead of 'int64'), also PEP 8 fixes

jreback requested changes Mar 30, 2019

View reviewed changes

pandas/tests/groupby/test_categorical.py Outdated Show resolved Hide resolved

pandas/tests/groupby/test_categorical.py Outdated Show resolved Hide resolved

Added pandas-dev#25167 to whatsnew file as resolved

df8a995

kpflugshaupt added 3 commits April 3, 2019 09:57

Merge branch 'master' into master

3f85534

Improve test reporting

21ff12c

Added a generic assert with a custom message to make problem more obvious.

Merge branch 'master' of https://github.com/kpflugshaupt/pandas

c9e3883

jreback added this to the 0.25.0 milestone Apr 5, 2019

jreback approved these changes Apr 5, 2019

View reviewed changes

jreback merged commit 1a30601 into pandas-dev:master Apr 5, 2019

jreback mentioned this pull request Apr 5, 2019

Groupby with observed=True doesn't sort #25167

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: Fix groupby sorting on ordered Categoricals (GH25871) #25908

BUG: Fix groupby sorting on ordered Categoricals (GH25871) #25908

kpflugshaupt commented Mar 28, 2019 •

edited

Loading

pep8speaks commented Mar 28, 2019 •

edited

Loading

kpflugshaupt commented Mar 28, 2019

kpflugshaupt commented Mar 28, 2019

kpflugshaupt commented Mar 28, 2019

WillAyd commented Mar 28, 2019

kpflugshaupt commented Mar 28, 2019

codecov bot commented Mar 28, 2019 •

edited

Loading

jreback Mar 28, 2019

kpflugshaupt Mar 29, 2019

kpflugshaupt commented Mar 29, 2019

kpflugshaupt commented Mar 29, 2019

kpflugshaupt commented Mar 29, 2019

topper-123 commented Mar 29, 2019

kpflugshaupt commented Mar 29, 2019

kpflugshaupt commented Mar 30, 2019

kpflugshaupt commented Apr 3, 2019

jreback commented Apr 5, 2019

kpflugshaupt commented Apr 5, 2019

BUG: Fix groupby sorting on ordered Categoricals (GH25871) #25908

BUG: Fix groupby sorting on ordered Categoricals (GH25871) #25908

Conversation

kpflugshaupt commented Mar 28, 2019 • edited Loading

pep8speaks commented Mar 28, 2019 • edited Loading

Comment last updated at 2019-04-03 08:05:15 UTC

kpflugshaupt commented Mar 28, 2019

kpflugshaupt commented Mar 28, 2019

kpflugshaupt commented Mar 28, 2019

WillAyd commented Mar 28, 2019

kpflugshaupt commented Mar 28, 2019

codecov bot commented Mar 28, 2019 • edited Loading

Codecov Report

jreback Mar 28, 2019

Choose a reason for hiding this comment

kpflugshaupt Mar 29, 2019

Choose a reason for hiding this comment

kpflugshaupt commented Mar 29, 2019

kpflugshaupt commented Mar 29, 2019

kpflugshaupt commented Mar 29, 2019

topper-123 commented Mar 29, 2019

kpflugshaupt commented Mar 29, 2019

kpflugshaupt commented Mar 30, 2019

kpflugshaupt commented Apr 3, 2019

jreback commented Apr 5, 2019

kpflugshaupt commented Apr 5, 2019

kpflugshaupt commented Mar 28, 2019 •

edited

Loading

pep8speaks commented Mar 28, 2019 •

edited

Loading

codecov bot commented Mar 28, 2019 •

edited

Loading