Skip to content

BUG: groupby reorders categorical categories #49131

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 12 commits into from
Oct 24, 2022

Conversation

rhshadrach
Copy link
Member

@rhshadrach rhshadrach commented Oct 16, 2022

Ref: #48749

Still needs more tests and a whatsnew

@rhshadrach rhshadrach added Groupby Categorical Categorical Data Type Bug labels Oct 16, 2022
@rhshadrach rhshadrach marked this pull request as ready for review October 18, 2022 04:14
@rhshadrach
Copy link
Member Author

@mroeschke - this doesn't tackle the performance aspect of #48749, only the behavior. It's not clear to me but it seems possible there is still something that can be done in regards to performance, so I think we should leave #48749 open for now.

@mroeschke
Copy link
Member

I think we should leave #48749 open for now.

Agreed

Also looks like a ASV benchmark needs addressing: https://github.com/pandas-dev/pandas/actions/runs/3261927947/jobs/5357896059

@@ -782,7 +782,7 @@ def test_preserve_categories():
# ordered=False
df = DataFrame({"A": Categorical(list("ba"), categories=categories, ordered=False)})
sort_index = CategoricalIndex(categories, categories, ordered=False, name="A")
nosort_index = CategoricalIndex(list("bac"), list("bac"), ordered=False, name="A")
nosort_index = CategoricalIndex(list("bac"), list("abc"), ordered=False, name="A")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add a GH reference by these changed tests?

@rhshadrach
Copy link
Member Author

rhshadrach commented Oct 20, 2022

Regarding the failing ASV, not sure how to approach here. I can't reproduce locally with asv continuous -f 1.1 upstream/main HEAD --quick and the log doesn't give any indication of what's failing.

Edit: I can reproduce locally, I just missed it in the output:

       before           after         ratio
     [aa051b8f]       [1bde9f0f]
     <main>           <groupby_cat_order>
!     1.08±0.02ms           failed      n/a  groupby.Categories.time_groupby_extra_cat_nosort

@rhshadrach
Copy link
Member Author

rhshadrach commented Oct 21, 2022

ASVs in groupby:

       before           after         ratio
     [890d0975]       [8166c13b]
     <groupby_cat_order~2^2>       <groupby_cat_order>
+         258±4μs      1.31±0.01ms     5.08  groupby.Categories.time_groupby_extra_cat_sort
+       271±0.5μs      1.35±0.01ms     4.98  groupby.Categories.time_groupby_ordered_sort
+       270±0.8μs      1.34±0.01ms     4.97  groupby.Categories.time_groupby_sort
+     1.08±0.02ms      2.34±0.01ms     2.18  groupby.Categories.time_groupby_extra_cat_nosort
+     1.27±0.01ms      2.50±0.01ms     1.97  groupby.Categories.time_groupby_nosort
+     1.44±0.01ms      2.50±0.07ms     1.73  groupby.Categories.time_groupby_ordered_nosort
+         105±5μs          118±2μs     1.12  groupby.GroupByMethods.time_dtype_as_field('int16', 'quantile', 'direct', 1)
-      12.0±0.1ms       10.8±0.3ms     0.90  groupby.AggEngine.time_dataframe_cython(True)

These benchmarks use data with length 100000 and 10000 categories.

@mroeschke mroeschke added this to the 2.0 milestone Oct 24, 2022
@mroeschke mroeschke merged commit 1294b19 into pandas-dev:main Oct 24, 2022
@mroeschke
Copy link
Member

Awesome, thanks @rhshadrach

@rhshadrach rhshadrach deleted the groupby_cat_order branch October 24, 2022 18:19
noatamir pushed a commit to noatamir/pandas that referenced this pull request Nov 9, 2022
* BUG: groupby reorders categorical categories

* Tests and whatsnew

* type-ignore

* GH#

* Add test

* Add TODO

* GH#

* fixups

* Revert test change; catch warnings
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Categorical Categorical Data Type Groupby
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants