-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
PERF: df.groupby(categorical, sort=False) #48976
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 3 commits
eb58451
c8da563
a08f936
88c11eb
330e3c3
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -75,21 +75,22 @@ def recode_for_groupby( | |
return c, None | ||
|
||
# sort=False should order groups in as-encountered order (GH-8868) | ||
cat = c.unique() | ||
|
||
# See GH-38140 for block below | ||
# exclude nan from indexer for categories | ||
take_codes = cat.codes[cat.codes != -1] | ||
if cat.ordered: | ||
take_codes = np.sort(take_codes) | ||
cat = cat.set_categories(cat.categories.take(take_codes)) | ||
|
||
# But for groupby to work, all categories should be present, | ||
# including those missing from the data (GH-13179), which .unique() | ||
# above dropped | ||
cat = cat.add_categories(c.categories[~c.categories.isin(cat.categories)]) | ||
|
||
return c.reorder_categories(cat.categories), None | ||
# xref GH:46909: Re-ordering codes faster than using (set|add|reorder)_categories | ||
all_codes = np.arange(c.categories.nunique(), dtype=np.int8) | ||
# GH 38140: exclude nan from indexer for categories | ||
# TODO: Probably why dropa=False doesn't work GH 36327 | ||
unique_notnan_codes = unique1d(c.codes[c.codes != -1]) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm working on a branch that fixes There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ah okay thanks for the insights. I'll remove my comment then referencing dropna=False. I just found it odd that this generates |
||
if c.ordered: | ||
unique_notnan_codes = np.sort(unique_notnan_codes) | ||
if len(all_codes) > len(unique_notnan_codes): | ||
# GH 13179: All categories need to be present, even if missing from the data | ||
missing_codes = np.setdiff1d(all_codes, unique_notnan_codes, assume_unique=True) | ||
take_codes = np.concatenate((unique_notnan_codes, missing_codes)) | ||
else: | ||
take_codes = unique_notnan_codes | ||
|
||
return Categorical(c, c.unique().categories.take(take_codes)), None | ||
|
||
|
||
def recode_from_groupby( | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mroeschke - why the specification of int8 here? This causes overflow when there are more than 256 categories. Found this via the failing asv in #49131 - if it's okay, going to remove the specification of dtype there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I was trying to match the output of
Categorical.codes
here which was int8, but here specifically I think ensuring thedtype
isint
should be sufficient.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Feel free to change it in #49131