-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
PERF: faster grouping #14294
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PERF: faster grouping #14294
Conversation
Does the performance regression pointed out in #14293 (comment) still hold here? |
so that was a bug, now fixed. this is size=2**21
These are faster for |
Current coverage is 85.26% (diff: 100%)@@ master #14294 diff @@
==========================================
Files 140 140
Lines 50593 50599 +6
Methods 0 0
Messages 0 0
Branches 0 0
==========================================
+ Hits 43137 43142 +5
- Misses 7456 7457 +1
Partials 0 0
|
5f82e6f
to
1ba34fb
Compare
these are 2*22 with 100, 10000 for small/large
|
any comments |
return a dictionary of groups -> indices | ||
""" | ||
if not is_categorical_dtype(values): | ||
values = Categorical(values) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I might be missing something, but wouldn't it possible to re-use the possibly existing factorization here?
def _groupby_indices(codes : Grouping.labels, cats : Grouping.group_index)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we are re-using the existing factorization (if its categorical already its unchanged).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I meant in the non Categorical case, if Grouping.labels
is already populated, could skip factorizing again in the Categorical?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually I see you proposed something else. .group_index
is not defined except for a single Grouping
. If we had a MultiGrouping
, then yes I think that would work (right now that basically a list of Groupings
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I see it now, I went one class too deep. I do think you could speed up the single grouping case here - https://github.com/jreback/pandas/blob/580237924022eb74575420ad4433952c8de318dd/pandas/core/groupby.py#L2345 - make it:
return self.index.groupby(Categorical.from_codes(self.labels, self.group_index))
seen[k] = loc + 1 | ||
loc = seen[k] | ||
vecs[k][loc] = i | ||
seen[k] = loc + 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We already have groupsort_indexer
, does this do anything different?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let me see...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated
result = _groupby_indices(to_groupby) | ||
|
||
# map to the label | ||
result = {k: self.take(v) for k, v in compat.iteritems(result)} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are there benefits to doing this lazily?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is not lazy, so not sure what you mean.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm just wondering whether producing a fully-boxed Index for each group up front has memory use or performance implications. This is something I guess we'll address in much more detail in pandas 2.0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AFAICT, this is only used on grouped.groups
, so by-definition we need to box. (and that's the ONLY benchmark that changed with this PR). I though this would have broader implications (on the good side), but no-go.
roughly the same perf benefits here. with 5802379
|
remove pandas.core.groupby._groupby_indices to use algos.groupsort_indexer add Categorical._reverse_indexer to facilitate closes pandas-dev#14293
lgtm |
closes #14293