-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
Performance of pandas.algos.groupby_int64 #14293
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
can u show a particular example that you have been timing (so all on the same page) |
Those The right way to do this that avoids the GIL (assuming that
For example, if the labels are:
Then you factorize (which will cause GIL contention if object dtype) to get
Stable argsort (either by counting sort or mergesort) yields
Now, you iterate through this array to delimit each contiguous group.
|
Note that in pandas 2.0, factorizing strings (assuming we implement https://pydata.github.io/pandas-design/strings.html) will not require the GIL -> multicore happiness. |
import numpy as np
import pandas as pd
s = pd.Series(np.random.randint(0, 100, size=2**24))
s.groupby(s).groups
|
@wesm FWIW I only care about |
master
patch
|
I had added a special case for categorical grouping because a couple of minor tests fail with this (edge case with all nan categories), but no big deal to fix. This still hits a dictionary encoding path (in cython), but could have the GIL released for part of it. |
you should get gil releasing in the factorize and now the groupby_indices (these are about 2/3 of the time), rest is python-ish |
That performance gain would definitely resolve my immediate needs and likely move Pandas well away from being a bottleneck. |
I pushed it up: https://github.com/jreback/pandas/tree/groupby (as s I said, running some perf numbers and a couple of edge cases), but give it a go |
|
remove pandas.core.groupby._groupby_indices to use algos.groupsort_indexer add Categorical._reverse_indexer to facilitate closes pandas-dev#14293
For dask.dataframe shuffle operations (groupby.apply, merge), when running with multiple threads per process, I sometimes find my computations dominated by
pandas.algos.groupby_int64
. Looking at the source code for this it looks like it's using dynamic pure python objects from Cython. I'm curious if there are ways to accelerate this function, particularly in multi-threaded situations (releasing the GIL).One solution that comes to mind would be to do a single pass over
labels
, pre-compute the length of eachmembers
list inresults
and then pre-allocate these as arrays. This might allow better GIL-releasing behavior.Thoughts?
The text was updated successfully, but these errors were encountered: