-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
[WIP] ENH: Groupby mode #39867
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] ENH: Groupby mode #39867
Conversation
Thanks @lithomas1! Can you include the frame used for the time test? If the values don't have high cardinality (> 500 distinct values?), can you also test with one that does? My expectation is that the cython implementation performs much better the larger the distinct values are. |
@@ -1578,6 +1578,35 @@ def median(self, numeric_only=True): | |||
numeric_only=numeric_only, | |||
) | |||
|
|||
@final |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this not just value counts(sort=True, dropna=True)[0]? why are we adding a whole bunch of cython code
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well value_counts is about twice as slow as my current implementation(takes same time for 1 column as my implementation does for 2), and all the other methods have corresponding cython code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the code that can do groupby.mode
using value_counts
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah - whoops, I was trying to use DataFrameGroupBy instead of SeriesGroupBy. Ignore the above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well value_counts is about twice as slow as my current implementation(takes same time for 1 column as my implementation does for 2), and all the other methods have corresponding cython code.
can you show some timings. I am really hesistant to add this much code for a niche case.
if cython_dtype is None: | ||
cython_dtype = values.dtype | ||
# We also need to get the specific function for that dtype | ||
how += f"_{cython_dtype}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
using cython fused types should make cython figure this out for us (also avoids need for tempita)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have to use tempita unfortunately because of the hashtables have different functions names based on dtype.
"C": [2, 4, 3, 1, 2, 3, 4, 5, 2, 7, 9, 9], | ||
} | ||
df = DataFrame(data) | ||
df.drop(columns="B", inplace=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the expected result for column B
when doing this?
@@ -0,0 +1,177 @@ | |||
{{py: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To @jbrockmendel point a lot of this can be simplified if using fused types instead of a template; we've started to move to that in most other places in Cython
|
||
table = kh_init_{{ttype}}() | ||
{{if dtype != 'object'}} | ||
#TODO: Fix NOGIL later |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The non-numeric code paths prevent the Gil from being released. At least until Cython 3 gets released we usually do something like this:
pandas/pandas/_libs/groupby.pyx
Line 928 in 859a787
if rank_t is object: |
Dont have time for this. |
Proof of concept for #19254
Is relatively fast compared to @rhshadrach method
API for what to do when multimodal needs discussion though.