Skip to content

[WIP] ENH: Groupby mode #39867

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 6 commits into from
Closed

Conversation

lithomas1
Copy link
Member

Proof of concept for #19254

Is relatively fast compared to @rhshadrach method

In [4]: %timeit df.groupby("a").mode()
2.52 ms ± 415 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [5]: %timeit df.groupby("a").mode()
1.99 ms ± 60.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [6]: %timeit gb_mode(df, ['a'], 'b')
3.9 ms ± 310 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [7]: %timeit gb_mode(df, ['a'], 'b')
3.8 ms ± 270 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [8]: %timeit gb_mode(df, ['a'], 'b')
3.89 ms ± 433 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

API for what to do when multimodal needs discussion though.

@lithomas1 lithomas1 added Enhancement Groupby Reduction Operations sum, mean, min, max, etc. Needs Discussion Requires discussion from core team before further action labels Feb 17, 2021
@rhshadrach
Copy link
Member

Thanks @lithomas1! Can you include the frame used for the time test? If the values don't have high cardinality (> 500 distinct values?), can you also test with one that does? My expectation is that the cython implementation performs much better the larger the distinct values are.

@@ -1578,6 +1578,35 @@ def median(self, numeric_only=True):
numeric_only=numeric_only,
)

@final
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this not just value counts(sort=True, dropna=True)[0]? why are we adding a whole bunch of cython code

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well value_counts is about twice as slow as my current implementation(takes same time for 1 column as my implementation does for 2), and all the other methods have corresponding cython code.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the code that can do groupby.mode using value_counts?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah - whoops, I was trying to use DataFrameGroupBy instead of SeriesGroupBy. Ignore the above.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well value_counts is about twice as slow as my current implementation(takes same time for 1 column as my implementation does for 2), and all the other methods have corresponding cython code.

can you show some timings. I am really hesistant to add this much code for a niche case.

if cython_dtype is None:
cython_dtype = values.dtype
# We also need to get the specific function for that dtype
how += f"_{cython_dtype}"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

using cython fused types should make cython figure this out for us (also avoids need for tempita)

Copy link
Member Author

@lithomas1 lithomas1 Feb 26, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have to use tempita unfortunately because of the hashtables have different functions names based on dtype.

"C": [2, 4, 3, 1, 2, 3, 4, 5, 2, 7, 9, 9],
}
df = DataFrame(data)
df.drop(columns="B", inplace=True)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the expected result for column B when doing this?

@@ -0,0 +1,177 @@
{{py:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To @jbrockmendel point a lot of this can be simplified if using fused types instead of a template; we've started to move to that in most other places in Cython


table = kh_init_{{ttype}}()
{{if dtype != 'object'}}
#TODO: Fix NOGIL later
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The non-numeric code paths prevent the Gil from being released. At least until Cython 3 gets released we usually do something like this:

if rank_t is object:

@lithomas1 lithomas1 removed the Needs Discussion Requires discussion from core team before further action label Apr 3, 2021
@lithomas1
Copy link
Member Author

Dont have time for this.

@lithomas1 lithomas1 closed this Apr 14, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Groupby.mode() - feature request
5 participants