[WIP] ENH: Groupby mode #39867

lithomas1 · 2021-02-17T17:57:51Z

closes Groupby.mode() - feature request #19254
tests added / passed
Ensure all linting tests pass, see here for how to run them
whatsnew entry

Proof of concept for #19254

Is relatively fast compared to @rhshadrach method

In [4]: %timeit df.groupby("a").mode()
2.52 ms ± 415 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [5]: %timeit df.groupby("a").mode()
1.99 ms ± 60.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [6]: %timeit gb_mode(df, ['a'], 'b')
3.9 ms ± 310 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [7]: %timeit gb_mode(df, ['a'], 'b')
3.8 ms ± 270 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [8]: %timeit gb_mode(df, ['a'], 'b')
3.89 ms ± 433 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

API for what to do when multimodal needs discussion though.

…oupby-mode

rhshadrach · 2021-02-17T22:15:00Z

Thanks @lithomas1! Can you include the frame used for the time test? If the values don't have high cardinality (> 500 distinct values?), can you also test with one that does? My expectation is that the cython implementation performs much better the larger the distinct values are.

jreback · 2021-02-19T01:15:56Z

pandas/core/groupby/groupby.py

@@ -1578,6 +1578,35 @@ def median(self, numeric_only=True):
            numeric_only=numeric_only,
        )

+    @final


is this not just value counts(sort=True, dropna=True)[0]? why are we adding a whole bunch of cython code

Well value_counts is about twice as slow as my current implementation(takes same time for 1 column as my implementation does for 2), and all the other methods have corresponding cython code.

What is the code that can do groupby.mode using value_counts?

Ah - whoops, I was trying to use DataFrameGroupBy instead of SeriesGroupBy. Ignore the above.

Well value_counts is about twice as slow as my current implementation(takes same time for 1 column as my implementation does for 2), and all the other methods have corresponding cython code.

can you show some timings. I am really hesistant to add this much code for a niche case.

jbrockmendel · 2021-02-19T16:11:17Z

pandas/core/groupby/groupby.py

+            if cython_dtype is None:
+                cython_dtype = values.dtype
+                # We also need to get the specific function for that dtype
+                how += f"_{cython_dtype}"


using cython fused types should make cython figure this out for us (also avoids need for tempita)

I have to use tempita unfortunately because of the hashtables have different functions names based on dtype.

WillAyd · 2021-02-26T17:46:25Z

pandas/tests/groupby/aggregate/test_cython.py

+        "C": [2, 4, 3, 1, 2, 3, 4, 5, 2, 7, 9, 9],
+    }
+    df = DataFrame(data)
+    df.drop(columns="B", inplace=True)


What is the expected result for column B when doing this?

WillAyd · 2021-02-26T17:46:52Z

pandas/_libs/groupby_mode_helper.pxi.in

@@ -0,0 +1,177 @@
+{{py:


To @jbrockmendel point a lot of this can be simplified if using fused types instead of a template; we've started to move to that in most other places in Cython

WillAyd · 2021-02-26T17:48:51Z

pandas/_libs/groupby_mode_helper.pxi.in

+
+    table = kh_init_{{ttype}}()
+    {{if dtype != 'object'}}
+    #TODO: Fix NOGIL later


The non-numeric code paths prevent the Gil from being released. At least until Cython 3 gets released we usually do something like this:

pandas/pandas/_libs/groupby.pyx

Line 928 in 859a787

if rank_t is object:

lithomas1 · 2021-04-14T18:59:47Z

Dont have time for this.

lithomas1 added 4 commits February 17, 2021 09:37

Proof of concept

10d72e8

Fixed annoying mistake

f318f24

Merge branch 'master' of https://github.com/pandas-dev/pandas into gr…

290bea9

…oupby-mode

Maybe silence warning?

6418db7

lithomas1 added Enhancement Groupby Reduction Operations sum, mean, min, max, etc. Needs Discussion Requires discussion from core team before further action labels Feb 17, 2021

lithomas1 added 2 commits February 17, 2021 12:02

Update allowlist

284e024

Fix tests

4ca6748

lithomas1 requested review from jreback and jbrockmendel February 18, 2021 03:48

jreback requested changes Feb 19, 2021

View reviewed changes

jbrockmendel reviewed Feb 19, 2021

View reviewed changes

WillAyd reviewed Feb 26, 2021

View reviewed changes

lithomas1 removed the Needs Discussion Requires discussion from core team before further action label Apr 3, 2021

lithomas1 closed this Apr 14, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] ENH: Groupby mode #39867

[WIP] ENH: Groupby mode #39867

lithomas1 commented Feb 17, 2021

rhshadrach commented Feb 17, 2021

jreback Feb 19, 2021

lithomas1 Feb 26, 2021

rhshadrach Feb 26, 2021

rhshadrach Feb 26, 2021

jreback Feb 27, 2021

jbrockmendel Feb 19, 2021

lithomas1 Feb 26, 2021 •

edited

Loading

WillAyd Feb 26, 2021

WillAyd Feb 26, 2021

WillAyd Feb 26, 2021

lithomas1 commented Apr 14, 2021

[WIP] ENH: Groupby mode #39867

[WIP] ENH: Groupby mode #39867

Conversation

lithomas1 commented Feb 17, 2021

rhshadrach commented Feb 17, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lithomas1 Feb 26, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lithomas1 commented Apr 14, 2021

lithomas1 Feb 26, 2021 •

edited

Loading