Groupby.mode() - feature request #19254

j-musial · 2018-01-15T19:39:21Z

Feature Request

Hi,

I use pandas a lot in my projects and I got stack with a problem of running the "mode" function (most common element) on a huge groupby object. There are few solutions available using aggregate and scipy.stats.mode, but they are unbearably slow in comparison to e.g. groupby.mean(). Thus, I would like to make a feature request to add cytonized version of groupby.mode() operator.

Thanks in advance!

Jan Musial

chris-b1 · 2018-01-15T20:34:48Z

xref #19165 (master tracker)

A PR would be welcome! As you note the key thing would be implementing a single-pass group mode function in cython, can look at others for examples.

pandas/pandas/_libs/groupby_helper.pxi.in

Line 213 in aa9e002

def group_mean_{{name}}(ndarray[{{dest_type2}}, ndim=2] out,

jreback · 2018-01-15T23:35:15Z

how is this different that .value_counts() ?

j-musial · 2018-01-16T07:14:15Z

Hi Jeff, As you can see below .value_counts() does not apply to groupby object. Furthermore, returning sorted values and counts within thousands/millions of groups gives huge overheads, whereas all you want is the most frequent value. Best, Jan

>>

pd.DataFrame({'a':['a','a','a','a','b','b','b','b'],'b':[1,1,2,3,1,2,2,3]}).groupby('a').value_counts() Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/usr/lib/python2.7/dist-packages/pandas/core/groupby.py", line 529, in __getattr__ (type(self).__name__, attr)) AttributeError: 'DataFrameGroupBy' object has no attribute 'value_counts'

>> pd.__version__

'0.19.2'

…

On Tue, Jan 16, 2018 at 12:35 AM, Jeff Reback ***@***.***> wrote: how is this different that .value_counts() ? — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#19254 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AX63CvX5BIOlK47m-61a26oBEGAvrJTOks5tK-DBgaJpZM4Re3QD> .

jreback · 2018-01-16T11:13:49Z

.value_counts() is a series method. I guess mode would simply give back the max per group.

In [32]: pd.options.display.max_rows=12

In [33]: ngroups = 100; N = 100000; np.random.seed(1234)

In [34]: df = pd.DataFrame({'key': np.random.randint(0, ngroups, size=N), 'value': np.random.randint(0, 10000, size=N)})

In [35]: df.groupby('key').value.value_counts()
Out[35]: 
key  value
0    5799     3
     7358     3
     8860     3
     185      2
     583      2
     872      2
             ..
99   9904     1
     9916     1
     9922     1
     9932     1
     9935     1
     9936     1
Name: value, Length: 95112, dtype: int64

j-musial · 2018-01-16T17:33:17Z

Thanks Jeff! Unlike .value_counts() the .mode() would act like a reduction function returning just one value. Returning entire histogram just to get the most frequent value is huge waste of computing resources. Best, Jan

…

On Tue, Jan 16, 2018 at 12:14 PM, Jeff Reback ***@***.***> wrote: .value_counts() is a series method. I guess mode would simply give back the max per group. In [32]: pd.options.display.max_rows=12 In [33]: ngroups = 100; N = 100000; np.random.seed(1234) In [34]: df = pd.DataFrame({'key': np.random.randint(0, ngroups, size=N), 'value': np.random.randint(0, 10000, size=N)}) In [35]: df.groupby('key').value.value_counts() Out[35]: key value 0 5799 3 7358 3 8860 3 185 2 583 2 872 2 .. 99 9904 1 9916 1 9922 1 9932 1 9935 1 9936 1 Name: value, Length: 95112, dtype: int64 — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#19254 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AX63CnSbVx2jD-KDExPJYtY0eLmF_Qp6ks5tLIR7gaJpZM4Re3QD> .

bfarrer · 2019-02-22T23:28:06Z

If I have a DataFrame that includes a column of cities and I want to know the most frequent city to occur on the list. I'm pretty new to python and pandas, so maybe there's an easy alternative, but I'm not aware of one.

BrittonWinterrose · 2019-03-17T18:56:07Z

I am interested in this feature as well. +1

rogerioluizsi · 2020-04-05T19:31:21Z

i'm too

Arregator · 2020-04-08T08:29:31Z

I can't find a good solution for the problem of populating missing values of my dataset with a most frequent value in a group (NOTE: not mean, but most frequent).

I thought that would be way too often intention for those handling missing values.

I have a dataset:

In [425]: df
Out[425]:
      brand     model fuelType  engineDisplacement
0      audi      a100   petrol              2000.0
1       bmw  3-series   diesel              2000.0
2  mercedes   e-class   petrol              3000.0
3  mercedes   e-class   petrol                 NaN
4    nissan      leaf  electro                 NaN
5     tesla   model x  electro                 NaN
6  mercedes   e-class   petrol              2000.0
7  mercedes   e-class   petrol              2000.0
8  mercedes   e-class   petrol              2000.0

and I want for my mercedes e-class petrol NaN to fill its NaN with 2000 as a most frequent value in the group brand-model-fuelType. I try somthing like:

df.groupby(['brand', 'model', 'fuelType'])['engineDisplacement'].transform(
    lambda x: x.fillna(x.mode())
)

In [427]: df.groupby(['brand', 'model', 'fuelType'])['engineDisplacement'].transform(
     ...:     lambda x: x.fillna(x.mode())
     ...:     )
Out[427]:
0    2000.0
1    2000.0
2    3000.0
3       NaN
4       NaN
5       NaN
6    2000.0
7    2000.0
8    2000.0
Name: engineDisplacement, dtype: float64

you see? it does not fill NaN with 2000 as I would expect :(

Again, it is important to have a most frequent value, because in many-many cases we have to deal with a categorical values, not numeric, so we need this feature badly.

kashumi-m · 2021-01-06T06:26:06Z

+1 for this

lithomas1 · 2021-01-20T13:26:24Z

take

arw2019 · 2021-02-10T18:10:59Z

+1 - had someone ask about this

lithomas1 · 2021-02-13T18:08:00Z

Unassigning myself as I don't have time for this.

rhshadrach · 2021-02-13T18:51:32Z

Edit: Replaced implementation with one that is more efficient on both few categorical values (3 values, ~20% faster) and many categorical values (20k values, ~5x faster).

def gb_mode(df, keys, column):
    return (
        df.groupby(keys + [column], sort=False)
        .size()
        .sort_values(ascending=False, kind='mergesort')
        .reset_index(column)
        .groupby(keys)
        [column]
        .head(1)
        .sort_index()
    )

df = pd.DataFrame({'a': [1, 1, 1, 2, 2, 2], 'b': [3, 4, 3, 4, 3, 4]})
print(gb_mode(df, ['a'], 'b'))

produces

a
1    3
2    4
dtype: int64

Seems to have decent performance, at least when the categorical ('b', here) has few values, but still +1 on adding a cythonized mode.

lithomas1 · 2021-02-13T19:00:19Z

I think there needs to be a discussion on the API for mode before we should proceed with anything. The question being "What happens if multimodal?" (Should we raise warning, return last mode, return smallest mode?). Also, I think there may be complexity around extension types when implementing in Cython?

I think I was able to hack up something by reintroducing the tempita template for groupby, and modifying _get_cythonized_result(which iterates over columns in python?), but I'm not sure if this is the right result.

trevor-pope · 2021-06-07T00:41:10Z

"What happens if multimodal?"

Why not all of them? It could just be an argument to the function.

keep='raise' could raise a warning, keep='smallest' or keep='largest' returns the smallest/largest, etc.

Something like df.groupby('col').mode(keep='all') will give all modes as a list (if a category is multimodal, thus making the resulting dtype object). This might run into efficiency concerns however. I've personally had this use case (getting all modes), but I am not sure how necessary it is to support when you could get by using .value_counts(), albeit with a bit more work and computation.

I could try to implement this, but I am not sure where to do it at. Is the tempita template still the way to go?

wcheek · 2022-01-21T04:42:05Z

Hoping for this feature..

chris-b1 added Groupby Performance Memory or execution speed performance Difficulty Intermediate labels Jan 15, 2018

chris-b1 added this to the Next Major Release milestone Jan 15, 2018

mroeschke mentioned this issue Jan 15, 2018

PERF: Discrepancy in groupby methods #19165

Closed

7 tasks

j-musial mentioned this issue Mar 15, 2019

Inconsistent behavior when using GroupBy and pandas.Series.mode #25581

Closed

jbrockmendel removed Effort Medium labels Oct 21, 2019

github-actions bot assigned lithomas1 Jan 20, 2021

lithomas1 removed their assignment Feb 13, 2021

lithomas1 added API Design Needs Discussion Requires discussion from core team before further action labels Feb 13, 2021

lithomas1 self-assigned this Feb 17, 2021

lithomas1 mentioned this issue Feb 17, 2021

[WIP] ENH: Groupby mode #39867

Closed

4 tasks

mroeschke mentioned this issue Apr 21, 2021

'mode' not recognized by df.groupby().agg(), but pd.Series.mode works #11562

Closed

lithomas1 removed their assignment May 3, 2021

mroeschke added Enhancement and removed API Design Needs Discussion Requires discussion from core team before further action Performance Memory or execution speed performance labels Jun 12, 2021

mroeschke removed this from the Contributions Welcome milestone Oct 13, 2022

This was referenced Mar 24, 2025

[Bug]: Expr.mode inconsistent output when applied to multiple columns narwhals-dev/narwhals#2273

Closed

feat: add required keep argument to Expr.mode narwhals-dev/narwhals#1793

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Groupby.mode() - feature request #19254

Groupby.mode() - feature request #19254

j-musial commented Jan 15, 2018

chris-b1 commented Jan 15, 2018

jreback commented Jan 15, 2018

j-musial commented Jan 16, 2018 via email

jreback commented Jan 16, 2018

j-musial commented Jan 16, 2018 via email

bfarrer commented Feb 22, 2019 •

edited

Loading

BrittonWinterrose commented Mar 17, 2019

rogerioluizsi commented Apr 5, 2020

Arregator commented Apr 8, 2020

kashumi-m commented Jan 6, 2021

lithomas1 commented Jan 20, 2021

arw2019 commented Feb 10, 2021

lithomas1 commented Feb 13, 2021 •

edited

Loading

rhshadrach commented Feb 13, 2021 •

edited

Loading

lithomas1 commented Feb 13, 2021

trevor-pope commented Jun 7, 2021

wcheek commented Jan 21, 2022

Groupby.mode() - feature request #19254

Groupby.mode() - feature request #19254

Comments

j-musial commented Jan 15, 2018

Feature Request

chris-b1 commented Jan 15, 2018

jreback commented Jan 15, 2018

j-musial commented Jan 16, 2018 via email

jreback commented Jan 16, 2018

j-musial commented Jan 16, 2018 via email

bfarrer commented Feb 22, 2019 • edited Loading

BrittonWinterrose commented Mar 17, 2019

rogerioluizsi commented Apr 5, 2020

Arregator commented Apr 8, 2020

kashumi-m commented Jan 6, 2021

lithomas1 commented Jan 20, 2021

arw2019 commented Feb 10, 2021

lithomas1 commented Feb 13, 2021 • edited Loading

rhshadrach commented Feb 13, 2021 • edited Loading

lithomas1 commented Feb 13, 2021

trevor-pope commented Jun 7, 2021

wcheek commented Jan 21, 2022

bfarrer commented Feb 22, 2019 •

edited

Loading

lithomas1 commented Feb 13, 2021 •

edited

Loading

rhshadrach commented Feb 13, 2021 •

edited

Loading