Skip to content

Groupby.mode() - feature request #19254

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
j-musial opened this issue Jan 15, 2018 · 17 comments
Open

Groupby.mode() - feature request #19254

j-musial opened this issue Jan 15, 2018 · 17 comments

Comments

@j-musial
Copy link

Feature Request

Hi,

I use pandas a lot in my projects and I got stack with a problem of running the "mode" function (most common element) on a huge groupby object. There are few solutions available using aggregate and scipy.stats.mode, but they are unbearably slow in comparison to e.g. groupby.mean(). Thus, I would like to make a feature request to add cytonized version of groupby.mode() operator.

Thanks in advance!

Jan Musial

@chris-b1
Copy link
Contributor

xref #19165 (master tracker)

A PR would be welcome! As you note the key thing would be implementing a single-pass group mode function in cython, can look at others for examples.

def group_mean_{{name}}(ndarray[{{dest_type2}}, ndim=2] out,

@chris-b1 chris-b1 added Groupby Performance Memory or execution speed performance Difficulty Intermediate labels Jan 15, 2018
@chris-b1 chris-b1 added this to the Next Major Release milestone Jan 15, 2018
@jreback
Copy link
Contributor

jreback commented Jan 15, 2018

how is this different that .value_counts() ?

@j-musial
Copy link
Author

j-musial commented Jan 16, 2018 via email

@jreback
Copy link
Contributor

jreback commented Jan 16, 2018

.value_counts() is a series method. I guess mode would simply give back the max per group.

In [32]: pd.options.display.max_rows=12

In [33]: ngroups = 100; N = 100000; np.random.seed(1234)

In [34]: df = pd.DataFrame({'key': np.random.randint(0, ngroups, size=N), 'value': np.random.randint(0, 10000, size=N)})

In [35]: df.groupby('key').value.value_counts()
Out[35]: 
key  value
0    5799     3
     7358     3
     8860     3
     185      2
     583      2
     872      2
             ..
99   9904     1
     9916     1
     9922     1
     9932     1
     9935     1
     9936     1
Name: value, Length: 95112, dtype: int64

@j-musial
Copy link
Author

j-musial commented Jan 16, 2018 via email

@bfarrer
Copy link

bfarrer commented Feb 22, 2019

If I have a DataFrame that includes a column of cities and I want to know the most frequent city to occur on the list. I'm pretty new to python and pandas, so maybe there's an easy alternative, but I'm not aware of one.

@BrittonWinterrose
Copy link

I am interested in this feature as well. +1

@rogerioluizsi
Copy link

i'm too

@Arregator
Copy link

I can't find a good solution for the problem of populating missing values of my dataset with a most frequent value in a group (NOTE: not mean, but most frequent).

I thought that would be way too often intention for those handling missing values.

I have a dataset:

In [425]: df
Out[425]:
      brand     model fuelType  engineDisplacement
0      audi      a100   petrol              2000.0
1       bmw  3-series   diesel              2000.0
2  mercedes   e-class   petrol              3000.0
3  mercedes   e-class   petrol                 NaN
4    nissan      leaf  electro                 NaN
5     tesla   model x  electro                 NaN
6  mercedes   e-class   petrol              2000.0
7  mercedes   e-class   petrol              2000.0
8  mercedes   e-class   petrol              2000.0

and I want for my mercedes e-class petrol NaN to fill its NaN with 2000 as a most frequent value in the group brand-model-fuelType. I try somthing like:

df.groupby(['brand', 'model', 'fuelType'])['engineDisplacement'].transform(
    lambda x: x.fillna(x.mode())
)

In [427]: df.groupby(['brand', 'model', 'fuelType'])['engineDisplacement'].transform(
     ...:     lambda x: x.fillna(x.mode())
     ...:     )
Out[427]:
0    2000.0
1    2000.0
2    3000.0
3       NaN
4       NaN
5       NaN
6    2000.0
7    2000.0
8    2000.0
Name: engineDisplacement, dtype: float64

you see? it does not fill NaN with 2000 as I would expect :(

Again, it is important to have a most frequent value, because in many-many cases we have to deal with a categorical values, not numeric, so we need this feature badly.

@kashumi-m
Copy link

+1 for this

@lithomas1
Copy link
Member

take

@arw2019
Copy link
Member

arw2019 commented Feb 10, 2021

+1 - had someone ask about this

@lithomas1 lithomas1 removed their assignment Feb 13, 2021
@lithomas1
Copy link
Member

lithomas1 commented Feb 13, 2021

Unassigning myself as I don't have time for this.

@rhshadrach
Copy link
Member

rhshadrach commented Feb 13, 2021

Edit: Replaced implementation with one that is more efficient on both few categorical values (3 values, ~20% faster) and many categorical values (20k values, ~5x faster).

def gb_mode(df, keys, column):
    return (
        df.groupby(keys + [column], sort=False)
        .size()
        .sort_values(ascending=False, kind='mergesort')
        .reset_index(column)
        .groupby(keys)
        [column]
        .head(1)
        .sort_index()
    )

df = pd.DataFrame({'a': [1, 1, 1, 2, 2, 2], 'b': [3, 4, 3, 4, 3, 4]})
print(gb_mode(df, ['a'], 'b'))

produces

a
1    3
2    4
dtype: int64

Seems to have decent performance, at least when the categorical ('b', here) has few values, but still +1 on adding a cythonized mode.

@lithomas1
Copy link
Member

I think there needs to be a discussion on the API for mode before we should proceed with anything. The question being "What happens if multimodal?" (Should we raise warning, return last mode, return smallest mode?). Also, I think there may be complexity around extension types when implementing in Cython?

I think I was able to hack up something by reintroducing the tempita template for groupby, and modifying _get_cythonized_result(which iterates over columns in python?), but I'm not sure if this is the right result.

@lithomas1 lithomas1 added API Design Needs Discussion Requires discussion from core team before further action labels Feb 13, 2021
@lithomas1 lithomas1 self-assigned this Feb 17, 2021
@trevor-pope
Copy link

"What happens if multimodal?"

Why not all of them? It could just be an argument to the function.

keep='raise' could raise a warning, keep='smallest' or keep='largest' returns the smallest/largest, etc.

Something like df.groupby('col').mode(keep='all') will give all modes as a list (if a category is multimodal, thus making the resulting dtype object). This might run into efficiency concerns however. I've personally had this use case (getting all modes), but I am not sure how necessary it is to support when you could get by using .value_counts(), albeit with a bit more work and computation.

I could try to implement this, but I am not sure where to do it at. Is the tempita template still the way to go?

@mroeschke mroeschke added Enhancement and removed API Design Needs Discussion Requires discussion from core team before further action Performance Memory or execution speed performance labels Jun 12, 2021
@wcheek
Copy link

wcheek commented Jan 21, 2022

Hoping for this feature..

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.