-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
Groupby.mode() - feature request #19254
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
xref #19165 (master tracker) A PR would be welcome! As you note the key thing would be implementing a single-pass group mode function in cython, can look at others for examples. pandas/pandas/_libs/groupby_helper.pxi.in Line 213 in aa9e002
|
how is this different that |
Hi Jeff,
As you can see below .value_counts() does not apply to groupby object.
Furthermore, returning sorted values and counts within thousands/millions
of groups gives huge overheads, whereas all you want is the most frequent
value.
Best,
Jan
>>
pd.DataFrame({'a':['a','a','a','a','b','b','b','b'],'b':[1,1,2,3,1,2,2,3]}).groupby('a').value_counts()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python2.7/dist-packages/pandas/core/groupby.py", line 529,
in __getattr__
(type(self).__name__, attr))
AttributeError: 'DataFrameGroupBy' object has no attribute 'value_counts'
>> pd.__version__
'0.19.2'
…On Tue, Jan 16, 2018 at 12:35 AM, Jeff Reback ***@***.***> wrote:
how is this different that .value_counts() ?
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#19254 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AX63CvX5BIOlK47m-61a26oBEGAvrJTOks5tK-DBgaJpZM4Re3QD>
.
|
|
Thanks Jeff!
Unlike .value_counts() the .mode() would act like a reduction function
returning just one value. Returning entire histogram just to get the most
frequent value is huge waste of computing resources.
Best,
Jan
…On Tue, Jan 16, 2018 at 12:14 PM, Jeff Reback ***@***.***> wrote:
.value_counts() is a series method. I guess mode would simply give back
the max per group.
In [32]: pd.options.display.max_rows=12
In [33]: ngroups = 100; N = 100000; np.random.seed(1234)
In [34]: df = pd.DataFrame({'key': np.random.randint(0, ngroups, size=N), 'value': np.random.randint(0, 10000, size=N)})
In [35]: df.groupby('key').value.value_counts()
Out[35]:
key value
0 5799 3
7358 3
8860 3
185 2
583 2
872 2
..
99 9904 1
9916 1
9922 1
9932 1
9935 1
9936 1
Name: value, Length: 95112, dtype: int64
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#19254 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AX63CnSbVx2jD-KDExPJYtY0eLmF_Qp6ks5tLIR7gaJpZM4Re3QD>
.
|
If I have a DataFrame that includes a column of cities and I want to know the most frequent city to occur on the list. I'm pretty new to python and pandas, so maybe there's an easy alternative, but I'm not aware of one. |
I am interested in this feature as well. +1 |
i'm too |
I can't find a good solution for the problem of populating missing values of my dataset with a most frequent value in a group (NOTE: not mean, but most frequent). I thought that would be way too often intention for those handling missing values. I have a dataset: In [425]: df
Out[425]:
brand model fuelType engineDisplacement
0 audi a100 petrol 2000.0
1 bmw 3-series diesel 2000.0
2 mercedes e-class petrol 3000.0
3 mercedes e-class petrol NaN
4 nissan leaf electro NaN
5 tesla model x electro NaN
6 mercedes e-class petrol 2000.0
7 mercedes e-class petrol 2000.0
8 mercedes e-class petrol 2000.0 and I want for my df.groupby(['brand', 'model', 'fuelType'])['engineDisplacement'].transform(
lambda x: x.fillna(x.mode())
)
In [427]: df.groupby(['brand', 'model', 'fuelType'])['engineDisplacement'].transform(
...: lambda x: x.fillna(x.mode())
...: )
Out[427]:
0 2000.0
1 2000.0
2 3000.0
3 NaN
4 NaN
5 NaN
6 2000.0
7 2000.0
8 2000.0
Name: engineDisplacement, dtype: float64 you see? it does not fill Again, it is important to have a most frequent value, because in many-many cases we have to deal with a categorical values, not numeric, so we need this feature badly. |
+1 for this |
take |
+1 - had someone ask about this |
Unassigning myself as I don't have time for this. |
Edit: Replaced implementation with one that is more efficient on both few categorical values (3 values, ~20% faster) and many categorical values (20k values, ~5x faster).
produces
Seems to have decent performance, at least when the categorical ('b', here) has few values, but still +1 on adding a cythonized mode. |
I think there needs to be a discussion on the API for mode before we should proceed with anything. The question being "What happens if multimodal?" (Should we raise warning, return last mode, return smallest mode?). Also, I think there may be complexity around extension types when implementing in Cython? I think I was able to hack up something by reintroducing the tempita template for groupby, and modifying _get_cythonized_result(which iterates over columns in python?), but I'm not sure if this is the right result. |
"What happens if multimodal?" Why not all of them? It could just be an argument to the function.
Something like I could try to implement this, but I am not sure where to do it at. Is the tempita template still the way to go? |
Hoping for this feature.. |
Feature Request
Hi,
I use pandas a lot in my projects and I got stack with a problem of running the "mode" function (most common element) on a huge groupby object. There are few solutions available using aggregate and scipy.stats.mode, but they are unbearably slow in comparison to e.g. groupby.mean(). Thus, I would like to make a feature request to add cytonized version of groupby.mode() operator.
Thanks in advance!
Jan Musial
The text was updated successfully, but these errors were encountered: