-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
Add mode() function to pandas.Series #5367
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
it could actually have a very fast implementation in cython as well |
It would also be nice to have this on DataFrames without having to go to each column (same with value_counts). |
What would it mean to do value_counts over the entire frame, given that the method returns a Series with the unique labels as Index? |
that said, I'll see if I can put something together for this - seems interesting to do. |
An index with the union over all series indices and NaNs or 0s in the columns if that value is not in the column. |
@rockg I'm -1 on doing that for now - too easy to end up with Index with non-comparable objects and right now there's some issues with making that cross-platform anyways. Feel free to open a separate issue or submit a PR if you want this (we just might not accept until we have all the sort issues nailed down). |
@jtratner This is all I was expecting. I see your point on non-comparable objects, but I think that would be an odd use. I can put something together that checks for similar dtypes. I'll open a separate issue.
|
the issue isn't implementation, it's just non-similar dtypes. No need to raise if they aren't similar. |
@rockg - I just put this together. The problem is that if you have multiple equal counts, you get multiple modes. What's the result supposed to be if you do this at the DataFrame level? E.g., if you have the list |
u should return series or frame these are like isin in that regard |
but what about: |
@jreback is there some function that handles converts ragged arrays to same size arrays? I.e., this:
|
maybe tile the shape you want and assign not sure |
I guess I could also do something like: arrays = arrays_with_diff_lengths
arr = np.empty((max(len(a) for a in arrays), len(arrays)), dtype=some_object)
for i, s in results:
arr[0:len(s), i] = s
return DataFrame(arr, columns=df.columns) |
Worked on this last night and put together something that takes around the same time as value_counts() [because it reuses the Cython component of value_counts]. In the case where you have many fewer unique levels than length of the array, should be good. I can still wring out a little more performance and space efficiency for the case of a huge array where there are nearly as many unique values as the length of the array, but many fewer modes. Probably will have a PR in the next few days. |
Fantastic @jtratner |
That was quick. :-) |
@rockg I also have simple implementation for frame-level value_counts. If |
So I have a finished mode implementation - but I just noticed that scipy just returns first along axis - http://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.mstats.mode.html , but it returns the bin counts too and just picks the first sorted value. Do we need to match this? Could probably implement it as a stat func if it only returned one value. Might also mean that you could get a more performant implementation |
@jtratner I don't think we want to follow the convention of just returning one value as that is not right. I'm surprised that scipy doesn't have a more complete implementation. |
The mode function returns the most common value (http://en.wikipedia.org/wiki/Mode_%28statistics%29) and it would be nice to have it is a shortcut for Series.value_counts().idxmax() .
The text was updated successfully, but these errors were encountered: