Skip to content

Add mode() function to pandas.Series #5367

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
dov opened this issue Oct 28, 2013 · 20 comments · Fixed by #5380
Closed

Add mode() function to pandas.Series #5367

dov opened this issue Oct 28, 2013 · 20 comments · Fixed by #5380
Labels
Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff Enhancement Numeric Operations Arithmetic, Comparison, and Logical operations
Milestone

Comments

@dov
Copy link

dov commented Oct 28, 2013

The mode function returns the most common value (http://en.wikipedia.org/wiki/Mode_%28statistics%29) and it would be nice to have it is a shortcut for Series.value_counts().idxmax() .

@jreback
Copy link
Contributor

jreback commented Oct 28, 2013

it could actually have a very fast implementation in cython as well

@rockg
Copy link
Contributor

rockg commented Oct 29, 2013

It would also be nice to have this on DataFrames without having to go to each column (same with value_counts).

@jtratner
Copy link
Contributor

What would it mean to do value_counts over the entire frame, given that the method returns a Series with the unique labels as Index?

@jtratner
Copy link
Contributor

that said, I'll see if I can put something together for this - seems interesting to do.

@rockg
Copy link
Contributor

rockg commented Oct 29, 2013

An index with the union over all series indices and NaNs or 0s in the columns if that value is not in the column.

@jtratner
Copy link
Contributor

@rockg I'm -1 on doing that for now - too easy to end up with Index with non-comparable objects and right now there's some issues with making that cross-platform anyways. Feel free to open a separate issue or submit a PR if you want this (we just might not accept until we have all the sort issues nailed down).

@rockg
Copy link
Contributor

rockg commented Oct 29, 2013

@jtratner This is all I was expecting. I see your point on non-comparable objects, but I think that would be an odd use. I can put something together that checks for similar dtypes. I'll open a separate issue.

import random
from pandas import DataFrame
a = []
b = []
for i in range(20):
    a.append(random.randint(0,5))
    b.append(random.randint(4,8))

df = DataFrame({'A': a, 'B': b})
vc = dict()
for k, v in df.iteritems():
    vc[k] = v.value_counts()

DataFrame(vc)
Out[1]: 
    A   B
0   4 NaN
1   5 NaN
2   3 NaN
3   4 NaN
4   1   5
5   3   6
6 NaN   6
7 NaN   2
8 NaN   1

@jtratner
Copy link
Contributor

the issue isn't implementation, it's just non-similar dtypes. No need to raise if they aren't similar.

@jtratner
Copy link
Contributor

@rockg - I just put this together. The problem is that if you have multiple equal counts, you get multiple modes. What's the result supposed to be if you do this at the DataFrame level?

E.g., if you have the list [12, 12, 11, 10, 19, 11], the modes are [12, 11]. I'm thinking Series returns ndarray and DataFrame returns dict of col: ndarray.

@jreback
Copy link
Contributor

jreback commented Oct 29, 2013

u should return series or frame

these are like isin in that regard

@jtratner
Copy link
Contributor

pd.Series([12, 12, 11, 10]).mode() --> pd.Series([12])?

but what about:
pd.DataFrame({"A": [12, 12, 11, 11], "B": [1, 1, 3, 5], "C": [0, 1, 2, 3]}).mode()
should that go to (equiv of): pd.DataFrame({"A": [12, 11], "B": [1, np.nan], "C": [np.nan, np.nan]})?

@jtratner
Copy link
Contributor

@jreback is there some function that handles converts ragged arrays to same size arrays? I.e., this:

[[1], [1, 3, 5], [2, 1]] --> [[1, nan, nan], [1, 3, 5], [2, 1, nan]]

@jreback
Copy link
Contributor

jreback commented Oct 29, 2013

maybe tile the shape you want and assign

not sure

@jtratner
Copy link
Contributor

I guess I could also do something like:

arrays = arrays_with_diff_lengths
arr = np.empty((max(len(a) for a in arrays), len(arrays)), dtype=some_object)
for i, s in results:
    arr[0:len(s), i] = s
return DataFrame(arr, columns=df.columns)

@jtratner
Copy link
Contributor

Worked on this last night and put together something that takes around the same time as value_counts() [because it reuses the Cython component of value_counts]. In the case where you have many fewer unique levels than length of the array, should be good. I can still wring out a little more performance and space efficiency for the case of a huge array where there are nearly as many unique values as the length of the array, but many fewer modes.

Probably will have a PR in the next few days.

@rockg
Copy link
Contributor

rockg commented Oct 29, 2013

Fantastic @jtratner

@dov
Copy link
Author

dov commented Oct 29, 2013

That was quick. :-)

@jtratner
Copy link
Contributor

@rockg I also have simple implementation for frame-level value_counts. If
you put together a few test cases I'll put up a PR

@jtratner
Copy link
Contributor

So I have a finished mode implementation - but I just noticed that scipy just returns first along axis - http://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.mstats.mode.html , but it returns the bin counts too and just picks the first sorted value. Do we need to match this?

Could probably implement it as a stat func if it only returned one value. Might also mean that you could get a more performant implementation

@rockg
Copy link
Contributor

rockg commented Oct 31, 2013

@jtratner I don't think we want to follow the convention of just returning one value as that is not right. I'm surprised that scipy doesn't have a more complete implementation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff Enhancement Numeric Operations Arithmetic, Comparison, and Logical operations
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants