Add mode() function to pandas.Series #5367

dov · 2013-10-28T21:50:57Z

The mode function returns the most common value (http://en.wikipedia.org/wiki/Mode_%28statistics%29) and it would be nice to have it is a shortcut for Series.value_counts().idxmax() .

jreback · 2013-10-28T22:11:10Z

it could actually have a very fast implementation in cython as well

rockg · 2013-10-29T00:34:03Z

It would also be nice to have this on DataFrames without having to go to each column (same with value_counts).

jtratner · 2013-10-29T00:40:02Z

What would it mean to do value_counts over the entire frame, given that the method returns a Series with the unique labels as Index?

jtratner · 2013-10-29T00:42:30Z

that said, I'll see if I can put something together for this - seems interesting to do.

rockg · 2013-10-29T00:42:59Z

An index with the union over all series indices and NaNs or 0s in the columns if that value is not in the column.

jtratner · 2013-10-29T00:47:03Z

@rockg I'm -1 on doing that for now - too easy to end up with Index with non-comparable objects and right now there's some issues with making that cross-platform anyways. Feel free to open a separate issue or submit a PR if you want this (we just might not accept until we have all the sort issues nailed down).

rockg · 2013-10-29T01:17:54Z

@jtratner This is all I was expecting. I see your point on non-comparable objects, but I think that would be an odd use. I can put something together that checks for similar dtypes. I'll open a separate issue.

import random
from pandas import DataFrame
a = []
b = []
for i in range(20):
    a.append(random.randint(0,5))
    b.append(random.randint(4,8))

df = DataFrame({'A': a, 'B': b})
vc = dict()
for k, v in df.iteritems():
    vc[k] = v.value_counts()

DataFrame(vc)
Out[1]: 
    A   B
0   4 NaN
1   5 NaN
2   3 NaN
3   4 NaN
4   1   5
5   3   6
6 NaN   6
7 NaN   2
8 NaN   1

jtratner · 2013-10-29T01:19:07Z

the issue isn't implementation, it's just non-similar dtypes. No need to raise if they aren't similar.

jtratner · 2013-10-29T02:14:49Z

@rockg - I just put this together. The problem is that if you have multiple equal counts, you get multiple modes. What's the result supposed to be if you do this at the DataFrame level?

E.g., if you have the list [12, 12, 11, 10, 19, 11], the modes are [12, 11]. I'm thinking Series returns ndarray and DataFrame returns dict of col: ndarray.

jreback · 2013-10-29T02:21:19Z

u should return series or frame

these are like isin in that regard

jtratner · 2013-10-29T02:25:12Z

pd.Series([12, 12, 11, 10]).mode() --> pd.Series([12])?

but what about:
pd.DataFrame({"A": [12, 12, 11, 11], "B": [1, 1, 3, 5], "C": [0, 1, 2, 3]}).mode()
should that go to (equiv of): pd.DataFrame({"A": [12, 11], "B": [1, np.nan], "C": [np.nan, np.nan]})?

jtratner · 2013-10-29T02:27:41Z

@jreback is there some function that handles converts ragged arrays to same size arrays? I.e., this:

[[1], [1, 3, 5], [2, 1]] --> [[1, nan, nan], [1, 3, 5], [2, 1, nan]]

jreback · 2013-10-29T02:35:56Z

maybe tile the shape you want and assign

not sure

jtratner · 2013-10-29T02:41:28Z

I guess I could also do something like:

arrays = arrays_with_diff_lengths
arr = np.empty((max(len(a) for a in arrays), len(arrays)), dtype=some_object)
for i, s in results:
    arr[0:len(s), i] = s
return DataFrame(arr, columns=df.columns)

jtratner · 2013-10-29T12:24:34Z

Worked on this last night and put together something that takes around the same time as value_counts() [because it reuses the Cython component of value_counts]. In the case where you have many fewer unique levels than length of the array, should be good. I can still wring out a little more performance and space efficiency for the case of a huge array where there are nearly as many unique values as the length of the array, but many fewer modes.

Probably will have a PR in the next few days.

rockg · 2013-10-29T12:48:23Z

Fantastic @jtratner

dov · 2013-10-29T13:00:37Z

That was quick. :-)

jtratner · 2013-10-29T16:21:05Z

@rockg I also have simple implementation for frame-level value_counts. If
you put together a few test cases I'll put up a PR

jtratner · 2013-10-31T04:13:23Z

So I have a finished mode implementation - but I just noticed that scipy just returns first along axis - http://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.mstats.mode.html , but it returns the bin counts too and just picks the first sorted value. Do we need to match this?

Could probably implement it as a stat func if it only returned one value. Might also mean that you could get a more performant implementation

rockg · 2013-10-31T14:35:48Z

@jtratner I don't think we want to follow the convention of just returning one value as that is not right. I'm surprised that scipy doesn't have a more complete implementation.

jtratner mentioned this issue Oct 30, 2013

ENH: Add mode method to Series and DataFrame #5380

Merged

jtratner closed this as completed in #5380 Nov 5, 2013

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add mode() function to pandas.Series #5367

Add mode() function to pandas.Series #5367

dov commented Oct 28, 2013

jreback commented Oct 28, 2013

rockg commented Oct 29, 2013

jtratner commented Oct 29, 2013

jtratner commented Oct 29, 2013

rockg commented Oct 29, 2013

jtratner commented Oct 29, 2013

rockg commented Oct 29, 2013

jtratner commented Oct 29, 2013

jtratner commented Oct 29, 2013

jreback commented Oct 29, 2013

jtratner commented Oct 29, 2013

jtratner commented Oct 29, 2013

jreback commented Oct 29, 2013

jtratner commented Oct 29, 2013

jtratner commented Oct 29, 2013

rockg commented Oct 29, 2013

dov commented Oct 29, 2013

jtratner commented Oct 29, 2013

jtratner commented Oct 31, 2013

rockg commented Oct 31, 2013

Add mode() function to pandas.Series #5367

Add mode() function to pandas.Series #5367

Comments

dov commented Oct 28, 2013

jreback commented Oct 28, 2013

rockg commented Oct 29, 2013

jtratner commented Oct 29, 2013

jtratner commented Oct 29, 2013

rockg commented Oct 29, 2013

jtratner commented Oct 29, 2013

rockg commented Oct 29, 2013

jtratner commented Oct 29, 2013

jtratner commented Oct 29, 2013

jreback commented Oct 29, 2013

jtratner commented Oct 29, 2013

jtratner commented Oct 29, 2013

jreback commented Oct 29, 2013

jtratner commented Oct 29, 2013

jtratner commented Oct 29, 2013

rockg commented Oct 29, 2013

dov commented Oct 29, 2013

jtratner commented Oct 29, 2013

jtratner commented Oct 31, 2013

rockg commented Oct 31, 2013