ENH add cumcount groupby method #5510

hayd · 2013-11-14T00:13:44Z

closes #4646

Happy to hear ways to make this faster, but I think this is a good convenience function as is.

Update: made significantly faster...

hayd · 2013-11-14T00:15:15Z

As an aside, I was sure you used to do be able to do g.value_counts()... guess I was mistaken.

jtratner · 2013-11-14T00:17:23Z

@hayd that's an open issue and being re-enabled. You could reincorporate it here if necessary to make this PR works.

jtratner · 2013-11-14T00:18:26Z

pandas/core/groupby.py

+        """
+        try to cast the result to our obj original type,
+        we may have roundtripped thru object in the mean-time
+


mind removing this blank line if you're editing this?

I added it in specifically, pep8 says it should be there, right?

okay, no idea.

hayd · 2013-11-14T03:39:19Z

I really there ought to be tests for dupe index too. In which case this implementation fails (!)

@jreback Is this a wider bug in apply alignment / or is there a cheeky way to do this without a sort:

In [52]: df = pd.DataFrame([['a'], ['a'], ['a'], ['b'], ['b'], ['a']], columns=['A'], index=[0] * 6)

In [53]: g = df.groupby('A')

In [54]: pd.concat(pd.DataFrame(np.arange(len(v)), v) for v in g.indices.values()).sort_index().set_index(df.index)
Out[54]: 
   0
0  0
0  1
0  2
0  0
0  1
0  3

In [55]: g.apply(lambda x: pd.Series(np.arange(len(x)), x.index))  # not correct 
Out[55]: 
0    0
0    1
0    2
0    3
0    0
0    1
dtype: int64

hayd · 2013-11-14T05:51:59Z

Refactored to work with dupes, also faster:

In [1]: import pandas as pd; import numpy as np

In [2]: df = pd.DataFrame(np.random.randint(0, 2, 1000), columns=['A'], index=[0] * 1000); g = df.groupby('A')

In [3]: %timeit g.apply(lambda x: pd.Series(np.arange(len(x)), x.index))
1000 loops, best of 3: 786 µs per loop

In [4]: %timeit pd.concat(pd.DataFrame(np.arange(len(v)), v) for v in g.indices.values()).sort_index().set_index(df.index)
1000 loops, best of 3: 568 µs per loop

In [5]: %timeit g.cumcount()
10000 loops, best of 3: 41.6 µs per loop


In [8]: df = pd.DataFrame(np.random.randint(0, 100, 1000), columns=['A'], index=[0] * 1000); g = df.groupby('A')

In [9]: %timeit g.apply(lambda x: pd.Series(np.arange(len(x)), x.index))
100 loops, best of 3: 7.35 ms per loop

In [10]: %timeit pd.concat(pd.DataFrame(np.arange(len(v)), v) for v in g.indices.values()).sort_index().set_index(df.index)
100 loops, best of 3: 10.7 ms per loop

In [11]: %timeit g.cumcount()
1000 loops, best of 3: 215 µs per loop

hayd · 2013-11-14T05:56:16Z

@jreback / @cpcloud too late for 0.13 ?

jorisvandenbossche · 2013-11-14T08:01:51Z

pandas/core/groupby.py

+        Number each item in each group from 0 to the length of that group.
+
+        Essentially this is equivalent to
+        >>> self.apply(lambda x: Series(np.arange(len(x)), x.index)).


There has to be a blank line between "Essentially" and the code to render it as code I think (in html docstring).

Thanks, fixed :)

jreback · 2013-11-14T13:55:35Z

@hayd I think this is ok fine for 0.13. maybe a doc mention in groupby.rst (could be a separate PR)

jtratner · 2013-11-14T13:57:53Z

pandas/core/groupby.py

+        index = self.obj.index
+        cumcounts = np.zeros(len(index), dtype='int')
+        for v in self.indices.values():
+            cumcounts[v] = np.arange(len(v))


clever way to do it :)

I think you need to specify dtype here (to avoid issues on 32-bit plat), e.g. np.arange(len(v),dtype='int64')

hayd · 2013-11-14T18:06:06Z

Updated those to int64, good spot.

I'm not sure what/where to add in the groupby.rst (there is nothing about cumsum and friends either)... "Other useful features" ?

jreback · 2013-11-14T18:07:49Z

maybe create another section after Flexible apply, which apply examples?

hayd · 2013-11-14T18:29:03Z

Just appended a small mention of it at the end of groupby.rst. ...cumsum etc. are covered in dispatched (though could do more in the future). :S

hayd · 2013-11-14T18:35:24Z

doc/source/groupby.rst

+~~~~~~~~~~~~~~~~~~~~~
+
+Sometimes you want to keep track of the order in which each row appears within
+its group. You can see this with the ``cumcount`` method:


This is just horrible

maybe "To see the order in which each row appears within its group, use the
cumcount method"

hayd · 2013-11-14T21:23:48Z

going to go ahead an merge this :)

ENH add cumcount groupby method

jtratner reviewed Nov 14, 2013
View reviewed changes

jorisvandenbossche reviewed Nov 14, 2013
View reviewed changes

jtratner reviewed Nov 14, 2013
View reviewed changes

ENH add cumcount groupby method

83386d8

hayd reviewed Nov 14, 2013
View reviewed changes

DOC add cumcount example to groupby.rst

b564798

hayd added a commit that referenced this pull request Nov 14, 2013

Merge pull request #5510 from hayd/groupby_cumcount

c70882a

ENH add cumcount groupby method

hayd merged commit c70882a into pandas-dev:master Nov 14, 2013

jorisvandenbossche mentioned this pull request Dec 2, 2014

Updating generic.py error message #8618 - New branch #8950

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH add cumcount groupby method #5510

ENH add cumcount groupby method #5510

hayd commented Nov 14, 2013

hayd commented Nov 14, 2013

jtratner commented Nov 14, 2013

jtratner Nov 14, 2013

hayd Nov 14, 2013

jtratner Nov 14, 2013

hayd commented Nov 14, 2013

hayd commented Nov 14, 2013

hayd commented Nov 14, 2013

jorisvandenbossche Nov 14, 2013

hayd Nov 14, 2013

jreback commented Nov 14, 2013

jtratner Nov 14, 2013

jreback Nov 14, 2013

hayd commented Nov 14, 2013

jreback commented Nov 14, 2013

hayd commented Nov 14, 2013

hayd Nov 14, 2013

hayd Nov 14, 2013

hayd commented Nov 14, 2013

ENH add cumcount groupby method #5510

ENH add cumcount groupby method #5510

Conversation

hayd commented Nov 14, 2013

hayd commented Nov 14, 2013

jtratner commented Nov 14, 2013

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hayd commented Nov 14, 2013

hayd commented Nov 14, 2013

hayd commented Nov 14, 2013

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback commented Nov 14, 2013

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hayd commented Nov 14, 2013

jreback commented Nov 14, 2013

hayd commented Nov 14, 2013

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hayd commented Nov 14, 2013