Skip to content

ENH add cumcount groupby method #5510

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Nov 14, 2013
Merged

Conversation

hayd
Copy link
Contributor

@hayd hayd commented Nov 14, 2013

closes #4646

Happy to hear ways to make this faster, but I think this is a good convenience function as is.

Update: made significantly faster...

@hayd
Copy link
Contributor Author

hayd commented Nov 14, 2013

As an aside, I was sure you used to do be able to do g.value_counts()... guess I was mistaken.

@jtratner
Copy link
Contributor

@hayd that's an open issue and being re-enabled. You could reincorporate it here if necessary to make this PR works.

"""
try to cast the result to our obj original type,
we may have roundtripped thru object in the mean-time

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mind removing this blank line if you're editing this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added it in specifically, pep8 says it should be there, right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay, no idea.

@hayd
Copy link
Contributor Author

hayd commented Nov 14, 2013

I really there ought to be tests for dupe index too. In which case this implementation fails (!)

@jreback Is this a wider bug in apply alignment / or is there a cheeky way to do this without a sort:

In [52]: df = pd.DataFrame([['a'], ['a'], ['a'], ['b'], ['b'], ['a']], columns=['A'], index=[0] * 6)

In [53]: g = df.groupby('A')

In [54]: pd.concat(pd.DataFrame(np.arange(len(v)), v) for v in g.indices.values()).sort_index().set_index(df.index)
Out[54]: 
   0
0  0
0  1
0  2
0  0
0  1
0  3

In [55]: g.apply(lambda x: pd.Series(np.arange(len(x)), x.index))  # not correct 
Out[55]: 
0    0
0    1
0    2
0    3
0    0
0    1
dtype: int64

@hayd
Copy link
Contributor Author

hayd commented Nov 14, 2013

Refactored to work with dupes, also faster:

In [1]: import pandas as pd; import numpy as np

In [2]: df = pd.DataFrame(np.random.randint(0, 2, 1000), columns=['A'], index=[0] * 1000); g = df.groupby('A')

In [3]: %timeit g.apply(lambda x: pd.Series(np.arange(len(x)), x.index))
1000 loops, best of 3: 786 µs per loop

In [4]: %timeit pd.concat(pd.DataFrame(np.arange(len(v)), v) for v in g.indices.values()).sort_index().set_index(df.index)
1000 loops, best of 3: 568 µs per loop

In [5]: %timeit g.cumcount()
10000 loops, best of 3: 41.6 µs per loop


In [8]: df = pd.DataFrame(np.random.randint(0, 100, 1000), columns=['A'], index=[0] * 1000); g = df.groupby('A')

In [9]: %timeit g.apply(lambda x: pd.Series(np.arange(len(x)), x.index))
100 loops, best of 3: 7.35 ms per loop

In [10]: %timeit pd.concat(pd.DataFrame(np.arange(len(v)), v) for v in g.indices.values()).sort_index().set_index(df.index)
100 loops, best of 3: 10.7 ms per loop

In [11]: %timeit g.cumcount()
1000 loops, best of 3: 215 µs per loop

@hayd
Copy link
Contributor Author

hayd commented Nov 14, 2013

@jreback / @cpcloud too late for 0.13 ?

Number each item in each group from 0 to the length of that group.

Essentially this is equivalent to
>>> self.apply(lambda x: Series(np.arange(len(x)), x.index)).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There has to be a blank line between "Essentially" and the code to render it as code I think (in html docstring).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, fixed :)

@jreback
Copy link
Contributor

jreback commented Nov 14, 2013

@hayd I think this is ok fine for 0.13. maybe a doc mention in groupby.rst (could be a separate PR)

index = self.obj.index
cumcounts = np.zeros(len(index), dtype='int')
for v in self.indices.values():
cumcounts[v] = np.arange(len(v))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clever way to do it :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you need to specify dtype here (to avoid issues on 32-bit plat), e.g. np.arange(len(v),dtype='int64')

@hayd
Copy link
Contributor Author

hayd commented Nov 14, 2013

Updated those to int64, good spot.

I'm not sure what/where to add in the groupby.rst (there is nothing about cumsum and friends either)... "Other useful features" ?

@jreback
Copy link
Contributor

jreback commented Nov 14, 2013

maybe create another section after Flexible apply, which apply examples?

@hayd
Copy link
Contributor Author

hayd commented Nov 14, 2013

Just appended a small mention of it at the end of groupby.rst. ...cumsum etc. are covered in dispatched (though could do more in the future). :S

~~~~~~~~~~~~~~~~~~~~~

Sometimes you want to keep track of the order in which each row appears within
its group. You can see this with the ``cumcount`` method:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just horrible

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe "To see the order in which each row appears within its group, use the
cumcount method"

@hayd
Copy link
Contributor Author

hayd commented Nov 14, 2013

going to go ahead an merge this :)

hayd added a commit that referenced this pull request Nov 14, 2013
ENH add cumcount groupby method
@hayd hayd merged commit c70882a into pandas-dev:master Nov 14, 2013
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

groupby enumerate method
4 participants