-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
ENH: cumcount() for DataFrames and Series #12648
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
an Index already does the pls show a usecase |
@johne13 you aren't confusing it with |
It is not a super-common event, to be sure, but there are cases like this: that I bump into somewhat regularly: Simpler example:
So yeah, lots of easy ways to do this on ENTIRE dataset but as soon as you want to do a selection, it gets a little messier. Or again, maybe I am missing something obvious. Also, I would think whatever justification there is for having this as a groupby method applies just as much to having it work for a dataframe or series, no? |
the common idiom is .reset_index() |
no for a groupby this is expensive if u don't have an array impl (which is why it exists) |
I guess I am missing something. I couldn't do this, for example, I should have been more clear about that aspect. I am generally trying to create a new column when I mistakenly attempt to use the nonexistent cumcount method. |
ok not averse to adding this method for consistency (among other cum. methods) but should return a Series pretty trivial impl if u would like to submit a PR |
signature would be identical to
|
Potentially not as fast (?), and handles NAs differently from the case above, but very idiomatic:
|
Taken a step further: we could offer a argument for changing the NaN behavior and offer all the |
@MaximilianR actually that's a good point. would you make a PR to:
|
maybe give a tiny example
|
I am jammed at the moment (although free enough to cruise the feed, it seems), but I can add that if no one else picks it up over the next few weeks |
ok, let me make an issue so we don't forget. |
@MaximilianR I think you can get the same behavior as in my earlier example by using a boolean mask instead of
|
Here's a quick speed test,
|
try assigning to a range ITS much faster still. this is the impl we were not suggesting using expanding.count here it's doing a lot more work and very careful about min periods and such |
@jreback So do you still think Sorry, don't understand what you mean by "assigning to a range"? If you still like the cumcount idea I'll give it a shot though I have not done a PR before. Of course if someone else wants to do it I'd gladly defer to them. |
I think it rounds out the cum functions a bit and would be ok |
OK, I think I have an initial attempt and will post the code and example in a moment, but first want to list other similar functions as a point of comparison:
Starting from there, it made the most sense to me to aim for something roughly consistent with cumsum() and expanding().count and not (groupby)cumcount(). Example to follow, see what you think. |
Here's the code, added to generic, just after cummax (and also cumsum for comparison).
It's a little less natural than the other cum functions, but that was the best way I could come up with that fit the existing In some quickie timings (1000 rows and 2 columns) it is just a little slower than cumsum() and much faster than expanding().count(). Exactly as expected, I think.
And here's groupby/cumcount for comparison:
|
The problem with making it consistent with So if we want to make |
@jorisvandenbossche I really hadn't thought about how exactly a Series/DataFrame.cumcount would work until I started comparing it to all the existing functions (count/cumsum/groupby.cumcount). Then I realized it couldn't be consistent with all of them, only some. Out of all the comparison functions, it seems to me that it ought to work approximately the same as expanding().count(), right? And conversely, the way groupby.cumcount works makes the least sense to me, but of course I'll gladly defer to you and others on how to best make this consistent with the rest of pandas. |
This is almost a trivial function, should be something like:
this is really just a range; if the index has nans then its slightly different. |
I guess it technically should have an This should mirror the |
@jreback But I would also find it very unfortunate that |
@jorisvandenbossche actually that's a good point. it should work on the values, similarl to |
Sorry, some of these details are over my head -- have never really delved into pandas internals before. I tried to make it work as similarly to cum* as possible, that's why it is done via
|
And also regarding |
@johne13 no |
And that's the problem, as for So it that sense, |
Right, I'm just trying to note what seems consistent and what doesn't. Along those lines, note that the following holds true for the relationship between
The same relationship holds for |
Yeah, as Joris notes, |
@jreback @jorisvandenbossche FYI, I will abandon this for now, pending further instructions. I'm not quite sure what if anything you guys decided about how best to proceed. And this might not be a good first project for someone who has never done a PR and is not super familiar with all the underlying code, so if someone else want to do something with it I'm happy to step aside. |
One other point of comparison for a hypothetical |
@johne13 almost everything is faster than sorting :) sorting |
Right, except in this case it's "sorting" a column of ones, so it's just a matter of how long it takes the algorithm to confirm that it's already sorted, there is no actual sorting. But I mainly mentioned |
So we have very fast routines to determine whether something is monotonic which determine if its 'sorted' very quickly.
|
Unless I am missing something obvious, we can do this:
df.groupby('x').cumcount()
but not:
df.cumcount()
Obviously, there are workarounds, but seems like we should be able to do on a dataframe/series if we can do on a groupby object?
The text was updated successfully, but these errors were encountered: