ENH: cumcount() for DataFrames and Series #12648

johne13 · 2016-03-16T23:09:33Z

Unless I am missing something obvious, we can do this:
df.groupby('x').cumcount()
but not:
df.cumcount()

Obviously, there are workarounds, but seems like we should be able to do on a dataframe/series if we can do on a groupby object?

The text was updated successfully, but these errors were encountered:

jreback · 2016-03-16T23:20:42Z

an Index already does the

pls show a usecase

jorisvandenbossche · 2016-03-17T00:06:20Z

@johne13 you aren't confusing it with cumsum? Besides obtaining it from an index, you can also just do range(len(df))

johne13 · 2016-03-17T00:12:44Z

It is not a super-common event, to be sure, but there are cases like this: that I bump into somewhat regularly:
http://stackoverflow.com/questions/36044890/how-to-speed-up-a-very-slow-pandas-apply-function/36047836#36047836

Simpler example:

df = pd.DataFrame({ 'x':[1,2,1,2] })
#df[ df.x == 2 ].cumcount()    # doesn't work of course

df.loc[ df.x == 2, 'y'] = 1    # so I do a two step with cumsum as 2nd step
df.y.cumsum()

Out[428]: 
0    NaN
1    1.0
2    NaN
3    2.0

So yeah, lots of easy ways to do this on ENTIRE dataset but as soon as you want to do a selection, it gets a little messier. Or again, maybe I am missing something obvious.

Also, I would think whatever justification there is for having this as a groupby method applies just as much to having it work for a dataframe or series, no?

jreback · 2016-03-17T00:20:08Z

the common idiom is

.reset_index()

jreback · 2016-03-17T00:21:08Z

no for a groupby this is expensive if u don't have an array impl (which is why it exists)

johne13 · 2016-03-17T00:31:53Z

I guess I am missing something. I couldn't do this, for example,
df['new'] = df[ df.x == 2 ].reset_index()
because obviously I have reset the index so can't correctly align it with the original dataset.

I should have been more clear about that aspect. I am generally trying to create a new column when I mistakenly attempt to use the nonexistent cumcount method.

jreback · 2016-03-17T00:42:40Z

ok not averse to adding this method for consistency (among other cum. methods) but should return a Series

pretty trivial impl if u would like to submit a PR

jreback · 2016-03-17T00:46:40Z

signature would be identical to .cum* (at end of core/generic.py)

In [1]: pd.DataFrame.cumsum?
Signature: pd.DataFrame.cumsum(self, axis=None, dtype=None, out=None, skipna=True, **kwargs)
Docstring:
Return cumulative sum over requested axis.

Parameters
----------
axis : {index (0), columns (1)}
skipna : boolean, default True
    Exclude NA/null values. If an entire row/column is NA, the result
    will be NA

Returns
-------
sum : Series
File:      ~/pandas/pandas/core/generic.py
Type:      instancemethod

max-sixty · 2016-03-17T01:25:36Z

Potentially not as fast (?), and handles NAs differently from the case above, but very idiomatic:

In [12]: df.where(df==2).expanding().count()
Out[12]: 
     x
0  0.0
1  1.0
2  1.0
3  2.0

max-sixty · 2016-03-17T01:32:03Z

Taken a step further: we could offer a argument for changing the NaN behavior and offer all the cum-x behavior with .expanding()

jreback · 2016-03-17T01:43:16Z

@MaximilianR actually that's a good point. would you make a PR to:

cross-link doc-strings for .cum* to the .expanding().* (for sum,prod,max,min, which are only supported).
if you find a nice place in computation.rst where you might want to mention these similarities. (I see you linked there), just review this then.

jreback · 2016-03-17T01:46:12Z

maybe give a tiny example

In [7]: s = Series([1,2,np.nan,3,np.nan,4])

In [8]: s.cumsum()
Out[8]: 
0     1.0
1     3.0
2     NaN
3     6.0
4     NaN
5    10.0
dtype: float64

In [9]: s.expanding().sum()
Out[9]: 
0     1.0
1     3.0
2     3.0
3     6.0
4     6.0
5    10.0
dtype: float64

In [10]: s.cumsum().ffill()
Out[10]: 
0     1.0
1     3.0
2     3.0
3     6.0
4     6.0
5    10.0
dtype: float64

max-sixty · 2016-03-17T02:00:27Z

@MaximilianR actually that's a good point. would you make a PR to:

I am jammed at the moment (although free enough to cruise the feed, it seems), but I can add that if no one else picks it up over the next few weeks

jreback · 2016-03-17T02:40:02Z

ok, let me make an issue so we don't forget.

johne13 · 2016-03-17T03:28:07Z

@MaximilianR I think you can get the same behavior as in my earlier example by using a boolean mask instead of where:

In [504]: df['new'] = df.x[ df.x == 2 ].expanding().count()

In [505]: df
Out[505]: 
   x  new
0  1  NaN
1  2  1.0
2  1  NaN
3  2  2.0

johne13 · 2016-03-17T03:33:59Z

Here's a quick speed test, cumsum about 4x faster than expanding().count()

In [515]: df = pd.DataFrame({ 'x':np.random.randn(10000) })

In [516]: %timeit df['new1'] = df.x[ df.x > 0 ].expanding().count()
100 loops, best of 3: 4.57 ms per loop

In [517]: %%timeit
     ...: df.loc[ df.x > 0, 'new2' ] = 1
     ...: df['new2'] = df['new2'].cumsum()
1000 loops, best of 3: 1.12 ms per loop

jreback · 2016-03-17T03:36:29Z

try assigning to a range ITS much faster still. this is the impl

we were not suggesting using expanding.count here it's doing a lot more work and very careful about min periods and such

johne13 · 2016-03-17T03:40:02Z

@jreback So do you still think cumcount is worthwhile or is expanding().count() the way to go?

Sorry, don't understand what you mean by "assigning to a range"?

If you still like the cumcount idea I'll give it a shot though I have not done a PR before. Of course if someone else wants to do it I'd gladly defer to them.

jreback · 2016-03-17T03:42:03Z

I think it rounds out the cum functions a bit and would be ok

johne13 · 2016-03-19T15:23:31Z

OK, I think I have an initial attempt and will post the code and example in a moment, but first want to list other similar functions as a point of comparison:

cumsum() -- outputs series or dataframe, dtype=float
(groupby)cumcount() -- outputs a series, numbered 0,1,2, ... n-1, dtype=int
expanding().count() -- outputs series or dataframe, numbered 1,2,3, ... n, dtype=float

Starting from there, it made the most sense to me to aim for something roughly consistent with cumsum() and expanding().count and not (groupby)cumcount(). Example to follow, see what you think.

johne13 · 2016-03-19T15:33:28Z

Here's the code, added to generic, just after cummax (and also cumsum for comparison).

        cls.cumcount = _make_cum_function(
            'count', name, name2, axis_descr, "cumulative count",
            lambda y, axis: np.cumsum(~np.isnan(y), axis).astype(float), np.nan, np.nan)

        cls.cumsum = _make_cum_function(
            'sum', name, name2, axis_descr, "cumulative sum",
            lambda y, axis: y.cumsum(axis), 0., np.nan)

It's a little less natural than the other cum functions, but that was the best way I could come up with that fit the existing _make_cum_function() infrastructure. It's basically cumsum but needs extra code to essentially convert non-nan to ones while maintaining the same treatment of nans as cumsum.

In some quickie timings (1000 rows and 2 columns) it is just a little slower than cumsum() and much faster than expanding().count(). Exactly as expected, I think.

df = pd.DataFrame({ 'x':[1,1,np.nan,1], 'y':range(4)  })

df.cumsum()
Out[18]: 
     x    y
0  1.0  0.0
1  2.0  1.0
2  NaN  3.0
3  3.0  6.0

df.cumcount()
Out[19]: 
     x    y
0  1.0  1.0
1  2.0  2.0
2  NaN  3.0
3  3.0  4.0

df.expanding().count()
Out[20]: 
     x    y
0  1.0  1.0
1  2.0  2.0
2  2.0  3.0
3  3.0  4.0

And here's groupby/cumcount for comparison:

df = pd.DataFrame({ 'x':[1,1,np.nan,1], 'y':range(4), 'grp':[1,1,1,1] })

df.groupby('grp').cumcount()
Out[40]: 
0    0
1    1
2    2
3    3
dtype: int64

jorisvandenbossche · 2016-03-19T17:34:10Z

The problem with making it consistent with groupby().cumcount() is that that function does not take the values into accound (whether they are NaN or not), it just numbers the entries in the group regardless of its values. That is also the reason that it returns a Series.

So if we want to make Series/DataFrame.cumcount consistent with the existing method, then it should also not disregard NaN values (as you do in your example implementation).
But, I suppose this defeats part of the reason you want a Series/DataFrame.cumcount?

johne13 · 2016-03-19T17:54:06Z

@jorisvandenbossche I really hadn't thought about how exactly a Series/DataFrame.cumcount would work until I started comparing it to all the existing functions (count/cumsum/groupby.cumcount). Then I realized it couldn't be consistent with all of them, only some.

Out of all the comparison functions, it seems to me that it ought to work approximately the same as expanding().count(), right? And conversely, the way groupby.cumcount works makes the least sense to me, but of course I'll gladly defer to you and others on how to best make this consistent with the rest of pandas.

jreback · 2016-03-19T18:04:56Z

This is almost a trivial function, should be something like:

def cumcount(self, axis=0, skipna=True):
   ax = self._get_axis(axis)
   if skipna and ax.hasnans:
         result = ax.isnull().cumsum()
   else:
         result = range(len(ax))
    return Series(result, index=ax)

this is really just a range; if the index has nans then its slightly different.

jreback · 2016-03-19T18:06:34Z

I guess it technically should have an ascending arg to be consistent. Further .groupby.cumcount() should have skipna.

This should mirror the .cum* signatures.

jorisvandenbossche · 2016-03-19T19:23:29Z

@jreback But I would also find it very unfortunate that groupby().cumcount() and groupby().apply(lambda x: x.cumcount()) would not be the same ..
And it is, as far as I can see, not possible to have both that and have it consistent with Series.cumsum and others

jorisvandenbossche · 2016-03-19T19:26:28Z

Actually, to phrase the question in another way: should cumcount work on the index or on the values?

The example implementation of @johne13 works on the values (ignoring NaNs there), while the example implementation of @jreback works on the index.

jreback · 2016-03-19T19:31:03Z

@jorisvandenbossche actually that's a good point. it should work on the values, similarl to cum*.

johne13 · 2016-03-19T19:43:14Z

Sorry, some of these details are over my head -- have never really delved into pandas internals before. I tried to make it work as similarly to cum* as possible, that's why it is done via _make_cum_function like the other cum functions rather than a standalone. Signature is comparable to cumsum.

In [28]: pd.DataFrame.cumcount?
Signature: pd.DataFrame.cumcount(self, axis=None, dtype=None, out=None, skipna=True, **kwargs)
Docstring:
Return cumulative count over requested axis.

Parameters
----------
axis : {index (0), columns (1)}
skipna : boolean, default True
    Exclude NA/null values. If an entire row/column is NA, the result
    will be NA

Returns
-------
count : Series
File:      ~/pandas-johne13/pandas/core/generic.py
Type:      function

johne13 · 2016-03-19T19:52:38Z

And also regarding groupby.cumcount(), it actually doesn't agree with groupby.count() in that its ending value is n-1 whereas count returns n.

jreback · 2016-03-19T19:54:36Z

@johne13 no .count() is the number of non-nans, whereas .cumcount() is an enumeration.

jorisvandenbossche · 2016-03-19T20:09:40Z

@jorisvandenbossche actually that's a good point. it should work on the values, similarl to cum*.

And that's the problem, as for groupby().cumcount, it works on the index (actually, it is not really defined there, but since it gives back a series regardless of how many columns you have, it is like it would work on the index).

So it that sense, groupby().cumcount is not really a cumulative count, but rather a cumulative size .. (like you have both groupby().count and groupby().size)

johne13 · 2016-03-19T21:29:39Z

Right, I'm just trying to note what seems consistent and what doesn't. Along those lines, note that the following holds true for the relationship between sum and cumsum

df.cumsum()[-1:] == df.sum()

The same relationship holds for prod/cumprod, max/cummax, and min/cummin but not for the groupby versions of count/cumcount. So out of all these relationships, groupby().cumcount seems like the one that is somewhat inconsistent with other pandas functions.

johne13 · 2016-03-19T21:38:02Z

Yeah, as Joris notes, groupby.cumcount seems more analogous to groupby.size than to groupby.count except for the n vs n-1 aspect.

johne13 · 2016-03-21T00:38:37Z

@jreback @jorisvandenbossche FYI, I will abandon this for now, pending further instructions. I'm not quite sure what if anything you guys decided about how best to proceed. And this might not be a good first project for someone who has never done a PR and is not super familiar with all the underlying code, so if someone else want to do something with it I'm happy to step aside.

johne13 · 2016-03-21T13:14:45Z

One other point of comparison for a hypothetical cumcount() is rank(method='first') on an array of ones. Just playing around with this a little bit it seems to do approximately the same thing as expanding().count() but is much faster -- actually faster than a cumsum() in my test example but I imagine that doesn't hold in all cases since adding must generally be faster than sorting (?).

jreback · 2016-03-21T13:16:52Z

@johne13 almost everything is faster than sorting :) sorting O(n log n); people go to great length to avoid sort.

johne13 · 2016-03-21T14:32:00Z

Right, except in this case it's "sorting" a column of ones, so it's just a matter of how long it takes the algorithm to confirm that it's already sorted, there is no actual sorting.

But I mainly mentioned rank() just as another function to consider in terms of consistency with how other existing functions behave.

jreback · 2016-03-21T14:36:48Z

So we have very fast routines to determine whether something is monotonic which determine if its 'sorted' very quickly.

In [2]: Index([1,2,3]).is_monotonic
Out[2]: True

In [3]: Index([3,2,1]).is_monotonic
Out[3]: False

johne13 changed the title ~~Enhancement request: cumcount() for DataFrames and Series~~ ENH: cumcount() for DataFrames and Series Mar 16, 2016

jreback added Enhancement Difficulty Novice API Design labels Mar 17, 2016

jreback added this to the 0.18.1 milestone Mar 17, 2016

jreback mentioned this issue Mar 17, 2016

DOC: improve cross-links in docs between .expanding and .cum* #12651

Closed

jreback modified the milestones: 0.18.2, 0.18.1 Apr 26, 2016

jorisvandenbossche modified the milestones: 0.20.0, 0.19.0 Aug 13, 2016

jreback modified the milestones: Next Major Release, 0.20.0 Apr 12, 2017

TomAugspurger added the good first issue label Oct 11, 2017

jreback removed the Difficulty Novice label Dec 15, 2017

jbrockmendel removed the Effort Low label Oct 21, 2019

mroeschke added Numeric Operations Arithmetic, Comparison, and Logical operations and removed API Design good first issue labels Apr 23, 2021

mroeschke removed this from the Contributions Welcome milestone Oct 13, 2022

ENH: cumcount() for DataFrames and Series #12648

ENH: cumcount() for DataFrames and Series #12648

Comments

johne13 commented Mar 16, 2016

jreback commented Mar 16, 2016

jorisvandenbossche commented Mar 17, 2016

johne13 commented Mar 17, 2016

jreback commented Mar 17, 2016

jreback commented Mar 17, 2016

johne13 commented Mar 17, 2016

jreback commented Mar 17, 2016

jreback commented Mar 17, 2016

max-sixty commented Mar 17, 2016

max-sixty commented Mar 17, 2016

jreback commented Mar 17, 2016

jreback commented Mar 17, 2016

max-sixty commented Mar 17, 2016

jreback commented Mar 17, 2016

johne13 commented Mar 17, 2016

johne13 commented Mar 17, 2016

jreback commented Mar 17, 2016

johne13 commented Mar 17, 2016

jreback commented Mar 17, 2016

johne13 commented Mar 19, 2016

johne13 commented Mar 19, 2016

jorisvandenbossche commented Mar 19, 2016

johne13 commented Mar 19, 2016

jreback commented Mar 19, 2016

jreback commented Mar 19, 2016

jorisvandenbossche commented Mar 19, 2016

jorisvandenbossche commented Mar 19, 2016

jreback commented Mar 19, 2016

johne13 commented Mar 19, 2016

johne13 commented Mar 19, 2016

jreback commented Mar 19, 2016

jorisvandenbossche commented Mar 19, 2016

johne13 commented Mar 19, 2016

johne13 commented Mar 19, 2016

johne13 commented Mar 21, 2016

johne13 commented Mar 21, 2016

jreback commented Mar 21, 2016

johne13 commented Mar 21, 2016

jreback commented Mar 21, 2016