Skip to content

ENH: cumcount() for DataFrames and Series #12648

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
johne13 opened this issue Mar 16, 2016 · 39 comments
Open

ENH: cumcount() for DataFrames and Series #12648

johne13 opened this issue Mar 16, 2016 · 39 comments
Labels
Enhancement Numeric Operations Arithmetic, Comparison, and Logical operations

Comments

@johne13
Copy link

johne13 commented Mar 16, 2016

Unless I am missing something obvious, we can do this:
df.groupby('x').cumcount()
but not:
df.cumcount()

Obviously, there are workarounds, but seems like we should be able to do on a dataframe/series if we can do on a groupby object?

@johne13 johne13 changed the title Enhancement request: cumcount() for DataFrames and Series ENH: cumcount() for DataFrames and Series Mar 16, 2016
@jreback
Copy link
Contributor

jreback commented Mar 16, 2016

an Index already does the

pls show a usecase

@jorisvandenbossche
Copy link
Member

@johne13 you aren't confusing it with cumsum? Besides obtaining it from an index, you can also just do range(len(df))

@johne13
Copy link
Author

johne13 commented Mar 17, 2016

It is not a super-common event, to be sure, but there are cases like this: that I bump into somewhat regularly:
http://stackoverflow.com/questions/36044890/how-to-speed-up-a-very-slow-pandas-apply-function/36047836#36047836

Simpler example:

df = pd.DataFrame({ 'x':[1,2,1,2] })
#df[ df.x == 2 ].cumcount()    # doesn't work of course

df.loc[ df.x == 2, 'y'] = 1    # so I do a two step with cumsum as 2nd step
df.y.cumsum()

Out[428]: 
0    NaN
1    1.0
2    NaN
3    2.0

So yeah, lots of easy ways to do this on ENTIRE dataset but as soon as you want to do a selection, it gets a little messier. Or again, maybe I am missing something obvious.

Also, I would think whatever justification there is for having this as a groupby method applies just as much to having it work for a dataframe or series, no?

@jreback
Copy link
Contributor

jreback commented Mar 17, 2016

the common idiom is

.reset_index()

@jreback
Copy link
Contributor

jreback commented Mar 17, 2016

no for a groupby this is expensive if u don't have an array impl (which is why it exists)

@johne13
Copy link
Author

johne13 commented Mar 17, 2016

I guess I am missing something. I couldn't do this, for example,
df['new'] = df[ df.x == 2 ].reset_index()
because obviously I have reset the index so can't correctly align it with the original dataset.

I should have been more clear about that aspect. I am generally trying to create a new column when I mistakenly attempt to use the nonexistent cumcount method.

@jreback
Copy link
Contributor

jreback commented Mar 17, 2016

ok not averse to adding this method for consistency (among other cum. methods) but should return a Series

pretty trivial impl if u would like to submit a PR

@jreback
Copy link
Contributor

jreback commented Mar 17, 2016

signature would be identical to .cum* (at end of core/generic.py)

In [1]: pd.DataFrame.cumsum?
Signature: pd.DataFrame.cumsum(self, axis=None, dtype=None, out=None, skipna=True, **kwargs)
Docstring:
Return cumulative sum over requested axis.

Parameters
----------
axis : {index (0), columns (1)}
skipna : boolean, default True
    Exclude NA/null values. If an entire row/column is NA, the result
    will be NA

Returns
-------
sum : Series
File:      ~/pandas/pandas/core/generic.py
Type:      instancemethod

@max-sixty
Copy link
Contributor

Potentially not as fast (?), and handles NAs differently from the case above, but very idiomatic:

In [12]: df.where(df==2).expanding().count()
Out[12]: 
     x
0  0.0
1  1.0
2  1.0
3  2.0

@max-sixty
Copy link
Contributor

Taken a step further: we could offer a argument for changing the NaN behavior and offer all the cum-x behavior with .expanding()

@jreback
Copy link
Contributor

jreback commented Mar 17, 2016

@MaximilianR actually that's a good point. would you make a PR to:

  • cross-link doc-strings for .cum* to the .expanding().* (for sum,prod,max,min, which are only supported).
  • if you find a nice place in computation.rst where you might want to mention these similarities. (I see you linked there), just review this then.

@jreback
Copy link
Contributor

jreback commented Mar 17, 2016

maybe give a tiny example

In [7]: s = Series([1,2,np.nan,3,np.nan,4])

In [8]: s.cumsum()
Out[8]: 
0     1.0
1     3.0
2     NaN
3     6.0
4     NaN
5    10.0
dtype: float64

In [9]: s.expanding().sum()
Out[9]: 
0     1.0
1     3.0
2     3.0
3     6.0
4     6.0
5    10.0
dtype: float64

In [10]: s.cumsum().ffill()
Out[10]: 
0     1.0
1     3.0
2     3.0
3     6.0
4     6.0
5    10.0
dtype: float64

@max-sixty
Copy link
Contributor

@MaximilianR actually that's a good point. would you make a PR to:

I am jammed at the moment (although free enough to cruise the feed, it seems), but I can add that if no one else picks it up over the next few weeks

@jreback
Copy link
Contributor

jreback commented Mar 17, 2016

ok, let me make an issue so we don't forget.

@johne13
Copy link
Author

johne13 commented Mar 17, 2016

@MaximilianR I think you can get the same behavior as in my earlier example by using a boolean mask instead of where:

In [504]: df['new'] = df.x[ df.x == 2 ].expanding().count()

In [505]: df
Out[505]: 
   x  new
0  1  NaN
1  2  1.0
2  1  NaN
3  2  2.0

@johne13
Copy link
Author

johne13 commented Mar 17, 2016

Here's a quick speed test, cumsum about 4x faster than expanding().count()

In [515]: df = pd.DataFrame({ 'x':np.random.randn(10000) })

In [516]: %timeit df['new1'] = df.x[ df.x > 0 ].expanding().count()
100 loops, best of 3: 4.57 ms per loop

In [517]: %%timeit
     ...: df.loc[ df.x > 0, 'new2' ] = 1
     ...: df['new2'] = df['new2'].cumsum()
1000 loops, best of 3: 1.12 ms per loop

@jreback
Copy link
Contributor

jreback commented Mar 17, 2016

try assigning to a range ITS much faster still. this is the impl

we were not suggesting using expanding.count here it's doing a lot more work and very careful about min periods and such

@johne13
Copy link
Author

johne13 commented Mar 17, 2016

@jreback So do you still think cumcount is worthwhile or is expanding().count() the way to go?

Sorry, don't understand what you mean by "assigning to a range"?

If you still like the cumcount idea I'll give it a shot though I have not done a PR before. Of course if someone else wants to do it I'd gladly defer to them.

@jreback
Copy link
Contributor

jreback commented Mar 17, 2016

I think it rounds out the cum functions a bit and would be ok

@johne13
Copy link
Author

johne13 commented Mar 19, 2016

OK, I think I have an initial attempt and will post the code and example in a moment, but first want to list other similar functions as a point of comparison:

  • cumsum() -- outputs series or dataframe, dtype=float
  • (groupby)cumcount() -- outputs a series, numbered 0,1,2, ... n-1, dtype=int
  • expanding().count() -- outputs series or dataframe, numbered 1,2,3, ... n, dtype=float

Starting from there, it made the most sense to me to aim for something roughly consistent with cumsum() and expanding().count and not (groupby)cumcount(). Example to follow, see what you think.

@johne13
Copy link
Author

johne13 commented Mar 19, 2016

Here's the code, added to generic, just after cummax (and also cumsum for comparison).

        cls.cumcount = _make_cum_function(
            'count', name, name2, axis_descr, "cumulative count",
            lambda y, axis: np.cumsum(~np.isnan(y), axis).astype(float), np.nan, np.nan)

        cls.cumsum = _make_cum_function(
            'sum', name, name2, axis_descr, "cumulative sum",
            lambda y, axis: y.cumsum(axis), 0., np.nan)

It's a little less natural than the other cum functions, but that was the best way I could come up with that fit the existing _make_cum_function() infrastructure. It's basically cumsum but needs extra code to essentially convert non-nan to ones while maintaining the same treatment of nans as cumsum.

In some quickie timings (1000 rows and 2 columns) it is just a little slower than cumsum() and much faster than expanding().count(). Exactly as expected, I think.

df = pd.DataFrame({ 'x':[1,1,np.nan,1], 'y':range(4)  })

df.cumsum()
Out[18]: 
     x    y
0  1.0  0.0
1  2.0  1.0
2  NaN  3.0
3  3.0  6.0

df.cumcount()
Out[19]: 
     x    y
0  1.0  1.0
1  2.0  2.0
2  NaN  3.0
3  3.0  4.0

df.expanding().count()
Out[20]: 
     x    y
0  1.0  1.0
1  2.0  2.0
2  2.0  3.0
3  3.0  4.0

And here's groupby/cumcount for comparison:

df = pd.DataFrame({ 'x':[1,1,np.nan,1], 'y':range(4), 'grp':[1,1,1,1] })

df.groupby('grp').cumcount()
Out[40]: 
0    0
1    1
2    2
3    3
dtype: int64

@jorisvandenbossche
Copy link
Member

The problem with making it consistent with groupby().cumcount() is that that function does not take the values into accound (whether they are NaN or not), it just numbers the entries in the group regardless of its values. That is also the reason that it returns a Series.

So if we want to make Series/DataFrame.cumcount consistent with the existing method, then it should also not disregard NaN values (as you do in your example implementation).
But, I suppose this defeats part of the reason you want a Series/DataFrame.cumcount?

@johne13
Copy link
Author

johne13 commented Mar 19, 2016

@jorisvandenbossche I really hadn't thought about how exactly a Series/DataFrame.cumcount would work until I started comparing it to all the existing functions (count/cumsum/groupby.cumcount). Then I realized it couldn't be consistent with all of them, only some.

Out of all the comparison functions, it seems to me that it ought to work approximately the same as expanding().count(), right? And conversely, the way groupby.cumcount works makes the least sense to me, but of course I'll gladly defer to you and others on how to best make this consistent with the rest of pandas.

@jreback
Copy link
Contributor

jreback commented Mar 19, 2016

This is almost a trivial function, should be something like:

def cumcount(self, axis=0, skipna=True):
   ax = self._get_axis(axis)
   if skipna and ax.hasnans:
         result = ax.isnull().cumsum()
   else:
         result = range(len(ax))
    return Series(result, index=ax)

this is really just a range; if the index has nans then its slightly different.

@jreback
Copy link
Contributor

jreback commented Mar 19, 2016

I guess it technically should have an ascending arg to be consistent. Further .groupby.cumcount() should have skipna.

This should mirror the .cum* signatures.

@jorisvandenbossche
Copy link
Member

@jreback But I would also find it very unfortunate that groupby().cumcount() and groupby().apply(lambda x: x.cumcount()) would not be the same ..
And it is, as far as I can see, not possible to have both that and have it consistent with Series.cumsum and others

@jorisvandenbossche
Copy link
Member

Actually, to phrase the question in another way: should cumcount work on the index or on the values?

The example implementation of @johne13 works on the values (ignoring NaNs there), while the example implementation of @jreback works on the index.

@jreback
Copy link
Contributor

jreback commented Mar 19, 2016

@jorisvandenbossche actually that's a good point. it should work on the values, similarl to cum*.

@johne13
Copy link
Author

johne13 commented Mar 19, 2016

Sorry, some of these details are over my head -- have never really delved into pandas internals before. I tried to make it work as similarly to cum* as possible, that's why it is done via _make_cum_function like the other cum functions rather than a standalone. Signature is comparable to cumsum.

In [28]: pd.DataFrame.cumcount?
Signature: pd.DataFrame.cumcount(self, axis=None, dtype=None, out=None, skipna=True, **kwargs)
Docstring:
Return cumulative count over requested axis.

Parameters
----------
axis : {index (0), columns (1)}
skipna : boolean, default True
    Exclude NA/null values. If an entire row/column is NA, the result
    will be NA

Returns
-------
count : Series
File:      ~/pandas-johne13/pandas/core/generic.py
Type:      function

@johne13
Copy link
Author

johne13 commented Mar 19, 2016

And also regarding groupby.cumcount(), it actually doesn't agree with groupby.count() in that its ending value is n-1 whereas count returns n.

@jreback
Copy link
Contributor

jreback commented Mar 19, 2016

@johne13 no .count() is the number of non-nans, whereas .cumcount() is an enumeration.

@jorisvandenbossche
Copy link
Member

@jorisvandenbossche actually that's a good point. it should work on the values, similarl to cum*.

And that's the problem, as for groupby().cumcount, it works on the index (actually, it is not really defined there, but since it gives back a series regardless of how many columns you have, it is like it would work on the index).

So it that sense, groupby().cumcount is not really a cumulative count, but rather a cumulative size .. (like you have both groupby().count and groupby().size)

@johne13
Copy link
Author

johne13 commented Mar 19, 2016

Right, I'm just trying to note what seems consistent and what doesn't. Along those lines, note that the following holds true for the relationship between sum and cumsum

df.cumsum()[-1:] == df.sum()

The same relationship holds for prod/cumprod, max/cummax, and min/cummin but not for the groupby versions of count/cumcount. So out of all these relationships, groupby().cumcount seems like the one that is somewhat inconsistent with other pandas functions.

@johne13
Copy link
Author

johne13 commented Mar 19, 2016

Yeah, as Joris notes, groupby.cumcount seems more analogous to groupby.size than to groupby.count except for the n vs n-1 aspect.

@johne13
Copy link
Author

johne13 commented Mar 21, 2016

@jreback @jorisvandenbossche FYI, I will abandon this for now, pending further instructions. I'm not quite sure what if anything you guys decided about how best to proceed. And this might not be a good first project for someone who has never done a PR and is not super familiar with all the underlying code, so if someone else want to do something with it I'm happy to step aside.

@johne13
Copy link
Author

johne13 commented Mar 21, 2016

One other point of comparison for a hypothetical cumcount() is rank(method='first') on an array of ones. Just playing around with this a little bit it seems to do approximately the same thing as expanding().count() but is much faster -- actually faster than a cumsum() in my test example but I imagine that doesn't hold in all cases since adding must generally be faster than sorting (?).

@jreback
Copy link
Contributor

jreback commented Mar 21, 2016

@johne13 almost everything is faster than sorting :) sorting O(n log n); people go to great length to avoid sort.

@johne13
Copy link
Author

johne13 commented Mar 21, 2016

Right, except in this case it's "sorting" a column of ones, so it's just a matter of how long it takes the algorithm to confirm that it's already sorted, there is no actual sorting.

But I mainly mentioned rank() just as another function to consider in terms of consistency with how other existing functions behave.

@jreback
Copy link
Contributor

jreback commented Mar 21, 2016

So we have very fast routines to determine whether something is monotonic which determine if its 'sorted' very quickly.

In [2]: Index([1,2,3]).is_monotonic
Out[2]: True

In [3]: Index([3,2,1]).is_monotonic
Out[3]: False

@jreback jreback modified the milestones: 0.18.2, 0.18.1 Apr 26, 2016
@jorisvandenbossche jorisvandenbossche modified the milestones: 0.20.0, 0.19.0 Aug 13, 2016
@jreback jreback modified the milestones: Next Major Release, 0.20.0 Apr 12, 2017
@mroeschke mroeschke added Numeric Operations Arithmetic, Comparison, and Logical operations and removed API Design good first issue labels Apr 23, 2021
@mroeschke mroeschke removed this from the Contributions Welcome milestone Oct 13, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement Numeric Operations Arithmetic, Comparison, and Logical operations
Projects
None yet
Development

No branches or pull requests

7 participants