Skip to content

ENH: allow 'size' in groupby aggregation #6312

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
jreback opened this issue Feb 9, 2014 · 9 comments
Closed

ENH: allow 'size' in groupby aggregation #6312

jreback opened this issue Feb 9, 2014 · 9 comments

Comments

@jreback
Copy link
Contributor

jreback commented Feb 9, 2014

Allow to use 'size' in groupby's aggregate, so you can do:

df.groupby(..).agg('size') 
df.groupny(..).agg(['mean', 'size'])

http://stackoverflow.com/questions/21660686/pandas-groupby-straight-forward-counting-the-number-of-elements-in-each-group-i

  • count should directly implement size (enh)
  • count/size should be allowed in an aggregation list (the bug)
@eldad-a
Copy link

eldad-a commented Feb 10, 2014

Please note, the current (slow) count is allowed in the aggregation list

@hayd
Copy link
Contributor

hayd commented Mar 4, 2014

so count is non-null values, which goes someway to explain why it is slower.

@jreback
Copy link
Contributor Author

jreback commented Mar 4, 2014

yep...this is a very easy fix (just alias count to size) as its already comptued by the group indexer

@hayd
Copy link
Contributor

hayd commented Mar 4, 2014

What I mean is, count is a different operation to size, size just cares about the result_index whilst count cares about whether values are non-null in columns... (same thing with vaue_counts, sometime user may want to count at values in another column).

@jreback jreback modified the milestones: 0.15.0, 0.14.0 Apr 21, 2014
@jreback jreback modified the milestones: 0.16.0, Next Major Release Mar 3, 2015
@jorisvandenbossche
Copy link
Member

@jreback This issue is not really clear to me, as size does already exist for groupby?
Or do you mean that you want to be able to do df.groupby(..).agg('size') instead of df.groupby(..).size() (and therefore be able to do agg(['mean', 'size']) ?

And I think we don't want to "alias count to size", as size does something different as count.

@jreback
Copy link
Contributor Author

jreback commented Nov 11, 2016

no i think we just need to alias size (like we do mean). iow add it to the cython table i think (this might work now)

@jorisvandenbossche jorisvandenbossche changed the title BUG: size should be allowed in groupby aggregation ENH: allow 'size' in groupby aggregation Nov 11, 2016
@jorisvandenbossche
Copy link
Member

@jreback updated top post to clarify the issue

@jreback
Copy link
Contributor Author

jreback commented Nov 11, 2016

I'll note that we should look at count perf as well (maybe create another issue); it may have been fixed since this issue

@jreback
Copy link
Contributor Author

jreback commented Nov 11, 2016

In [7]: df = pd.DataFrame({'x':np.random.randn(50000), ## produce the demo DataFrame 
   ...:    ...:                    'y':np.random.randn(50000),
   ...:    ...:                    'z':np.random.randn(50000)})
   ...: 
   ...: In [4]: buckets = {col : np.arange(int(df[col].min()) ,int(df[col].max())+2) 
   ...:    ...:            for col in df.columns} ## produce the unit bins
   ...: 
   ...: In [5]: cats = [pd.cut(df[col], bucket) for col,bucket in buckets.items()]
   ...: 
   ...: In [6]: grouped = df.groupby(cats) # group by the binned x,y,z
   ...: 
   ...: 

In [20]: %timeit -n1 -r1 grouped.x.size()
1 loop, best of 1: 642 µs per loop

In [9]: %timeit -n1 -r1 grouped.x.mean()
1 loop, best of 1: 2.98 ms per loop

In [10]: %timeit -n1 -r1 grouped.x.count()
1 loop, best of 1: 696 µs per loop

In [19]: %timeit -n1 -r1 grouped.x.agg(['mean','size'])
1 loop, best of 1: 1.62 ms per loop

this is implemented actually.

@jreback jreback closed this as completed Nov 11, 2016
@jorisvandenbossche jorisvandenbossche modified the milestones: No action, Next Major Release Nov 11, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants