-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
DOC: OrderedDict example in groupby aggregation #12879
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
@randomgambit there is this In [60]: df = pd.DataFrame({'A': [1, 2, 3], 'B': ['a', 'a', 'b'], 'C': [3, 4, 5]})
In [61]: df
Out[61]:
A B C
0 1 a 3
1 2 a 4
2 3 b 5
In [62]: df.groupby('B').agg({'A': {'mean1': 'mean', 'med1': 'median'}, 'C': {'mean2': 'mean', 'med2': 'median'}})
Out[62]:
C A
mean2 med2 mean1 med1
B
a 3.5 3.5 1.5 1.5
b 5.0 5.0 3.0 3.0 So you pass a dict like # from collections import OrderedDict
In [99]: agg_funcs = OrderedDict([('A', OrderedDict([('mean1', 'mean'), ('med1', 'median')])), ('C', OrderedDict([('mean2', 'mean'), ('med2', 'median')]))])
In [100]: df.groupby('B').agg(agg_funcs)
Out[100]:
A C
mean1 med1 mean2 med2
B
a 1.5 1.5 3.5 3.5
b 3.0 3.0 5.0 5.0 |
excellent! I suggest you add this to the cookbook or to the tutorial. I know many people that are confused by this simple renaming thing. Thanks |
just a follow up: if I remember well this only works with |
you could put a section at the end of http://pandas-docs.github.io/pandas-docs-travis/groupby.html#aggregation |
Hello,
It's me, again. ;-) I would like to submit to your attention another possible improvement your could implement.
Let me start by saying that 90% of the time, data processing in my field involves some form of groupby at the time frequency, panel frequency, etc.
So aggregating data efficiently is definitely something important and very useful.
I clearly prefer pandas'
groupby
over statacollapse
(or others) because it is so much faster.However, a key functionality seems to be missing in Pandas. Usually, people want to apply functions to several columns, and be able to rename the output. In stata, you would write something like
collapse (firstnm) jreback=pandas, by(time)
to create a variable named
jreback
, that contains the first non missing value of the columnpandas
for every group intime
.In Pandas, a similar process seem unnecessarily complex.
I can only use the syntax
group=df.groupby('group').agg({'A':'mean', 'B':['mean','sum']})
which has a major disadvantage
group.column.tolist()
manually to understand which column corresponds to what. That is clearly not efficient (or maybe I am missing something here)It would be therefore useful to add an option
column_names
that allows the user to chose columns names at theagg
level, with the guarantee that the first name correspond to the first column and so on. For instance, in the example above I could specifycol_names=['mean_A','mean_B','this is a sum']
What do you think?
The text was updated successfully, but these errors were encountered: