DOC: OrderedDict example in groupby aggregation #12879

randomgambit · 2016-04-12T12:12:41Z

Hello,

It's me, again. ;-) I would like to submit to your attention another possible improvement your could implement.

Let me start by saying that 90% of the time, data processing in my field involves some form of groupby at the time frequency, panel frequency, etc.

So aggregating data efficiently is definitely something important and very useful.

I clearly prefer pandas' groupby over stata collapse (or others) because it is so much faster.
However, a key functionality seems to be missing in Pandas. Usually, people want to apply functions to several columns, and be able to rename the output. In stata, you would write something like

collapse (firstnm) jreback=pandas, by(time)

to create a variable named jreback, that contains the first non missing value of the column pandas for every group in time.

In Pandas, a similar process seem unnecessarily complex.

I can only use the syntax group=df.groupby('group').agg({'A':'mean', 'B':['mean','sum']})
which has a major disadvantage

no predictability over the sorting order of the columns. That is, there is no guarantee that in group, the first column will be A and the second one will be B. I need to group.column.tolist() manually to understand which column corresponds to what. That is clearly not efficient (or maybe I am missing something here)

It would be therefore useful to add an option column_names that allows the user to chose columns names at the agg level, with the guarantee that the first name correspond to the first column and so on. For instance, in the example above I could specify col_names=['mean_A','mean_B','this is a sum']

What do you think?

The text was updated successfully, but these errors were encountered:

TomAugspurger · 2016-04-12T12:25:52Z

@randomgambit there is this

In [60]: df = pd.DataFrame({'A': [1, 2, 3], 'B': ['a', 'a', 'b'], 'C': [3, 4, 5]})

In [61]: df
Out[61]:
   A  B  C
0  1  a  3
1  2  a  4
2  3  b  5

In [62]: df.groupby('B').agg({'A': {'mean1': 'mean', 'med1': 'median'}, 'C': {'mean2': 'mean', 'med2': 'median'}})
Out[62]:
      C          A
  mean2 med2 mean1 med1
B
a   3.5  3.5   1.5  1.5
b   5.0  5.0   3.0  3.0

So you pass a dict like {original_column: {'output_name': aggfunc}}. As you see this is unordered since python's dicts are undordered. So I'd suggest using an OrderedDict if order is important

# from collections import OrderedDict
In [99]: agg_funcs = OrderedDict([('A', OrderedDict([('mean1', 'mean'), ('med1', 'median')])), ('C', OrderedDict([('mean2', 'mean'), ('med2', 'median')]))])

In [100]: df.groupby('B').agg(agg_funcs)
Out[100]:
      A          C
  mean1 med1 mean2 med2
B
a   1.5  1.5   3.5  3.5
b   3.0  3.0   5.0  5.0

randomgambit · 2016-04-12T12:28:21Z

excellent! I suggest you add this to the cookbook or to the tutorial. I know many people that are confused by this simple renaming thing. Thanks

randomgambit · 2016-04-12T12:29:06Z

just a follow up: if I remember well this only works with agg, right? what about transform?

jreback · 2016-04-12T12:39:00Z

you could put a section at the end of http://pandas-docs.github.io/pandas-docs-travis/groupby.html#aggregation

TomAugspurger added Usage Question Groupby labels Apr 12, 2016

jreback added the Docs label Apr 12, 2016

jreback changed the title ~~additional options for groupby needed~~ DOC: OrderedDict example in groupby aggregation Apr 12, 2016

jreback added this to the Next Major Release milestone Apr 12, 2016

jreback added Difficulty Novice labels Apr 12, 2016

bkandel mentioned this issue Aug 8, 2016

DOC: Added example of using OrderedDict for agg. #13938

Closed

2 tasks

jreback modified the milestones: 0.19.0, Next Major Release Aug 8, 2016

jreback closed this as completed in b7abef4 Aug 8, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DOC: OrderedDict example in groupby aggregation #12879

DOC: OrderedDict example in groupby aggregation #12879

randomgambit commented Apr 12, 2016

TomAugspurger commented Apr 12, 2016

randomgambit commented Apr 12, 2016

randomgambit commented Apr 12, 2016

jreback commented Apr 12, 2016

DOC: OrderedDict example in groupby aggregation #12879

DOC: OrderedDict example in groupby aggregation #12879

Comments

randomgambit commented Apr 12, 2016

TomAugspurger commented Apr 12, 2016

randomgambit commented Apr 12, 2016

randomgambit commented Apr 12, 2016

jreback commented Apr 12, 2016