as_index =True for groupby transform inconsistent with agg #15290

Kevin-McIsaac · 2017-02-02T05:17:46Z

Code Sample, a copy-pastable example if possible

import pandas as pd
import numpy as np
index = pd.date_range('10/1/1999', periods=1100, name='Date')

ts = pd.DataFrame({'Entry': np.random.normal(300, 50, 1100), 
                                'Exit':  np.random.normal(300, 50, 1100)}, index)

ts['Year'] = ts.index.year
ts['Day'] = ts.index.weekday_name

zscore = lambda x: (x - x.mean()) / x.std()
ts.groupby(['Year', 'Day'],  as_index=True).transform(lambda g:g).head(5)

Problem description

Unlike agg, when using transform

as_index =True does not set the index to the groupby by (i.e ['Year', 'Day']).
as_index = False does not prepend the groupby by column to the output of the transform as is done with agg.

I found this inconsistency very confusing and hard to work with.

This is also noted in ##5755

Expected Output

If as_index =True, the index of the result should the groupby by (i.e ['Year', 'Day']).

if as_index =False, the output should have the the groupby by (i.e ['Year', 'Day']) prepended to the as is done with agg.

I've really struggled using transform, partly as the semantics of transform were unclear to me. Specifically :

Which columns is the transform applied to. Is it all columns of the df or only those not in the groupby by? It looks like its the latter. This seems to be true with agg too. It would have been helpful to me if this was clearer in the documentation
Is the transform passed a df of columns or is it applied column by column? My guess is it's the former if the transform takes a df otherwise it's the latter. This seems to be hinted at in some parts of the documentation but it could be made more explicit.

Output of `pd.show_versions()`

INSTALLED VERSIONS ------------------ commit: None python: 3.5.2.final.0 python-bits: 64 OS: Linux OS-release: 4.4.41-36.55.amzn1.x86_64 machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: en_US.UTF-8

pandas: 0.19.2
nose: 1.3.7
pip: 9.0.1
setuptools: 27.2.0
Cython: 0.25.2
numpy: 1.11.3
scipy: 0.18.1
statsmodels: 0.6.1
xarray: None
IPython: 5.1.0
sphinx: 1.5.1
patsy: 0.4.1
dateutil: 2.6.0
pytz: 2016.10
blosc: None
bottleneck: 1.2.0
tables: 3.3.0
numexpr: 2.6.1
matplotlib: 2.0.0
openpyxl: 2.4.1
xlrd: 1.0.0
xlwt: 1.2.0
xlsxwriter: 0.9.6
lxml: 3.7.2
bs4: 4.5.3
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: 1.1.5
pymysql: None
psycopg2: 2.6.2 (dt dec pq3 ext lo64)
jinja2: 2.9.4
boto: 2.45.0
pandas_datareader: None

The text was updated successfully, but these errors were encountered:

jreback · 2017-02-02T13:18:22Z

I think you are confused as to the purpose of .transform. This is a reductive operation on a group to a scalar, which is then broadcast back to the original shape.

So I would say that what you are doing should actually raise a ValueError; we don't really check this fully.

.apply is a more general operator that will infer what you are doing / want.

Further I think we completely ignore as_index with transform as it makes no sense. Since it defaults to True, I think we could/should raise if its not True.

I will mark this as a better error reporting

Kevin-McIsaac · 2017-02-02T21:24:48Z

I concur that I'm confused, however why should transform drop the groupby columns? Why is this more useful than returning the transformed columns with the groupby columns prepended as is done by agg?

jreback · 2017-02-02T21:34:14Z

I think the docs are pretty clear: http://pandas.pydata.org/pandas-docs/stable/groupby.html#transformation

Kevin-McIsaac · 2017-02-02T22:47:37Z

The docs states "The transform method returns an object that is indexed the same", so it is clear that the index labels are the same, however it is silent on

Which columns are transformed. Excluding the "by" columns makes sense and is the same as agg.
What happens to the "by" columns and why dropping them is the right choice where as with agg they are retained.

jreback · 2017-02-02T23:34:27Z

well I think that should be clear, the semantics are exactly the same as .agg, IOW on a DataFrameGroupby you get all columns but the groupers, and on a SeriesGrouppBy you get that column.

.transform is essentially a reductive operation on each group which is then broadcast back to the original size.

Kevin-McIsaac · 2017-02-03T00:02:13Z

Yes the semantics are the same, making point 1 explicit (both for agg and transform) in the docs would help new users like myself.

What about point 2, why is it the right thing to drop the groupers columns, which is not how .agg works?

jreback · 2017-02-03T00:05:54Z

well, if you want to a pull-request would be helpful.

its a transform, by-definition you are only affecting the non-groupers. you are returning the original just with replaced data for the non-groupers.

its rarely makes sense to perform an operation on grouped columns. for example normalizing doesn't make any sense on datetimes or strings, sure it makes sense on intergers (talking about groupers here), but that would be just weird.

Kevin-McIsaac · 2017-02-03T00:13:22Z

I'm not suggesting transforming the groupers just prepending them to the transformed columns, which is the same as the .agg semantics.

The transformed DataFrame then has the same shape as the original, with the same indexes and same columns. I think that makes more sense.

It also enable as_index = True to have a sensible interpretation.

jreback · 2017-02-03T01:05:50Z

that's not very useful. The point is to align the index on assign back. IOW a very common thing to do is.

df = df.assign(transformed=df.groupby(...).foo.transform(...))

which only makes sense if the returned column has the same index as the original.

Kevin-McIsaac · 2017-02-03T03:45:17Z

BTW, thank you for persisting with me on this.

Agreed this is common, however if the groupers were retained (i.e., prepended as with .agg) in what I suspect is the common use case there isn't a need to do an assign.

Using my example you propose writing:

ts[['Entry', 'Exit']] = ts.groupby(['Year', 'Day']).transform(zscore)

using assignment to update ts as a side-effect. This works if the index is unique, and is a common approach, but fails if the index is not unique. In my case, the index is only unique within each group.

I'm proposing the following give the same result by pre-pended the groupers, as is done with .agg

ts.groupby(['Year', 'Day']).transform(zscore)

This has the advantages of:

being shorter,
the same behaviour as .agg
allows an interpretation of as_index
being side effect free.

Of course if I only want to apply the transform just Exit, then there is a problem and I'd need to write

ts[['Exit']] = ts.groupby(['Year', 'Day']).transform(zscore).Exit

jreback · 2017-02-03T03:50:31Z

you can just use .apply to get what you want. .transform has a specific function. Please really learn why things are before trying to change them.

Kevin-McIsaac · 2017-02-03T04:20:37Z

Fair enough. I did spend about 6 hrs on this before raising the issue. Taking your advice I tried apply in the obvious way:

zscore = lambda x: (x - x.mean()) / x.std()
ts.groupby(['Year', 'Day'],  as_index=True).apply(zscore)

but this gives

ValueError: can only convert an array of size 1 to a Python scalar

so I exclude the groupers

ts.groupby(['Year', 'Day'],  as_index=True).apply(lambda g: zscore(g[['Exit', 'Entry']]))

and then I get the same result as transform. So is this the semantics of transform, i.e., apply excluding groupers.

To get the result I've proposed I have to write

def my_transform(group, by, fn):
    transformed = fn(group.drop(by, axis='columns'))
    return pd.concat([group[by], transformed], axis='columns')

ts.groupby(['Year', 'Day']).apply(lambda g:my_transform(g, ['Year', 'Day'], zscore))

rhshadrach · 2020-07-12T12:26:56Z

Seems to me too much is attempted being done within one line.

g = ts.groupby(['Year', 'Day'])
mean, std = g.transform('mean'), g.transform('std')
(ts[['Entry', 'Exit']] - mean)/std

This also has the benefit of not using apply, it should also be more performant but I haven't tested this.

I think this issue can be closed.

jreback added Difficulty Intermediate Error Reporting Incorrect or improved errors from pandas Groupby labels Feb 2, 2017

jreback added this to the 0.20.0 milestone Feb 2, 2017

jreback modified the milestones: 0.20.0, Next Major Release Mar 23, 2017

h-vetinari mentioned this issue Aug 30, 2018

API: groupby aggregation with apply does not drop groupby-column #22542

Closed

jbrockmendel removed Difficulty Intermediate labels Oct 21, 2019

rhshadrach added the Apply Apply, Aggregate, Transform, Map label Aug 14, 2020

mroeschke added the Enhancement label May 8, 2021

mroeschke removed this from the Contributions Welcome milestone Oct 13, 2022

rhshadrach closed this as completed Apr 23, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

as_index =True for groupby transform inconsistent with agg #15290

as_index =True for groupby transform inconsistent with agg #15290

Kevin-McIsaac commented Feb 2, 2017

jreback commented Feb 2, 2017

Kevin-McIsaac commented Feb 2, 2017

jreback commented Feb 2, 2017

Kevin-McIsaac commented Feb 2, 2017

jreback commented Feb 2, 2017

Kevin-McIsaac commented Feb 3, 2017

jreback commented Feb 3, 2017

Kevin-McIsaac commented Feb 3, 2017

jreback commented Feb 3, 2017

Kevin-McIsaac commented Feb 3, 2017

jreback commented Feb 3, 2017

Kevin-McIsaac commented Feb 3, 2017

rhshadrach commented Jul 12, 2020 •

edited

Loading

as_index =True for groupby transform inconsistent with agg #15290

as_index =True for groupby transform inconsistent with agg #15290

Comments

Kevin-McIsaac commented Feb 2, 2017

Code Sample, a copy-pastable example if possible

Problem description

Expected Output

Output of pd.show_versions()

jreback commented Feb 2, 2017

Kevin-McIsaac commented Feb 2, 2017

jreback commented Feb 2, 2017

Kevin-McIsaac commented Feb 2, 2017

jreback commented Feb 2, 2017

Kevin-McIsaac commented Feb 3, 2017

jreback commented Feb 3, 2017

Kevin-McIsaac commented Feb 3, 2017

jreback commented Feb 3, 2017

Kevin-McIsaac commented Feb 3, 2017

jreback commented Feb 3, 2017

Kevin-McIsaac commented Feb 3, 2017

rhshadrach commented Jul 12, 2020 • edited Loading

Output of `pd.show_versions()`

rhshadrach commented Jul 12, 2020 •

edited

Loading