Skip to content

as_index =True for groupby transform inconsistent with agg #15290

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
Kevin-McIsaac opened this issue Feb 2, 2017 · 13 comments
Closed

as_index =True for groupby transform inconsistent with agg #15290

Kevin-McIsaac opened this issue Feb 2, 2017 · 13 comments
Labels
Apply Apply, Aggregate, Transform, Map Enhancement Error Reporting Incorrect or improved errors from pandas Groupby

Comments

@Kevin-McIsaac
Copy link

Code Sample, a copy-pastable example if possible

import pandas as pd
import numpy as np
index = pd.date_range('10/1/1999', periods=1100, name='Date')

ts = pd.DataFrame({'Entry': np.random.normal(300, 50, 1100), 
                                'Exit':  np.random.normal(300, 50, 1100)}, index)

ts['Year'] = ts.index.year
ts['Day'] = ts.index.weekday_name

zscore = lambda x: (x - x.mean()) / x.std()
ts.groupby(['Year', 'Day'],  as_index=True).transform(lambda g:g).head(5)

Problem description

Unlike agg, when using transform

  1. as_index =True does not set the index to the groupby by (i.e ['Year', 'Day']).
  2. as_index = False does not prepend the groupby by column to the output of the transform as is done with agg.

I found this inconsistency very confusing and hard to work with.

This is also noted in ##5755

Expected Output

If as_index =True, the index of the result should the groupby by (i.e ['Year', 'Day']).

if as_index =False, the output should have the the groupby by (i.e ['Year', 'Day']) prepended to the as is done with agg.

I've really struggled using transform, partly as the semantics of transform were unclear to me. Specifically :

  1. Which columns is the transform applied to. Is it all columns of the df or only those not in the groupby by? It looks like its the latter. This seems to be true with agg too. It would have been helpful to me if this was clearer in the documentation

  2. Is the transform passed a df of columns or is it applied column by column? My guess is it's the former if the transform takes a df otherwise it's the latter. This seems to be hinted at in some parts of the documentation but it could be made more explicit.

Output of pd.show_versions()

INSTALLED VERSIONS ------------------ commit: None python: 3.5.2.final.0 python-bits: 64 OS: Linux OS-release: 4.4.41-36.55.amzn1.x86_64 machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: en_US.UTF-8

pandas: 0.19.2
nose: 1.3.7
pip: 9.0.1
setuptools: 27.2.0
Cython: 0.25.2
numpy: 1.11.3
scipy: 0.18.1
statsmodels: 0.6.1
xarray: None
IPython: 5.1.0
sphinx: 1.5.1
patsy: 0.4.1
dateutil: 2.6.0
pytz: 2016.10
blosc: None
bottleneck: 1.2.0
tables: 3.3.0
numexpr: 2.6.1
matplotlib: 2.0.0
openpyxl: 2.4.1
xlrd: 1.0.0
xlwt: 1.2.0
xlsxwriter: 0.9.6
lxml: 3.7.2
bs4: 4.5.3
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: 1.1.5
pymysql: None
psycopg2: 2.6.2 (dt dec pq3 ext lo64)
jinja2: 2.9.4
boto: 2.45.0
pandas_datareader: None

@jreback
Copy link
Contributor

jreback commented Feb 2, 2017

I think you are confused as to the purpose of .transform. This is a reductive operation on a group to a scalar, which is then broadcast back to the original shape.

So I would say that what you are doing should actually raise a ValueError; we don't really check this fully.

.apply is a more general operator that will infer what you are doing / want.

Further I think we completely ignore as_index with transform as it makes no sense. Since it defaults to True, I think we could/should raise if its not True.

I will mark this as a better error reporting

@jreback jreback added Difficulty Intermediate Error Reporting Incorrect or improved errors from pandas Groupby labels Feb 2, 2017
@jreback jreback added this to the 0.20.0 milestone Feb 2, 2017
@Kevin-McIsaac
Copy link
Author

I concur that I'm confused, however why should transform drop the groupby columns? Why is this more useful than returning the transformed columns with the groupby columns prepended as is done by agg?

@jreback
Copy link
Contributor

jreback commented Feb 2, 2017

I think the docs are pretty clear: http://pandas.pydata.org/pandas-docs/stable/groupby.html#transformation

@Kevin-McIsaac
Copy link
Author

The docs states "The transform method returns an object that is indexed the same", so it is clear that the index labels are the same, however it is silent on

  1. Which columns are transformed. Excluding the "by" columns makes sense and is the same as agg.
  2. What happens to the "by" columns and why dropping them is the right choice where as with agg they are retained.

@jreback
Copy link
Contributor

jreback commented Feb 2, 2017

well I think that should be clear, the semantics are exactly the same as .agg, IOW on a DataFrameGroupby you get all columns but the groupers, and on a SeriesGrouppBy you get that column.

.transform is essentially a reductive operation on each group which is then broadcast back to the original size.

@Kevin-McIsaac
Copy link
Author

Yes the semantics are the same, making point 1 explicit (both for agg and transform) in the docs would help new users like myself.

What about point 2, why is it the right thing to drop the groupers columns, which is not how .agg works?

@jreback
Copy link
Contributor

jreback commented Feb 3, 2017

well, if you want to a pull-request would be helpful.

its a transform, by-definition you are only affecting the non-groupers. you are returning the original just with replaced data for the non-groupers.

its rarely makes sense to perform an operation on grouped columns. for example normalizing doesn't make any sense on datetimes or strings, sure it makes sense on intergers (talking about groupers here), but that would be just weird.

@Kevin-McIsaac
Copy link
Author

I'm not suggesting transforming the groupers just prepending them to the transformed columns, which is the same as the .agg semantics.

The transformed DataFrame then has the same shape as the original, with the same indexes and same columns. I think that makes more sense.

It also enable as_index = True to have a sensible interpretation.

@jreback
Copy link
Contributor

jreback commented Feb 3, 2017

that's not very useful. The point is to align the index on assign back. IOW a very common thing to do is.

df = df.assign(transformed=df.groupby(...).foo.transform(...))

which only makes sense if the returned column has the same index as the original.

@Kevin-McIsaac
Copy link
Author

BTW, thank you for persisting with me on this.

Agreed this is common, however if the groupers were retained (i.e., prepended as with .agg) in what I suspect is the common use case there isn't a need to do an assign.

Using my example you propose writing:

ts[['Entry', 'Exit']] = ts.groupby(['Year', 'Day']).transform(zscore)

using assignment to update ts as a side-effect. This works if the index is unique, and is a common approach, but fails if the index is not unique. In my case, the index is only unique within each group.

I'm proposing the following give the same result by pre-pended the groupers, as is done with .agg

ts.groupby(['Year', 'Day']).transform(zscore)

This has the advantages of:

  1. being shorter,
  2. the same behaviour as .agg
  3. allows an interpretation of as_index
  4. being side effect free.

Of course if I only want to apply the transform just Exit, then there is a problem and I'd need to write

ts[['Exit']] = ts.groupby(['Year', 'Day']).transform(zscore).Exit

@jreback
Copy link
Contributor

jreback commented Feb 3, 2017

you can just use .apply to get what you want. .transform has a specific function. Please really learn why things are before trying to change them.

@Kevin-McIsaac
Copy link
Author

Fair enough. I did spend about 6 hrs on this before raising the issue. Taking your advice I tried apply in the obvious way:

zscore = lambda x: (x - x.mean()) / x.std()
ts.groupby(['Year', 'Day'],  as_index=True).apply(zscore)

but this gives

ValueError: can only convert an array of size 1 to a Python scalar

so I exclude the groupers

ts.groupby(['Year', 'Day'],  as_index=True).apply(lambda g: zscore(g[['Exit', 'Entry']]))

and then I get the same result as transform. So is this the semantics of transform, i.e., apply excluding groupers.

To get the result I've proposed I have to write

def my_transform(group, by, fn):
    transformed = fn(group.drop(by, axis='columns'))
    return pd.concat([group[by], transformed], axis='columns')

ts.groupby(['Year', 'Day']).apply(lambda g:my_transform(g, ['Year', 'Day'], zscore))

@rhshadrach
Copy link
Member

rhshadrach commented Jul 12, 2020

Seems to me too much is attempted being done within one line.

g = ts.groupby(['Year', 'Day'])
mean, std = g.transform('mean'), g.transform('std')
(ts[['Entry', 'Exit']] - mean)/std

This also has the benefit of not using apply, it should also be more performant but I haven't tested this.

I think this issue can be closed.

@rhshadrach rhshadrach added the Apply Apply, Aggregate, Transform, Map label Aug 14, 2020
@mroeschke mroeschke removed this from the Contributions Welcome milestone Oct 13, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Apply Apply, Aggregate, Transform, Map Enhancement Error Reporting Incorrect or improved errors from pandas Groupby
Projects
None yet
Development

No branches or pull requests

5 participants