-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
as_index =True for groupby transform inconsistent with agg #15290
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I think you are confused as to the purpose of So I would say that what you are doing should actually raise a
Further I think we completely ignore I will mark this as a better error reporting |
I concur that I'm confused, however why should transform drop the groupby columns? Why is this more useful than returning the transformed columns with the groupby columns prepended as is done by agg? |
I think the docs are pretty clear: http://pandas.pydata.org/pandas-docs/stable/groupby.html#transformation |
The docs states "The transform method returns an object that is indexed the same", so it is clear that the index labels are the same, however it is silent on
|
well I think that should be clear, the semantics are exactly the same as
|
Yes the semantics are the same, making point 1 explicit (both for agg and transform) in the docs would help new users like myself. What about point 2, why is it the right thing to drop the groupers columns, which is not how .agg works? |
well, if you want to a pull-request would be helpful. its a transform, by-definition you are only affecting the non-groupers. you are returning the original just with replaced data for the non-groupers. its rarely makes sense to perform an operation on grouped columns. for example normalizing doesn't make any sense on datetimes or strings, sure it makes sense on intergers (talking about groupers here), but that would be just weird. |
I'm not suggesting transforming the groupers just prepending them to the transformed columns, which is the same as the .agg semantics. The transformed DataFrame then has the same shape as the original, with the same indexes and same columns. I think that makes more sense. It also enable as_index = True to have a sensible interpretation. |
that's not very useful. The point is to align the index on assign back. IOW a very common thing to do is.
which only makes sense if the returned column has the same index as the original. |
BTW, thank you for persisting with me on this. Agreed this is common, however if the groupers were retained (i.e., prepended as with .agg) in what I suspect is the common use case there isn't a need to do an assign. Using my example you propose writing: ts[['Entry', 'Exit']] = ts.groupby(['Year', 'Day']).transform(zscore) using assignment to update ts as a side-effect. This works if the index is unique, and is a common approach, but fails if the index is not unique. In my case, the index is only unique within each group. I'm proposing the following give the same result by pre-pended the groupers, as is done with .agg ts.groupby(['Year', 'Day']).transform(zscore) This has the advantages of:
Of course if I only want to apply the transform just Exit, then there is a problem and I'd need to write ts[['Exit']] = ts.groupby(['Year', 'Day']).transform(zscore).Exit |
you can just use |
Fair enough. I did spend about 6 hrs on this before raising the issue. Taking your advice I tried apply in the obvious way: zscore = lambda x: (x - x.mean()) / x.std()
ts.groupby(['Year', 'Day'], as_index=True).apply(zscore) but this gives ValueError: can only convert an array of size 1 to a Python scalar so I exclude the groupers ts.groupby(['Year', 'Day'], as_index=True).apply(lambda g: zscore(g[['Exit', 'Entry']])) and then I get the same result as transform. So is this the semantics of transform, i.e., apply excluding groupers. To get the result I've proposed I have to write def my_transform(group, by, fn):
transformed = fn(group.drop(by, axis='columns'))
return pd.concat([group[by], transformed], axis='columns')
ts.groupby(['Year', 'Day']).apply(lambda g:my_transform(g, ['Year', 'Day'], zscore)) |
Seems to me too much is attempted being done within one line.
This also has the benefit of not using I think this issue can be closed. |
Code Sample, a copy-pastable example if possible
Problem description
Unlike agg, when using transform
I found this inconsistency very confusing and hard to work with.
This is also noted in ##5755
Expected Output
If as_index =True, the index of the result should the groupby by (i.e ['Year', 'Day']).
if as_index =False, the output should have the the groupby by (i.e ['Year', 'Day']) prepended to the as is done with agg.
I've really struggled using transform, partly as the semantics of transform were unclear to me. Specifically :
Which columns is the transform applied to. Is it all columns of the df or only those not in the groupby by? It looks like its the latter. This seems to be true with agg too. It would have been helpful to me if this was clearer in the documentation
Is the transform passed a df of columns or is it applied column by column? My guess is it's the former if the transform takes a df otherwise it's the latter. This seems to be hinted at in some parts of the documentation but it could be made more explicit.
Output of
pd.show_versions()
pandas: 0.19.2
nose: 1.3.7
pip: 9.0.1
setuptools: 27.2.0
Cython: 0.25.2
numpy: 1.11.3
scipy: 0.18.1
statsmodels: 0.6.1
xarray: None
IPython: 5.1.0
sphinx: 1.5.1
patsy: 0.4.1
dateutil: 2.6.0
pytz: 2016.10
blosc: None
bottleneck: 1.2.0
tables: 3.3.0
numexpr: 2.6.1
matplotlib: 2.0.0
openpyxl: 2.4.1
xlrd: 1.0.0
xlwt: 1.2.0
xlsxwriter: 0.9.6
lxml: 3.7.2
bs4: 4.5.3
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: 1.1.5
pymysql: None
psycopg2: 2.6.2 (dt dec pq3 ext lo64)
jinja2: 2.9.4
boto: 2.45.0
pandas_datareader: None
The text was updated successfully, but these errors were encountered: