Skip to content

GroupBy.transform bug/API change #8046

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wesm opened this issue Aug 17, 2014 · 4 comments · Fixed by #8049
Closed

GroupBy.transform bug/API change #8046

wesm opened this issue Aug 17, 2014 · 4 comments · Fixed by #8049
Milestone

Comments

@wesm
Copy link
Member

wesm commented Aug 17, 2014

hi folks,

A Python for Data Analysis reader noted the following issue with recent versions of pandas (as of 1 year ago):

import pandas as pd
from pandas import DataFrame
import numpy as np

def demean(arr):
    return arr - arr.mean()

people = DataFrame(np.random.randn(5, 5),
columns=['a', 'b', 'c', 'd', 'e'],
index=['Joe', 'Steve', 'Wes', 'Jim', 'Travis'])
key = ['one', 'two', 'one', 'two', 'one']

on pandas 0.14.1:

In [14]: people.groupby(key).transform(demean).groupby(key).mean()
Out[14]: 
            a         b         c         d         e
one -0.228006  0.246737  0.201117  0.250544  0.273858
two  0.342009 -0.370106 -0.301676 -0.375816 -0.410788

on the other hand:

In [15]: people.groupby(key).apply(demean).groupby(key).mean()
Out[15]: 
                a             b             c             d             e
one -3.700743e-17  7.401487e-17 -7.401487e-17  7.401487e-17  0.000000e+00
two  0.000000e+00  0.000000e+00  0.000000e+00  0.000000e+00  5.551115e-17

Looks like transform has undergone some work in recent times; any ideas? I need to look at the book text and see if I can triage by replacing transform with apply. At this point transform feels a little bit anachronistic.

@wesm wesm added the Bug label Aug 17, 2014
@dsm054
Copy link
Contributor

dsm054 commented Aug 17, 2014

We're not preserving the index order when we return from transform, so key doesn't mean the same thing in each case:

In [119]: df = pd.DataFrame(np.arange(6).reshape(3,2), columns=["a","b"], index=[0,2,1])

In [120]: key = [0,0,1]

In [121]: df
Out[121]: 
   a  b
0  0  1
2  2  3
1  4  5

In [122]: df.groupby(key).transform(lambda x: x-x.mean())
Out[122]: 
   a  b
0 -1 -1
1  0  0
2  1  1

In [123]: df.groupby(key).transform(lambda x: x-x.mean()).groupby(key).mean()
Out[123]: 
     a    b
0 -0.5 -0.5
1  1.0  1.0

In [124]: df.sort_index().groupby(key).transform(lambda x: x-x.mean()).groupby(key).mean()
Out[124]: 
   a  b
0  0  0
1  0  0

Should be straightforward enough to fix.

@wesm
Copy link
Member Author

wesm commented Aug 17, 2014

A reader noted that this works:

mapc = {'joe':'one', 'steve':'two', 'wes':'one', 'jim':'two', 'travis':'one'}

and now the the following produces zero mean:

>>> demeaned = p.groupby(mapc).transform(demean)
>>> demeaned.groupby(mapc).mean()
a b c d e
one 7.401487e-17 0 3.700743e-17 3.700743e-17 -4.625929e-17
two 0.000000e+00 0 -1.387779e-17 5.551115e-17 0.000000e+00

@dsm054
Copy link
Contributor

dsm054 commented Aug 17, 2014

If I'm right about the problem, then it makes sense that using a dictionary should work, because the dictionary doesn't depend upon the order of the indices to establish the group mapping.

@jreback
Copy link
Contributor

jreback commented Aug 17, 2014

@wesm @dsm054 ok, #8049 fixes (the routine I already had to put the transform in the original order, needed for the Index de-sub-class ndarray). But IIRC this sorting logic has been in quite a while (so certainly possible since 0.11 or so)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants