GroupBy.transform bug/API change #8046

wesm · 2014-08-17T02:39:32Z

hi folks,

A Python for Data Analysis reader noted the following issue with recent versions of pandas (as of 1 year ago):

import pandas as pd
from pandas import DataFrame
import numpy as np

def demean(arr):
    return arr - arr.mean()

people = DataFrame(np.random.randn(5, 5),
columns=['a', 'b', 'c', 'd', 'e'],
index=['Joe', 'Steve', 'Wes', 'Jim', 'Travis'])
key = ['one', 'two', 'one', 'two', 'one']

on pandas 0.14.1:

In [14]: people.groupby(key).transform(demean).groupby(key).mean()
Out[14]: 
            a         b         c         d         e
one -0.228006  0.246737  0.201117  0.250544  0.273858
two  0.342009 -0.370106 -0.301676 -0.375816 -0.410788

on the other hand:

In [15]: people.groupby(key).apply(demean).groupby(key).mean()
Out[15]: 
                a             b             c             d             e
one -3.700743e-17  7.401487e-17 -7.401487e-17  7.401487e-17  0.000000e+00
two  0.000000e+00  0.000000e+00  0.000000e+00  0.000000e+00  5.551115e-17

Looks like transform has undergone some work in recent times; any ideas? I need to look at the book text and see if I can triage by replacing transform with apply. At this point transform feels a little bit anachronistic.

The text was updated successfully, but these errors were encountered:

dsm054 · 2014-08-17T03:26:52Z

We're not preserving the index order when we return from transform, so key doesn't mean the same thing in each case:

In [119]: df = pd.DataFrame(np.arange(6).reshape(3,2), columns=["a","b"], index=[0,2,1])

In [120]: key = [0,0,1]

In [121]: df
Out[121]: 
   a  b
0  0  1
2  2  3
1  4  5

In [122]: df.groupby(key).transform(lambda x: x-x.mean())
Out[122]: 
   a  b
0 -1 -1
1  0  0
2  1  1

In [123]: df.groupby(key).transform(lambda x: x-x.mean()).groupby(key).mean()
Out[123]: 
     a    b
0 -0.5 -0.5
1  1.0  1.0

In [124]: df.sort_index().groupby(key).transform(lambda x: x-x.mean()).groupby(key).mean()
Out[124]: 
   a  b
0  0  0
1  0  0

Should be straightforward enough to fix.

wesm · 2014-08-17T03:44:53Z

A reader noted that this works:

mapc = {'joe':'one', 'steve':'two', 'wes':'one', 'jim':'two', 'travis':'one'}

and now the the following produces zero mean:

>>> demeaned = p.groupby(mapc).transform(demean)
>>> demeaned.groupby(mapc).mean()
a b c d e
one 7.401487e-17 0 3.700743e-17 3.700743e-17 -4.625929e-17
two 0.000000e+00 0 -1.387779e-17 5.551115e-17 0.000000e+00

dsm054 · 2014-08-17T03:51:15Z

If I'm right about the problem, then it makes sense that using a dictionary should work, because the dictionary doesn't depend upon the order of the indices to establish the group mapping.

jreback · 2014-08-17T22:04:48Z

@wesm @dsm054 ok, #8049 fixes (the routine I already had to put the transform in the original order, needed for the Index de-sub-class ndarray). But IIRC this sorting logic has been in quite a while (so certainly possible since 0.11 or so)

wesm added the Bug label Aug 17, 2014

jreback added the Groupby label Aug 17, 2014

jreback added this to the 0.15.0 milestone Aug 17, 2014

jreback mentioned this issue Aug 17, 2014

BUG: Bug in DataFrameGroupby.transform when transforming with a passed non-sorted key (GH8046) #8049

Merged

jreback closed this as completed in #8049 Aug 18, 2014

jreback mentioned this issue Oct 1, 2014

BUG: Groupby.transform related to BinGrouper and GH8046 (GH8430) #8434

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GroupBy.transform bug/API change #8046

GroupBy.transform bug/API change #8046

wesm commented Aug 17, 2014

dsm054 commented Aug 17, 2014

wesm commented Aug 17, 2014

dsm054 commented Aug 17, 2014

jreback commented Aug 17, 2014

GroupBy.transform bug/API change #8046

GroupBy.transform bug/API change #8046

Comments

wesm commented Aug 17, 2014

dsm054 commented Aug 17, 2014

wesm commented Aug 17, 2014

dsm054 commented Aug 17, 2014

jreback commented Aug 17, 2014