Apply vs Transform on a group object #9235

amelio-vazquez-reina · 2015-01-12T21:47:19Z

Hi,

I understand this ticket is a borderline usage question, but I think it may signal an example of inconsistency either in the documentation or on the way these methods work.

The original question on SO is here.

Consider the following dataframe:

     A      B         C         D
0  foo    one  0.162003  0.087469
1  bar    one -1.156319 -1.526272
2  foo    two  0.833892 -1.666304
3  bar  three -2.026673 -0.322057
4  foo    two  0.411452 -0.954371
5  bar    two  0.765878 -0.095968
6  foo    one -0.654890  0.678091
7  foo  three -1.789842 -1.130922

The following commands work:

> df.groupby('A').apply(lambda x: (x['C'] - x['D']))
> df.groupby('A').apply(lambda x: (x['C'] - x['D']).mean())

but none of the following work:

> df.groupby('A').transform(lambda x: (x['C'] - x['D']))
ValueError: could not broadcast input array from shape (5) into shape (5,3)

> df.groupby('A').transform(lambda x: (x['C'] - x['D']).mean())
 TypeError: cannot concatenate a non-NDFrame object

Why? The example on the documentation seems to suggest that calling transform on a group allows one to do row-wise operation processing:

# Note that the following suggests row-wise operation (x.mean is the column mean)
zscore = lambda x: (x - x.mean()) / x.std()
transformed = ts.groupby(key).transform(zscore)

In other words, I thought that transform is essentially a specific type of apply (the one that does not aggregate). Where am I wrong?

For reference, below is the construction of the original dataframe above:

 df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar',
                           'foo', 'bar', 'foo', 'foo'],
                    'B' : ['one', 'one', 'two', 'three',
                           'two', 'two', 'one', 'three'],
                    'C' : randn(8), 'D' : randn(8)})

The text was updated successfully, but these errors were encountered:

shoyer · 2015-01-13T03:29:28Z

This is definitely a documentation/usage question (pull requests to improve the docs are very welcome!)

Here are a couple things we say about transform:

It returns a "like-indexed" result, which for a dataframe means an object with the same row labels (the index) and column labels (which are technically also make use of a pandas index).
it returns an object that is indexed the same (same size) as the one being grouped.

In this case, the first group (corresponding to "foo") is an array with shape (5, 3). So the returned result also needs to have shape (5, 3). Instead, it has shape (5,) in your first example (corresponding to a single column) or shape () in your second example (corresponding to a scalar).

In my experience, it's never worth bothering with transform or aggregate -- the more flexible apply is much easier to use. I suppose there may be some performance benefits to using the more specific methods, but apply is much easier to use.

jreback · 2015-01-13T03:33:45Z

here is the rule of thumb

a transform should be a reducing function and needs to return a scalar; iow it returns a value for each value in the frame
apply can handle a transform or an arbitrary return for each group

TomAugspurger · 2015-01-13T03:46:52Z

The DataFrameGroupBy docstring does say

Parameters
----------
f : function
    Function to apply to each subframe

subframe does make it sound like you'll be working on DataFrames like

In [17]: gr.get_group("foo")
Out[17]: 
     A      B         C         D
0  foo    one  0.733783  0.043300
2  foo    two  1.046065 -1.899229
4  foo    two  0.899932 -0.510403
6  foo    one -0.102515 -1.300482
7  foo  three  0.867948 -0.570468

This should be clarified.

jreback · 2015-01-13T11:08:47Z

going to close, but if you'd like to improve the docs would be great!.

amelio-vazquez-reina changed the title ~~Pandas apply vs transform on a group object~~ Apply vs Transform on a group object Jan 12, 2015

shoyer added the Usage Question label Jan 13, 2015

shoyer mentioned this issue Jan 13, 2015

API/ENH: Add mutate like method to DataFrames #9229

Closed

shoyer added the Docs label Jan 13, 2015

jreback closed this as completed Jan 13, 2015

jreback added Groupby API Design and removed API Design labels Jan 13, 2015

jreback mentioned this issue Sep 21, 2016

bug when filling missing values with transform? #14274

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Apply vs Transform on a group object #9235

Apply vs Transform on a group object #9235

amelio-vazquez-reina commented Jan 12, 2015

shoyer commented Jan 13, 2015

jreback commented Jan 13, 2015

TomAugspurger commented Jan 13, 2015

jreback commented Jan 13, 2015

Apply vs Transform on a group object #9235

Apply vs Transform on a group object #9235

Comments

amelio-vazquez-reina commented Jan 12, 2015

shoyer commented Jan 13, 2015

jreback commented Jan 13, 2015

TomAugspurger commented Jan 13, 2015

jreback commented Jan 13, 2015