Skip to content

Apply vs Transform on a group object #9235

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
amelio-vazquez-reina opened this issue Jan 12, 2015 · 4 comments
Closed

Apply vs Transform on a group object #9235

amelio-vazquez-reina opened this issue Jan 12, 2015 · 4 comments

Comments

@amelio-vazquez-reina
Copy link
Contributor

Hi,

I understand this ticket is a borderline usage question, but I think it may signal an example of inconsistency either in the documentation or on the way these methods work.

The original question on SO is here.

Consider the following dataframe:

     A      B         C         D
0  foo    one  0.162003  0.087469
1  bar    one -1.156319 -1.526272
2  foo    two  0.833892 -1.666304
3  bar  three -2.026673 -0.322057
4  foo    two  0.411452 -0.954371
5  bar    two  0.765878 -0.095968
6  foo    one -0.654890  0.678091
7  foo  three -1.789842 -1.130922

The following commands work:

> df.groupby('A').apply(lambda x: (x['C'] - x['D']))
> df.groupby('A').apply(lambda x: (x['C'] - x['D']).mean())

but none of the following work:

> df.groupby('A').transform(lambda x: (x['C'] - x['D']))
ValueError: could not broadcast input array from shape (5) into shape (5,3)

> df.groupby('A').transform(lambda x: (x['C'] - x['D']).mean())
 TypeError: cannot concatenate a non-NDFrame object

Why? The example on the documentation seems to suggest that calling transform on a group allows one to do row-wise operation processing:

# Note that the following suggests row-wise operation (x.mean is the column mean)
zscore = lambda x: (x - x.mean()) / x.std()
transformed = ts.groupby(key).transform(zscore)

In other words, I thought that transform is essentially a specific type of apply (the one that does not aggregate). Where am I wrong?

For reference, below is the construction of the original dataframe above:

 df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar',
                           'foo', 'bar', 'foo', 'foo'],
                    'B' : ['one', 'one', 'two', 'three',
                           'two', 'two', 'one', 'three'],
                    'C' : randn(8), 'D' : randn(8)})
@amelio-vazquez-reina amelio-vazquez-reina changed the title Pandas apply vs transform on a group object Apply vs Transform on a group object Jan 12, 2015
@shoyer
Copy link
Member

shoyer commented Jan 13, 2015

This is definitely a documentation/usage question (pull requests to improve the docs are very welcome!)

Here are a couple things we say about transform:

  • It returns a "like-indexed" result, which for a dataframe means an object with the same row labels (the index) and column labels (which are technically also make use of a pandas index).
  • it returns an object that is indexed the same (same size) as the one being grouped.

In this case, the first group (corresponding to "foo") is an array with shape (5, 3). So the returned result also needs to have shape (5, 3). Instead, it has shape (5,) in your first example (corresponding to a single column) or shape () in your second example (corresponding to a scalar).

In my experience, it's never worth bothering with transform or aggregate -- the more flexible apply is much easier to use. I suppose there may be some performance benefits to using the more specific methods, but apply is much easier to use.

@jreback
Copy link
Contributor

jreback commented Jan 13, 2015

here is the rule of thumb

  • a transform should be a reducing function and needs to return a scalar; iow it returns a value for each value in the frame
  • apply can handle a transform or an arbitrary return for each group

@TomAugspurger
Copy link
Contributor

The DataFrameGroupBy docstring does say

Parameters
----------
f : function
    Function to apply to each subframe

subframe does make it sound like you'll be working on DataFrames like

In [17]: gr.get_group("foo")
Out[17]: 
     A      B         C         D
0  foo    one  0.733783  0.043300
2  foo    two  1.046065 -1.899229
4  foo    two  0.899932 -0.510403
6  foo    one -0.102515 -1.300482
7  foo  three  0.867948 -0.570468

This should be clarified.

@jreback
Copy link
Contributor

jreback commented Jan 13, 2015

going to close, but if you'd like to improve the docs would be great!.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants