Skip to content

DOC: Add details to DataFrame groupby transform #14388

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

bashtage
Copy link
Contributor

Add requirements for user function in groupby transform

closes #13543
[skip ci]

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bashtage It is a really good idea to better document this, thanks for starting this PR!

However, see my comments, I think we first have to flesh out a bit more the exact accepted semantics of f.

The current implementation imposes two requirements on f:

* f must return a new DataFrame - in-place mutation of subframes is
not supported and may produce unexpected results.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not entirely correct. It can certainly return other things as well (like scalars), as long as it broadcasts to a Series or DataFrame of the same shape as the subset.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense - was a bit too focused on the common case.

* f must return a new DataFrame - in-place mutation of subframes is
not supported and may produce unexpected results.
* f must support application both column-by-column and to the entire
subframe.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also this is not entirely true I think. In any case it does work (testing with a simple example) with functions that only work on a Series (so column by column)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right - series by series will always work. The only thing that doesn't work is one that expects to always be passed a DataFrame. Will clarify this.

@jorisvandenbossche jorisvandenbossche added this to the 0.20.0 milestone Oct 10, 2016
@bashtage bashtage force-pushed the groupby-transform-doc-string branch from 4507f3d to 311a52e Compare October 24, 2016 14:56
@bashtage
Copy link
Contributor Author

@jorisvandenbossche I've taken another stab at explaining the requirements. Also updated the requirements in the online docs.

The current implementation imposes three requirements on f:

* f must return a value that either has the same shape as the input
subframe or can be broadcast to the shape of the input subframe.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you give an example here (and in the docs), for each of these? and example function that is

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ahh I just added that point above!

@jreback
Copy link
Contributor

jreback commented Dec 6, 2016

also would like to update the doc-string with the same

as the one being grouped. The transform function must:

* Return a result that is either the same size as the group chunk or
broadcastable to the size of the group chunk.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add that a scalar is one example of such a broadcastable value?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added this

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see you added an example below, but I would explicitly add here (eg a scalar) at the end of this sentence to make it clearer to the user what the typical use case is for the broadcastable result.

* Return a result that is either the same size as the group chunk or
broadcastable to the size of the group chunk.
* Operate column-by-column on the group chunk. A fast path is used if the
transform function also operates on the entire group chunk as a DataFrame.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this will not be very clear to readers. Do you have a specific example where you noticed this? Or is this explanation based on looking at the code?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is what started this issue/PR. I wanted to use a grand average subframe-by-subframe but this isn't actually possible since the first subframe is always operated on column-by-column and so it isn't possible to use information across all columns.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add an example of the function that you describe for each of these cases (no need to actually execute), more like (for the 3rd case).

grouped.transform.lambda x: x.fillna(inplace=False)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the first subframe is always operated on column-by-column and so it isn't possible to use information across all columns.

What you say here seems to contradict with the sentence you added in the docs: "A fast path is used if the transform function also operates on the entire group chunk as a DataFrame". So that is not really clear to me.

@jreback
Copy link
Contributor

jreback commented Dec 30, 2016

@bashtage can you rebase / update

@bashtage bashtage force-pushed the groupby-transform-doc-string branch 2 times, most recently from f6935c6 to 9b33584 Compare December 31, 2016 12:29
.. ipython:: python

extremum = lambda x: x.max() - x.min()
transformed = ts.groupby(key).transform(extremum)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you show here the output of transformed? (or just remove the transformed = if this variable is not further used).

It would be nice to also show that you can use transform(np.min) or tranform('min').

as the one being grouped. The transform function must:

* Return a result that is either the same size as the group chunk or
broadcastable to the size of the group chunk.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see you added an example below, but I would explicitly add here (eg a scalar) at the end of this sentence to make it clearer to the user what the typical use case is for the broadcastable result.

Add requirements for user function in groupby transform

closes pandas-dev#13543
[skip ci]
@bashtage bashtage force-pushed the groupby-transform-doc-string branch from 9b33584 to ef1ff13 Compare February 17, 2017 20:40
@bashtage
Copy link
Contributor Author

I've tried again to explain. Maybe it would be safer to just say "Don't use f that mutate the data or make use of data in multiple columns to compute the transformation"?

same shape as the input subframe.
* f must support application column-by-column in the subframe. If f
also supports application to the entire subframe, then a fast path
is used starting from the second chunk.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add the same to SeriesGroupBy.transform? or even better is to rip out the doc-strings to have a common one :> and use Appender.

@jreback
Copy link
Contributor

jreback commented Mar 23, 2017

@bashtage can you update

@jreback jreback closed this in 80f30b4 Mar 25, 2017
@jreback
Copy link
Contributor

jreback commented Mar 25, 2017

thanks @bashtage I added a templated version to handle SeriesGroupBy as well

mattip pushed a commit to mattip/pandas that referenced this pull request Apr 3, 2017
closes pandas-dev#13543

Author: Kevin Sheppard <[email protected]>

Closes pandas-dev#14388 from bashtage/groupby-transform-doc-string and squashes the following commits:

ef1ff13 [Kevin Sheppard] DOC: Add details to DataFrame groupby transform
@bashtage bashtage deleted the groupby-transform-doc-string branch April 22, 2018 21:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: groupby.transform passing Series to transformation
3 participants