-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
DOC: Add details to DataFrame groupby transform #14388
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bashtage It is a really good idea to better document this, thanks for starting this PR!
However, see my comments, I think we first have to flesh out a bit more the exact accepted semantics of f
.
pandas/core/groupby.py
Outdated
The current implementation imposes two requirements on f: | ||
|
||
* f must return a new DataFrame - in-place mutation of subframes is | ||
not supported and may produce unexpected results. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not entirely correct. It can certainly return other things as well (like scalars), as long as it broadcasts to a Series or DataFrame of the same shape as the subset.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense - was a bit too focused on the common case.
pandas/core/groupby.py
Outdated
* f must return a new DataFrame - in-place mutation of subframes is | ||
not supported and may produce unexpected results. | ||
* f must support application both column-by-column and to the entire | ||
subframe. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also this is not entirely true I think. In any case it does work (testing with a simple example) with functions that only work on a Series (so column by column)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right - series by series will always work. The only thing that doesn't work is one that expects to always be passed a DataFrame
. Will clarify this.
4507f3d
to
311a52e
Compare
@jorisvandenbossche I've taken another stab at explaining the requirements. Also updated the requirements in the online docs. |
The current implementation imposes three requirements on f: | ||
|
||
* f must return a value that either has the same shape as the input | ||
subframe or can be broadcast to the shape of the input subframe. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you give an example here (and in the docs), for each of these? and example function that is
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ahh I just added that point above!
also would like to update the doc-string with the same |
doc/source/groupby.rst
Outdated
as the one being grouped. The transform function must: | ||
|
||
* Return a result that is either the same size as the group chunk or | ||
broadcastable to the size of the group chunk. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add that a scalar is one example of such a broadcastable value?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see you added an example below, but I would explicitly add here (eg a scalar)
at the end of this sentence to make it clearer to the user what the typical use case is for the broadcastable result.
doc/source/groupby.rst
Outdated
* Return a result that is either the same size as the group chunk or | ||
broadcastable to the size of the group chunk. | ||
* Operate column-by-column on the group chunk. A fast path is used if the | ||
transform function also operates on the entire group chunk as a DataFrame. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this will not be very clear to readers. Do you have a specific example where you noticed this? Or is this explanation based on looking at the code?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is what started this issue/PR. I wanted to use a grand average subframe-by-subframe but this isn't actually possible since the first subframe is always operated on column-by-column and so it isn't possible to use information across all columns.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add an example of the function that you describe for each of these cases (no need to actually execute), more like (for the 3rd case).
grouped.transform.lambda x: x.fillna(inplace=False)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the first subframe is always operated on column-by-column and so it isn't possible to use information across all columns.
What you say here seems to contradict with the sentence you added in the docs: "A fast path is used if the transform function also operates on the entire group chunk as a DataFrame". So that is not really clear to me.
@bashtage can you rebase / update |
f6935c6
to
9b33584
Compare
doc/source/groupby.rst
Outdated
.. ipython:: python | ||
|
||
extremum = lambda x: x.max() - x.min() | ||
transformed = ts.groupby(key).transform(extremum) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you show here the output of transformed
? (or just remove the transformed =
if this variable is not further used).
It would be nice to also show that you can use transform(np.min)
or tranform('min')
.
doc/source/groupby.rst
Outdated
as the one being grouped. The transform function must: | ||
|
||
* Return a result that is either the same size as the group chunk or | ||
broadcastable to the size of the group chunk. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see you added an example below, but I would explicitly add here (eg a scalar)
at the end of this sentence to make it clearer to the user what the typical use case is for the broadcastable result.
Add requirements for user function in groupby transform closes pandas-dev#13543 [skip ci]
9b33584
to
ef1ff13
Compare
I've tried again to explain. Maybe it would be safer to just say "Don't use f that mutate the data or make use of data in multiple columns to compute the transformation"? |
same shape as the input subframe. | ||
* f must support application column-by-column in the subframe. If f | ||
also supports application to the entire subframe, then a fast path | ||
is used starting from the second chunk. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add the same to SeriesGroupBy.transform? or even better is to rip out the doc-strings to have a common one :> and use Appender.
@bashtage can you update |
thanks @bashtage I added a templated version to handle SeriesGroupBy as well |
closes pandas-dev#13543 Author: Kevin Sheppard <[email protected]> Closes pandas-dev#14388 from bashtage/groupby-transform-doc-string and squashes the following commits: ef1ff13 [Kevin Sheppard] DOC: Add details to DataFrame groupby transform
Series
to transformation #13543git diff upstream/master | flake8 --diff
Add requirements for user function in groupby transform
closes #13543
[skip ci]