-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG: groupby.agg/transform casts UDF results #40790
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…nt_cast_udfs � Conflicts: � doc/source/whatsnew/v1.3.0.rst
…nt_cast_udfs � Conflicts: � doc/source/whatsnew/v1.3.0.rst � pandas/core/groupby/generic.py
…into dont_cast_udfs � Conflicts: � pandas/_libs/lib.pyx � pandas/core/groupby/generic.py � pandas/tests/groupby/test_groupby.py � pandas/tests/resample/test_datetime_index.py
right i think we should try to start documenting return types for the udfs in the main docs with some examples (and ideally add to the doc strings too) |
@jreback - docs added and green. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm. can you rebase and some very minor doc-comments. ping on green.
doc/source/whatsnew/v1.3.0.rst
Outdated
|
||
.. ipython:: python | ||
|
||
In [5]: df.groupby('key').agg(lambda x: x.sum()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
don't need the prompt
pandas/core/groupby/generic.py
Outdated
2 3 4""" | ||
2 3 4 | ||
|
||
The resulting dtype will reflect the return value of the aggregating function. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you do a versionchanged tag here
pandas/core/groupby/generic.py
Outdated
See :ref:`groupby.aggregate.named` for more.""" | ||
See :ref:`groupby.aggregate.named` for more. | ||
|
||
The resulting dtype will reflect the return value of the aggregating function. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
versionchanged
|
||
>>> g[['B', 'C']].apply(lambda x: x.max() - x.min()) | ||
B C | ||
>>> g[['B', 'C']].apply(lambda x: x.astype(float).max() - x.min()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
versionchanged
The resulting dtype will reflect the return value of the transformation function, | ||
for example: | ||
|
||
>>> grouped[['C', 'D']].transform(lambda x: x.astype(int).max()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
versioncahnged
@jreback - changes made and green. I added versionchanged to the Notes section as well as the Examples sections. |
thanks @rhshadrach |
@@ -1240,9 +1276,6 @@ def _python_agg_general(self, func, *args, **kwargs): | |||
assert result is not None | |||
key = base.OutputKey(label=name, position=idx) | |||
|
|||
if is_numeric_dtype(obj.dtype): | |||
result = maybe_downcast_numeric(result, obj.dtype) | |||
|
|||
if self.grouper._filter_empty_groups: | |||
mask = counts.ravel() > 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rhshadrach can i get your help in this nearby piece of code? in all existing tests, when we get here, we have mask.all()
. trying to come up with a case where this doesnt hold (or prove that it must always hold). any thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like this is now removed - guessing mask.all() always held.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep.
next thing to ask your help with : IIRC you've done a lot of work in core.apply, which DataFrameGroupBy.aggregate uses. id like to make SeriesGroupBy.aggregate and DataFrameGroupBy.aggregate share more code (or at least be more obviously-similar). can i get your thoughts on how to achieve this (and whether its the effort)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Once we make aggregate always aggregate (PoC implemented in #40275), we can greatly simplify these methods. However, in order to do that we need to separate the apply/agg paths: currently apply uses agg for list/dicts and agg also uses apply for UDFs. I make a couple of attempts to do this but kept running into issues with changing behaviors without having a clear way to deprecate. This was the motivation for #41112. I plan to start working on that, assuming that's a good approach, in 1.3.
groupby.agg
with function returningint
onfloat
input #17035This PR stops casting when the func is a callable since we can't tell when casting will be more harmful than helpful. I think this results in a more expected/consistent behavior for users.