-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
agg with list of non-aggregating functions #35723
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
thanks @rhshadrach can you open an issue to 'fix' this properly in 1.2 |
The issue is here: #35725 |
@meeseeksdev backport 1.1.x |
Co-authored-by: Richard Shadrach <[email protected]>
@@ -322,11 +322,14 @@ def _aggregate_multiple_funcs(self, arg): | |||
# let higher level handle | |||
return results | |||
|
|||
output = self._wrap_aggregated_output(results) | |||
output = self._wrap_aggregated_output(results, index=None) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rhshadrach is it the case that if we get here then the user passed something sketchy? i.e. should we do the deprecation here?
What should the "do XYZ instead" be?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you're pointing at the return results
line immediately above this and referring to #35725. Yes, the user passed something I would consider to be sketchy.
My understanding here is that we're a SeriesgGroupBy with results being a mapping from the groups to the result of a UDF. My opinion is that the result should be a Series whose index consists of the groups and values are DataFrames themselves. In other words, the user passed something to agg
, so no matter what we get back we are going to treat it as a scalar. It might not be useful, but it is consistent, predictable, and easy to maintain. Optionally, we can emit a performance warning in the case of getting back a complex object.
I've been trying to do the deprecation for #35725 for a bit now, but ended up giving because of the way agg
and apply
cross paths. I am currently writing an issue for what I see being the way forward, hoping it will be ready at some point tonight.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you're pointing at the return results line immediately above this
I was actually pointing the 324, so I take it my original guess is wrong? We get to L325 if the user passed something "normal" and return on L323 if they passed something wonky?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
L323 the user definitely passed something wonky. L325 gets hit in both wonky and non-wonky cases, e.g.
gb = DataFrame({'key': [1, 1, 2, 2], 'a': [1, 2, 3, 4]}).groupby('key')['a']
gb.agg(['sum', 'mean'])) # non-wonky
gb.agg(['sum', 'cumsum'])) # wonky
both hit L325.
black pandas
git diff upstream/master -u -- "*.py" | flake8 --diff
Reverts to 1.0.5 behavior while maintaining the bugfix that introduced the "regression". Ideally
agg
would instead raise in these scenarios (ref), but that would be an API change. Much of this PR should be reverted once this is done, I've marked such places with TODOs.