-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
ENH: Should we support aggregating by-frame in DataFrameGroupBy.agg #58225
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Nice write up. If we were to choose Option A, and if someone had a func that operated on a DataFrame, what would be their migration path? Can you illustrate via an example? |
During the dev meeting, I expressed doubt that this could be reliably answered. Seeing how bad the behavior of In the OP, I detailed the poor behavior of Scalar return:
|
One thing I missed: If the index of the result is always the same, DataFrameGroupBy.apply stacks the results horizontally (the first example). If the index of the result is not always the same, DataFrameGroupBy.apply stacks the results vertically.
|
As mentioned in the call, these checks can get expensive. I've been liking the decorator idea more the more I think about it. I think it would apply to UDFs in general, not just agg (maybe not even just groupby). We would get a lot of mileage just by having it specify its argument/returns types. More if it specifies "All returned Series have the same Index" or "here are the return dtypes". In the latter case we could get a non-trivial perf improvement by avoiding an object-dtype intermediate (with non-trivial effort on our end). I've talked myself into liking the decorator idea on the theory that I've talked myself into liking the idea Joris mentioned of using a decorator to let users tell us what their function takes/returns. I also think that can improve perf in cases where e.g. dtypes can be known in advance and we can avoid an object-dtype intermediate. (that perf improvement would take some non-trivial effort on our end (though now that I think of it, is that just reinventing numba?)) Short-term, the only strong opinion I have is that "by" should not be re-used, as it already has a meaning in other groupby methods. |
My only opposition to the decorator is that it would make inline lambdas more awkward. But it's not strong opposition. |
Short-term, I would like to resolve the bug #39169 by removing any operation by-frame in DataFrameGroupBy.agg and only operate by-column for now. We can then add by-frame as a clean enhancement in the future. As mentioned in the OP, the only cases you get to by-frame behavior currently is (a) one grouping and supplying args/kwargs or (b) an empty input. I think those are small enough cases and the bug is sufficiently bad behavior that it should be fixed sooner rather than later. For implementation, I would deprecate those cases and instruct the user to use Is there any support or opposition to this path forward? |
This topic was discussed in the dev meeting on 2024-04-10. I believe I captured all ideas expressed there.
Summary: DataFrameGroupBy.agg will sometimes pass the UDF each column of the original DataFrame as a Series ("by-column"), and sometimes pass the UDF the entire DataFrame ("by-frame"). This causes issues for users. Either we should only operate by-column, or allow users to control whether the UDF they provide is passed a Series or DataFrame. If we do enable users to control this, we need to decide on what the right behavior is in the by-frame case.
Relevant code:
pandas/pandas/core/groupby/generic.py
Lines 1546 to 1569 in b4493b6
Logic: In the case where the user passes a callable (UDF), pandas will operate by-frame when:
.groupby("a")
and not.groupby(["a", "b"])
) and the user is passingargs
orkwargs
through to the UDF; or.agg([func])
raises a ValueError with "No objects to concatenate" in the messageIn all other cases, pandas operates by-column. The only place (2) is currently hit in the test-suite is aggregating an empty DataFrame. I do not see how else it might be possible to hit (2).
Impact: When a user provides a UDF that can work either by-column or by-frame, but not necessarily produce the same result, whether they have a single grouping and/or provide a arg/kwarg to pass through can produce different results. This is #39169.
This seems to me like a particularly bad behavior in DataFrameGroupBy.agg, and I'd like to resolve this bug.
Option A: Only support by-column in groupby(...).agg
Pros: Most straight-forward.
Cons: Removes ability of users to use agg when they want the UDF to use multiple columns (which isn't very accessible anyways).
Option B: Add an argument
Users would communicate what they want the UDF to accept by passing an argument.
Pros: Enables users to have their UDF accept a frame when the desire.
Cons: We need to decide on behavior in the case of
by="frame"
(see below).Option C-a: Required Decorator
Users would communicate what they want the UDF to accept by adding a decorator.
Pros: Enables users to have their UDF accept a frame when the desire.
Cons:
by="frame"
(see below).Option C-b: Optional Decorator
This is somewhat of a combination of Option A and Option C-a. Without a decorator, we would always operate by-column. If the decorator is provided, we could operate by-frame.
If we do decide to support by-frame
Then we need to decide on its behavior. Currently
_aggregate_frame
will create a dictionary, roughly aspass this dictionary to the DataFrame constructor, and then take a transpose.
pandas/pandas/core/groupby/generic.py
Lines 1614 to 1621 in b4493b6
This behavior works well when scalars are returned by the UDF, but not iterable objects. E.g.
Option I: Treat all returns as if they are scalars.
Pros: Simple for users to understand and us to implement.
Cons: Does not allow performance enhancement by operating on a DataFrame and returning a Series (e.g.
lambda x: x.sum()
wherex
is a DataFrame)Option II: Add an argument.
Users would communicate what they want the UDF return interpreted as via an argument.
Pros: Explicit user control.
Cons: Has an additional argument.
Option IIIa: Series is interpeted as stack vertically, all others are scalars.
Pros: No additional argument / features needed.
Cons: Somewhat magical, users cannot have pandas treat Series as if it were a scalar.
Option IIIb: Series is interpeted as stack vertically unless boxed with
pd.Scalar
, all others are scalars.By boxing the UDFs result with
pd.Scalar
tells pandas to treat it like a scalar. Note thatpd.Scalar
does not currently exist.Pros: No additional argument needed.
Cons: Somewhat magical, need to add
pd.Scalar
.cc @jorisvandenbossche @jbrockmendel @Dr-Irv
The text was updated successfully, but these errors were encountered: