-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
PERF: Try fast/slow paths only once in DataFrameGroupby.transform #42195
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
pandas/core/groupby/generic.py
Outdated
except ValueError as err: | ||
msg = "transform must return a scalar value for each group" | ||
raise ValueError(msg) from err | ||
if path is None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
instead of doing this at each step in the loop, can we just do this outside the loop?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes - updated PR to show what this would look like. Minor cleanups can be done to avoid the nested if statements in the newly added process_result
, but it's a bit ugly. We're trading if path is None
for a function call as I don't see another way to avoid code duplication.
pandas/core/groupby/generic.py
Outdated
columns=group.columns, | ||
index=group.index, | ||
) | ||
applied.append(r) | ||
else: | ||
applied.append(res) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can process_result return res and do this append outside the func
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One path uses continue
, so the most immediate ways I see are to (a) use a sentinel value to indicate "do not add" or (b) move the non-adding branch outside of process_result
(duplicating code).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Never mind, the path using continue can almost be moved ahead of any computation - it doesn't really depend on the result except for whether the result is a Series or not. I'm going to see if I can do a precursor cleanup of this method. Marking this as a draft for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK. if the answer to my question is "no" thats OK too.
pandas/core/groupby/generic.py
Outdated
msg = "transform must return a scalar value for each group" | ||
raise ValueError(msg) from err | ||
|
||
def process_result(applied, group, res): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can this be changed to return the object to be appended to applied? rather than mutating
@rhshadrach can you rebase |
…oupby_transform_path � Conflicts: � pandas/core/groupby/generic.py
@jreback @jbrockmendel Updated and ready for review. I also tried only using the slow path (instead of choosing between fast/slow); this resulted in a 0.7 ratio for the benchmark |
IIUC you mean do some introspection on the function (or require users to specify somehow) to determine this instead of a try/except? That would be nice.
What about for the groupby benchmarks more generally? The slow path has been optimized a bunch this year (xref #40263) |
I think its fine to determine how to call it based on the first group. we need some metric. |
pandas/core/groupby/generic.py
Outdated
msg = "transform must return a scalar value for each group" | ||
raise ValueError(msg) from err | ||
|
||
def process_result( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can this be a module level function?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes - can either make it a module-level by passing self.obj or make this into a class method. Do you see a reason to prefer one or the other?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think prefer a module level
Introspection seems fragile to me. I didn't consider user specification but I like the idea as long as it isn't required. Another option would be to restrict what UDFs are acceptable (e.g. must return Series, DataFrame, or NumPy array, and maybe some other things). For this latter option, we could then handle everything acceptable via the fastpath and remove the slowpath entirely. But perhaps the flexibility of current handling makes this an unwelcome trade-off.
|
pandas/core/groupby/generic.py
Outdated
index=group.index, | ||
) | ||
assert isinstance(r, DataFrame) | ||
return r |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you call this e.g. res_frame? i.e. not 1 letter
comment by @jbrockmendel ping on green |
@jbrockmendel - This is slowpath only (L1385, first line of _choose_path is
|
@jreback - green |
thanks @rhshadrach very nice! |
Still need to run ASVs to measure performance, making sure CI passes first.
Edit: ASV results for 7b86fdb