Skip to content

PERF: Calling slowpath for every group in transform #41598

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
rhshadrach opened this issue May 21, 2021 · 1 comment · Fixed by #42195
Closed

PERF: Calling slowpath for every group in transform #41598

rhshadrach opened this issue May 21, 2021 · 1 comment · Fixed by #42195
Labels
Groupby Performance Memory or execution speed performance
Milestone

Comments

@rhshadrach
Copy link
Member

Code in groupby.generic.DataFrameGroupBy._transform_general:

for name, group in gen:
object.__setattr__(group, "name", name)
# Try slow path and fast path.
try:
path, res = self._choose_path(fast_path, slow_path, group)
except TypeError:
return self._transform_item_by_item(obj, fast_path)
except ValueError as err:
msg = "transform must return a scalar value for each group"
raise ValueError(msg) from err

This is calling _choose_path for every group, which in turn calls both the slow_path and the fast_path to determine if the fast path can be used. Indeed, running the code (from #41584):

df = pd.DataFrame({
    'x': ['a', 'b', 'c', 'd'],
    'y': [5, 6, 7, 8],
    'g': [1, 2, 3, 3]
})
def myfirst(c):
    return c.iloc[0]
print(df.groupby('g').transform(myfirst))

shows myfirst gets called 9 times - 3 times with columns x, 3 times with column y, and three times with the DataFrame consisting of x and y.

Should we just be calling choose_path on the first group to determine which can be used?

cc @phofl @jbrockmendel

@rhshadrach rhshadrach added Bug Groupby Performance Memory or execution speed performance labels May 21, 2021
@jbrockmendel
Copy link
Member

Should we just be calling choose_path on the first group to determine which can be used?

worth a shot. itd be a pretty nice simplification of spaghetti code on top of the perf benefit

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Groupby Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants