PERF: Try fast/slow paths only once in DataFrameGroupby.transform #42195

rhshadrach · 2021-06-23T01:55:14Z

closes PERF: Calling slowpath for every group in transform #41598
Ensure all linting tests pass, see here for how to run them
whatsnew entry

Still need to run ASVs to measure performance, making sure CI passes first.

Edit: ASV results for 7b86fdb

       before           after         ratio
     [9f6a91ae]       [7b86fdb9]
     <groupby_transform_path~1>       <groupby_transform_path>
-     23.8±0.07ms         21.4±2ms     0.90  groupby.Nth.time_series_nth_all('float32')
-      11.8±0.1ms       10.4±0.6ms     0.88  groupby.Nth.time_frame_nth('object')
-      27.1±0.1ms       23.3±0.1ms     0.86  groupby.Nth.time_series_nth_all('object')
-      19.4±0.2ms       16.5±0.2ms     0.85  groupby.Nth.time_groupby_nth_all('float64')
-      21.7±0.2ms       18.4±0.3ms     0.85  groupby.Nth.time_frame_nth_any('float32')
-      24.1±0.1ms       20.2±0.2ms     0.84  groupby.Nth.time_series_nth_any('float64')
-      24.0±0.2ms      20.1±0.05ms     0.84  groupby.Nth.time_series_nth_all('float64')
-      56.4±0.4ms       20.1±0.4ms     0.36  groupby.TransformEngine.time_dataframe_cython(False)
-      56.5±0.4ms       20.1±0.4ms     0.36  groupby.TransformEngine.time_dataframe_cython(True)

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
PERFORMANCE INCREASED.

jbrockmendel · 2021-06-23T03:55:50Z

pandas/core/groupby/generic.py

-            except ValueError as err:
-                msg = "transform must return a scalar value for each group"
-                raise ValueError(msg) from err
+            if path is None:


instead of doing this at each step in the loop, can we just do this outside the loop?

Yes - updated PR to show what this would look like. Minor cleanups can be done to avoid the nested if statements in the newly added process_result, but it's a bit ugly. We're trading if path is None for a function call as I don't see another way to avoid code duplication.

jbrockmendel · 2021-06-23T15:12:02Z

pandas/core/groupby/generic.py

+                            columns=group.columns,
+                            index=group.index,
+                        )
+                    applied.append(r)
            else:
                applied.append(res)


can process_result return res and do this append outside the func

One path uses continue, so the most immediate ways I see are to (a) use a sentinel value to indicate "do not add" or (b) move the non-adding branch outside of process_result (duplicating code).

Never mind, the path using continue can almost be moved ahead of any computation - it doesn't really depend on the result except for whether the result is a Series or not. I'm going to see if I can do a precursor cleanup of this method. Marking this as a draft for now.

OK. if the answer to my question is "no" thats OK too.

jreback · 2021-06-25T01:33:41Z

pandas/core/groupby/generic.py

-                msg = "transform must return a scalar value for each group"
-                raise ValueError(msg) from err
-
+        def process_result(applied, group, res):


can this be changed to return the object to be appended to applied? rather than mutating

jreback · 2021-07-04T21:38:22Z

@rhshadrach can you rebase

…oupby_transform_path � Conflicts: � pandas/core/groupby/generic.py

rhshadrach · 2021-07-07T13:59:44Z

@jreback @jbrockmendel Updated and ready for review.

I also tried only using the slow path (instead of choosing between fast/slow); this resulted in a 0.7 ratio for the benchmark groupby.TransformEngine.time_dataframe_cython(False) as opposed to a 0.3 in this PR. While I'd prefer to find a way to not have to choose between different paths based on the first group, the performance seems worth it in my opinion.

jbrockmendel · 2021-07-07T15:03:16Z

While I'd prefer to find a way to not have to choose between different paths based on the first group

IIUC you mean do some introspection on the function (or require users to specify somehow) to determine this instead of a try/except? That would be nice.

this resulted in a 0.7 ratio for the benchmark

What about for the groupby benchmarks more generally? The slow path has been optimized a bunch this year (xref #40263)

jreback · 2021-07-07T15:10:49Z

I think its fine to determine how to call it based on the first group. we need some metric.

pandas/core/groupby/generic.py

jreback · 2021-07-07T15:11:34Z

pandas/core/groupby/generic.py

-                msg = "transform must return a scalar value for each group"
-                raise ValueError(msg) from err
-
+        def process_result(


can this be a module level function?

Yes - can either make it a module-level by passing self.obj or make this into a class method. Do you see a reason to prefer one or the other?

i think prefer a module level

rhshadrach · 2021-07-07T18:15:34Z

@jbrockmendel

IIUC you mean do some introspection on the function (or require users to specify somehow) to determine this instead of a try/except? That would be nice.

Introspection seems fragile to me. I didn't consider user specification but I like the idea as long as it isn't required. Another option would be to restrict what UDFs are acceptable (e.g. must return Series, DataFrame, or NumPy array, and maybe some other things). For this latter option, we could then handle everything acceptable via the fastpath and remove the slowpath entirely. But perhaps the flexibility of current handling makes this an unwelcome trade-off.

What about for the groupby benchmarks more generally? The slow path has been optimized a bunch this year (xref #40263)

~~The OP has all the groupby ASVs. I don't expect any difference in performance with the updates to this PR but we re-run tonight to check.~~ I misread. Will run groupby ASVs using only slowpath tonight.

jbrockmendel · 2021-07-07T22:22:46Z

pandas/core/groupby/generic.py

+                index=group.index,
+            )
+        assert isinstance(r, DataFrame)
+        return r


can you call this e.g. res_frame? i.e. not 1 letter

pandas/core/groupby/generic.py

jreback · 2021-07-08T12:44:14Z

comment by @jbrockmendel ping on green

rhshadrach · 2021-07-08T14:49:31Z

@jbrockmendel - This is slowpath only (L1385, first line of _choose_path is return slow_path, slow_path(group)):

       before           after         ratio
     [9ffc924e]       [462500f7]
     <whatsnew_contributing~2>       <groupby_transform_path>
+      17.9±0.6ms         25.4±1ms     1.42  groupby.Cumulative.time_frame_transform('int64', 'cummin')
+     21.3±0.07ms       26.5±0.5ms     1.24  groupby.Nth.time_frame_nth_any('datetime')
+      25.9±0.7ms       31.4±0.4ms     1.21  groupby.Nth.time_frame_nth_any('object')
+     11.9±0.09ms       14.0±0.3ms     1.18  groupby.Nth.time_frame_nth('object')
+      23.5±0.1ms         26.7±3ms     1.14  groupby.Nth.time_series_nth_any('datetime')
-      71.1±0.5ms       57.4±0.5ms     0.81  groupby.TransformEngine.time_dataframe_cython(True)
-      71.0±0.5ms       56.9±0.2ms     0.80  groupby.TransformEngine.time_dataframe_cython(False)

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
PERFORMANCE DECREASED.

rhshadrach · 2021-07-08T15:14:15Z

@jreback - green

jreback · 2021-07-08T16:08:18Z

thanks @rhshadrach very nice!

PERF: Try fast/slow paths only once in DataFrameGroupby.transform

7b86fdb

rhshadrach added Groupby Performance Memory or execution speed performance Apply Apply, Aggregate, Transform, Map labels Jun 23, 2021

jbrockmendel reviewed Jun 23, 2021

View reviewed changes

rhshadrach added 2 commits June 23, 2021 07:08

Avoid checking path each iteration

b70e420

fix

15d238a

jbrockmendel reviewed Jun 23, 2021

View reviewed changes

rhshadrach marked this pull request as draft June 23, 2021 23:14

jreback reviewed Jun 25, 2021

View reviewed changes

rhshadrach added 3 commits July 7, 2021 09:50

Make process_result return instead of mutate

36de30e

Merge branch 'master' of https://github.com/pandas-dev/pandas into gr…

2c1d337

…oupby_transform_path � Conflicts: � pandas/core/groupby/generic.py

whatsnew

2d40eaa

rhshadrach marked this pull request as ready for review July 7, 2021 13:56

jreback added this to the 1.4 milestone Jul 7, 2021

jreback requested changes Jul 7, 2021

View reviewed changes

Comment, refactor, and narrow typing

ec9515b

jbrockmendel reviewed Jul 7, 2021

View reviewed changes

jreback approved these changes Jul 7, 2021

View reviewed changes

jbrockmendel reviewed Jul 7, 2021

View reviewed changes

pandas/core/groupby/generic.py Show resolved Hide resolved

r -> res_frame

2bea964

jreback merged commit 57bb165 into pandas-dev:master Jul 8, 2021

rhshadrach deleted the groupby_transform_path branch July 8, 2021 16:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PERF: Try fast/slow paths only once in DataFrameGroupby.transform #42195

PERF: Try fast/slow paths only once in DataFrameGroupby.transform #42195

rhshadrach commented Jun 23, 2021 •

edited

Loading

jbrockmendel Jun 23, 2021

rhshadrach Jun 23, 2021

jbrockmendel Jun 23, 2021

rhshadrach Jun 23, 2021

rhshadrach Jun 23, 2021

jbrockmendel Jun 24, 2021

jreback Jun 25, 2021

jreback commented Jul 4, 2021

rhshadrach commented Jul 7, 2021

jbrockmendel commented Jul 7, 2021

jreback commented Jul 7, 2021

jreback Jul 7, 2021

rhshadrach Jul 7, 2021 •

edited

Loading

jreback Jul 7, 2021

rhshadrach commented Jul 7, 2021 •

edited

Loading

jbrockmendel Jul 7, 2021

jreback commented Jul 8, 2021

rhshadrach commented Jul 8, 2021

rhshadrach commented Jul 8, 2021

jreback commented Jul 8, 2021

PERF: Try fast/slow paths only once in DataFrameGroupby.transform #42195

PERF: Try fast/slow paths only once in DataFrameGroupby.transform #42195

Conversation

rhshadrach commented Jun 23, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback commented Jul 4, 2021

rhshadrach commented Jul 7, 2021

jbrockmendel commented Jul 7, 2021

jreback commented Jul 7, 2021

Choose a reason for hiding this comment

rhshadrach Jul 7, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rhshadrach commented Jul 7, 2021 • edited Loading

Choose a reason for hiding this comment

jreback commented Jul 8, 2021

rhshadrach commented Jul 8, 2021

rhshadrach commented Jul 8, 2021

jreback commented Jul 8, 2021

rhshadrach commented Jun 23, 2021 •

edited

Loading

rhshadrach Jul 7, 2021 •

edited

Loading

rhshadrach commented Jul 7, 2021 •

edited

Loading