PERF: Don't copy if already sorted in DataSplitter #52035

lithomas1 · 2023-03-16T21:24:09Z

closes PERF: cache sorted data in GroupBy? #51077 (Replace xxxx with the GitHub issue number)
Tests added and passed if fixing a bug or adding a new feature
All code checks passed.
Added type annotations to new arguments/methods/functions.
Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

       before           after         ratio
     [2a983bb1]       [2a980997]
     <datasplitter-ordered-nocopy~1>       <datasplitter-ordered-nocopy>
+      14.1±0.4ms       32.8±0.3ms     2.32  groupby.ApplyDictReturn.time_groupby_apply_dict_return
-     3.56±0.03ms      3.01±0.07ms     0.85  groupby.Apply.time_sorted_scalar_function_single_col(5)
-            236M             197M     0.83  groupby.Apply.peakmem_sorted_scalar_function_multi_col(6)
-            220M             182M     0.83  groupby.Apply.peakmem_sorted_scalar_function_single_col(6)
-        51.4±4ms       34.1±0.5ms     0.66  groupby.Apply.time_sorted_scalar_function_single_col(6)

Benchmarks.
The regression in apply_dict_return is real, and I think the .copy() method has a lot of overhead(both the finalize + name validation I think).
No memory improvement is showing up there since the size isn't big enough.

jbrockmendel · 2023-03-16T22:51:28Z

would #51109 address some or all of the finalize/name slowdown you mentioned?

jbrockmendel · 2023-03-16T22:52:36Z

pandas/core/groupby/ops.py


    @cache_readonly
    def _sorted_data(self) -> NDFrameT:
+        if lib.is_range_indexer(self._sort_idx, len(self.data)):
+            # Check if DF already in correct order
+            return self.data  # Need to take a copy later!


can you flesh out this comment a little bit. This is the sort of thing where I would preface a comment with "NB:"

The reason to not make a copy here (once) is for memory reasons (to not have a full copy at once in memory)?
Because in the end the full data will be copied as well when copying each slice? Just that the same amount of data gets copied in multiple copy() calls, and thus introducing more overhead.

Yeah, that's basically the idea behind it.

For CoW, this is already optimized, since take will return a shallow copy.
I'm trying to see if this is viable for non CoW because I'm trying to apply this optimization to unsorted arrays as well(idea is to only take a chunk of the unsorted array at a time instead of the whole thingwith the argsort indices).

So far, I've gotten the copy() call overhead down to ~1.5 from 2.3.
Most of the overhead that's left is from pandas itself (e.g. creating a view of the Index, BlockManager stuff), and the actual copy of the group values is relatively cheap.

jbrockmendel · 2023-03-16T22:56:02Z

pandas/core/groupby/ops.py

-            yield self._chop(sdata, slice(start, end))
+            sliced = self._chop(sdata, slice(start, end))
+            if self._sorted_data is self.data:
+                yield sliced.copy()


can we avoid this? ideas:

with CoW it should be unnecessary right?

i think we document that UDFs with side-effects are not supported (or at least caveat emptor)

could pre-check if this is a UDF and not copy in non-UDF cases?

doesn't avoid the copy, but defining chop=self._chop, data=self.data, sorted_data=self._sorted_data inside the method should make make this iteration marginally faster in large-ngroups cases

could do the self._sorted_data is self.data check outside the loop.

For the CoW case, deep=None inside the copy call would make this a no-op.

https://pandas.pydata.org/docs/user_guide/gotchas.html#mutating-with-user-defined-function-udf-methods says that mutation is not allowed, but the fact that mutation doesn't mess up the DF due to a copy being made is explicitly tested for in tests, right?

e.g. maybe

pandas/pandas/tests/groupby/test_apply_mutate.py

Line 23 in 23c3dc2

def test_mutate_groups():

?

I don't think 4, 5 make a difference (at least looking at a snakeviz profile).

Somehow, the groupby benchmarks are really flaky for me comparing this PR to main, since sometimes I get 10-50% regression, but not when doing a comparison with itself (as a sanity test). The improvements are not flaky, though.

The culprit is probably is duplicate work between is_range_indexer and array_equal_fast (in take). Curiously, this doesn't show up in the snakeviz profile...

jbrockmendel · 2023-03-16T22:56:41Z

pandas/tests/groupby/test_groupby.py

@@ -1320,7 +1320,7 @@ def convert_fast(x):

    def convert_force_pure(x):
        # base will be length 0
-        assert len(x.values.base) > 0
+        # assert len(x.values.base) > 0


Wanted to ask if this test is accurate. IDK what it does right now.

With current behavior, x.values is a view of the sorted DF (which is a copy of the original DF), so it passes.

It fails with this PR, since we're not materializing a sorted DF copy when sorted.

Do you know what this is supposed to check for?

no idea, id have to dig into the git history

jbrockmendel · 2023-03-17T16:40:11Z

Profiling this with

from asv_bench.benchmarks.groupby import *

cls = ApplyDictReturn
self = cls()
self.setup()

%prun -s cumtime for n in range(100): self.time_groupby_apply_dict_return()

main

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    [...]
      100    0.291    0.003    4.381    0.044 ops.py:1000(apply_groupwise)
   100100    0.101    0.000    2.755    0.000 ops.py:1252(__iter__)
   100000    0.255    0.000    2.633    0.000 ops.py:1274(_chop)
   100200    0.399    0.000    1.005    0.000 generic.py:5985(__finalize__)
   100000    0.403    0.000    0.945    0.000 managers.py:1990(get_slice)
   100000    0.121    0.000    0.945    0.000 groupby.py:68(<lambda>)
   200000    0.083    0.000    0.823    0.000 series.py:672(values)
   200000    0.228    0.000    0.740    0.000 managers.py:2016(external_values)
   200500    0.200    0.000    0.713    0.000 series.py:667(name)

branch

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    [...]
      100    0.368    0.004    9.000    0.090 ops.py:1000(apply_groupwise)
   100100    0.267    0.000    7.213    0.000 ops.py:1252(__iter__)
   100000    0.267    0.000    4.328    0.000 generic.py:6394(copy)
   100000    0.273    0.000    2.614    0.000 ops.py:1281(_chop)
   200100    0.820    0.000    1.934    0.000 generic.py:5985(__finalize__)
   100000    0.260    0.000    1.836    0.000 managers.py:620(copy)
   200200    0.263    0.000    1.628    0.000 series.py:373(__init__)
   400300    0.397    0.000    1.358    0.000 series.py:667(name)

Looks like some of the slowdown is caused by extra calls to __finalize__ which add up. (#51109 or #51784 might help, #51280 definitely would once enforced)

This makes me want to revisit optimizing get_slice and external_values, though that won't do much good here.

jorisvandenbossche · 2023-03-17T21:26:26Z

asv_bench/benchmarks/groupby.py

@@ -71,7 +71,7 @@ def time_groupby_apply_dict_return(self):

 class Apply:
    param_names = ["factor"]
-    params = [4, 5]
+    params = [5, 6]


What's the reason for changing this?
And if this needs to be changed, also the comments and code below have to be updated (they assume values of 4 and 5)

jbrockmendel · 2023-03-30T22:31:46Z

Closes #43011?

lithomas1 · 2023-03-31T02:20:42Z

Closes #43011?

For the non CoW case, yes. This is already optimized for CoW (since a no-op take there will return a shallow copy).

github-actions · 2023-05-01T00:06:03Z

This pull request is stale because it has been open for thirty days with no activity. Please update and respond to this comment if you're still interested in working on this.

mroeschke · 2023-08-01T17:16:21Z

Closing to clear the queue for now. Feel free to reopen when you have time to revisit

PERF: Don't copy if already sorted in DataSplitter

2a98099

jbrockmendel reviewed Mar 16, 2023

View reviewed changes

jorisvandenbossche reviewed Mar 17, 2023

View reviewed changes

github-actions bot added the Stale label May 1, 2023

mroeschke closed this Aug 1, 2023

mroeschke added the Mothballed Temporarily-closed PR the author plans to return to label Aug 1, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PERF: Don't copy if already sorted in DataSplitter #52035

PERF: Don't copy if already sorted in DataSplitter #52035

lithomas1 commented Mar 16, 2023 •

edited

Loading

jbrockmendel commented Mar 16, 2023

jbrockmendel Mar 16, 2023

jorisvandenbossche Mar 17, 2023

lithomas1 Mar 31, 2023

jbrockmendel Mar 16, 2023 •

edited

Loading

lithomas1 Mar 17, 2023

jbrockmendel Mar 16, 2023

lithomas1 Mar 17, 2023

jbrockmendel Mar 17, 2023

jbrockmendel commented Mar 17, 2023

jorisvandenbossche Mar 17, 2023

jbrockmendel commented Mar 30, 2023

lithomas1 commented Mar 31, 2023

github-actions bot commented May 1, 2023

mroeschke commented Aug 1, 2023

PERF: Don't copy if already sorted in DataSplitter #52035

PERF: Don't copy if already sorted in DataSplitter #52035

Conversation

lithomas1 commented Mar 16, 2023 • edited Loading

jbrockmendel commented Mar 16, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jbrockmendel Mar 16, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jbrockmendel commented Mar 17, 2023

Choose a reason for hiding this comment

jbrockmendel commented Mar 30, 2023

lithomas1 commented Mar 31, 2023

github-actions bot commented May 1, 2023

mroeschke commented Aug 1, 2023

lithomas1 commented Mar 16, 2023 •

edited

Loading

jbrockmendel Mar 16, 2023 •

edited

Loading