Skip to content

PERF: Don't copy if already sorted in DataSplitter #52035

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

lithomas1
Copy link
Member

@lithomas1 lithomas1 commented Mar 16, 2023

       before           after         ratio
     [2a983bb1]       [2a980997]
     <datasplitter-ordered-nocopy~1>       <datasplitter-ordered-nocopy>
+      14.1±0.4ms       32.8±0.3ms     2.32  groupby.ApplyDictReturn.time_groupby_apply_dict_return
-     3.56±0.03ms      3.01±0.07ms     0.85  groupby.Apply.time_sorted_scalar_function_single_col(5)
-            236M             197M     0.83  groupby.Apply.peakmem_sorted_scalar_function_multi_col(6)
-            220M             182M     0.83  groupby.Apply.peakmem_sorted_scalar_function_single_col(6)
-        51.4±4ms       34.1±0.5ms     0.66  groupby.Apply.time_sorted_scalar_function_single_col(6)

Benchmarks.
The regression in apply_dict_return is real, and I think the .copy() method has a lot of overhead(both the finalize + name validation I think).
No memory improvement is showing up there since the size isn't big enough.

@jbrockmendel
Copy link
Member

would #51109 address some or all of the finalize/name slowdown you mentioned?


@cache_readonly
def _sorted_data(self) -> NDFrameT:
if lib.is_range_indexer(self._sort_idx, len(self.data)):
# Check if DF already in correct order
return self.data # Need to take a copy later!
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you flesh out this comment a little bit. This is the sort of thing where I would preface a comment with "NB:"

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason to not make a copy here (once) is for memory reasons (to not have a full copy at once in memory)?
Because in the end the full data will be copied as well when copying each slice? Just that the same amount of data gets copied in multiple copy() calls, and thus introducing more overhead.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that's basically the idea behind it.

For CoW, this is already optimized, since take will return a shallow copy.
I'm trying to see if this is viable for non CoW because I'm trying to apply this optimization to unsorted arrays as well(idea is to only take a chunk of the unsorted array at a time instead of the whole thingwith the argsort indices).

So far, I've gotten the copy() call overhead down to ~1.5 from 2.3.
Most of the overhead that's left is from pandas itself (e.g. creating a view of the Index, BlockManager stuff), and the actual copy of the group values is relatively cheap.

yield self._chop(sdata, slice(start, end))
sliced = self._chop(sdata, slice(start, end))
if self._sorted_data is self.data:
yield sliced.copy()
Copy link
Member

@jbrockmendel jbrockmendel Mar 16, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we avoid this? ideas:

  1. with CoW it should be unnecessary right?
  2. i think we document that UDFs with side-effects are not supported (or at least caveat emptor)
  3. could pre-check if this is a UDF and not copy in non-UDF cases?
  4. doesn't avoid the copy, but defining chop=self._chop, data=self.data, sorted_data=self._sorted_data inside the method should make make this iteration marginally faster in large-ngroups cases
  5. could do the self._sorted_data is self.data check outside the loop.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the CoW case, deep=None inside the copy call would make this a no-op.

https://pandas.pydata.org/docs/user_guide/gotchas.html#mutating-with-user-defined-function-udf-methods says that mutation is not allowed, but the fact that mutation doesn't mess up the DF due to a copy being made is explicitly tested for in tests, right?

e.g. maybe

def test_mutate_groups():
?

I don't think 4, 5 make a difference (at least looking at a snakeviz profile).

Somehow, the groupby benchmarks are really flaky for me comparing this PR to main, since sometimes I get 10-50% regression, but not when doing a comparison with itself (as a sanity test). The improvements are not flaky, though.

The culprit is probably is duplicate work between is_range_indexer and array_equal_fast (in take). Curiously, this doesn't show up in the snakeviz profile...

@@ -1320,7 +1320,7 @@ def convert_fast(x):

def convert_force_pure(x):
# base will be length 0
assert len(x.values.base) > 0
# assert len(x.values.base) > 0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

delete?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wanted to ask if this test is accurate. IDK what it does right now.

With current behavior, x.values is a view of the sorted DF (which is a copy of the original DF), so it passes.

It fails with this PR, since we're not materializing a sorted DF copy when sorted.

Do you know what this is supposed to check for?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no idea, id have to dig into the git history

@jbrockmendel
Copy link
Member

Profiling this with

from asv_bench.benchmarks.groupby import *

cls = ApplyDictReturn
self = cls()
self.setup()

%prun -s cumtime for n in range(100): self.time_groupby_apply_dict_return()

main

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    [...]
      100    0.291    0.003    4.381    0.044 ops.py:1000(apply_groupwise)
   100100    0.101    0.000    2.755    0.000 ops.py:1252(__iter__)
   100000    0.255    0.000    2.633    0.000 ops.py:1274(_chop)
   100200    0.399    0.000    1.005    0.000 generic.py:5985(__finalize__)
   100000    0.403    0.000    0.945    0.000 managers.py:1990(get_slice)
   100000    0.121    0.000    0.945    0.000 groupby.py:68(<lambda>)
   200000    0.083    0.000    0.823    0.000 series.py:672(values)
   200000    0.228    0.000    0.740    0.000 managers.py:2016(external_values)
   200500    0.200    0.000    0.713    0.000 series.py:667(name)

branch

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    [...]
      100    0.368    0.004    9.000    0.090 ops.py:1000(apply_groupwise)
   100100    0.267    0.000    7.213    0.000 ops.py:1252(__iter__)
   100000    0.267    0.000    4.328    0.000 generic.py:6394(copy)
   100000    0.273    0.000    2.614    0.000 ops.py:1281(_chop)
   200100    0.820    0.000    1.934    0.000 generic.py:5985(__finalize__)
   100000    0.260    0.000    1.836    0.000 managers.py:620(copy)
   200200    0.263    0.000    1.628    0.000 series.py:373(__init__)
   400300    0.397    0.000    1.358    0.000 series.py:667(name)

Looks like some of the slowdown is caused by extra calls to __finalize__ which add up. (#51109 or #51784 might help, #51280 definitely would once enforced)

This makes me want to revisit optimizing get_slice and external_values, though that won't do much good here.

@@ -71,7 +71,7 @@ def time_groupby_apply_dict_return(self):

class Apply:
param_names = ["factor"]
params = [4, 5]
params = [5, 6]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the reason for changing this?
And if this needs to be changed, also the comments and code below have to be updated (they assume values of 4 and 5)

@jbrockmendel
Copy link
Member

Closes #43011?

@lithomas1
Copy link
Member Author

Closes #43011?

For the non CoW case, yes. This is already optimized for CoW (since a no-op take there will return a shallow copy).

@github-actions
Copy link
Contributor

github-actions bot commented May 1, 2023

This pull request is stale because it has been open for thirty days with no activity. Please update and respond to this comment if you're still interested in working on this.

@github-actions github-actions bot added the Stale label May 1, 2023
@mroeschke
Copy link
Member

Closing to clear the queue for now. Feel free to reopen when you have time to revisit

@mroeschke mroeschke closed this Aug 1, 2023
@mroeschke mroeschke added the Mothballed Temporarily-closed PR the author plans to return to label Aug 1, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Mothballed Temporarily-closed PR the author plans to return to Stale
Projects
None yet
Development

Successfully merging this pull request may close these issues.

PERF: cache sorted data in GroupBy?
4 participants