-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
PERF: Don't copy if already sorted in DataSplitter #52035
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
would #51109 address some or all of the finalize/name slowdown you mentioned? |
|
||
@cache_readonly | ||
def _sorted_data(self) -> NDFrameT: | ||
if lib.is_range_indexer(self._sort_idx, len(self.data)): | ||
# Check if DF already in correct order | ||
return self.data # Need to take a copy later! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you flesh out this comment a little bit. This is the sort of thing where I would preface a comment with "NB:"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason to not make a copy here (once) is for memory reasons (to not have a full copy at once in memory)?
Because in the end the full data will be copied as well when copying each slice? Just that the same amount of data gets copied in multiple copy() calls, and thus introducing more overhead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that's basically the idea behind it.
For CoW, this is already optimized, since take will return a shallow copy.
I'm trying to see if this is viable for non CoW because I'm trying to apply this optimization to unsorted arrays as well(idea is to only take a chunk of the unsorted array at a time instead of the whole thingwith the argsort indices).
So far, I've gotten the copy() call overhead down to ~1.5 from 2.3.
Most of the overhead that's left is from pandas itself (e.g. creating a view of the Index, BlockManager stuff), and the actual copy of the group values is relatively cheap.
yield self._chop(sdata, slice(start, end)) | ||
sliced = self._chop(sdata, slice(start, end)) | ||
if self._sorted_data is self.data: | ||
yield sliced.copy() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we avoid this? ideas:
- with CoW it should be unnecessary right?
- i think we document that UDFs with side-effects are not supported (or at least caveat emptor)
- could pre-check if this is a UDF and not copy in non-UDF cases?
- doesn't avoid the copy, but defining chop=self._chop, data=self.data, sorted_data=self._sorted_data inside the method should make make this iteration marginally faster in large-ngroups cases
- could do the self._sorted_data is self.data check outside the loop.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the CoW case, deep=None
inside the copy call would make this a no-op.
https://pandas.pydata.org/docs/user_guide/gotchas.html#mutating-with-user-defined-function-udf-methods says that mutation is not allowed, but the fact that mutation doesn't mess up the DF due to a copy being made is explicitly tested for in tests, right?
e.g. maybe
def test_mutate_groups(): |
I don't think 4, 5 make a difference (at least looking at a snakeviz profile).
Somehow, the groupby benchmarks are really flaky for me comparing this PR to main, since sometimes I get 10-50% regression, but not when doing a comparison with itself (as a sanity test). The improvements are not flaky, though.
The culprit is probably is duplicate work between is_range_indexer
and array_equal_fast
(in take). Curiously, this doesn't show up in the snakeviz profile...
@@ -1320,7 +1320,7 @@ def convert_fast(x): | |||
|
|||
def convert_force_pure(x): | |||
# base will be length 0 | |||
assert len(x.values.base) > 0 | |||
# assert len(x.values.base) > 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
delete?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wanted to ask if this test is accurate. IDK what it does right now.
With current behavior, x.values is a view of the sorted DF (which is a copy of the original DF), so it passes.
It fails with this PR, since we're not materializing a sorted DF copy when sorted.
Do you know what this is supposed to check for?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no idea, id have to dig into the git history
Profiling this with
main
branch
Looks like some of the slowdown is caused by extra calls to This makes me want to revisit optimizing get_slice and external_values, though that won't do much good here. |
@@ -71,7 +71,7 @@ def time_groupby_apply_dict_return(self): | |||
|
|||
class Apply: | |||
param_names = ["factor"] | |||
params = [4, 5] | |||
params = [5, 6] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the reason for changing this?
And if this needs to be changed, also the comments and code below have to be updated (they assume values of 4 and 5)
Closes #43011? |
For the non CoW case, yes. This is already optimized for CoW (since a no-op take there will return a shallow copy). |
This pull request is stale because it has been open for thirty days with no activity. Please update and respond to this comment if you're still interested in working on this. |
Closing to clear the queue for now. Feel free to reopen when you have time to revisit |
doc/source/whatsnew/vX.X.X.rst
file if fixing a bug or adding a new feature.Benchmarks.
The regression in apply_dict_return is real, and I think the
.copy()
method has a lot of overhead(both the finalize + name validation I think).No memory improvement is showing up there since the size isn't big enough.