-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
PERF: Don't copy if already sorted in DataSplitter #52035
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change | ||
---|---|---|---|---|
|
@@ -1260,10 +1260,17 @@ def __iter__(self) -> Iterator: | |||
starts, ends = lib.generate_slices(self._slabels, self.ngroups) | ||||
|
||||
for start, end in zip(starts, ends): | ||||
yield self._chop(sdata, slice(start, end)) | ||||
sliced = self._chop(sdata, slice(start, end)) | ||||
if self._sorted_data is self.data: | ||||
yield sliced.copy() | ||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. can we avoid this? ideas:
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. For the CoW case, https://pandas.pydata.org/docs/user_guide/gotchas.html#mutating-with-user-defined-function-udf-methods says that mutation is not allowed, but the fact that mutation doesn't mess up the DF due to a copy being made is explicitly tested for in tests, right? e.g. maybe
I don't think 4, 5 make a difference (at least looking at a snakeviz profile). Somehow, the groupby benchmarks are really flaky for me comparing this PR to main, since sometimes I get 10-50% regression, but not when doing a comparison with itself (as a sanity test). The improvements are not flaky, though. The culprit is probably is duplicate work between |
||||
else: | ||||
yield sliced | ||||
|
||||
@cache_readonly | ||||
def _sorted_data(self) -> NDFrameT: | ||||
if lib.is_range_indexer(self._sort_idx, len(self.data)): | ||||
# Check if DF already in correct order | ||||
return self.data # Need to take a copy later! | ||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. can you flesh out this comment a little bit. This is the sort of thing where I would preface a comment with "NB:" There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The reason to not make a copy here (once) is for memory reasons (to not have a full copy at once in memory)? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yeah, that's basically the idea behind it. For CoW, this is already optimized, since take will return a shallow copy. So far, I've gotten the copy() call overhead down to ~1.5 from 2.3. |
||||
return self.data.take(self._sort_idx, axis=self.axis) | ||||
|
||||
def _chop(self, sdata, slice_obj: slice) -> NDFrame: | ||||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -1320,7 +1320,7 @@ def convert_fast(x): | |
|
||
def convert_force_pure(x): | ||
# base will be length 0 | ||
assert len(x.values.base) > 0 | ||
# assert len(x.values.base) > 0 | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. delete? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Wanted to ask if this test is accurate. IDK what it does right now. With current behavior, x.values is a view of the sorted DF (which is a copy of the original DF), so it passes. It fails with this PR, since we're not materializing a sorted DF copy when sorted. Do you know what this is supposed to check for? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. no idea, id have to dig into the git history |
||
return Decimal(str(x.mean())) | ||
|
||
grouped = s.groupby(labels) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the reason for changing this?
And if this needs to be changed, also the comments and code below have to be updated (they assume values of 4 and 5)