Skip to content

PERF: Don't copy if already sorted in DataSplitter #52035

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
21 changes: 20 additions & 1 deletion asv_bench/benchmarks/groupby.py
Original file line number Diff line number Diff line change
Expand Up @@ -71,7 +71,7 @@ def time_groupby_apply_dict_return(self):

class Apply:
param_names = ["factor"]
params = [4, 5]
params = [5, 6]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the reason for changing this?
And if this needs to be changed, also the comments and code below have to be updated (they assume values of 4 and 5)


def setup(self, factor):
N = 10**factor
Expand All @@ -91,13 +91,32 @@ def setup(self, factor):
}
)
self.df = df
self.df_sorted = df.sort_values(by="key")

def time_scalar_function_multi_col(self, factor):
self.df.groupby(["key", "key2"]).apply(lambda x: 1)

def time_scalar_function_single_col(self, factor):
self.df.groupby("key").apply(lambda x: 1)

def peakmem_scalar_function_multi_col(self, factor):
self.df.groupby(["key", "key2"]).apply(lambda x: 1)

def peakmem_scalar_function_single_col(self, factor):
self.df.groupby("key").apply(lambda x: 1)

def time_sorted_scalar_function_multi_col(self, factor):
self.df_sorted.groupby(["key", "key2"]).apply(lambda x: 1)

def time_sortedscalar_function_single_col(self, factor):
self.df_sorted.groupby("key").apply(lambda x: 1)

def peakmem_sorted_scalar_function_multi_col(self, factor):
self.df_sorted.groupby(["key", "key2"]).apply(lambda x: 1)

def peakmem_sorted_scalar_function_single_col(self, factor):
self.df_sorted.groupby("key").apply(lambda x: 1)

@staticmethod
def df_copy_function(g):
# ensure that the group name is available (see GH #15062)
Expand Down
9 changes: 8 additions & 1 deletion pandas/core/groupby/ops.py
Original file line number Diff line number Diff line change
Expand Up @@ -1260,10 +1260,17 @@ def __iter__(self) -> Iterator:
starts, ends = lib.generate_slices(self._slabels, self.ngroups)

for start, end in zip(starts, ends):
yield self._chop(sdata, slice(start, end))
sliced = self._chop(sdata, slice(start, end))
if self._sorted_data is self.data:
yield sliced.copy()
Copy link
Member

@jbrockmendel jbrockmendel Mar 16, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we avoid this? ideas:

  1. with CoW it should be unnecessary right?
  2. i think we document that UDFs with side-effects are not supported (or at least caveat emptor)
  3. could pre-check if this is a UDF and not copy in non-UDF cases?
  4. doesn't avoid the copy, but defining chop=self._chop, data=self.data, sorted_data=self._sorted_data inside the method should make make this iteration marginally faster in large-ngroups cases
  5. could do the self._sorted_data is self.data check outside the loop.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the CoW case, deep=None inside the copy call would make this a no-op.

https://pandas.pydata.org/docs/user_guide/gotchas.html#mutating-with-user-defined-function-udf-methods says that mutation is not allowed, but the fact that mutation doesn't mess up the DF due to a copy being made is explicitly tested for in tests, right?

e.g. maybe

def test_mutate_groups():
?

I don't think 4, 5 make a difference (at least looking at a snakeviz profile).

Somehow, the groupby benchmarks are really flaky for me comparing this PR to main, since sometimes I get 10-50% regression, but not when doing a comparison with itself (as a sanity test). The improvements are not flaky, though.

The culprit is probably is duplicate work between is_range_indexer and array_equal_fast (in take). Curiously, this doesn't show up in the snakeviz profile...

else:
yield sliced

@cache_readonly
def _sorted_data(self) -> NDFrameT:
if lib.is_range_indexer(self._sort_idx, len(self.data)):
# Check if DF already in correct order
return self.data # Need to take a copy later!
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you flesh out this comment a little bit. This is the sort of thing where I would preface a comment with "NB:"

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason to not make a copy here (once) is for memory reasons (to not have a full copy at once in memory)?
Because in the end the full data will be copied as well when copying each slice? Just that the same amount of data gets copied in multiple copy() calls, and thus introducing more overhead.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that's basically the idea behind it.

For CoW, this is already optimized, since take will return a shallow copy.
I'm trying to see if this is viable for non CoW because I'm trying to apply this optimization to unsorted arrays as well(idea is to only take a chunk of the unsorted array at a time instead of the whole thingwith the argsort indices).

So far, I've gotten the copy() call overhead down to ~1.5 from 2.3.
Most of the overhead that's left is from pandas itself (e.g. creating a view of the Index, BlockManager stuff), and the actual copy of the group values is relatively cheap.

return self.data.take(self._sort_idx, axis=self.axis)

def _chop(self, sdata, slice_obj: slice) -> NDFrame:
Expand Down
2 changes: 1 addition & 1 deletion pandas/tests/groupby/test_groupby.py
Original file line number Diff line number Diff line change
Expand Up @@ -1320,7 +1320,7 @@ def convert_fast(x):

def convert_force_pure(x):
# base will be length 0
assert len(x.values.base) > 0
# assert len(x.values.base) > 0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

delete?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wanted to ask if this test is accurate. IDK what it does right now.

With current behavior, x.values is a view of the sorted DF (which is a copy of the original DF), so it passes.

It fails with this PR, since we're not materializing a sorted DF copy when sorted.

Do you know what this is supposed to check for?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no idea, id have to dig into the git history

return Decimal(str(x.mean()))

grouped = s.groupby(labels)
Expand Down