-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
PERF: repeated slicing along index in groupby #40353
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
pandas/core/internals/managers.py
Outdated
@@ -817,6 +818,15 @@ def get_slice(self, slobj: slice, axis: int = 0) -> BlockManager: | |||
|
|||
return type(self)._simple_new(tuple(new_blocks), new_axes) | |||
|
|||
def get_slice_index(self, slobj: slice) -> BlockManager: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add a get_slice_columns and change get_slice to just dispatch? (to share code)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the only code-sharing i see available would be to change L819-L819 to return self.get_slice_index(slobj)
which only saves 1 line
Flip side: one of the things im looking at is making another attempt at getting axes
out of BlockManager (and ArrayManager), which would marginally widen the scope for sharing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does this specialized version avoid? Isn't it only the if axis == 0:
/ elif axis == 1
checks?
(which I assume are quite fast checks, although if done repeatedly might indeed still worth to avoid)
The other difference is that get_slice
does
slicer = (slice(None), slobj)
new_blocks = [blk.getitem_block(slicer) for blk in self.blocks]
while this new function does:
new_blocks = [blk.getitem_block_index(slobj) for blk in self.blocks]
but get_slice
could do that as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you answer this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't it only the if axis == 0:/ elif axis == 1 checks? (which I assume are quite fast checks, although if done repeatedly might indeed still worth to avoid)
Yes. The code I'm profiling to optimize is (#40171 (comment))
import numpy as np
import pandas as pd
N = 10 ** 4
labels = np.random.randint(0, 2000, size=N)
labels2 = np.random.randint(0, 3, size=N)
df = pd.DataFrame(
{
"key": labels,
"key2": labels2,
"value1": np.random.randn(N),
"value2": ["foo", "bar", "baz", "qux"] * (N // 4),
}
)
%prun -s cumtime for i in range(100): df.groupby("key").apply(lambda x: 1) # on master as of this writing
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.000 0.000 4.170 4.170 {built-in method builtins.exec}
1 0.001 0.001 4.170 4.170 <string>:1(<module>)
100 0.001 0.000 4.156 0.042 groupby.py:911(apply)
100 0.004 0.000 4.150 0.042 groupby.py:960(_python_apply_general)
100 0.497 0.005 4.084 0.041 ops.py:263(apply)
199000 0.176 0.000 3.223 0.000 ops.py:1005(__iter__)
198900 0.248 0.000 2.986 0.000 ops.py:1046(_chop)
198900 0.339 0.000 2.522 0.000 managers.py:796(get_slice)
198900 0.188 0.000 1.631 0.000 managers.py:803(<listcomp>)
596700 0.823 0.000 1.443 0.000 blocks.py:324(getitem_block)
198900 0.135 0.000 0.409 0.000 base.py:4638(_getitem_slice)
we're still spending 8% of the runtime in get_slice itself, which is why i think avoiding re-doing these checks is worthwhile.
but get_slice could do that as well?
sure, could update get_slice to also use getitem_block_index
(i think the first draft of this PR started before get_slice
reliably actually got a slice object)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we're still spending 8% of the runtime in get_slice itself, which is why i think avoiding re-doing these checks is worthwhile.
And did that number reduce by using get_slice_index instead?
sure, could update get_slice to also use getitem_block_index
Yes, let's at least make the code consistent there
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using this branch, timing mgr.get_slice_index(slobj)
vs mgr.get_slice(slobj, axis=1)
(but making them both use getitem_block_index
, and removing the assert) , I only get a tiny difference (less than 5% difference, and that's then on something that only takes 8% of the total runtime)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
im seeing 17-22% gains with just getitem_block_index and not the extra BM/AM methods, so will revert those for this PR
pandas/core/arrays/string_arrow.py
Outdated
@@ -349,6 +349,15 @@ def __getitem__(self, item: Any) -> Any: | |||
"Only integers, slices and integer or " | |||
"boolean arrays are valid indices." | |||
) | |||
elif isinstance(item, tuple): | |||
# possibly unpack arr[:, n] to arr[n] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this needed for arrow string array? That's invalid indexing for a 1D array, and should raise an IndexError (as numpy does, or as we do for eg StringArray)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo, should be arr[..., n]
, will update
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even for arr[..., n]
, why is it needed specifically here for Arrow string array? That also doesn't work for StringArray
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
got a test failure without this; will add a base extension test to require it for all EAs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you remember which test?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(BTW, I am not sure we should require this behaviour from EAs, as we can't be consistent with numpy in returning a 0-dim array)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you remember which test?
test_groupby_extension_apply * 8
pandas/core/internals/managers.py
Outdated
@@ -817,6 +818,15 @@ def get_slice(self, slobj: slice, axis: int = 0) -> BlockManager: | |||
|
|||
return type(self)._simple_new(tuple(new_blocks), new_axes) | |||
|
|||
def get_slice_index(self, slobj: slice) -> BlockManager: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does this specialized version avoid? Isn't it only the if axis == 0:
/ elif axis == 1
checks?
(which I assume are quite fast checks, although if done repeatedly might indeed still worth to avoid)
The other difference is that get_slice
does
slicer = (slice(None), slobj)
new_blocks = [blk.getitem_block(slicer) for blk in self.blocks]
while this new function does:
new_blocks = [blk.getitem_block_index(slobj) for blk in self.blocks]
but get_slice
could do that as well?
pandas/core/groupby/ops.py
Outdated
@@ -1012,19 +1012,27 @@ def __iter__(self): | |||
|
|||
starts, ends = lib.generate_slices(self.slabels, self.ngroups) | |||
|
|||
for i, (start, end) in enumerate(zip(starts, ends)): | |||
yield i, self._chop(sdata, slice(start, end)) | |||
if self.axis == 0: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can make this more readable by
method = self._chop_index if self.axis == 0 else self._chop_column
for i ......
yield i, method(....)
@@ -245,6 +245,17 @@ def test_getitem_slice(self, data): | |||
result = data[slice(1)] # scalar | |||
assert isinstance(result, type(data)) | |||
|
|||
def test_getitem_ellipsis_and_slice(self, data): | |||
# GH#40353 this is called from getitem_block_index | |||
result = data[..., :] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also test data[:, ...]
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good idea, will update
i think following today's call we're all set here? |
guess i merged in wrong order. can you merge master and ping on green. |
@jorisvandenbossche any guesses why the array manager sql tests are breaking on a bunch of PRs? update: looks like they break for non-ArrayManager, just thats the only place we have it running with up-to-date sqlalchemy |
Yes, it seems some new deprecation warnings with a new SQLAlchemy |
green ex unrelated sqlalchemy issue |
merging this, the sqlalchemy patch can be a followup |
This cuts another 25% off of the benchmark discussed #40171 (comment), getting us to within 3x of the cython-path performance.