PERF: repeated slicing along index in groupby #40353

jbrockmendel · 2021-03-10T16:45:51Z

This cuts another 25% off of the benchmark discussed #40171 (comment), getting us to within 3x of the cython-path performance.

jreback · 2021-03-11T02:21:17Z

pandas/core/internals/managers.py

@@ -817,6 +818,15 @@ def get_slice(self, slobj: slice, axis: int = 0) -> BlockManager:

        return type(self)._simple_new(tuple(new_blocks), new_axes)

+    def get_slice_index(self, slobj: slice) -> BlockManager:


can you add a get_slice_columns and change get_slice to just dispatch? (to share code)

the only code-sharing i see available would be to change L819-L819 to return self.get_slice_index(slobj)which only saves 1 line

Flip side: one of the things im looking at is making another attempt at getting axes out of BlockManager (and ArrayManager), which would marginally widen the scope for sharing

What does this specialized version avoid? Isn't it only the if axis == 0:/ elif axis == 1 checks?
(which I assume are quite fast checks, although if done repeatedly might indeed still worth to avoid)

The other difference is that get_slice does

slicer = (slice(None), slobj) new_blocks = [blk.getitem_block(slicer) for blk in self.blocks]

while this new function does:

new_blocks = [blk.getitem_block_index(slobj) for blk in self.blocks]

but get_slice could do that as well?

Can you answer this?

Isn't it only the if axis == 0:/ elif axis == 1 checks? (which I assume are quite fast checks, although if done repeatedly might indeed still worth to avoid)

Yes. The code I'm profiling to optimize is (#40171 (comment))

import numpy as np import pandas as pd N = 10 ** 4 labels = np.random.randint(0, 2000, size=N) labels2 = np.random.randint(0, 3, size=N) df = pd.DataFrame( { "key": labels, "key2": labels2, "value1": np.random.randn(N), "value2": ["foo", "bar", "baz", "qux"] * (N // 4), } ) %prun -s cumtime for i in range(100): df.groupby("key").apply(lambda x: 1) # on master as of this writing ncalls tottime percall cumtime percall filename:lineno(function) 1 0.000 0.000 4.170 4.170 {built-in method builtins.exec} 1 0.001 0.001 4.170 4.170 <string>:1(<module>) 100 0.001 0.000 4.156 0.042 groupby.py:911(apply) 100 0.004 0.000 4.150 0.042 groupby.py:960(_python_apply_general) 100 0.497 0.005 4.084 0.041 ops.py:263(apply) 199000 0.176 0.000 3.223 0.000 ops.py:1005(__iter__) 198900 0.248 0.000 2.986 0.000 ops.py:1046(_chop) 198900 0.339 0.000 2.522 0.000 managers.py:796(get_slice) 198900 0.188 0.000 1.631 0.000 managers.py:803(<listcomp>) 596700 0.823 0.000 1.443 0.000 blocks.py:324(getitem_block) 198900 0.135 0.000 0.409 0.000 base.py:4638(_getitem_slice)

we're still spending 8% of the runtime in get_slice itself, which is why i think avoiding re-doing these checks is worthwhile.

but get_slice could do that as well?

sure, could update get_slice to also use getitem_block_index (i think the first draft of this PR started before get_slice reliably actually got a slice object)

we're still spending 8% of the runtime in get_slice itself, which is why i think avoiding re-doing these checks is worthwhile.

And did that number reduce by using get_slice_index instead?

sure, could update get_slice to also use getitem_block_index

Yes, let's at least make the code consistent there

Using this branch, timing mgr.get_slice_index(slobj) vs mgr.get_slice(slobj, axis=1) (but making them both use getitem_block_index, and removing the assert) , I only get a tiny difference (less than 5% difference, and that's then on something that only takes 8% of the total runtime)

im seeing 17-22% gains with just getitem_block_index and not the extra BM/AM methods, so will revert those for this PR

jorisvandenbossche · 2021-03-11T10:26:12Z

pandas/core/arrays/string_arrow.py

@@ -349,6 +349,15 @@ def __getitem__(self, item: Any) -> Any:
                    "Only integers, slices and integer or "
                    "boolean arrays are valid indices."
                )
+        elif isinstance(item, tuple):
+            # possibly unpack arr[:, n] to arr[n]


Why is this needed for arrow string array? That's invalid indexing for a 1D array, and should raise an IndexError (as numpy does, or as we do for eg StringArray)

typo, should be arr[..., n], will update

Even for arr[..., n], why is it needed specifically here for Arrow string array? That also doesn't work for StringArray

got a test failure without this; will add a base extension test to require it for all EAs

Do you remember which test?

(BTW, I am not sure we should require this behaviour from EAs, as we can't be consistent with numpy in returning a 0-dim array)

Do you remember which test?

test_groupby_extension_apply * 8

jorisvandenbossche · 2021-03-11T10:31:33Z

pandas/core/internals/managers.py

@@ -817,6 +818,15 @@ def get_slice(self, slobj: slice, axis: int = 0) -> BlockManager:

        return type(self)._simple_new(tuple(new_blocks), new_axes)

+    def get_slice_index(self, slobj: slice) -> BlockManager:


What does this specialized version avoid? Isn't it only the if axis == 0:/ elif axis == 1 checks?
(which I assume are quite fast checks, although if done repeatedly might indeed still worth to avoid)

The other difference is that get_slice does

slicer = (slice(None), slobj) new_blocks = [blk.getitem_block(slicer) for blk in self.blocks]

while this new function does:

new_blocks = [blk.getitem_block_index(slobj) for blk in self.blocks]

but get_slice could do that as well?

jreback · 2021-03-12T21:51:03Z

pandas/core/groupby/ops.py

@@ -1012,19 +1012,27 @@ def __iter__(self):

        starts, ends = lib.generate_slices(self.slabels, self.ngroups)

-        for i, (start, end) in enumerate(zip(starts, ends)):
-            yield i, self._chop(sdata, slice(start, end))
+        if self.axis == 0:


can make this more readable by

method = self._chop_index if self.axis == 0 else self._chop_column for i ...... yield i, method(....)

jorisvandenbossche · 2021-03-15T14:18:10Z

pandas/tests/extension/base/getitem.py

@@ -245,6 +245,17 @@ def test_getitem_slice(self, data):
        result = data[slice(1)]  # scalar
        assert isinstance(result, type(data))

+    def test_getitem_ellipsis_and_slice(self, data):
+        # GH#40353 this is called from getitem_block_index
+        result = data[..., :]


Also test data[:, ...] ?

good idea, will update

jbrockmendel · 2021-03-15T19:15:18Z

i think following today's call we're all set here?

jreback · 2021-03-16T13:49:46Z

guess i merged in wrong order. can you merge master and ping on green.

jbrockmendel · 2021-03-16T16:34:17Z

@jorisvandenbossche any guesses why the array manager sql tests are breaking on a bunch of PRs?

update: looks like they break for non-ArrayManager, just thats the only place we have it running with up-to-date sqlalchemy

jorisvandenbossche · 2021-03-16T16:37:06Z

Yes, it seems some new deprecation warnings with a new SQLAlchemy

jbrockmendel · 2021-03-16T16:50:43Z

green ex unrelated sqlalchemy issue

jreback · 2021-03-16T16:55:37Z

merging this, the sqlalchemy patch can be a followup

jbrockmendel added 10 commits March 5, 2021 16:08

CLN: BlockManager.get_slice require only slice arg

60a8649

Merge branch 'master' into ref-slice

91436a5

mypy fixup

b17ad65

Merge branch 'master' into ref-slice

9fa658f

PERF: implement Index._getitem_slice

5afce04

PERF: implement getitem_block_columns

36a8530

Merge branch 'master' into cln-getitem_block

c528119

Merge branch 'master' into cln-getitem_block

dc5f975

PERF: repeated slicing along index in groupby

14ee965

mypy fixup

f6655ad

jreback requested changes Mar 11, 2021

View reviewed changes

jreback added Groupby Performance Memory or execution speed performance labels Mar 11, 2021

Merge branch 'master' into cln-getitem_block

21fe008

jorisvandenbossche requested changes Mar 11, 2021

View reviewed changes

jbrockmendel added 9 commits March 11, 2021 10:44

Merge branch 'master' into cln-getitem_block

e4eae87

Merge branch 'master' into cln-getitem_block

ce2dec6

mypy fixup

8ff5167

revert

2ade2bb

comment typo fixup

c594fd0

Merge branch 'master' into cln-getitem_block

0cefeb2

type:ignore

e9d0a92

Merge branch 'master' into cln-getitem_block

9a4ccc1

TST: EA[..., slc]

14497df

jreback added this to the 1.3 milestone Mar 12, 2021

jreback requested changes Mar 12, 2021

View reviewed changes

jbrockmendel added 3 commits March 14, 2021 15:59

Merge branch 'master' into cln-getitem_block

9a3969f

recert get_slice_index

23cf40d

Merge branch 'master' into cln-getitem_block

12266aa

jorisvandenbossche reviewed Mar 15, 2021

View reviewed changes

jbrockmendel added 2 commits March 15, 2021 07:25

Merge branch 'master' into cln-getitem_block

d13ad92

TST: arr[foo, ...]

e36b7b0

Merge branch 'master' into cln-getitem_block

bc0a110

jreback approved these changes Mar 16, 2021

View reviewed changes

jreback merged commit 203f901 into pandas-dev:master Mar 16, 2021

jbrockmendel deleted the cln-getitem_block branch March 16, 2021 17:05

JulianWgs pushed a commit to JulianWgs/pandas that referenced this pull request Jul 3, 2021

PERF: repeated slicing along index in groupby (pandas-dev#40353)

3842a16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PERF: repeated slicing along index in groupby #40353

PERF: repeated slicing along index in groupby #40353

jbrockmendel commented Mar 10, 2021

jreback Mar 11, 2021

jbrockmendel Mar 11, 2021

jorisvandenbossche Mar 11, 2021

jorisvandenbossche Mar 12, 2021

jbrockmendel Mar 12, 2021

jorisvandenbossche Mar 12, 2021

jorisvandenbossche Mar 12, 2021

jbrockmendel Mar 14, 2021

jorisvandenbossche Mar 11, 2021

jbrockmendel Mar 11, 2021

jorisvandenbossche Mar 12, 2021

jbrockmendel Mar 12, 2021

jorisvandenbossche Mar 12, 2021

jorisvandenbossche Mar 12, 2021

jbrockmendel Mar 12, 2021

jorisvandenbossche Mar 11, 2021

jreback Mar 12, 2021

jorisvandenbossche Mar 15, 2021

jbrockmendel Mar 15, 2021

jbrockmendel commented Mar 15, 2021

jreback commented Mar 16, 2021

jbrockmendel commented Mar 16, 2021 •

edited

Loading

jorisvandenbossche commented Mar 16, 2021

jbrockmendel commented Mar 16, 2021

jreback commented Mar 16, 2021

		@@ -817,6 +818,15 @@ def get_slice(self, slobj: slice, axis: int = 0) -> BlockManager:

		return type(self)._simple_new(tuple(new_blocks), new_axes)

		def get_slice_index(self, slobj: slice) -> BlockManager:

PERF: repeated slicing along index in groupby #40353

PERF: repeated slicing along index in groupby #40353

Conversation

jbrockmendel commented Mar 10, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jbrockmendel commented Mar 15, 2021

jreback commented Mar 16, 2021

jbrockmendel commented Mar 16, 2021 • edited Loading

jorisvandenbossche commented Mar 16, 2021

jbrockmendel commented Mar 16, 2021

jreback commented Mar 16, 2021

jbrockmendel commented Mar 16, 2021 •

edited

Loading