PERF: Use Indexers to implement groupby rolling #34052

mroeschke · 2020-05-07T18:26:48Z

tests added / passed
passes black pandas
passes git diff upstream/master -u -- "*.py" | flake8 --diff

Currently, grouby.rolling is implemented essentially as groupby.apply(lambda x: x.rolling()) which can be potentially slow.

This PR implements groupby.rolling by calculating bounds with a GroupbyRollingIndxer and using the rolling aggregations in cython to compute the results.

…ndexer

mroeschke · 2020-05-13T06:25:54Z

Here are preliminary benchmarks. The performance so far is fairly similar. I suspect that the fact that I have to reconstruct the resulting index is killing the performance

With this benchmark:

In [1]: n = 1000
   ...: df = pd.DataFrame({"A": [str(i) for i in range(n)] * 10, "B": list(range(n)) * 10})
   ...: g = df.groupby("A").rolling(window=2)
   ...: %timeit g.sum()
1.5 s ± 5.77 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) <-- PR
1.52 s ± 49 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) <-- master

jreback · 2020-05-13T08:34:14Z

try with a longer window as well
also profile this should be way faster

mroeschke · 2020-05-13T19:36:43Z

I was accidentally still dispatching to the old implementation in this PR, here are the performance results with that removed

-- PR
In [2]: %timeit g.sum()
61.2 ms ± 4.61 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [3]: %timeit g.sum()
71.9 ms ± 10.3 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [4]: %timeit g.sum()
57.9 ms ± 714 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)


-- master
In [2]: %timeit g.sum()
1.42 s ± 9.08 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [3]: %timeit g.sum()
1.44 s ± 26.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [4]: %timeit g.sum()
1.48 s ± 73.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

jreback

can you enhance the benchamarks for this (if we don't have?)

also if you can add a whatsnew note.

…ndexer

pep8speaks · 2020-05-14T06:13:13Z

Hello @mroeschke! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2020-05-21 04:47:41 UTC

…ndexer

mroeschke · 2020-05-18T05:51:31Z

@jreback ready for another look

…ndexer

jreback · 2020-05-18T13:06:38Z

asv_bench/benchmarks/rolling.py

+        df = pd.DataFrame(
+            {"A": [str(i) for i in range(N)] * 10, "B": list(range(N)) * 10}
+        )
+        self.groupby_roll = df.groupby("A").rolling(window=2)


can you add a timebased one as well

jreback · 2020-05-18T13:07:06Z

doc/source/whatsnew/v1.1.0.rst

@@ -611,7 +611,7 @@ Performance improvements
  and :meth:`~pandas.core.groupby.groupby.Groupby.last` (:issue:`34178`)
 - Performance improvement in :func:`factorize` for nullable (integer and boolean) dtypes (:issue:`33064`).
 - Performance improvement in reductions (sum, prod, min, max) for nullable (integer and boolean) dtypes (:issue:`30982`, :issue:`33261`, :issue:`33442`).
-
+- Performance improvement in ``groupby(..).rolling(..)`` (:issue:`34052`)


do we have a way of hitting the api here?

pandas/core/window/indexers.py

jreback · 2020-05-18T13:14:19Z

pandas/core/window/rolling.py

+            np.concatenate(list(self._groupby.grouper.indices.values()))
+        )
+
+        # filter out the on from the object


maybe better to call super()._create_blocks(obj) (e.g. add obj as an optional arg that defaults to self._selected_obj)

pandas/core/window/rolling.py

…ndexer

mroeschke · 2020-05-20T03:52:39Z

Final benchmarks:

# rolling over 1000 groups
In [1]: In [1]: n = 1000
   ...:    ...: df = pd.DataFrame({"A": [str(i) for i in range(n)] * 10, "B": list(range(n)) * 10})
   ...:    ...: g = df.groupby("A").rolling(window=2)

-- master
In [2]: %timeit g.sum()
1.38 s ± 10.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [3]: %timeit g.sum()
1.4 s ± 13.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [4]: %timeit g.sum()
1.37 s ± 6.29 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

--- PR
In [2]: %timeit g.sum()
63.1 ms ± 396 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [3]: %timeit g.sum()
70.1 ms ± 8.1 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [4]: %timeit g.sum()
63.7 ms ± 471 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

jreback

looks good, some doc comment requests. ping on green.

jreback · 2020-05-20T20:08:21Z

pandas/core/window/indexers.py

+        center: Optional[bool] = None,
+        closed: Optional[str] = None,
+    ) -> Tuple[np.ndarray, np.ndarray]:
+        start_arrays = []


can you add some comments here on what you are doing

jreback · 2020-05-20T20:09:07Z

pandas/core/window/rolling.py

@@ -147,12 +148,10 @@ def _validate_get_window_bounds_signature(window: BaseIndexer) -> None:
                f"get_window_bounds"
            )

-    def _create_blocks(self):
+    def _create_blocks(self, obj):


if you can type: Union[Series, DataFrame] (I think we have an annoation for that).

jreback · 2020-05-20T20:09:26Z

pandas/core/window/rolling.py

+        groupby_keys = [grouping.name for grouping in self._groupby.grouper._groupings]
+        result_index_names = groupby_keys + grouped_index_name
+
+        result_index_data = []


if you can add some commments here on what you are doing

jreback · 2020-05-20T20:09:43Z

pandas/core/window/rolling.py

    @property
    def _constructor(self):
        return Rolling

-    def _gotitem(self, key, ndim, subset=None):
+    def _create_blocks(self, obj):


if you can type

…ndexer

mroeschke · 2020-05-21T16:28:34Z

@jreback Ping all green

jreback · 2020-05-25T17:26:25Z

thanks @mroeschke very nice!

ddofer · 2020-08-04T15:07:26Z

Does this affect also "Expanding"?

mroeschke · 2020-08-04T16:40:50Z

No this currently doesn't apply to expanding. PR's to make it apply to expanding welcome!

Matt Roeschke added 14 commits May 6, 2020 19:40

Prep rolling groupby

723d7e2

Merge remote-tracking branch 'upstream/master' into rolling_groupby_i…

47a53d6

…ndexer

figure out how to get a similar groupby result

efc11a7

Wrap result in list

db9e134

Merge remote-tracking branch 'upstream/master' into rolling_groupby_i…

e106b69

…ndexer

add custom code to produce resulting index

c5a2ab0

Merge remote-tracking branch 'upstream/master' into rolling_groupby_i…

3d6468a

…ndexer

Use _get_window_indexer instead of changing window in __init__

7856eac

Remove unused import

bc7a8c8

Merge remote-tracking branch 'upstream/master' into rolling_groupby_i…

462c7e5

…ndexer

Merge remote-tracking branch 'upstream/master' into rolling_groupby_i…

67f938e

…ndexer

Create groupby_parent for groupbyrolling

4f698ec

Merge remote-tracking branch 'upstream/master' into rolling_groupby_i…

741d79e

…ndexer

remove grouper parent and user the grouper object to get the key names

d77a96b

remove comma

6b86936

Don't dispatch to WindowGroupbyMixin._apply

bfd485e

jreback added Performance Memory or execution speed performance Window rolling, ewma, expanding labels May 13, 2020

jreback added this to the 1.1 milestone May 13, 2020

jreback requested changes May 13, 2020

View reviewed changes

Matt Roeschke added 3 commits May 13, 2020 21:15

Merge remote-tracking branch 'upstream/master' into rolling_groupby_i…

e1eee66

…ndexer

Use variable algorithms only for groupby rolling

8a680b1

Add todo comments

88b0b25

Matt Roeschke added 3 commits May 13, 2020 23:14

Flake8

0f45cd1

Merge remote-tracking branch 'upstream/master' into rolling_groupby_i…

a8c8d8d

…ndexer

rename some variables and handle timeseries case

32d4c49

Merge remote-tracking branch 'upstream/master' into rolling_groupby_i…

d924f70

…ndexer

jreback requested changes May 18, 2020

View reviewed changes

Matt Roeschke added 7 commits May 18, 2020 20:34

Merge remote-tracking branch 'upstream/master' into rolling_groupby_i…

08abad3

…ndexer

Add timeseries benchmark

7390a62

Improve whatsnew sphinx link

41f8569

reuse _create_blocks

5517cc4

expand docstring

a62dba4

Merge remote-tracking branch 'upstream/master' into rolling_groupby_i…

b64a321

…ndexer

Merge remote-tracking branch 'upstream/master' into rolling_groupby_i…

b7b992b

…ndexer

jreback requested changes May 20, 2020

View reviewed changes

Matt Roeschke added 2 commits May 20, 2020 21:46

Add more commentary, typing

dad3d0e

Merge remote-tracking branch 'upstream/master' into rolling_groupby_i…

93772c0

…ndexer

jreback approved these changes May 25, 2020

View reviewed changes

jreback merged commit bad52a9 into pandas-dev:master May 25, 2020

mroeschke deleted the rolling_groupby_indexer branch May 25, 2020 17:35

mroeschke mentioned this pull request Jun 9, 2020

BUG: Positional Arguments Passed as a Keyword Argument in Custom Rolling Aggregation #34605

Closed

3 tasks

simonjayhawkins mentioned this pull request Jul 31, 2020

QST: is the new behavior of GroupByRolling in v1.1.0 intended? #35486

Closed

WhistleWhileYouWork mentioned this pull request Aug 4, 2020

BUG: 1.1.0 breaks custom indexer support in groupby().rolling() #35557

Closed

3 tasks

simonjayhawkins mentioned this pull request Sep 6, 2020

QST: is the new behavior of GroupByRolling for MultiIndex in v1.1.1 intended? #36018

Closed

mroeschke mentioned this pull request Oct 6, 2020

BUG: RollingGroupby no longer respects sort being disabled #36889

Closed

3 tasks

simonjayhawkins mentioned this pull request Oct 6, 2020

BUG: Segmentation fault when doing pandas.core.window.rolling.RollingGroupBy.apply #36727

Closed

3 tasks

simonjayhawkins mentioned this pull request Nov 9, 2020

BUG: RollingGroupby duplicates columns in index even with group_keys=False #37641

Closed

3 tasks

jorisvandenbossche mentioned this pull request Nov 24, 2020

REGR: Performance regression on RollingGroupby #38038

Closed

simonjayhawkins mentioned this pull request Dec 29, 2020

BUG:Wrong sum in groupby rolling due to precision issues #38752

Closed

3 tasks

Uh oh!

PERF: Use Indexers to implement groupby rolling #34052

PERF: Use Indexers to implement groupby rolling #34052

Uh oh!

Conversation

mroeschke commented May 7, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mroeschke commented May 13, 2020

Uh oh!

jreback commented May 13, 2020

Uh oh!

mroeschke commented May 13, 2020

Uh oh!

jreback left a comment

Choose a reason for hiding this comment

Uh oh!

pep8speaks commented May 14, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Comment last updated at 2020-05-21 04:47:41 UTC

Uh oh!

mroeschke commented May 18, 2020

Uh oh!

jreback May 18, 2020

Choose a reason for hiding this comment

Uh oh!

jreback May 18, 2020

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jreback May 18, 2020

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

mroeschke commented May 20, 2020

Uh oh!

jreback left a comment

Choose a reason for hiding this comment

Uh oh!

jreback May 20, 2020

Choose a reason for hiding this comment

Uh oh!

jreback May 20, 2020

Choose a reason for hiding this comment

Uh oh!

jreback May 20, 2020

Choose a reason for hiding this comment

Uh oh!

jreback May 20, 2020

Choose a reason for hiding this comment

Uh oh!

mroeschke commented May 21, 2020

Uh oh!

jreback commented May 25, 2020

Uh oh!

ddofer commented Aug 4, 2020

Uh oh!

mroeschke commented Aug 4, 2020

Uh oh!

Uh oh!

mroeschke commented May 7, 2020 •

edited

Loading

pep8speaks commented May 14, 2020 •

edited

Loading