BUG: Rolling groupby should not maintain the by column in the resulting DataFrame #32332

MarcoGorelli · 2020-02-28T13:55:34Z

closes Rolling groupby should not maintain the by column in the resulting DataFrame #32262
tests added / passed
passes black pandas
passes git diff upstream/master -u -- "*.py" | flake8 --diff
whatsnew entry

Does this seem like the correct idea? If so, I'll expand on tests / write whatsnew / think about a cleaner patch

EDIT

lots of unwanted changes here, just seeing what tests fail

EDIT

need to check the types make sense and see what's the proper way to write

kwargs["exclusions"] = self.exclusions

WillAyd

Thanks for taking a look! @mroeschke might also have thoughts

pandas/core/window/common.py

pandas/core/groupby/groupby.py

MarcoGorelli · 2020-03-01T21:28:57Z

@WillAyd thanks for your review - I've written it a different way (so no side-effects this time)

WillAyd

Thanks - this looks like a better approach. Just wondering if there's something even cleaner out there (a lot of the groupby stuff is a mess as is...)

pandas/core/groupby/groupby.py

jreback

see my comments. The OP is only partially correct. If you do .rolling.groupby (or any of the variants) and .apply then you will get back the grouping column, just like groupby.apply does; so this should not change.

e.g.

In [1]: df = pd.DataFrame({'A': [1, 1, 2], 'B': [1, 2, 3]})                                                                                                                                                                                                     

In [2]: df.groupby('A').sum()                                                                                                                                                                                                                                   
Out[2]: 
   B
A   
1  3
2  3

In [3]: df.groupby('A').apply(lambda x: x.sum())                                                                                                                                                                                                                
Out[3]: 
   A  B
A      
1  2  3
2  2  3

The issue here is that .rolling.groupby happens to use .apply as an implementation detail, so that's the behavior you get. I am open to patching this only there.

MarcoGorelli · 2020-03-03T10:34:40Z

Thanks @jreback - that probably simplifies the patch then.

What about if we use .resample? e.g.

>>> df
           id  score
date                
2010-01-01  A      2
2010-01-02  A      3
2010-01-05  A      8
2010-01-10  A      7
2010-01-13  A      3
2010-01-01  B      5
2010-01-03  B      2
2010-01-04  B      1
2010-01-11  B      7
2010-01-14  B      3

>>> df.groupby("id").resample("D").asfreq()
                id  score
id date                  
A  2010-01-01    A    2.0
   2010-01-02    A    3.0
   2010-01-03  NaN    NaN
   2010-01-04  NaN    NaN
   2010-01-05    A    8.0
   2010-01-06  NaN    NaN
   2010-01-07  NaN    NaN
   2010-01-08  NaN    NaN
   2010-01-09  NaN    NaN
   2010-01-10    A    7.0
   2010-01-11  NaN    NaN
   2010-01-12  NaN    NaN
   2010-01-13    A    3.0
B  2010-01-01    B    5.0
   2010-01-02  NaN    NaN
   2010-01-03    B    2.0
   2010-01-04    B    1.0
   2010-01-05  NaN    NaN
   2010-01-06  NaN    NaN
   2010-01-07  NaN    NaN
   2010-01-08  NaN    NaN
   2010-01-09  NaN    NaN
   2010-01-10  NaN    NaN
   2010-01-11    B    7.0
   2010-01-12  NaN    NaN
   2010-01-13  NaN    NaN
   2010-01-14    B    3.0

This is the output that's currently being tested for in pandas/tests/resample/test_resampler_grouper.py - this one's correct as it is, then?

pandas/core/window/rolling.py

MarcoGorelli · 2020-03-03T21:56:07Z

@jreback have pushed a new fix. Have had to update some tests which check output against groupby.apply

jreback

this will affect groupby.rolling and groupby.resample but NOT groupby.apply

doc/source/whatsnew/v1.1.0.rst

jreback · 2020-03-04T14:24:47Z

pandas/core/window/rolling.py

+    def _shallow_copy(self, obj: FrameOrSeries, **kwargs) -> ShallowMixin:
+        if isinstance(obj, ABCDataFrame):
+            try:
+                new_obj = super()._shallow_copy(


pandas/core/window/rolling.py

pandas/tests/window/test_grouper.py

…lling

lint wip clean re-write add comment

pandas/core/window/ewm.py

jreback · 2020-03-14T23:12:57Z

pandas/core/window/rolling.py

+        self.exclusions = kwargs.get("exclusions", set())
+
+    def _shallow_copy(self, obj: FrameOrSeries, **kwargs) -> ShallowMixin:
+        if isinstance(obj, ABCDataFrame):


WAT? i would much rather properly copy in the super class. really avoid if/then cases unless absolutely necessary.

IOW we can always just copy exclusions

@jreback sure, I can remove the if-then cases here, but then I need to potentially remove elements from self.exclusions in

def _obj_with_exclusions(self)

, otherwise this could throw an error if there are elements of self.exclusions that are no longer part of self.obj.columns:

return self.obj.drop(self.exclusions, axis=1)

Anyway, that's what I've done in the latest commit

doc/source/whatsnew/v1.1.0.rst

pandas/core/base.py

jreback · 2020-03-16T02:53:32Z

pandas/core/window/rolling.py

+    def _shallow_copy(self, obj: FrameOrSeries, **kwargs) -> ShallowMixin:
+        exclusions = self.exclusions
+        new_obj = super()._shallow_copy(obj, exclusions=exclusions, **kwargs)
+        new_obj.obj = new_obj._obj_with_exclusions


are you sure this line is actually needed? this is very very odd to do

copying the exclusions should be enough. you maybe be able to move this shallow copy to pandas/core/groupby/groupby.py which is better than here.

The reason for doing this was that RollingGroupby._groupby._python_apply_general operates on RollingGroupby._groupby._selected_obj.
And RollingGroupby._groupby._selected_obj returns RollingGroupby._groupby.obj, possibly sliced by RollingGroupby._groupby._selection

you maybe be able to move this shallow copy to pandas/core/groupby/groupby.py

As in, define _GroupBy._shallow_copy? I don't understand how that would work: when we call Rolling.sum, we go to _Rolling_and_Expanding.sum and then WindowGroupByMixin._apply. Inside there, we have

def f(x, name=name, *args): x = self._shallow_copy(x) if isinstance(name, str): return getattr(x, name)(*args, **kwargs) return x.apply(name, *args, **kwargs)

Here, self is WindowGroupByMixin. In the case when we get here from Rolling.sum, then self._shallow_copy will take us directly to ShallowMixin._shallow_copy. Therefore, even if we define _GroupBy._shallow_copy, the execution won't get there during this operation.

EDIT

Anyway, am trying a slightly different approach, will ping if/when green and once I've thought over why corr and cov currently use ._selected_obj instead of ._obj_with_exclusions

@jreback thanks for your review. I've pushed a new attempt at a solution.

I patches obj before we get to _GroupBy.apply, and only for Rolling so that previous behaviour in DataFrame.groupby.apply.

pandas/tests/groupby/test_groupby.py

…play dataframe in whatsnew entry separately from before vs after

jreback

this is not going well. there is mutation all over the place. This needs a good clean solution to get merged.

jreback · 2020-03-22T20:51:58Z

doc/source/whatsnew/v1.1.0.rst

@@ -167,6 +167,35 @@ key and type of :class:`Index`.  These now consistently raise ``KeyError`` (:iss
    ...
    KeyError: Timestamp('1970-01-01 00:00:00')

+GroupBy.rolling no longer returns grouped-by column in values
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+


Add an expl here, a couple of sentences; this the issue number.

jreback · 2020-03-22T20:52:31Z

pandas/core/base.py

            return self.obj

+        # there may be elements in self.exclusions that are no longer
+        # in self.obj, see GH 32468
+        nlevels = self.obj.columns.nlevels


huh? why are you doing all of this

jreback · 2020-03-22T20:52:48Z

pandas/core/window/common.py

@@ -35,6 +35,8 @@ def _dispatch(name: str, *args, **kwargs):
    def outer(self, *args, **kwargs):
        def f(x):
            x = self._shallow_copy(x, groupby=self._groupby)
+            # patch for GH 32332
+            x.obj = x._obj_with_exclusions


I really don't like this

MarcoGorelli · 2020-03-22T21:08:34Z

Yes, I see what you mean - to be honest I can't think of a clean solution, and have spent a while on it already. I'll close this for now so someone else can take it up

WillAyd requested changes Feb 29, 2020

View reviewed changes

pandas/core/window/common.py Outdated Show resolved Hide resolved

WillAyd added the Window rolling, ewma, expanding label Feb 29, 2020

MarcoGorelli changed the title ~~(wip) BUG: Rolling groupby should not maintain the by column in the resulting DataFrame~~ BUG: Rolling groupby should not maintain the by column in the resulting DataFrame Feb 29, 2020

MarcoGorelli changed the title ~~BUG: Rolling groupby should not maintain the by column in the resulting DataFrame~~ (wip) BUG: Rolling groupby should not maintain the by column in the resulting DataFrame Mar 1, 2020

MarcoGorelli force-pushed the rolling-groupby branch from 3be9d0e to 1ca73fd Compare March 1, 2020 17:44

mroeschke reviewed Mar 1, 2020

View reviewed changes

pandas/core/groupby/groupby.py Outdated Show resolved Hide resolved

MarcoGorelli changed the title ~~(wip) BUG: Rolling groupby should not maintain the by column in the resulting DataFrame~~ BUG: Rolling groupby should not maintain the by column in the resulting DataFrame Mar 1, 2020

WillAyd requested changes Mar 2, 2020

View reviewed changes

pandas/core/groupby/groupby.py Outdated Show resolved Hide resolved

jreback added the Groupby label Mar 3, 2020

jreback requested changes Mar 3, 2020

View reviewed changes

MarcoGorelli changed the title ~~BUG: Rolling groupby should not maintain the by column in the resulting DataFrame~~ (wip) BUG: Rolling groupby should not maintain the by column in the resulting DataFrame Mar 3, 2020

MarcoGorelli commented Mar 3, 2020

View reviewed changes

pandas/core/window/rolling.py Outdated Show resolved Hide resolved

MarcoGorelli changed the title ~~(wip) BUG: Rolling groupby should not maintain the by column in the resulting DataFrame~~ BUG: Rolling groupby should not maintain the by column in the resulting DataFrame Mar 3, 2020

jreback requested changes Mar 4, 2020

View reviewed changes

MarcoGorelli mentioned this pull request Mar 5, 2020

Should Groupby.sum modify _selected_obj? #32468

Closed

Marco Gorelli and others added 12 commits March 5, 2020 19:02

make sure exclusions are applied before the groupby object reaches ro…

a6ad6c5

…lling

pass object with exclusions earlier on

54903a7

revert accident

d81c572

check for side effects in obj

e049205

wip

5f43f3a

Wip

3b1c3ff

change _selected_obj according to self.mutated

207a507

sanitize

c9b34b2

update old tests

b2ce758

fix old docstring

c307fdc

finish correcting old docstring

35f2b2d

use getattr

2bb7932

document change as api-breaking

63e3d85

lint wip clean re-write add comment

MarcoGorelli commented Mar 5, 2020

View reviewed changes

pandas/core/window/ewm.py Show resolved Hide resolved

jreback requested changes Mar 14, 2020

View reviewed changes

MarcoGorelli added 5 commits March 15, 2020 11:18

Merge remote-tracking branch 'upstream/master' into rolling-groupby

17373ad

remove elements from exclusions within _obj_with_exclusions

a7ca8eb

remove no-longer-necessary performance warning

35e7340

simplify _obj_with_exclusions

17090da

correct comment

7c7f79c

jreback requested changes Mar 16, 2020

View reviewed changes

Merge remote-tracking branch 'upstream/master' into rolling-groupby

9b98dad

MarcoGorelli changed the title ~~BUG: Rolling groupby should not maintain the by column in the resulting DataFrame~~ (wip) BUG: Rolling groupby should not maintain the by column in the resulting DataFrame Mar 21, 2020

Marco Gorelli/DI /SRUK/Engineer/Samsung Electronics added 5 commits March 21, 2020 12:01

deal with multiindexed columns case

4d1c5a6

use elif in _obj_with_exclusions, add test for column multiindex, dis…

0dde4a8

…play dataframe in whatsnew entry separately from before vs after

simplify unique_column_names

1e4ba56

rewrite using get_level_values

6312725

revert 'or' which was accidentally changed to 'and'

d9748d1

MarcoGorelli changed the title ~~(wip) BUG: Rolling groupby should not maintain the by column in the resulting DataFrame~~ BUG: Rolling groupby should not maintain the by column in the resulting DataFrame Mar 21, 2020

Marco Gorelli/DI /SRUK/Engineer/Samsung Electronics added 3 commits March 21, 2020 22:09

only patch _apply, cov and corr

1623902

reinstate change to _obj_with_exclusions

5d7a477

exclude 'by' column in Rolling.count

d117624

MarcoGorelli changed the title ~~BUG: Rolling groupby should not maintain the by column in the resulting DataFrame~~ (wip) BUG: Rolling groupby should not maintain the by column in the resulting DataFrame Mar 22, 2020

Marco Gorelli/DI /SRUK/Engineer/Samsung Electronics added 3 commits March 22, 2020 11:08

don't modify _selected_obj - instead, patch obj before we reach apply

367e671

slight simplification to _shallow_copy

d7cf9f1

comment on patch in corr and cov

d251715

MarcoGorelli changed the title ~~(wip) BUG: Rolling groupby should not maintain the by column in the resulting DataFrame~~ BUG: Rolling groupby should not maintain the by column in the resulting DataFrame Mar 22, 2020

jreback requested changes Mar 22, 2020

View reviewed changes

MarcoGorelli closed this Mar 22, 2020

MarcoGorelli deleted the rolling-groupby branch October 10, 2020 14:14

mroeschke mentioned this pull request Mar 10, 2021

BUG: RollingGroupby no longer keeps the groupby column in the result #40341

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: Rolling groupby should not maintain the by column in the resulting DataFrame #32332

BUG: Rolling groupby should not maintain the by column in the resulting DataFrame #32332

MarcoGorelli commented Feb 28, 2020 •

edited

Loading

WillAyd left a comment

MarcoGorelli commented Mar 1, 2020

WillAyd left a comment

jreback left a comment •

edited

Loading

MarcoGorelli commented Mar 3, 2020

MarcoGorelli commented Mar 3, 2020

jreback left a comment

jreback Mar 4, 2020

jreback Mar 14, 2020

jreback Mar 14, 2020

MarcoGorelli Mar 15, 2020

MarcoGorelli Mar 15, 2020

jreback Mar 16, 2020

jreback Mar 16, 2020

MarcoGorelli Mar 21, 2020

MarcoGorelli Mar 21, 2020 •

edited

Loading

MarcoGorelli Mar 22, 2020

jreback left a comment

jreback Mar 22, 2020

jreback Mar 22, 2020

jreback Mar 22, 2020

MarcoGorelli commented Mar 22, 2020

BUG: Rolling groupby should not maintain the by column in the resulting DataFrame #32332

BUG: Rolling groupby should not maintain the by column in the resulting DataFrame #32332

Conversation

MarcoGorelli commented Feb 28, 2020 • edited Loading

EDIT

EDIT

WillAyd left a comment

Choose a reason for hiding this comment

MarcoGorelli commented Mar 1, 2020

WillAyd left a comment

Choose a reason for hiding this comment

jreback left a comment • edited Loading

Choose a reason for hiding this comment

MarcoGorelli commented Mar 3, 2020

MarcoGorelli commented Mar 3, 2020

jreback left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

MarcoGorelli Mar 21, 2020 • edited Loading

Choose a reason for hiding this comment

EDIT

Choose a reason for hiding this comment

jreback left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

MarcoGorelli commented Mar 22, 2020

MarcoGorelli commented Feb 28, 2020 •

edited

Loading

jreback left a comment •

edited

Loading

MarcoGorelli Mar 21, 2020 •

edited

Loading