Skip to content

BUG: Rolling groupby should not maintain the by column in the resulting DataFrame #32332

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 41 commits into from

Conversation

MarcoGorelli
Copy link
Member

@MarcoGorelli MarcoGorelli commented Feb 28, 2020

Does this seem like the correct idea? If so, I'll expand on tests / write whatsnew / think about a cleaner patch

EDIT

lots of unwanted changes here, just seeing what tests fail

EDIT

need to check the types make sense and see what's the proper way to write

kwargs["exclusions"] = self.exclusions

Copy link
Member

@WillAyd WillAyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for taking a look! @mroeschke might also have thoughts

@WillAyd WillAyd added the Window rolling, ewma, expanding label Feb 29, 2020
@MarcoGorelli MarcoGorelli changed the title (wip) BUG: Rolling groupby should not maintain the by column in the resulting DataFrame BUG: Rolling groupby should not maintain the by column in the resulting DataFrame Feb 29, 2020
@MarcoGorelli MarcoGorelli changed the title BUG: Rolling groupby should not maintain the by column in the resulting DataFrame (wip) BUG: Rolling groupby should not maintain the by column in the resulting DataFrame Mar 1, 2020
@MarcoGorelli MarcoGorelli changed the title (wip) BUG: Rolling groupby should not maintain the by column in the resulting DataFrame BUG: Rolling groupby should not maintain the by column in the resulting DataFrame Mar 1, 2020
@MarcoGorelli
Copy link
Member Author

@WillAyd thanks for your review - I've written it a different way (so no side-effects this time)

Copy link
Member

@WillAyd WillAyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks - this looks like a better approach. Just wondering if there's something even cleaner out there (a lot of the groupby stuff is a mess as is...)

@jreback jreback added the Groupby label Mar 3, 2020
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see my comments. The OP is only partially correct. If you do .rolling.groupby (or any of the variants) and .apply then you will get back the grouping column, just like groupby.apply does; so this should not change.

e.g.

In [1]: df = pd.DataFrame({'A': [1, 1, 2], 'B': [1, 2, 3]})                                                                                                                                                                                                     

In [2]: df.groupby('A').sum()                                                                                                                                                                                                                                   
Out[2]: 
   B
A   
1  3
2  3

In [3]: df.groupby('A').apply(lambda x: x.sum())                                                                                                                                                                                                                
Out[3]: 
   A  B
A      
1  2  3
2  2  3

The issue here is that .rolling.groupby happens to use .apply as an implementation detail, so that's the behavior you get. I am open to patching this only there.

@MarcoGorelli
Copy link
Member Author

Thanks @jreback - that probably simplifies the patch then.

What about if we use .resample? e.g.

>>> df
           id  score
date                
2010-01-01  A      2
2010-01-02  A      3
2010-01-05  A      8
2010-01-10  A      7
2010-01-13  A      3
2010-01-01  B      5
2010-01-03  B      2
2010-01-04  B      1
2010-01-11  B      7
2010-01-14  B      3

>>> df.groupby("id").resample("D").asfreq()
                id  score
id date                  
A  2010-01-01    A    2.0
   2010-01-02    A    3.0
   2010-01-03  NaN    NaN
   2010-01-04  NaN    NaN
   2010-01-05    A    8.0
   2010-01-06  NaN    NaN
   2010-01-07  NaN    NaN
   2010-01-08  NaN    NaN
   2010-01-09  NaN    NaN
   2010-01-10    A    7.0
   2010-01-11  NaN    NaN
   2010-01-12  NaN    NaN
   2010-01-13    A    3.0
B  2010-01-01    B    5.0
   2010-01-02  NaN    NaN
   2010-01-03    B    2.0
   2010-01-04    B    1.0
   2010-01-05  NaN    NaN
   2010-01-06  NaN    NaN
   2010-01-07  NaN    NaN
   2010-01-08  NaN    NaN
   2010-01-09  NaN    NaN
   2010-01-10  NaN    NaN
   2010-01-11    B    7.0
   2010-01-12  NaN    NaN
   2010-01-13  NaN    NaN
   2010-01-14    B    3.0

This is the output that's currently being tested for in pandas/tests/resample/test_resampler_grouper.py - this one's correct as it is, then?

@MarcoGorelli MarcoGorelli changed the title BUG: Rolling groupby should not maintain the by column in the resulting DataFrame (wip) BUG: Rolling groupby should not maintain the by column in the resulting DataFrame Mar 3, 2020
@MarcoGorelli MarcoGorelli changed the title (wip) BUG: Rolling groupby should not maintain the by column in the resulting DataFrame BUG: Rolling groupby should not maintain the by column in the resulting DataFrame Mar 3, 2020
@MarcoGorelli
Copy link
Member Author

@jreback have pushed a new fix. Have had to update some tests which check output against groupby.apply

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this will affect groupby.rolling and groupby.resample but NOT groupby.apply

def _shallow_copy(self, obj: FrameOrSeries, **kwargs) -> ShallowMixin:
if isinstance(obj, ABCDataFrame):
try:
new_obj = super()._shallow_copy(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WAT?

lint

wip

clean re-write

add comment
self.exclusions = kwargs.get("exclusions", set())

def _shallow_copy(self, obj: FrameOrSeries, **kwargs) -> ShallowMixin:
if isinstance(obj, ABCDataFrame):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WAT? i would much rather properly copy in the super class. really avoid if/then cases unless absolutely necessary.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IOW we can always just copy exclusions

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback sure, I can remove the if-then cases here, but then I need to potentially remove elements from self.exclusions in

def _obj_with_exclusions(self)

, otherwise this could throw an error if there are elements of self.exclusions that are no longer part of self.obj.columns:

return self.obj.drop(self.exclusions, axis=1)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Anyway, that's what I've done in the latest commit

def _shallow_copy(self, obj: FrameOrSeries, **kwargs) -> ShallowMixin:
exclusions = self.exclusions
new_obj = super()._shallow_copy(obj, exclusions=exclusions, **kwargs)
new_obj.obj = new_obj._obj_with_exclusions
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are you sure this line is actually needed? this is very very odd to do

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

copying the exclusions should be enough. you maybe be able to move this shallow copy to pandas/core/groupby/groupby.py which is better than here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason for doing this was that RollingGroupby._groupby._python_apply_general operates on RollingGroupby._groupby._selected_obj.
And RollingGroupby._groupby._selected_obj returns RollingGroupby._groupby.obj, possibly sliced by RollingGroupby._groupby._selection

Copy link
Member Author

@MarcoGorelli MarcoGorelli Mar 21, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you maybe be able to move this shallow copy to pandas/core/groupby/groupby.py

As in, define _GroupBy._shallow_copy? I don't understand how that would work: when we call Rolling.sum, we go to _Rolling_and_Expanding.sum and then WindowGroupByMixin._apply. Inside there, we have

        def f(x, name=name, *args):
            x = self._shallow_copy(x)

            if isinstance(name, str):
                return getattr(x, name)(*args, **kwargs)

            return x.apply(name, *args, **kwargs)

Here, self is WindowGroupByMixin. In the case when we get here from Rolling.sum, then self._shallow_copy will take us directly to ShallowMixin._shallow_copy. Therefore, even if we define _GroupBy._shallow_copy, the execution won't get there during this operation.

EDIT

Anyway, am trying a slightly different approach, will ping if/when green and once I've thought over why corr and cov currently use ._selected_obj instead of ._obj_with_exclusions

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback thanks for your review. I've pushed a new attempt at a solution.

I patches obj before we get to _GroupBy.apply, and only for Rolling so that previous behaviour in DataFrame.groupby.apply.

@MarcoGorelli MarcoGorelli changed the title BUG: Rolling groupby should not maintain the by column in the resulting DataFrame (wip) BUG: Rolling groupby should not maintain the by column in the resulting DataFrame Mar 21, 2020
@MarcoGorelli MarcoGorelli changed the title (wip) BUG: Rolling groupby should not maintain the by column in the resulting DataFrame BUG: Rolling groupby should not maintain the by column in the resulting DataFrame Mar 21, 2020
@MarcoGorelli MarcoGorelli changed the title BUG: Rolling groupby should not maintain the by column in the resulting DataFrame (wip) BUG: Rolling groupby should not maintain the by column in the resulting DataFrame Mar 22, 2020
@MarcoGorelli MarcoGorelli changed the title (wip) BUG: Rolling groupby should not maintain the by column in the resulting DataFrame BUG: Rolling groupby should not maintain the by column in the resulting DataFrame Mar 22, 2020
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is not going well. there is mutation all over the place. This needs a good clean solution to get merged.

@@ -167,6 +167,35 @@ key and type of :class:`Index`. These now consistently raise ``KeyError`` (:iss
...
KeyError: Timestamp('1970-01-01 00:00:00')

GroupBy.rolling no longer returns grouped-by column in values
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add an expl here, a couple of sentences; this the issue number.

return self.obj

# there may be elements in self.exclusions that are no longer
# in self.obj, see GH 32468
nlevels = self.obj.columns.nlevels
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

huh? why are you doing all of this

@@ -35,6 +35,8 @@ def _dispatch(name: str, *args, **kwargs):
def outer(self, *args, **kwargs):
def f(x):
x = self._shallow_copy(x, groupby=self._groupby)
# patch for GH 32332
x.obj = x._obj_with_exclusions
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really don't like this

@MarcoGorelli
Copy link
Member Author

Yes, I see what you mean - to be honest I can't think of a clean solution, and have spent a while on it already. I'll close this for now so someone else can take it up

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Groupby Window rolling, ewma, expanding
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Rolling groupby should not maintain the by column in the resulting DataFrame
4 participants