-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG: Rolling groupby should not maintain the by column in the resulting DataFrame #32332
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for taking a look! @mroeschke might also have thoughts
3be9d0e
to
1ca73fd
Compare
@WillAyd thanks for your review - I've written it a different way (so no side-effects this time) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks - this looks like a better approach. Just wondering if there's something even cleaner out there (a lot of the groupby stuff is a mess as is...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see my comments. The OP is only partially correct. If you do .rolling.groupby (or any of the variants) and .apply then you will get back the grouping column, just like groupby.apply does; so this should not change.
e.g.
In [1]: df = pd.DataFrame({'A': [1, 1, 2], 'B': [1, 2, 3]})
In [2]: df.groupby('A').sum()
Out[2]:
B
A
1 3
2 3
In [3]: df.groupby('A').apply(lambda x: x.sum())
Out[3]:
A B
A
1 2 3
2 2 3
The issue here is that .rolling.groupby happens to use .apply as an implementation detail, so that's the behavior you get. I am open to patching this only there.
Thanks @jreback - that probably simplifies the patch then. What about if we use
This is the output that's currently being tested for in |
@jreback have pushed a new fix. Have had to update some tests which check output against |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this will affect groupby.rolling and groupby.resample but NOT groupby.apply
pandas/core/window/rolling.py
Outdated
def _shallow_copy(self, obj: FrameOrSeries, **kwargs) -> ShallowMixin: | ||
if isinstance(obj, ABCDataFrame): | ||
try: | ||
new_obj = super()._shallow_copy( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
WAT?
lint wip clean re-write add comment
pandas/core/window/rolling.py
Outdated
self.exclusions = kwargs.get("exclusions", set()) | ||
|
||
def _shallow_copy(self, obj: FrameOrSeries, **kwargs) -> ShallowMixin: | ||
if isinstance(obj, ABCDataFrame): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
WAT? i would much rather properly copy in the super class. really avoid if/then cases unless absolutely necessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IOW we can always just copy exclusions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jreback sure, I can remove the if-then cases here, but then I need to potentially remove elements from self.exclusions
in
def _obj_with_exclusions(self)
, otherwise this could throw an error if there are elements of self.exclusions
that are no longer part of self.obj.columns
:
return self.obj.drop(self.exclusions, axis=1)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Anyway, that's what I've done in the latest commit
pandas/core/window/rolling.py
Outdated
def _shallow_copy(self, obj: FrameOrSeries, **kwargs) -> ShallowMixin: | ||
exclusions = self.exclusions | ||
new_obj = super()._shallow_copy(obj, exclusions=exclusions, **kwargs) | ||
new_obj.obj = new_obj._obj_with_exclusions |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are you sure this line is actually needed? this is very very odd to do
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
copying the exclusions should be enough. you maybe be able to move this shallow copy to pandas/core/groupby/groupby.py which is better than here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason for doing this was that RollingGroupby._groupby._python_apply_general
operates on RollingGroupby._groupby._selected_obj
.
And RollingGroupby._groupby._selected_obj
returns RollingGroupby._groupby.obj
, possibly sliced by RollingGroupby._groupby._selection
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you maybe be able to move this shallow copy to pandas/core/groupby/groupby.py
As in, define _GroupBy._shallow_copy
? I don't understand how that would work: when we call Rolling.sum
, we go to _Rolling_and_Expanding.sum
and then WindowGroupByMixin._apply
. Inside there, we have
def f(x, name=name, *args):
x = self._shallow_copy(x)
if isinstance(name, str):
return getattr(x, name)(*args, **kwargs)
return x.apply(name, *args, **kwargs)
Here, self
is WindowGroupByMixin
. In the case when we get here from Rolling.sum
, then self._shallow_copy
will take us directly to ShallowMixin._shallow_copy
. Therefore, even if we define _GroupBy._shallow_copy
, the execution won't get there during this operation.
EDIT
Anyway, am trying a slightly different approach, will ping if/when green and once I've thought over why corr
and cov
currently use ._selected_obj
instead of ._obj_with_exclusions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jreback thanks for your review. I've pushed a new attempt at a solution.
I patches obj
before we get to _GroupBy.apply
, and only for Rolling
so that previous behaviour in DataFrame.groupby.apply
.
…play dataframe in whatsnew entry separately from before vs after
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is not going well. there is mutation all over the place. This needs a good clean solution to get merged.
@@ -167,6 +167,35 @@ key and type of :class:`Index`. These now consistently raise ``KeyError`` (:iss | |||
... | |||
KeyError: Timestamp('1970-01-01 00:00:00') | |||
|
|||
GroupBy.rolling no longer returns grouped-by column in values | |||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add an expl here, a couple of sentences; this the issue number.
return self.obj | ||
|
||
# there may be elements in self.exclusions that are no longer | ||
# in self.obj, see GH 32468 | ||
nlevels = self.obj.columns.nlevels |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
huh? why are you doing all of this
@@ -35,6 +35,8 @@ def _dispatch(name: str, *args, **kwargs): | |||
def outer(self, *args, **kwargs): | |||
def f(x): | |||
x = self._shallow_copy(x, groupby=self._groupby) | |||
# patch for GH 32332 | |||
x.obj = x._obj_with_exclusions |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really don't like this
Yes, I see what you mean - to be honest I can't think of a clean solution, and have spent a while on it already. I'll close this for now so someone else can take it up |
black pandas
git diff upstream/master -u -- "*.py" | flake8 --diff
Does this seem like the correct idea? If so, I'll expand on tests / write whatsnew / think about a cleaner patch
EDIT
lots of unwanted changes here, just seeing what tests fail
EDIT
need to check the types make sense and see what's the proper way to write