Skip to content

BUG: Rolling groupby should not maintain the by column in the resulting DataFrame #32332

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 41 commits into from
Closed
Show file tree
Hide file tree
Changes from 29 commits
Commits
Show all changes
41 commits
Select commit Hold shift + click to select a range
a6ad6c5
make sure exclusions are applied before the groupby object reaches ro…
Feb 28, 2020
54903a7
pass object with exclusions earlier on
MarcoGorelli Feb 29, 2020
d81c572
revert accident
MarcoGorelli Feb 29, 2020
e049205
check for side effects in obj
MarcoGorelli Feb 29, 2020
5f43f3a
wip
MarcoGorelli Mar 1, 2020
3b1c3ff
Wip
MarcoGorelli Mar 1, 2020
207a507
change _selected_obj according to self.mutated
MarcoGorelli Mar 1, 2020
c9b34b2
sanitize
MarcoGorelli Mar 1, 2020
b2ce758
update old tests
MarcoGorelli Mar 1, 2020
c307fdc
fix old docstring
MarcoGorelli Mar 1, 2020
35f2b2d
finish correcting old docstring
MarcoGorelli Mar 1, 2020
2bb7932
use getattr
MarcoGorelli Mar 1, 2020
eed4122
old fix
Mar 3, 2020
26d9dec
simpler fix
Mar 3, 2020
68b4299
try making Window's own _shallow_copy
Mar 3, 2020
159a3d6
more wip
Mar 3, 2020
0b393b6
typeerror!
Mar 3, 2020
9fc50ee
add some types, comment tests
Mar 3, 2020
7fe2fcf
fix typing
Mar 3, 2020
8a34ff4
better fix
Mar 3, 2020
8953bda
correct return annotation
Mar 3, 2020
b2b8a42
simplify code
Mar 3, 2020
d476191
fix upstream, remove try except
Mar 4, 2020
63e3d85
document change as api-breaking
Mar 4, 2020
17373ad
Merge remote-tracking branch 'upstream/master' into rolling-groupby
MarcoGorelli Mar 15, 2020
a7ca8eb
remove elements from exclusions within _obj_with_exclusions
MarcoGorelli Mar 15, 2020
35e7340
remove no-longer-necessary performance warning
MarcoGorelli Mar 15, 2020
17090da
simplify _obj_with_exclusions
MarcoGorelli Mar 15, 2020
7c7f79c
correct comment
MarcoGorelli Mar 15, 2020
9b98dad
Merge remote-tracking branch 'upstream/master' into rolling-groupby
Mar 19, 2020
4d1c5a6
deal with multiindexed columns case
Mar 21, 2020
0dde4a8
use elif in _obj_with_exclusions, add test for column multiindex, dis…
Mar 21, 2020
1e4ba56
simplify unique_column_names
Mar 21, 2020
6312725
rewrite using get_level_values
Mar 21, 2020
d9748d1
revert 'or' which was accidentally changed to 'and'
Mar 21, 2020
1623902
only patch _apply, cov and corr
Mar 21, 2020
5d7a477
reinstate change to _obj_with_exclusions
Mar 21, 2020
d117624
exclude 'by' column in Rolling.count
Mar 22, 2020
367e671
don't modify _selected_obj - instead, patch obj before we reach apply
Mar 22, 2020
d7cf9f1
slight simplification to _shallow_copy
Mar 22, 2020
d251715
comment on patch in corr and cov
Mar 22, 2020
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
25 changes: 25 additions & 0 deletions doc/source/whatsnew/v1.1.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -166,6 +166,31 @@ key and type of :class:`Index`. These now consistently raise ``KeyError`` (:iss
...
KeyError: Timestamp('1970-01-01 00:00:00')

GroupBy.rolling no longer returns grouped-by column in values
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add an expl here, a couple of sentences; this the issue number.

*Previous behavior*:

.. code-block:: ipython

In [1]: df = pd.DataFrame({"A": [1, 1, 2, 3], "B": [0, 1, 2, 3]})

In [2]: df.groupby("A").rolling(2).sum()
Out[2]:
A B
A
1 0 NaN NaN
1 2.0 1.0
2 2 NaN NaN
3 3 NaN NaN

*New behavior*:

.. ipython:: python

df = pd.DataFrame({"A": [1, 1, 2, 3], "B": [0, 1, 2, 3]})
df.groupby("A").rolling(2).sum()

.. ---------------------------------------------------------------------------

.. _whatsnew_110.api_breaking.assignment_to_multiple_columns:
Expand Down
9 changes: 6 additions & 3 deletions pandas/core/base.py
Original file line number Diff line number Diff line change
Expand Up @@ -207,11 +207,14 @@ def _obj_with_exclusions(self):
if self._selection is not None and isinstance(self.obj, ABCDataFrame):
return self.obj.reindex(columns=self._selection_list)

if len(self.exclusions) > 0:
return self.obj.drop(self.exclusions, axis=1)
else:
if not isinstance(self.obj, ABCDataFrame) or not self.exclusions:
return self.obj

# there may be elements in self.exclusions that are no longer
# in self.obj, see GH 32468
exclusions = self.exclusions.intersection(self.obj.columns)
return self.obj.drop(exclusions, axis=1)

def __getitem__(self, key):
if self._selection is not None:
raise IndexError(f"Column(s) {self._selection} already selected")
Expand Down
1 change: 1 addition & 0 deletions pandas/core/groupby/groupby.py
Original file line number Diff line number Diff line change
Expand Up @@ -1577,6 +1577,7 @@ def rolling(self, *args, **kwargs):
"""
from pandas.core.window import RollingGroupby

kwargs["exclusions"] = self.exclusions
return RollingGroupby(self, *args, **kwargs)

@Substitution(name="groupby")
Expand Down
1 change: 1 addition & 0 deletions pandas/core/window/ewm.py
Original file line number Diff line number Diff line change
Expand Up @@ -142,6 +142,7 @@ def __init__(
adjust=True,
ignore_na=False,
axis=0,
**kwargs,
):
self.obj = obj
self.com = _get_center_of_mass(com, span, halflife, alpha)
Expand Down
7 changes: 7 additions & 0 deletions pandas/core/window/rolling.py
Original file line number Diff line number Diff line change
Expand Up @@ -93,6 +93,13 @@ def __init__(
self.axis = obj._get_axis_number(axis) if axis is not None else None
self.validate()
self._numba_func_cache: Dict[Optional[str], Callable] = dict()
self.exclusions = kwargs.get("exclusions", set())

def _shallow_copy(self, obj: FrameOrSeries, **kwargs) -> ShallowMixin:
exclusions = self.exclusions
new_obj = super()._shallow_copy(obj, exclusions=exclusions, **kwargs)
new_obj.obj = new_obj._obj_with_exclusions
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are you sure this line is actually needed? this is very very odd to do

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

copying the exclusions should be enough. you maybe be able to move this shallow copy to pandas/core/groupby/groupby.py which is better than here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason for doing this was that RollingGroupby._groupby._python_apply_general operates on RollingGroupby._groupby._selected_obj.
And RollingGroupby._groupby._selected_obj returns RollingGroupby._groupby.obj, possibly sliced by RollingGroupby._groupby._selection

Copy link
Member Author

@MarcoGorelli MarcoGorelli Mar 21, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you maybe be able to move this shallow copy to pandas/core/groupby/groupby.py

As in, define _GroupBy._shallow_copy? I don't understand how that would work: when we call Rolling.sum, we go to _Rolling_and_Expanding.sum and then WindowGroupByMixin._apply. Inside there, we have

        def f(x, name=name, *args):
            x = self._shallow_copy(x)

            if isinstance(name, str):
                return getattr(x, name)(*args, **kwargs)

            return x.apply(name, *args, **kwargs)

Here, self is WindowGroupByMixin. In the case when we get here from Rolling.sum, then self._shallow_copy will take us directly to ShallowMixin._shallow_copy. Therefore, even if we define _GroupBy._shallow_copy, the execution won't get there during this operation.

EDIT

Anyway, am trying a slightly different approach, will ping if/when green and once I've thought over why corr and cov currently use ._selected_obj instead of ._obj_with_exclusions

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback thanks for your review. I've pushed a new attempt at a solution.

I patches obj before we get to _GroupBy.apply, and only for Rolling so that previous behaviour in DataFrame.groupby.apply.

return new_obj

@property
def _constructor(self):
Expand Down
5 changes: 1 addition & 4 deletions pandas/tests/groupby/test_groupby.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,8 +5,6 @@
import numpy as np
import pytest

from pandas.errors import PerformanceWarning

import pandas as pd
from pandas import DataFrame, Index, MultiIndex, Series, Timestamp, date_range, read_csv
import pandas._testing as tm
Expand Down Expand Up @@ -1565,8 +1563,7 @@ def test_groupby_multiindex_not_lexsorted():
tm.assert_frame_equal(lexsorted_df, not_lexsorted_df)

expected = lexsorted_df.groupby("a").mean()
with tm.assert_produces_warning(PerformanceWarning):
result = not_lexsorted_df.groupby("a").mean()
result = not_lexsorted_df.groupby("a").mean()
tm.assert_frame_equal(expected, result)

# a transforming function should work regardless of sort
Expand Down
10 changes: 10 additions & 0 deletions pandas/tests/window/test_grouper.py
Original file line number Diff line number Diff line change
Expand Up @@ -62,11 +62,15 @@ def test_rolling(self):
for f in ["sum", "mean", "min", "max", "count", "kurt", "skew"]:
result = getattr(r, f)()
expected = g.apply(lambda x: getattr(x.rolling(4), f)())
# groupby.apply doesn't drop the grouped-by column
expected = expected.drop("A", axis=1)
tm.assert_frame_equal(result, expected)

for f in ["std", "var"]:
result = getattr(r, f)(ddof=1)
expected = g.apply(lambda x: getattr(x.rolling(4), f)(ddof=1))
# groupby.apply doesn't drop the grouped-by column
expected = expected.drop("A", axis=1)
tm.assert_frame_equal(result, expected)

@pytest.mark.parametrize(
Expand All @@ -79,6 +83,8 @@ def test_rolling_quantile(self, interpolation):
expected = g.apply(
lambda x: x.rolling(4).quantile(0.4, interpolation=interpolation)
)
# groupby.apply doesn't drop the grouped-by column
expected = expected.drop("A", axis=1)
tm.assert_frame_equal(result, expected)

def test_rolling_corr_cov(self):
Expand All @@ -92,6 +98,8 @@ def func(x):
return getattr(x.rolling(4), f)(self.frame)

expected = g.apply(func)
# groupby.apply doesn't drop the grouped-by column
expected = expected.drop("A", axis=1)
tm.assert_frame_equal(result, expected)

result = getattr(r.B, f)(pairwise=True)
Expand All @@ -109,6 +117,8 @@ def test_rolling_apply(self, raw):
# reduction
result = r.apply(lambda x: x.sum(), raw=raw)
expected = g.apply(lambda x: x.rolling(4).apply(lambda y: y.sum(), raw=raw))
# groupby.apply doesn't drop the grouped-by column
expected = expected.drop("A", axis=1)
tm.assert_frame_equal(result, expected)

def test_rolling_apply_mutability(self):
Expand Down
12 changes: 12 additions & 0 deletions pandas/tests/window/test_rolling.py
Original file line number Diff line number Diff line change
Expand Up @@ -465,3 +465,15 @@ def test_rolling_count_default_min_periods_with_null_values(constructor):
result = constructor(values).rolling(3).count()
expected = constructor(expected_counts)
tm.assert_equal(result, expected)


def test_by_column_not_in_values():
# GH 32262
df = pd.DataFrame({"A": [1] * 20 + [2] * 12 + [3] * 8, "B": np.arange(40)})

g = df.groupby("A")
original_obj = g.obj.copy(deep=True)
r = g.rolling(4)
result = r.sum()
assert "A" not in result.columns
tm.assert_frame_equal(g.obj, original_obj) # check for side-effects