Skip to content

BUG: Rolling groupby should not maintain the by column in the resulting DataFrame #32332

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 41 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
41 commits
Select commit Hold shift + click to select a range
a6ad6c5
make sure exclusions are applied before the groupby object reaches ro…
Feb 28, 2020
54903a7
pass object with exclusions earlier on
MarcoGorelli Feb 29, 2020
d81c572
revert accident
MarcoGorelli Feb 29, 2020
e049205
check for side effects in obj
MarcoGorelli Feb 29, 2020
5f43f3a
wip
MarcoGorelli Mar 1, 2020
3b1c3ff
Wip
MarcoGorelli Mar 1, 2020
207a507
change _selected_obj according to self.mutated
MarcoGorelli Mar 1, 2020
c9b34b2
sanitize
MarcoGorelli Mar 1, 2020
b2ce758
update old tests
MarcoGorelli Mar 1, 2020
c307fdc
fix old docstring
MarcoGorelli Mar 1, 2020
35f2b2d
finish correcting old docstring
MarcoGorelli Mar 1, 2020
2bb7932
use getattr
MarcoGorelli Mar 1, 2020
eed4122
old fix
Mar 3, 2020
26d9dec
simpler fix
Mar 3, 2020
68b4299
try making Window's own _shallow_copy
Mar 3, 2020
159a3d6
more wip
Mar 3, 2020
0b393b6
typeerror!
Mar 3, 2020
9fc50ee
add some types, comment tests
Mar 3, 2020
7fe2fcf
fix typing
Mar 3, 2020
8a34ff4
better fix
Mar 3, 2020
8953bda
correct return annotation
Mar 3, 2020
b2b8a42
simplify code
Mar 3, 2020
d476191
fix upstream, remove try except
Mar 4, 2020
63e3d85
document change as api-breaking
Mar 4, 2020
17373ad
Merge remote-tracking branch 'upstream/master' into rolling-groupby
MarcoGorelli Mar 15, 2020
a7ca8eb
remove elements from exclusions within _obj_with_exclusions
MarcoGorelli Mar 15, 2020
35e7340
remove no-longer-necessary performance warning
MarcoGorelli Mar 15, 2020
17090da
simplify _obj_with_exclusions
MarcoGorelli Mar 15, 2020
7c7f79c
correct comment
MarcoGorelli Mar 15, 2020
9b98dad
Merge remote-tracking branch 'upstream/master' into rolling-groupby
Mar 19, 2020
4d1c5a6
deal with multiindexed columns case
Mar 21, 2020
0dde4a8
use elif in _obj_with_exclusions, add test for column multiindex, dis…
Mar 21, 2020
1e4ba56
simplify unique_column_names
Mar 21, 2020
6312725
rewrite using get_level_values
Mar 21, 2020
d9748d1
revert 'or' which was accidentally changed to 'and'
Mar 21, 2020
1623902
only patch _apply, cov and corr
Mar 21, 2020
5d7a477
reinstate change to _obj_with_exclusions
Mar 21, 2020
d117624
exclude 'by' column in Rolling.count
Mar 22, 2020
367e671
don't modify _selected_obj - instead, patch obj before we reach apply
Mar 22, 2020
d7cf9f1
slight simplification to _shallow_copy
Mar 22, 2020
d251715
comment on patch in corr and cov
Mar 22, 2020
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
29 changes: 29 additions & 0 deletions doc/source/whatsnew/v1.1.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -167,6 +167,35 @@ key and type of :class:`Index`. These now consistently raise ``KeyError`` (:iss
...
KeyError: Timestamp('1970-01-01 00:00:00')

GroupBy.rolling no longer returns grouped-by column in values
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add an expl here, a couple of sentences; this the issue number.

Suppose we start with

.. ipython:: python

df = pd.DataFrame({"A": [1, 1, 2, 3], "B": [0, 1, 2, 3]})
df

*Previous behavior*:

.. code-block:: ipython

In [1]: df.groupby("A").rolling(2).sum()
Out[1]:
A B
A
1 0 NaN NaN
1 2.0 1.0
2 2 NaN NaN
3 3 NaN NaN

*New behavior*:

.. ipython:: python

df.groupby("A").rolling(2).sum()

.. ---------------------------------------------------------------------------

.. _whatsnew_110.api_breaking.assignment_to_multiple_columns:
Expand Down
13 changes: 10 additions & 3 deletions pandas/core/base.py
Original file line number Diff line number Diff line change
Expand Up @@ -209,11 +209,18 @@ def _obj_with_exclusions(self):
if self._selection is not None and isinstance(self.obj, ABCDataFrame):
return self.obj.reindex(columns=self._selection_list)

if len(self.exclusions) > 0:
return self.obj.drop(self.exclusions, axis=1)
else:
elif not self.exclusions or not isinstance(self.obj, ABCDataFrame):
return self.obj

# there may be elements in self.exclusions that are no longer
# in self.obj, see GH 32468
nlevels = self.obj.columns.nlevels
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

huh? why are you doing all of this

unique_column_names = {
j for i in range(nlevels) for j in self.obj.columns.get_level_values(i)
}
exclusions = self.exclusions.intersection(unique_column_names)
return self.obj.drop(exclusions, axis=1)

def __getitem__(self, key):
if self._selection is not None:
raise IndexError(f"Column(s) {self._selection} already selected")
Expand Down
1 change: 1 addition & 0 deletions pandas/core/groupby/groupby.py
Original file line number Diff line number Diff line change
Expand Up @@ -1577,6 +1577,7 @@ def rolling(self, *args, **kwargs):
"""
from pandas.core.window import RollingGroupby

kwargs["exclusions"] = self.exclusions
return RollingGroupby(self, *args, **kwargs)

@Substitution(name="groupby")
Expand Down
4 changes: 4 additions & 0 deletions pandas/core/window/common.py
Original file line number Diff line number Diff line change
Expand Up @@ -35,6 +35,8 @@ def _dispatch(name: str, *args, **kwargs):
def outer(self, *args, **kwargs):
def f(x):
x = self._shallow_copy(x, groupby=self._groupby)
# patch for GH 32332
x.obj = x._obj_with_exclusions
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really don't like this

return getattr(x, name)(*args, **kwargs)

return self._groupby.apply(f)
Expand Down Expand Up @@ -82,6 +84,8 @@ def _apply(
# TODO: can we de-duplicate with _dispatch?
def f(x, name=name, *args):
x = self._shallow_copy(x)
# patch for GH 32332
x.obj = x._obj_with_exclusions

if isinstance(name, str):
return getattr(x, name)(*args, **kwargs)
Expand Down
1 change: 1 addition & 0 deletions pandas/core/window/ewm.py
Original file line number Diff line number Diff line change
Expand Up @@ -142,6 +142,7 @@ def __init__(
adjust=True,
ignore_na=False,
axis=0,
**kwargs,
):
self.obj = obj
self.com = _get_center_of_mass(com, span, halflife, alpha)
Expand Down
11 changes: 9 additions & 2 deletions pandas/core/window/rolling.py
Original file line number Diff line number Diff line change
Expand Up @@ -93,6 +93,10 @@ def __init__(
self.axis = obj._get_axis_number(axis) if axis is not None else None
self.validate()
self._numba_func_cache: Dict[Optional[str], Callable] = dict()
self.exclusions = kwargs.get("exclusions", set())

def _shallow_copy(self, obj: FrameOrSeries, **kwargs) -> ShallowMixin:
return super()._shallow_copy(obj, exclusions=self.exclusions, **kwargs)

@property
def _constructor(self):
Expand Down Expand Up @@ -1187,8 +1191,7 @@ def count(self):
closed=self.closed,
).sum()
results.append(result)

return self._wrap_results(results, blocks, obj)
return self._wrap_results(results, blocks, obj, exclude=self.exclusions)

_shared_docs["apply"] = dedent(
r"""
Expand Down Expand Up @@ -1632,6 +1635,8 @@ def cov(self, other=None, pairwise=None, ddof=1, **kwargs):
# only default unset
pairwise = True if pairwise is None else pairwise
other = self._shallow_copy(other)
# patch for GH 32332
other.obj = other._obj_with_exclusions

# GH 16058: offset window
if self.is_freq_type:
Expand Down Expand Up @@ -1775,6 +1780,8 @@ def corr(self, other=None, pairwise=None, **kwargs):
# only default unset
pairwise = True if pairwise is None else pairwise
other = self._shallow_copy(other)
# patch for GH 32332
other.obj = other._obj_with_exclusions
window = self._get_window(other) if not self.is_freq_type else self.win_freq

def _get_corr(a, b):
Expand Down
10 changes: 10 additions & 0 deletions pandas/tests/window/test_grouper.py
Original file line number Diff line number Diff line change
Expand Up @@ -62,11 +62,15 @@ def test_rolling(self):
for f in ["sum", "mean", "min", "max", "count", "kurt", "skew"]:
result = getattr(r, f)()
expected = g.apply(lambda x: getattr(x.rolling(4), f)())
# groupby.apply doesn't drop the grouped-by column
expected = expected.drop("A", axis=1)
tm.assert_frame_equal(result, expected)

for f in ["std", "var"]:
result = getattr(r, f)(ddof=1)
expected = g.apply(lambda x: getattr(x.rolling(4), f)(ddof=1))
# groupby.apply doesn't drop the grouped-by column
expected = expected.drop("A", axis=1)
tm.assert_frame_equal(result, expected)

@pytest.mark.parametrize(
Expand All @@ -79,6 +83,8 @@ def test_rolling_quantile(self, interpolation):
expected = g.apply(
lambda x: x.rolling(4).quantile(0.4, interpolation=interpolation)
)
# groupby.apply doesn't drop the grouped-by column
expected = expected.drop("A", axis=1)
tm.assert_frame_equal(result, expected)

def test_rolling_corr_cov(self):
Expand All @@ -92,6 +98,8 @@ def func(x):
return getattr(x.rolling(4), f)(self.frame)

expected = g.apply(func)
# groupby.apply doesn't drop the grouped-by column
expected = expected.drop("A", axis=1)
tm.assert_frame_equal(result, expected)

result = getattr(r.B, f)(pairwise=True)
Expand All @@ -109,6 +117,8 @@ def test_rolling_apply(self, raw):
# reduction
result = r.apply(lambda x: x.sum(), raw=raw)
expected = g.apply(lambda x: x.rolling(4).apply(lambda y: y.sum(), raw=raw))
# groupby.apply doesn't drop the grouped-by column
expected = expected.drop("A", axis=1)
tm.assert_frame_equal(result, expected)

def test_rolling_apply_mutability(self):
Expand Down
15 changes: 15 additions & 0 deletions pandas/tests/window/test_rolling.py
Original file line number Diff line number Diff line change
Expand Up @@ -465,3 +465,18 @@ def test_rolling_count_default_min_periods_with_null_values(constructor):
result = constructor(values).rolling(3).count()
expected = constructor(expected_counts)
tm.assert_equal(result, expected)


@pytest.mark.parametrize(
"columns", [pd.MultiIndex.from_tuples([("A", ""), ("B", "C")]), ["A", "B"]]
)
def test_by_column_not_in_values(columns):
# GH 32262
df = pd.DataFrame([[1, 0]] * 20 + [[2, 0]] * 12 + [[3, 0]] * 8, columns=columns)

g = df.groupby("A")
original_obj = g.obj.copy(deep=True)
r = g.rolling(4)
result = r.sum()
assert "A" not in result.columns
tm.assert_frame_equal(g.obj, original_obj) # check for side-effects