Skip to content

DEPR: dropping nuisance columns in rolling aggregations #42834

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
Aug 10, 2021
Merged
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions doc/source/whatsnew/v1.4.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -159,6 +159,7 @@ Deprecations
- Deprecated treating ``numpy.datetime64`` objects as UTC times when passed to the :class:`Timestamp` constructor along with a timezone. In a future version, these will be treated as wall-times. To retain the old behavior, use ``Timestamp(dt64).tz_localize("UTC").tz_convert(tz)`` (:issue:`24559`)
- Deprecated ignoring missing labels when indexing with a sequence of labels on a level of a MultiIndex (:issue:`42351`)
- Creating an empty Series without a dtype will now raise a more visible ``FutureWarning`` instead of a ``DeprecationWarning`` (:issue:`30017`)
- Deprecated dropping of nuisance columns in :class:`Rolling` aggregations (:issue:`42738`)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good to include Expanding and EWM as well.


.. ---------------------------------------------------------------------------

Expand Down
11 changes: 11 additions & 0 deletions pandas/core/window/rolling.py
Original file line number Diff line number Diff line change
Expand Up @@ -32,6 +32,7 @@
from pandas.compat._optional import import_optional_dependency
from pandas.compat.numpy import function as nv
from pandas.util._decorators import doc
from pandas.util._exceptions import find_stack_level

from pandas.core.dtypes.common import (
ensure_float64,
Expand Down Expand Up @@ -436,6 +437,16 @@ def hfunc2d(values: ArrayLike) -> ArrayLike:
new_mgr = mgr.apply_2d(hfunc2d, ignore_failures=True)
else:
new_mgr = mgr.apply(hfunc, ignore_failures=True)

if 0 != len(new_mgr.items) != len(mgr.items):
# GH#42738 ignore_failures dropped nuisance columns
warnings.warn(
"Dropping of nuisance columns in rolling operations "
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you list the columns that are problematic in the error message?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated + green

"is deprecated; in a future version this will raise TypeError. "
"Select only valid columns before calling the operation.",
FutureWarning,
stacklevel=find_stack_level(),
)
out = obj._constructor(new_mgr)

return self._resolve_output(out, obj)
Expand Down
4 changes: 3 additions & 1 deletion pandas/tests/window/test_api.py
Original file line number Diff line number Diff line change
Expand Up @@ -68,7 +68,9 @@ def tests_skip_nuisance():
def test_skip_sum_object_raises():
df = DataFrame({"A": range(5), "B": range(5, 10), "C": "foo"})
r = df.rolling(window=3)
result = r.sum()
with tm.assert_produces_warning(FutureWarning, match="nuisance columns"):
# GH#42738
result = r.sum()
expected = DataFrame(
{"A": [np.nan, np.nan, 3, 6, 9], "B": [np.nan, np.nan, 18, 21, 24]},
columns=list("AB"),
Expand Down
6 changes: 4 additions & 2 deletions pandas/tests/window/test_ewm.py
Original file line number Diff line number Diff line change
Expand Up @@ -116,8 +116,10 @@ def test_ewma_with_times_equal_spacing(halflife_with_times, times, min_periods):
data = np.arange(10.0)
data[::2] = np.nan
df = DataFrame({"A": data, "time_col": date_range("2000", freq="D", periods=10)})
result = df.ewm(halflife=halflife, min_periods=min_periods, times=times).mean()
expected = df.ewm(halflife=1.0, min_periods=min_periods).mean()
with tm.assert_produces_warning(FutureWarning, match="nuisance columns"):
# GH#42738
result = df.ewm(halflife=halflife, min_periods=min_periods, times=times).mean()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also should probably add a deprecation here too that allows times to be a string of a column label (which is already kinda flawed because a column label isn't always a string). A column of times will always be a nuisance column I think

if isinstance(self.times, str):

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can this be done separately? this is less obvious to me

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think so, yes.

expected = df.ewm(halflife=1.0, min_periods=min_periods).mean()
tm.assert_frame_equal(result, expected)


Expand Down
38 changes: 23 additions & 15 deletions pandas/tests/window/test_groupby.py
Original file line number Diff line number Diff line change
Expand Up @@ -923,7 +923,12 @@ def test_methods(self, method, expected_data):
)
tm.assert_frame_equal(result, expected)

expected = df.groupby("A").apply(lambda x: getattr(x.ewm(com=1.0), method)())
with tm.assert_produces_warning(FutureWarning, match="nuisance"):
# GH#42738
expected = df.groupby("A").apply(
lambda x: getattr(x.ewm(com=1.0), method)()
)

# There may be a bug in the above statement; not returning the correct index
tm.assert_frame_equal(result.reset_index(drop=True), expected)

Expand Down Expand Up @@ -955,7 +960,9 @@ def test_pairwise_methods(self, method, expected_data):
def test_times(self, times_frame):
# GH 40951
halflife = "23 days"
result = times_frame.groupby("A").ewm(halflife=halflife, times="C").mean()
with tm.assert_produces_warning(FutureWarning, match="nuisance"):
# GH#42738
result = times_frame.groupby("A").ewm(halflife=halflife, times="C").mean()
expected = DataFrame(
{
"B": [
Expand Down Expand Up @@ -992,22 +999,23 @@ def test_times(self, times_frame):
def test_times_vs_apply(self, times_frame):
# GH 40951
halflife = "23 days"
result = times_frame.groupby("A").ewm(halflife=halflife, times="C").mean()
expected = (
times_frame.groupby("A")
.apply(lambda x: x.ewm(halflife=halflife, times="C").mean())
.iloc[[0, 3, 6, 9, 1, 4, 7, 2, 5, 8]]
.reset_index(drop=True)
)
with tm.assert_produces_warning(FutureWarning, match="nuisance"):
# GH#42738
result = times_frame.groupby("A").ewm(halflife=halflife, times="C").mean()
expected = (
times_frame.groupby("A")
.apply(lambda x: x.ewm(halflife=halflife, times="C").mean())
.iloc[[0, 3, 6, 9, 1, 4, 7, 2, 5, 8]]
.reset_index(drop=True)
)
tm.assert_frame_equal(result.reset_index(drop=True), expected)

def test_times_array(self, times_frame):
# GH 40951
halflife = "23 days"
result = times_frame.groupby("A").ewm(halflife=halflife, times="C").mean()
expected = (
times_frame.groupby("A")
.ewm(halflife=halflife, times=times_frame["C"].values)
.mean()
)
gb = times_frame.groupby("A")
with tm.assert_produces_warning(FutureWarning, match="nuisance"):
# GH#42738
result = gb.ewm(halflife=halflife, times="C").mean()
expected = gb.ewm(halflife=halflife, times=times_frame["C"].values).mean()
tm.assert_frame_equal(result, expected)
37 changes: 27 additions & 10 deletions pandas/tests/window/test_numba.py
Original file line number Diff line number Diff line change
Expand Up @@ -170,26 +170,39 @@ def test_invalid_engine_kwargs(self, grouper):
engine="cython", engine_kwargs={"nopython": True}
)

@pytest.mark.parametrize(
"grouper", [lambda x: x, lambda x: x.groupby("A")], ids=["None", "groupby"]
)
@pytest.mark.parametrize("grouper", ["None", "groupby"])
def test_cython_vs_numba(
self, grouper, nogil, parallel, nopython, ignore_na, adjust
):
if grouper == "None":
grouper = lambda x: x
warn = FutureWarning
else:
grouper = lambda x: x.groupby("A")
warn = None

df = DataFrame({"A": ["a", "b", "a", "b"], "B": range(4)})
ewm = grouper(df).ewm(com=1.0, adjust=adjust, ignore_na=ignore_na)

engine_kwargs = {"nogil": nogil, "parallel": parallel, "nopython": nopython}
result = ewm.mean(engine="numba", engine_kwargs=engine_kwargs)
expected = ewm.mean(engine="cython")
with tm.assert_produces_warning(warn, match="nuisance"):
# GH#42738
result = ewm.mean(engine="numba", engine_kwargs=engine_kwargs)
expected = ewm.mean(engine="cython")

tm.assert_frame_equal(result, expected)

@pytest.mark.parametrize(
"grouper", [lambda x: x, lambda x: x.groupby("A")], ids=["None", "groupby"]
)
@pytest.mark.parametrize("grouper", ["None", "groupby"])
def test_cython_vs_numba_times(self, grouper, nogil, parallel, nopython, ignore_na):
# GH 40951

if grouper == "None":
grouper = lambda x: x
warn = FutureWarning
else:
grouper = lambda x: x.groupby("A")
warn = None

halflife = "23 days"
times = to_datetime(
[
Expand All @@ -207,8 +220,12 @@ def test_cython_vs_numba_times(self, grouper, nogil, parallel, nopython, ignore_
)

engine_kwargs = {"nogil": nogil, "parallel": parallel, "nopython": nopython}
result = ewm.mean(engine="numba", engine_kwargs=engine_kwargs)
expected = ewm.mean(engine="cython")

# TODO: why only in these cases?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because column "A" is string dtype and in the non-group case there's an attempt to agg over the column, while in the groupby case, the grouping column should be dropped internally before hand

with tm.assert_produces_warning(warn, match="nuisance"):
# GH#42738
result = ewm.mean(engine="numba", engine_kwargs=engine_kwargs)
expected = ewm.mean(engine="cython")

tm.assert_frame_equal(result, expected)

Expand Down