Skip to content

API: reimplement FixedWindowIndexer.get_window_bounds to fix groupby bug #36132

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 25 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
25 commits
Select commit Hold shift + click to select a range
69f084f
updated fixed indexer to work with rolling df and groupby
justinessert Sep 4, 2020
71830c8
updated is_weighted case
justinessert Sep 4, 2020
a449d9b
added comment
justinessert Sep 4, 2020
476fe83
corrected offset for even window sizes
justinessert Sep 5, 2020
9dfd9f3
reverted changes for weighted windows
justinessert Sep 5, 2020
f025600
reverted back to fixed func type; added func_type variable
justinessert Sep 6, 2020
5d902fd
reformatted
justinessert Sep 6, 2020
59fcd3e
merged master and resolved conflict
justinessert Sep 6, 2020
cdecf34
corrected return typing
justinessert Sep 6, 2020
4e8f844
added consistency tests
justinessert Sep 6, 2020
6e66a49
corrected typing change
justinessert Sep 6, 2020
e7fb384
reformatted test to pass blac
justinessert Sep 6, 2020
3649ca2
added typing and docstring
justinessert Sep 6, 2020
00cc1dc
fixing center param in median's _apply
justinessert Sep 6, 2020
3de7fcc
added center_min_periods test to test_grouper
justinessert Sep 7, 2020
f779321
replaced func_type with skip_offset
justinessert Sep 9, 2020
a817f87
removed unneeded class attribute
justinessert Sep 9, 2020
d72812d
moved logic into calculate_center_offset
justinessert Sep 9, 2020
96c6959
removed whitespace
justinessert Sep 9, 2020
daacae7
added whatsnew entry
justinessert Sep 12, 2020
52a8a6b
Merge remote-tracking branch 'upstream/master' into groupby-rolling
justinessert Sep 12, 2020
950018c
formatting fixes
justinessert Sep 13, 2020
0798c70
removed pytest.slow
justinessert Sep 13, 2020
f413ec8
removed typing of window
justinessert Sep 13, 2020
70679be
Merge branch 'master' into groupby-rolling
justinessert Sep 25, 2020
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions doc/source/whatsnew/v1.2.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -347,6 +347,7 @@ Groupby/resample/rolling
- Bug when subsetting columns on a :class:`~pandas.core.groupby.DataFrameGroupBy` (e.g. ``df.groupby('a')[['b']])``) would reset the attributes ``axis``, ``dropna``, ``group_keys``, ``level``, ``mutated``, ``sort``, and ``squeeze`` to their default values. (:issue:`9959`)
- Bug in :meth:`DataFrameGroupby.tshift` failing to raise ``ValueError`` when a frequency cannot be inferred for the index of a group (:issue:`35937`)
- Bug in :meth:`DataFrame.groupby` does not always maintain column index name for ``any``, ``all``, ``bfill``, ``ffill``, ``shift`` (:issue:`29764`)
- Bug in :meth:`DataFrame.groupby.rolling` output incorrect when using a partial window (:issue:`36040`)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

returning wrong values with partial window

- Bug in :meth:`DataFrameGroupBy.apply` raising error with ``np.nan`` group(s) when ``dropna=False`` (:issue:`35889`)
- Bug in :meth:`Rolling.sum()` returned wrong values when dtypes where mixed between float and integer and axis was equal to one (:issue:`20649`, :issue:`35596`)

Expand Down
1 change: 1 addition & 0 deletions pandas/core/window/common.py
Original file line number Diff line number Diff line change
Expand Up @@ -71,6 +71,7 @@ def _apply(
is_weighted: bool = False,
name: Optional[str] = None,
use_numba_cache: bool = False,
skip_offset: bool = False,
**kwargs,
):
"""
Expand Down
20 changes: 10 additions & 10 deletions pandas/core/window/indexers.py
Original file line number Diff line number Diff line change
Expand Up @@ -78,17 +78,17 @@ def get_window_bounds(
closed: Optional[str] = None,
) -> Tuple[np.ndarray, np.ndarray]:

start_s = np.zeros(self.window_size, dtype="int64")
start_e = (
np.arange(self.window_size, num_values, dtype="int64")
- self.window_size
+ 1
)
start = np.concatenate([start_s, start_e])[:num_values]
if center:
offset = (self.window_size - 1) // 2
else:
offset = 0

end = np.arange(1 + offset, num_values + 1 + offset).astype("int64")
start = end - self.window_size

end = np.clip(end, 0, num_values)
start = np.clip(start, 0, num_values)

end_s = np.arange(self.window_size, dtype="int64") + 1
end_e = start_e + self.window_size
end = np.concatenate([end_s, end_e])[:num_values]
return start, end


Expand Down
125 changes: 89 additions & 36 deletions pandas/core/window/rolling.py
Original file line number Diff line number Diff line change
Expand Up @@ -72,24 +72,6 @@
from pandas.core.internals import Block # noqa:F401


def calculate_center_offset(window) -> int:
"""
Calculate an offset necessary to have the window label to be centered.

Parameters
----------
window: ndarray or int
window weights or window

Returns
-------
int
"""
if not is_integer(window):
window = len(window)
return int((window - 1) / 2.0)


def calculate_min_periods(
window: int,
min_periods: Optional[int],
Expand Down Expand Up @@ -417,18 +399,44 @@ def _insert_on_column(self, result: "DataFrame", obj: "DataFrame"):
# insert at the end
result[name] = extra_col

def _center_window(self, result: np.ndarray, window) -> np.ndarray:
def calculate_center_offset(self, window, center: bool) -> int:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pls type window: Union[np.ndarray, int]

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doing that causes an error in typing validation:

pandas/core/window/rolling.py:417: error: Argument 1 to "len" has incompatible type "Union[Any, int]"; expected "Sized"  [arg-type]
pandas/core/window/rolling.py:2300: error: Argument 1 to "len" has incompatible type "Union[Any, int]"; expected "Sized"  [arg-type]

"""
Calculate an offset necessary to have the window label to be centered.

Parameters
----------
window : ndarray or int
window weights or window
center : bool
Set the labels at the center of the window.

Returns
-------
int
"""
if not center:
return 0

if self.is_freq_type or isinstance(self.window, BaseIndexer):
return 0

if not is_integer(window):
window = len(window)
return int((window - 1) / 2.0)

def _center_window(self, result: np.ndarray, window, center) -> np.ndarray:
"""
Center the result in the window.
"""
if self.axis > result.ndim - 1:
raise ValueError("Requested axis is larger then no. of argument dimensions")

offset = calculate_center_offset(window)
offset = self.calculate_center_offset(window, center)
if offset > 0:
lead_indexer = [slice(None)] * result.ndim
lead_indexer[self.axis] = slice(offset, None)
result = np.copy(result[tuple(lead_indexer)])

return result

def _get_roll_func(self, func_name: str) -> Callable:
Expand Down Expand Up @@ -524,6 +532,7 @@ def _apply(
is_weighted: bool = False,
name: Optional[str] = None,
use_numba_cache: bool = False,
skip_offset: bool = False,
**kwargs,
):
"""
Expand All @@ -543,6 +552,8 @@ def _apply(
use_numba_cache : bool
whether to cache a numba compiled function. Only available for numba
enabled methods (so far only apply)
skip_offset : bool
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's the point of an addtional parameter here? this makes it really hard to understand

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@justinessert can you address

whether to skip offsetting x
**kwargs
additional arguments for rolling function and window function

Expand All @@ -560,7 +571,11 @@ def homogeneous_func(values: np.ndarray):
if values.size == 0:
return values.copy()

offset = calculate_center_offset(window) if center else 0
if skip_offset:
offset = 0
else:
offset = self.calculate_center_offset(window, center)

additional_nans = np.array([np.nan] * offset)

if not is_weighted:
Expand Down Expand Up @@ -603,8 +618,8 @@ def calc(x):
if use_numba_cache:
NUMBA_FUNC_CACHE[(kwargs["original_func"], "rolling_apply")] = func

if center:
result = self._center_window(result, window)
if not skip_offset:
result = self._center_window(result, window, center)

return result

Expand Down Expand Up @@ -1189,7 +1204,7 @@ def sum(self, *args, **kwargs):
window_func = self._get_roll_func("roll_weighted_sum")
window_func = get_weighted_roll_func(window_func)
return self._apply(
window_func, center=self.center, is_weighted=True, name="sum", **kwargs
window_func, center=self.center, is_weighted=True, name="sum", **kwargs,
)

@Substitution(name="window")
Expand All @@ -1210,7 +1225,7 @@ def var(self, ddof=1, *args, **kwargs):
window_func = get_weighted_roll_func(window_func)
kwargs.pop("name", None)
return self._apply(
window_func, center=self.center, is_weighted=True, name="var", **kwargs
window_func, center=self.center, is_weighted=True, name="var", **kwargs,
)

@Substitution(name="window", versionadded="\n.. versionadded:: 1.0.0\n")
Expand Down Expand Up @@ -1388,7 +1403,8 @@ def apply(
# Cython apply functions handle center, so don't need to use
# _apply's center handling
window = self._get_window()
offset = calculate_center_offset(window) if self.center else 0

offset = self.calculate_center_offset(window, self.center)
apply_func = self._generate_cython_apply_func(
args, kwargs, raw, offset, func
)
Expand All @@ -1406,19 +1422,17 @@ def apply(
raw=raw,
original_func=func,
args=args,
skip_offset=True,
kwargs=kwargs,
)

def _generate_cython_apply_func(self, args, kwargs, raw, offset, func):
from pandas import Series

cython_func = self._get_cython_func_type("roll_generic")

window_func = partial(
self._get_cython_func_type("roll_generic"),
args=args,
kwargs=kwargs,
raw=raw,
offset=offset,
func=func,
cython_func, args=args, kwargs=kwargs, raw=raw, offset=offset, func=func,
)

def apply_func(values, begin, end, min_periods, raw=raw):
Expand All @@ -1433,7 +1447,7 @@ def sum(self, *args, **kwargs):
window_func = self._get_cython_func_type("roll_sum")
kwargs.pop("floor", None)
return self._apply(
window_func, center=self.center, floor=0, name="sum", **kwargs
window_func, center=self.center, floor=0, name="sum", **kwargs,
)

_shared_docs["max"] = dedent(
Expand Down Expand Up @@ -1540,7 +1554,9 @@ def median(self, **kwargs):
window_func = self._get_roll_func("roll_median_c")
# GH 32865. Move max window size calculation to
# the median function implementation
return self._apply(window_func, center=self.center, name="median", **kwargs)
return self._apply(
window_func, center=self.center, name="median", skip_offset=True, **kwargs
)

def std(self, ddof=1, *args, **kwargs):
nv.validate_window_func("std", args, kwargs)
Expand All @@ -1563,7 +1579,8 @@ def zsqrt_func(values, begin, end, min_periods):
def var(self, ddof=1, *args, **kwargs):
nv.validate_window_func("var", args, kwargs)
kwargs.pop("require_min_periods", None)
window_func = partial(self._get_cython_func_type("roll_var"), ddof=ddof)
cython_func = self._get_cython_func_type("roll_var")
window_func = partial(cython_func, ddof=ddof)
# ddof passed again for compat with groupby.rolling
return self._apply(
window_func,
Expand Down Expand Up @@ -1696,20 +1713,29 @@ def kurt(self, **kwargs):
def quantile(self, quantile, interpolation="linear", **kwargs):
if quantile == 1.0:
window_func = self._get_cython_func_type("roll_max")
skip_offset = False
elif quantile == 0.0:
window_func = self._get_cython_func_type("roll_min")
skip_offset = False
else:
window_func = partial(
self._get_roll_func("roll_quantile"),
win=self._get_window(),
quantile=quantile,
interpolation=interpolation,
)
skip_offset = True

# Pass through for groupby.rolling
kwargs["quantile"] = quantile
kwargs["interpolation"] = interpolation
return self._apply(window_func, center=self.center, name="quantile", **kwargs)
return self._apply(
window_func,
center=self.center,
name="quantile",
skip_offset=skip_offset,
**kwargs,
)

_shared_docs[
"cov"
Expand Down Expand Up @@ -2189,6 +2215,7 @@ def _apply(
is_weighted: bool = False,
name: Optional[str] = None,
use_numba_cache: bool = False,
skip_offset: bool = True,
**kwargs,
):
result = Rolling._apply(
Expand All @@ -2200,6 +2227,7 @@ def _apply(
is_weighted,
name,
use_numba_cache,
skip_offset,
**kwargs,
)
# Cannot use _wrap_outputs because we calculate the result all at once
Expand Down Expand Up @@ -2243,6 +2271,31 @@ def _create_data(self, obj: FrameOrSeries) -> FrameOrSeries:
obj = obj.take(groupby_order)
return super()._create_data(obj)

def calculate_center_offset(self, window, center: bool) -> int:
"""
Calculate an offset necessary to have the window label to be centered.

Parameters
----------
window : ndarray or int
window weights or window
center : bool
Set the labels at the center of the window.

Returns
-------
int
"""
if not center or not self.win_type:
return 0

if self.is_freq_type or isinstance(self.window, BaseIndexer):
return 0

if not is_integer(window):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we not just make this a free function? i get that you are passing win_type here, but that could easily be passed in as an arg

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I again tried making this a free function and confirmed that the two classes require different functionality (whether or not to include or not self.win_type in the if statement).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can just add an optional 3rd kwarg (window, center: bool, win_type: Optional[str] = None)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did try that and it did not work. The issue is that in one class win_type should be completely ignored whereas the other class needs to use it. But both classes share the _apply function, which actually calls calculate_center_offset here.

So I can't just do calculate_center_offset(..., None) for the first class and calculate_center_offset(..., self.win_type) for the second (because there's only one shared call to calculate_center_offset(...), not two separate ones)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok see my comment above; if this is now a method, then you can simply use is_freq_type and this PR is a lot simpler.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback thanks for the suggestion, I changed the code to use is_freq_type instead. That wasn't sufficient by itself (e.g. a groupby().rolling() function will always use a variable function, not just if self.is_freq_type, so I also added a skip_offset param to the _apply function.

Nonetheless, I do think this is a simpler approach than I had previously, so thanks for the recommendation!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback Are you cool with this implementation?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand the skip_offset parameter. why is it not sufficient to just make a property on the class itself (e.g. you can make another preoprty / method if needed, similr to is_ffreq_type). passing parameters around like this is really hard to understand.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@justinessert I think same as above, can you explain why skip_offset arg is needed

window = len(window)
return int((window - 1) / 2.0)

def _get_cython_func_type(self, func: str) -> Callable:
"""
Return the cython function type.
Expand Down
47 changes: 47 additions & 0 deletions pandas/tests/window/moments/test_moments_consistency_rolling.py
Original file line number Diff line number Diff line change
Expand Up @@ -136,6 +136,53 @@ def test_rolling_apply_consistency(
tm.assert_equal(rolling_f_result, rolling_apply_f_result)


@pytest.mark.parametrize(
"window,min_periods,center", list(_rolling_consistency_cases())
)
def test_rolling_groupby(base_functions, window, min_periods, center):
base_df = DataFrame({"group": "A", "data": randn(20)})

b_df = base_df.copy()
b_df["group"] = "B"

grp_df = pd.concat([base_df, b_df]).groupby("group")

for (f, require_min_periods, name) in base_functions:
if (
require_min_periods
and (min_periods is not None)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are these skipped?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if min_periods is less than the required_min_periods then there will be an error thrown so we wouldn't be able to test the equivalency of the two dfs

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there's an error thrown test for that using pytest.raises

and (min_periods < require_min_periods)
):
continue

base_rolling_f = getattr(
base_df[["data"]].rolling(
window=window, center=center, min_periods=min_periods
),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how is this actually testing center? sure you pass it but unless we can see the expected results i don't have any idea whether this is correct. a small set of fixed cases that really show the input and output is much more useful here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair, see response on your previous comment, but here I'm more testing the consistency of df.rolling vs df.groupby.rolling. If you assume that df.rolling works correctly (which is extensively tested elsewhere), then this will test df.groupby.rolling also works correctly.

That being said, I'm open to creating another test to explicitly tests the correctness of df.groupby.rolling if you think that's necessary.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haven't digested the details here but it sounds like you should hard code expected results and test against that

name,
)

grp_rolling_f = getattr(
grp_df[["data"]].rolling(
window=window, center=center, min_periods=min_periods
),
name,
)

base_result = base_rolling_f().reset_index(drop=True)
grp_result = grp_rolling_f().reset_index()

a_result = grp_result[grp_result["group"] == "A"][["data"]].reset_index(
drop=True
)
b_result = grp_result[grp_result["group"] == "B"][["data"]].reset_index(
drop=True
)

tm.assert_frame_equal(base_result, a_result)
tm.assert_frame_equal(base_result, b_result)


@pytest.mark.parametrize("window", range(7))
def test_rolling_corr_with_zero_variance(window):
# GH 18430
Expand Down
Loading