API: reimplement FixedWindowIndexer.get_window_bounds to fix groupby bug #36132

justinessert · 2020-09-04T23:37:20Z

closes BUG: Rolling min_periods not working on groupby object #36040
tests added / passed
passes black pandas
passes git diff upstream/master -u -- "*.py" | flake8 --diff
whatsnew entry

Creating PR to solve Issue 36040.

The old FixedWindowIndexer.get_window_bounds function provided unintuitive bounds that went beyond the length of the original array. Additionally, to do a centered rolling operation, it required NaN values to be appended to the end of original array to enable some roundabout way of achieving the centering.

I replaced it with one that seems much simpler and actually creates "fixed" size windows (at least prior to clipping the ends), which the previous function did not.

That being said, I know this PR fails some tests, I'm would appreciate some advice on how best to proceed!

pep8speaks · 2020-09-04T23:37:25Z

Hello @justinessert! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2020-09-25 02:51:47 UTC

arw2019

thanks @justinessert for the PR!

From your description and a quick look at the code this looks like it's an API breaking change - is that right?

In any case, to move forward we'll need tests for the behavior you're targeting. I think you would need to alter at least some of the tests in pandas/tests/window/test_base_indexer.py and maybe some others. Seeing those will make it easier to figure out if this is a desired change

I would also retitle the PR to something like API: reimplement FixedWindowIndexer.get_window_bounds

jreback

pls add a test from the OP
it's ok to refactor as well but should pass the new test

pandas/core/window/rolling.py

pandas/core/window/indexers.py

justinessert · 2020-09-06T16:40:28Z

@arw2019 @jreback @mroeschke Thanks for your feedback, I believe that I addressed all of it, please let me know if there's anything else you want to see in this PR.

I know that I still need to add a whatsnew entry, could you please advise what/where I should add?

pandas/core/window/rolling.py

jreback · 2020-09-06T17:35:58Z

pandas/core/window/rolling.py

+        if not center or not self.win_type:
+            return 0
+
+        if not is_integer(window):


can we not just make this a free function? i get that you are passing win_type here, but that could easily be passed in as an arg

I again tried making this a free function and confirmed that the two classes require different functionality (whether or not to include or not self.win_type in the if statement).

You can just add an optional 3rd kwarg (window, center: bool, win_type: Optional[str] = None)

I did try that and it did not work. The issue is that in one class win_type should be completely ignored whereas the other class needs to use it. But both classes share the _apply function, which actually calls calculate_center_offset here.

So I can't just do calculate_center_offset(..., None) for the first class and calculate_center_offset(..., self.win_type) for the second (because there's only one shared call to calculate_center_offset(...), not two separate ones)

ok see my comment above; if this is now a method, then you can simply use is_freq_type and this PR is a lot simpler.

@jreback thanks for the suggestion, I changed the code to use is_freq_type instead. That wasn't sufficient by itself (e.g. a groupby().rolling() function will always use a variable function, not just if self.is_freq_type, so I also added a skip_offset param to the _apply function.

Nonetheless, I do think this is a simpler approach than I had previously, so thanks for the recommendation!

@jreback Are you cool with this implementation?

I don't understand the skip_offset parameter. why is it not sufficient to just make a property on the class itself (e.g. you can make another preoprty / method if needed, similr to is_ffreq_type). passing parameters around like this is really hard to understand.

@justinessert I think same as above, can you explain why skip_offset arg is needed

jreback · 2020-09-06T17:36:12Z

pandas/core/window/rolling.py

+            window = len(window)
+        return int((window - 1) / 2.0)
+
+    def _get_cython_func_type(self, func: str) -> Tuple[Callable, str]:


what is the point of expanding the signature?

@jreback The point is to know whether the cython function is fixed or variable. Fixed functions require extra logic to handle a centered rolling window whereas (with my changes to the FixedWindowIndexer) variable functions do not.

Namely, fixed functions require appending additional_nans in the homogeneous_func (here) and then removing the same number of nans from the beginning of the result (here).

If you have a different suggestion for how to determine whether a function is fixed or variable I'd be open to an alternative route.

same comment

jreback · 2020-09-06T17:37:14Z

pandas/tests/window/moments/test_moments_consistency_rolling.py

@@ -136,6 +136,54 @@ def test_rolling_apply_consistency(
                tm.assert_equal(rolling_f_result, rolling_apply_f_result)


+@pytest.mark.slow


why do you need the _rolling_consistency_cases here? rather than simply constructing a set of cases.

That's fair, I'm open to creating an additional test if you'd like me to. If we have an array of numbers arr and dfs:

base_df = DataFrame({'data': arr}) group_df = DataFrame({'group': ['A']*len(arr) + ['B']*len(arr), 'data': arr + arr})

The idea here was just to test that the result of base_df.rolling... and the result of each group in group_df.groupby('group').rolling... are all identical. Because in the current master branch, the base result is different then the A result which is also different from the B result.

Essentially I'm testing the consistency of using df.rolling vs df.groupby.rolling (and also testing the consistency of different groups within df.groupby.rolling) since that was the issue before.

Yeah I think it'd be better to add test cases to window/test_grouper.py in with more pytest idioms

@jreback @mroeschke I added that test_grouper test, would you like me to remove this one or keep it?

jreback · 2020-09-06T17:38:28Z

pandas/tests/window/moments/test_moments_consistency_rolling.py

+        base_rolling_f = getattr(
+            base_df[["data"]].rolling(
+                window=window, center=center, min_periods=min_periods
+            ),


how is this actually testing center? sure you pass it but unless we can see the expected results i don't have any idea whether this is correct. a small set of fixed cases that really show the input and output is much more useful here.

Fair, see response on your previous comment, but here I'm more testing the consistency of df.rolling vs df.groupby.rolling. If you assume that df.rolling works correctly (which is extensively tested elsewhere), then this will test df.groupby.rolling also works correctly.

That being said, I'm open to creating another test to explicitly tests the correctness of df.groupby.rolling if you think that's necessary.

Haven't digested the details here but it sounds like you should hard code expected results and test against that

pandas/core/window/rolling.py

jreback · 2020-09-08T00:25:39Z

pandas/core/window/rolling.py

@@ -554,7 +563,11 @@ def homogeneous_func(values: np.ndarray):
            if values.size == 0:
                return values.copy()

-            offset = calculate_center_offset(window) if center else 0
+            if func_type in ["fixed", "weighted"]:


see comment above.

since calculate_center_offset is now a method, you can just have this return the correct value.

jreback · 2020-09-08T00:25:48Z

pandas/core/window/rolling.py

@@ -597,8 +610,8 @@ def calc(x):
            if use_numba_cache:
                NUMBA_FUNC_CACHE[(kwargs["original_func"], "rolling_apply")] = func

-            if center:
-                result = self._center_window(result, window)
+            if func_type in ["fixed", "weighted"]:


same comment

jreback · 2020-09-08T00:26:37Z

pandas/core/window/rolling.py

+        if not center or not self.win_type:
+            return 0
+
+        if not is_integer(window):


ok see my comment above; if this is now a method, then you can simply use is_freq_type and this PR is a lot simpler.

jreback · 2020-09-08T00:26:47Z

pandas/core/window/rolling.py

+            window = len(window)
+        return int((window - 1) / 2.0)
+
+    def _get_cython_func_type(self, func: str) -> Tuple[Callable, str]:


same comment

justinessert · 2020-09-10T14:16:12Z

Can someone help me understand the test failure? It seems to be with the to_parquet function, but only on Windows py38_np18

arw2019 · 2020-09-11T15:47:45Z

Can someone help me understand the test failure? It seems to be with the to_parquet function, but only on Windows py38_np18

I think that's unrelated to this patch

jreback · 2020-09-12T21:44:12Z

pandas/core/window/rolling.py

@@ -410,18 +392,44 @@ def _insert_on_column(self, result: "DataFrame", obj: "DataFrame"):
                # insert at the end
                result[name] = extra_col

-    def _center_window(self, result: np.ndarray, window) -> np.ndarray:
+    def calculate_center_offset(self, window, center: bool) -> int:


pls type window: Union[np.ndarray, int]

Doing that causes an error in typing validation:

pandas/core/window/rolling.py:417: error: Argument 1 to "len" has incompatible type "Union[Any, int]"; expected "Sized" [arg-type] pandas/core/window/rolling.py:2300: error: Argument 1 to "len" has incompatible type "Union[Any, int]"; expected "Sized" [arg-type]

jreback · 2020-09-12T21:44:52Z

pandas/core/window/rolling.py

@@ -536,6 +545,8 @@ def _apply(
        use_numba_cache : bool
            whether to cache a numba compiled function. Only available for numba
            enabled methods (so far only apply)
+        skip_offset : bool


what's the point of an addtional parameter here? this makes it really hard to understand

@justinessert can you address

pandas/core/window/rolling.py

jreback · 2020-09-12T21:47:01Z

pandas/core/window/rolling.py

+        if not center or not self.win_type:
+            return 0
+
+        if not is_integer(window):


I don't understand the skip_offset parameter. why is it not sufficient to just make a property on the class itself (e.g. you can make another preoprty / method if needed, similr to is_ffreq_type). passing parameters around like this is really hard to understand.

pandas/tests/window/moments/test_moments_consistency_rolling.py

jreback · 2020-09-12T21:47:32Z

pandas/tests/window/moments/test_moments_consistency_rolling.py

+    for (f, require_min_periods, name) in base_functions:
+        if (
+            require_min_periods
+            and (min_periods is not None)


why are these skipped?

if min_periods is less than the required_min_periods then there will be an error thrown so we wouldn't be able to test the equivalency of the two dfs

If there's an error thrown test for that using pytest.raises

justinessert · 2020-09-13T14:40:25Z

@jreback I agree that the skip_offset param is not ideal, but I currently don't see a way to do this without a similar level of complexity. I'll do my best to explain the reasoning here.

The core of the issue is that there are 4 different types of cython functions being used, which all handle creating a centered window differently:

Fixed Cython Funcs: Fixed cython funcs require appending NaNs to the original array to create a centered window (this is when offset should be non-zero).
Weighted Cython Funcs: Similar to Fixed, these also require appending NaNs.
Variable Cython Funcs: Rather than appending NaNs, start/end arrays handle creating a centered window
.apply(lambda...): These lambda functions take an offset parameter and handle the centering themselves (rather than using start/end or appending NaNs)

Essentially, the skip_offset parameter is just trying to track when we need an offset (ie. fixed/weighted) and when we don't need an offset (ie. variable/.apply(lambda...)). You're right that in most cases we use is_freq_type variable to determine whether to use Fixed/Variable. But there are also many cases with alternate logic:

As suggested above, their are also cases when we use weighted cython funcs or .apply(lambda...), which each use separate logic
Functions applied to groupby.rolling() will never be a fixed cython func, regardless of is_freq_type
The Median function always needs skip_offset=True (I believe because it falls into the variable cython func category
The Quantile function is the worst, if quantile is 0.0 or 1.0 then it uses the min/max functions which could be either fixed or variable (based on is_freq_type) but if quantile is something else, then we use the roll_quantile function, which is always a variable function.

justinessert · 2020-09-15T18:57:30Z

@jreback does the above comment sufficiently describe why the extra param is needed? If not, what further questions can I answer?

justinessert · 2020-09-25T02:50:42Z

@jreback @arw2019 @mroeschke what questions can I answer? What changes are you looking for? Is this PR good to go?

arw2019 · 2020-09-25T02:54:57Z

doc/source/whatsnew/v1.2.0.rst

@@ -347,6 +347,7 @@ Groupby/resample/rolling
 - Bug when subsetting columns on a :class:`~pandas.core.groupby.DataFrameGroupBy` (e.g. ``df.groupby('a')[['b']])``) would reset the attributes ``axis``, ``dropna``, ``group_keys``, ``level``, ``mutated``, ``sort``, and ``squeeze`` to their default values. (:issue:`9959`)
 - Bug in :meth:`DataFrameGroupby.tshift` failing to raise ``ValueError`` when a frequency cannot be inferred for the index of a group (:issue:`35937`)
 - Bug in :meth:`DataFrame.groupby` does not always maintain column index name for ``any``, ``all``, ``bfill``, ``ffill``, ``shift`` (:issue:`29764`)
+- Bug in :meth:`DataFrame.groupby.rolling` output incorrect when using a partial window (:issue:`36040`)


returning wrong values with partial window

mroeschke · 2020-09-25T03:49:35Z

Sorry for the delay @justinessert.

This PR may become a lot simpler if we can push #36567 through as we will no longer need to pass around a flag that indicates whether we're using a variable or fixed algorithm

justinessert · 2020-10-10T15:55:40Z

Replacing this PR with 37035 following @mroeschke's 36567

justinessert added 3 commits September 4, 2020 18:39

updated fixed indexer to work with rolling df and groupby

69f084f

updated is_weighted case

71830c8

added comment

a449d9b

arw2019 suggested changes Sep 5, 2020

View reviewed changes

jreback requested changes Sep 5, 2020

View reviewed changes

jreback added the Window rolling, ewma, expanding label Sep 5, 2020

justinessert added 2 commits September 5, 2020 12:59

corrected offset for even window sizes

476fe83

reverted changes for weighted windows

9dfd9f3

mroeschke reviewed Sep 5, 2020

View reviewed changes

pandas/core/window/rolling.py Outdated Show resolved Hide resolved

mroeschke reviewed Sep 5, 2020

View reviewed changes

pandas/core/window/indexers.py Outdated Show resolved Hide resolved

justinessert added 5 commits September 6, 2020 11:39

reverted back to fixed func type; added func_type variable

f025600

reformatted

5d902fd

merged master and resolved conflict

59fcd3e

corrected return typing

cdecf34

added consistency tests

4e8f844

justinessert changed the title ~~updated fixed indexer to work with rolling df and groupby~~ API: reimplement FixedWindowIndexer.get_window_bounds to fix groupby bug Sep 6, 2020

justinessert added 2 commits September 6, 2020 12:36

corrected typing change

6e66a49

reformatted test to pass blac

e7fb384

jreback requested changes Sep 6, 2020

View reviewed changes

added typing and docstring

3649ca2

mroeschke reviewed Sep 6, 2020

View reviewed changes

pandas/core/window/rolling.py Outdated Show resolved Hide resolved

justinessert added 2 commits September 6, 2020 17:45

fixing center param in median's _apply

00cc1dc

added center_min_periods test to test_grouper

3de7fcc

jreback requested changes Sep 8, 2020

View reviewed changes

justinessert added 4 commits September 9, 2020 18:27

replaced func_type with skip_offset

f779321

removed unneeded class attribute

a817f87

moved logic into calculate_center_offset

d72812d

removed whitespace

96c6959

justinessert requested a review from jreback September 11, 2020 19:21

justinessert added 2 commits September 12, 2020 11:46

added whatsnew entry

daacae7

Merge remote-tracking branch 'upstream/master' into groupby-rolling

52a8a6b

jreback requested changes Sep 12, 2020

View reviewed changes

justinessert added 2 commits September 13, 2020 09:54

formatting fixes

950018c

removed pytest.slow

0798c70

removed typing of window

f413ec8

justinessert requested a review from jreback September 18, 2020 20:48

Merge branch 'master' into groupby-rolling

70679be

arw2019 reviewed Sep 25, 2020

View reviewed changes

jreback mentioned this pull request Oct 2, 2020

REF: Remove rolling window fixed algorithms #36567

Merged

3 tasks

justinessert mentioned this pull request Oct 10, 2020

API: reimplement FixedWindowIndexer.get_window_bounds #37035

Merged

5 tasks

justinessert closed this Oct 10, 2020

		@@ -136,6 +136,54 @@ def test_rolling_apply_consistency(
		tm.assert_equal(rolling_f_result, rolling_apply_f_result)


		@pytest.mark.slow

Uh oh!

API: reimplement FixedWindowIndexer.get_window_bounds to fix groupby bug #36132

API: reimplement FixedWindowIndexer.get_window_bounds to fix groupby bug #36132

Uh oh!

Conversation

justinessert commented Sep 4, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pep8speaks commented Sep 4, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Comment last updated at 2020-09-25 02:51:47 UTC

Uh oh!

arw2019 left a comment

Choose a reason for hiding this comment

Uh oh!

jreback left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

justinessert commented Sep 6, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

justinessert commented Sep 10, 2020

Uh oh!

arw2019 commented Sep 11, 2020

Uh oh!

justinessert commented Sep 4, 2020 •

edited

Loading

pep8speaks commented Sep 4, 2020 •

edited

Loading

justinessert commented Sep 6, 2020 •

edited

Loading