Skip to content

API: reimplement FixedWindowIndexer.get_window_bounds to fix groupby bug #36132

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 25 commits into from

Conversation

justinessert
Copy link
Contributor

@justinessert justinessert commented Sep 4, 2020

Creating PR to solve Issue 36040.

The old FixedWindowIndexer.get_window_bounds function provided unintuitive bounds that went beyond the length of the original array. Additionally, to do a centered rolling operation, it required NaN values to be appended to the end of original array to enable some roundabout way of achieving the centering.

I replaced it with one that seems much simpler and actually creates "fixed" size windows (at least prior to clipping the ends), which the previous function did not.

That being said, I know this PR fails some tests, I'm would appreciate some advice on how best to proceed!

@pep8speaks
Copy link

pep8speaks commented Sep 4, 2020

Hello @justinessert! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2020-09-25 02:51:47 UTC

Copy link
Member

@arw2019 arw2019 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @justinessert for the PR!

From your description and a quick look at the code this looks like it's an API breaking change - is that right?

In any case, to move forward we'll need tests for the behavior you're targeting. I think you would need to alter at least some of the tests in pandas/tests/window/test_base_indexer.py and maybe some others. Seeing those will make it easier to figure out if this is a desired change

I would also retitle the PR to something like API: reimplement FixedWindowIndexer.get_window_bounds

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pls add a test from the OP
it's ok to refactor as well but should pass the new test

@jreback jreback added the Window rolling, ewma, expanding label Sep 5, 2020
@justinessert justinessert changed the title updated fixed indexer to work with rolling df and groupby API: reimplement FixedWindowIndexer.get_window_bounds to fix groupby bug Sep 6, 2020
@justinessert
Copy link
Contributor Author

justinessert commented Sep 6, 2020

@arw2019 @jreback @mroeschke Thanks for your feedback, I believe that I addressed all of it, please let me know if there's anything else you want to see in this PR.

I know that I still need to add a whatsnew entry, could you please advise what/where I should add?

if not center or not self.win_type:
return 0

if not is_integer(window):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we not just make this a free function? i get that you are passing win_type here, but that could easily be passed in as an arg

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I again tried making this a free function and confirmed that the two classes require different functionality (whether or not to include or not self.win_type in the if statement).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can just add an optional 3rd kwarg (window, center: bool, win_type: Optional[str] = None)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did try that and it did not work. The issue is that in one class win_type should be completely ignored whereas the other class needs to use it. But both classes share the _apply function, which actually calls calculate_center_offset here.

So I can't just do calculate_center_offset(..., None) for the first class and calculate_center_offset(..., self.win_type) for the second (because there's only one shared call to calculate_center_offset(...), not two separate ones)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok see my comment above; if this is now a method, then you can simply use is_freq_type and this PR is a lot simpler.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback thanks for the suggestion, I changed the code to use is_freq_type instead. That wasn't sufficient by itself (e.g. a groupby().rolling() function will always use a variable function, not just if self.is_freq_type, so I also added a skip_offset param to the _apply function.

Nonetheless, I do think this is a simpler approach than I had previously, so thanks for the recommendation!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback Are you cool with this implementation?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand the skip_offset parameter. why is it not sufficient to just make a property on the class itself (e.g. you can make another preoprty / method if needed, similr to is_ffreq_type). passing parameters around like this is really hard to understand.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@justinessert I think same as above, can you explain why skip_offset arg is needed

window = len(window)
return int((window - 1) / 2.0)

def _get_cython_func_type(self, func: str) -> Tuple[Callable, str]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the point of expanding the signature?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback The point is to know whether the cython function is fixed or variable. Fixed functions require extra logic to handle a centered rolling window whereas (with my changes to the FixedWindowIndexer) variable functions do not.

Namely, fixed functions require appending additional_nans in the homogeneous_func (here) and then removing the same number of nans from the beginning of the result (here).

If you have a different suggestion for how to determine whether a function is fixed or variable I'd be open to an alternative route.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same comment

@@ -136,6 +136,54 @@ def test_rolling_apply_consistency(
tm.assert_equal(rolling_f_result, rolling_apply_f_result)


@pytest.mark.slow
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do you need the _rolling_consistency_cases here? rather than simply constructing a set of cases.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's fair, I'm open to creating an additional test if you'd like me to. If we have an array of numbers arr and dfs:

base_df = DataFrame({'data': arr})
group_df = DataFrame({'group': ['A']*len(arr) + ['B']*len(arr), 'data': arr + arr})

The idea here was just to test that the result of base_df.rolling... and the result of each group in group_df.groupby('group').rolling... are all identical. Because in the current master branch, the base result is different then the A result which is also different from the B result.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Essentially I'm testing the consistency of using df.rolling vs df.groupby.rolling (and also testing the consistency of different groups within df.groupby.rolling) since that was the issue before.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I think it'd be better to add test cases to window/test_grouper.py in with more pytest idioms

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback @mroeschke I added that test_grouper test, would you like me to remove this one or keep it?

base_rolling_f = getattr(
base_df[["data"]].rolling(
window=window, center=center, min_periods=min_periods
),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how is this actually testing center? sure you pass it but unless we can see the expected results i don't have any idea whether this is correct. a small set of fixed cases that really show the input and output is much more useful here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair, see response on your previous comment, but here I'm more testing the consistency of df.rolling vs df.groupby.rolling. If you assume that df.rolling works correctly (which is extensively tested elsewhere), then this will test df.groupby.rolling also works correctly.

That being said, I'm open to creating another test to explicitly tests the correctness of df.groupby.rolling if you think that's necessary.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haven't digested the details here but it sounds like you should hard code expected results and test against that

@@ -554,7 +563,11 @@ def homogeneous_func(values: np.ndarray):
if values.size == 0:
return values.copy()

offset = calculate_center_offset(window) if center else 0
if func_type in ["fixed", "weighted"]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see comment above.

since calculate_center_offset is now a method, you can just have this return the correct value.

@@ -597,8 +610,8 @@ def calc(x):
if use_numba_cache:
NUMBA_FUNC_CACHE[(kwargs["original_func"], "rolling_apply")] = func

if center:
result = self._center_window(result, window)
if func_type in ["fixed", "weighted"]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same comment

if not center or not self.win_type:
return 0

if not is_integer(window):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok see my comment above; if this is now a method, then you can simply use is_freq_type and this PR is a lot simpler.

window = len(window)
return int((window - 1) / 2.0)

def _get_cython_func_type(self, func: str) -> Tuple[Callable, str]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same comment

@justinessert
Copy link
Contributor Author

Can someone help me understand the test failure? It seems to be with the to_parquet function, but only on Windows py38_np18

@arw2019
Copy link
Member

arw2019 commented Sep 11, 2020

Can someone help me understand the test failure? It seems to be with the to_parquet function, but only on Windows py38_np18

I think that's unrelated to this patch

@@ -410,18 +392,44 @@ def _insert_on_column(self, result: "DataFrame", obj: "DataFrame"):
# insert at the end
result[name] = extra_col

def _center_window(self, result: np.ndarray, window) -> np.ndarray:
def calculate_center_offset(self, window, center: bool) -> int:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pls type window: Union[np.ndarray, int]

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doing that causes an error in typing validation:

pandas/core/window/rolling.py:417: error: Argument 1 to "len" has incompatible type "Union[Any, int]"; expected "Sized"  [arg-type]
pandas/core/window/rolling.py:2300: error: Argument 1 to "len" has incompatible type "Union[Any, int]"; expected "Sized"  [arg-type]

@@ -536,6 +545,8 @@ def _apply(
use_numba_cache : bool
whether to cache a numba compiled function. Only available for numba
enabled methods (so far only apply)
skip_offset : bool
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's the point of an addtional parameter here? this makes it really hard to understand

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@justinessert can you address

if not center or not self.win_type:
return 0

if not is_integer(window):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand the skip_offset parameter. why is it not sufficient to just make a property on the class itself (e.g. you can make another preoprty / method if needed, similr to is_ffreq_type). passing parameters around like this is really hard to understand.

for (f, require_min_periods, name) in base_functions:
if (
require_min_periods
and (min_periods is not None)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are these skipped?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if min_periods is less than the required_min_periods then there will be an error thrown so we wouldn't be able to test the equivalency of the two dfs

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there's an error thrown test for that using pytest.raises

@justinessert
Copy link
Contributor Author

@jreback I agree that the skip_offset param is not ideal, but I currently don't see a way to do this without a similar level of complexity. I'll do my best to explain the reasoning here.

The core of the issue is that there are 4 different types of cython functions being used, which all handle creating a centered window differently:

  1. Fixed Cython Funcs: Fixed cython funcs require appending NaNs to the original array to create a centered window (this is when offset should be non-zero).
  2. Weighted Cython Funcs: Similar to Fixed, these also require appending NaNs.
  3. Variable Cython Funcs: Rather than appending NaNs, start/end arrays handle creating a centered window
  4. .apply(lambda...): These lambda functions take an offset parameter and handle the centering themselves (rather than using start/end or appending NaNs)

Essentially, the skip_offset parameter is just trying to track when we need an offset (ie. fixed/weighted) and when we don't need an offset (ie. variable/.apply(lambda...)). You're right that in most cases we use is_freq_type variable to determine whether to use Fixed/Variable. But there are also many cases with alternate logic:

  • As suggested above, their are also cases when we use weighted cython funcs or .apply(lambda...), which each use separate logic
  • Functions applied to groupby.rolling() will never be a fixed cython func, regardless of is_freq_type
  • The Median function always needs skip_offset=True (I believe because it falls into the variable cython func category
  • The Quantile function is the worst, if quantile is 0.0 or 1.0 then it uses the min/max functions which could be either fixed or variable (based on is_freq_type) but if quantile is something else, then we use the roll_quantile function, which is always a variable function.

@justinessert
Copy link
Contributor Author

@jreback does the above comment sufficiently describe why the extra param is needed? If not, what further questions can I answer?

@justinessert
Copy link
Contributor Author

@jreback @arw2019 @mroeschke what questions can I answer? What changes are you looking for? Is this PR good to go?

@@ -347,6 +347,7 @@ Groupby/resample/rolling
- Bug when subsetting columns on a :class:`~pandas.core.groupby.DataFrameGroupBy` (e.g. ``df.groupby('a')[['b']])``) would reset the attributes ``axis``, ``dropna``, ``group_keys``, ``level``, ``mutated``, ``sort``, and ``squeeze`` to their default values. (:issue:`9959`)
- Bug in :meth:`DataFrameGroupby.tshift` failing to raise ``ValueError`` when a frequency cannot be inferred for the index of a group (:issue:`35937`)
- Bug in :meth:`DataFrame.groupby` does not always maintain column index name for ``any``, ``all``, ``bfill``, ``ffill``, ``shift`` (:issue:`29764`)
- Bug in :meth:`DataFrame.groupby.rolling` output incorrect when using a partial window (:issue:`36040`)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

returning wrong values with partial window

@mroeschke
Copy link
Member

Sorry for the delay @justinessert.

This PR may become a lot simpler if we can push #36567 through as we will no longer need to pass around a flag that indicates whether we're using a variable or fixed algorithm

@justinessert
Copy link
Contributor Author

Replacing this PR with 37035 following @mroeschke's 36567

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Window rolling, ewma, expanding
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: Rolling min_periods not working on groupby object
5 participants