-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
API: reimplement FixedWindowIndexer.get_window_bounds to fix groupby bug #36132
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Hello @justinessert! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found: There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻 Comment last updated at 2020-09-25 02:51:47 UTC |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks @justinessert for the PR!
From your description and a quick look at the code this looks like it's an API breaking change - is that right?
In any case, to move forward we'll need tests for the behavior you're targeting. I think you would need to alter at least some of the tests in pandas/tests/window/test_base_indexer.py
and maybe some others. Seeing those will make it easier to figure out if this is a desired change
I would also retitle the PR to something like API: reimplement FixedWindowIndexer.get_window_bounds
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pls add a test from the OP
it's ok to refactor as well but should pass the new test
@arw2019 @jreback @mroeschke Thanks for your feedback, I believe that I addressed all of it, please let me know if there's anything else you want to see in this PR. I know that I still need to add a whatsnew entry, could you please advise what/where I should add? |
if not center or not self.win_type: | ||
return 0 | ||
|
||
if not is_integer(window): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we not just make this a free function? i get that you are passing win_type here, but that could easily be passed in as an arg
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I again tried making this a free function and confirmed that the two classes require different functionality (whether or not to include or not self.win_type
in the if statement).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can just add an optional 3rd kwarg (window, center: bool, win_type: Optional[str] = None)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did try that and it did not work. The issue is that in one class win_type should be completely ignored whereas the other class needs to use it. But both classes share the _apply
function, which actually calls calculate_center_offset
here.
So I can't just do calculate_center_offset(..., None)
for the first class and calculate_center_offset(..., self.win_type)
for the second (because there's only one shared call to calculate_center_offset(...)
, not two separate ones)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok see my comment above; if this is now a method, then you can simply use is_freq_type and this PR is a lot simpler.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jreback thanks for the suggestion, I changed the code to use is_freq_type
instead. That wasn't sufficient by itself (e.g. a groupby().rolling()
function will always use a variable function, not just if self.is_freq_type
, so I also added a skip_offset
param to the _apply function.
Nonetheless, I do think this is a simpler approach than I had previously, so thanks for the recommendation!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jreback Are you cool with this implementation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand the skip_offset parameter. why is it not sufficient to just make a property on the class itself (e.g. you can make another preoprty / method if needed, similr to is_ffreq_type). passing parameters around like this is really hard to understand.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@justinessert I think same as above, can you explain why skip_offset arg is needed
pandas/core/window/rolling.py
Outdated
window = len(window) | ||
return int((window - 1) / 2.0) | ||
|
||
def _get_cython_func_type(self, func: str) -> Tuple[Callable, str]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is the point of expanding the signature?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jreback The point is to know whether the cython function is fixed or variable. Fixed functions require extra logic to handle a centered rolling window whereas (with my changes to the FixedWindowIndexer) variable functions do not.
Namely, fixed functions require appending additional_nans
in the homogeneous_func
(here) and then removing the same number of nans from the beginning of the result (here).
If you have a different suggestion for how to determine whether a function is fixed or variable I'd be open to an alternative route.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same comment
@@ -136,6 +136,54 @@ def test_rolling_apply_consistency( | |||
tm.assert_equal(rolling_f_result, rolling_apply_f_result) | |||
|
|||
|
|||
@pytest.mark.slow |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do you need the _rolling_consistency_cases here? rather than simply constructing a set of cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's fair, I'm open to creating an additional test if you'd like me to. If we have an array of numbers arr
and dfs:
base_df = DataFrame({'data': arr})
group_df = DataFrame({'group': ['A']*len(arr) + ['B']*len(arr), 'data': arr + arr})
The idea here was just to test that the result of base_df.rolling...
and the result of each group in group_df.groupby('group').rolling...
are all identical. Because in the current master branch, the base
result is different then the A
result which is also different from the B
result.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Essentially I'm testing the consistency of using df.rolling
vs df.groupby.rolling
(and also testing the consistency of different groups within df.groupby.rolling
) since that was the issue before.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I think it'd be better to add test cases to window/test_grouper.py
in with more pytest
idioms
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jreback @mroeschke I added that test_grouper test, would you like me to remove this one or keep it?
base_rolling_f = getattr( | ||
base_df[["data"]].rolling( | ||
window=window, center=center, min_periods=min_periods | ||
), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how is this actually testing center? sure you pass it but unless we can see the expected results i don't have any idea whether this is correct. a small set of fixed cases that really show the input and output is much more useful here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair, see response on your previous comment, but here I'm more testing the consistency of df.rolling
vs df.groupby.rolling
. If you assume that df.rolling
works correctly (which is extensively tested elsewhere), then this will test df.groupby.rolling
also works correctly.
That being said, I'm open to creating another test to explicitly tests the correctness of df.groupby.rolling
if you think that's necessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Haven't digested the details here but it sounds like you should hard code expected results and test against that
pandas/core/window/rolling.py
Outdated
@@ -554,7 +563,11 @@ def homogeneous_func(values: np.ndarray): | |||
if values.size == 0: | |||
return values.copy() | |||
|
|||
offset = calculate_center_offset(window) if center else 0 | |||
if func_type in ["fixed", "weighted"]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see comment above.
since calculate_center_offset is now a method, you can just have this return the correct value.
pandas/core/window/rolling.py
Outdated
@@ -597,8 +610,8 @@ def calc(x): | |||
if use_numba_cache: | |||
NUMBA_FUNC_CACHE[(kwargs["original_func"], "rolling_apply")] = func | |||
|
|||
if center: | |||
result = self._center_window(result, window) | |||
if func_type in ["fixed", "weighted"]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same comment
if not center or not self.win_type: | ||
return 0 | ||
|
||
if not is_integer(window): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok see my comment above; if this is now a method, then you can simply use is_freq_type and this PR is a lot simpler.
pandas/core/window/rolling.py
Outdated
window = len(window) | ||
return int((window - 1) / 2.0) | ||
|
||
def _get_cython_func_type(self, func: str) -> Tuple[Callable, str]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same comment
Can someone help me understand the test failure? It seems to be with the |
I think that's unrelated to this patch |
@@ -410,18 +392,44 @@ def _insert_on_column(self, result: "DataFrame", obj: "DataFrame"): | |||
# insert at the end | |||
result[name] = extra_col | |||
|
|||
def _center_window(self, result: np.ndarray, window) -> np.ndarray: | |||
def calculate_center_offset(self, window, center: bool) -> int: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pls type window: Union[np.ndarray, int]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doing that causes an error in typing validation:
pandas/core/window/rolling.py:417: error: Argument 1 to "len" has incompatible type "Union[Any, int]"; expected "Sized" [arg-type]
pandas/core/window/rolling.py:2300: error: Argument 1 to "len" has incompatible type "Union[Any, int]"; expected "Sized" [arg-type]
@@ -536,6 +545,8 @@ def _apply( | |||
use_numba_cache : bool | |||
whether to cache a numba compiled function. Only available for numba | |||
enabled methods (so far only apply) | |||
skip_offset : bool |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what's the point of an addtional parameter here? this makes it really hard to understand
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@justinessert can you address
if not center or not self.win_type: | ||
return 0 | ||
|
||
if not is_integer(window): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand the skip_offset parameter. why is it not sufficient to just make a property on the class itself (e.g. you can make another preoprty / method if needed, similr to is_ffreq_type). passing parameters around like this is really hard to understand.
for (f, require_min_periods, name) in base_functions: | ||
if ( | ||
require_min_periods | ||
and (min_periods is not None) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why are these skipped?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if min_periods is less than the required_min_periods then there will be an error thrown so we wouldn't be able to test the equivalency of the two dfs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If there's an error thrown test for that using pytest.raises
@jreback I agree that the skip_offset param is not ideal, but I currently don't see a way to do this without a similar level of complexity. I'll do my best to explain the reasoning here. The core of the issue is that there are 4 different types of cython functions being used, which all handle creating a centered window differently:
Essentially, the
|
@jreback does the above comment sufficiently describe why the extra param is needed? If not, what further questions can I answer? |
@jreback @arw2019 @mroeschke what questions can I answer? What changes are you looking for? Is this PR good to go? |
@@ -347,6 +347,7 @@ Groupby/resample/rolling | |||
- Bug when subsetting columns on a :class:`~pandas.core.groupby.DataFrameGroupBy` (e.g. ``df.groupby('a')[['b']])``) would reset the attributes ``axis``, ``dropna``, ``group_keys``, ``level``, ``mutated``, ``sort``, and ``squeeze`` to their default values. (:issue:`9959`) | |||
- Bug in :meth:`DataFrameGroupby.tshift` failing to raise ``ValueError`` when a frequency cannot be inferred for the index of a group (:issue:`35937`) | |||
- Bug in :meth:`DataFrame.groupby` does not always maintain column index name for ``any``, ``all``, ``bfill``, ``ffill``, ``shift`` (:issue:`29764`) | |||
- Bug in :meth:`DataFrame.groupby.rolling` output incorrect when using a partial window (:issue:`36040`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
returning wrong values with partial window
Sorry for the delay @justinessert. This PR may become a lot simpler if we can push #36567 through as we will no longer need to pass around a flag that indicates whether we're using a variable or fixed algorithm |
Replacing this PR with 37035 following @mroeschke's 36567 |
black pandas
git diff upstream/master -u -- "*.py" | flake8 --diff
Creating PR to solve Issue 36040.
The old
FixedWindowIndexer.get_window_bounds
function provided unintuitive bounds that went beyond the length of the original array. Additionally, to do a centered rolling operation, it required NaN values to be appended to the end of original array to enable some roundabout way of achieving the centering.I replaced it with one that seems much simpler and actually creates "fixed" size windows (at least prior to clipping the ends), which the previous function did not.
That being said, I know this PR fails some tests, I'm would appreciate some advice on how best to proceed!