-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG: Fixes to FixedForwardWindowIndexer and GroupbyIndexer (#43267) #43291
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: Fixes to FixedForwardWindowIndexer and GroupbyIndexer (#43267) #43291
Conversation
@github-actions pre-commit |
2755500
to
5900e00
Compare
pandas/core/indexers/objects.py
Outdated
@@ -93,7 +93,6 @@ def get_window_bounds( | |||
|
|||
end = np.clip(end, 0, num_values) | |||
start = np.clip(start, 0, num_values) | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mind reverting these whitespace changes? It's somewhat distracting from the other changes in this PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure.
pandas/core/window/rolling.py
Outdated
start, end = window_indexer.get_window_bounds( | ||
num_values=len(x), | ||
min_periods=min_periods, | ||
center=self.center, | ||
closed=self.closed, | ||
) | ||
|
||
# From get_window_bounds, those two should be equal in length of array | ||
assert len(start) == len(end) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could this assertion be moved to GroupbyIndexer.get_window_bounds
instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm fine with putting one there too -- I was just mirroring existing practice -- but I think an independent check here has merit.
If start/end are broken here, result = calc(values)
is probably going to crash a few lines below.
Since there's logic in _apply which calculates min_periods which is passed as an arg to get_window_bounds, we can dream up scenarios where everything would be fine at that level but broken here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay. Putting this assert here is fine as well then
pandas/core/window/rolling.py
Outdated
@@ -1429,6 +1435,10 @@ def cov_func(x, y): | |||
center=self.center, | |||
closed=self.closed, | |||
) | |||
|
|||
# From get_window_bounds, those two should be equal in length of array | |||
assert len(start) == len(end) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same (as well as below)
@pytest.mark.parametrize( | ||
"df", | ||
[ | ||
DataFrame({"a": [1, 1], "b": [0, 1]}), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you make the DataFrame
call in the test method instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
start, end = indexer.get_window_bounds(num_values=num_values) | ||
|
||
tm.assert_equal(start, np.array(expected_start, dtype="int64")) | ||
tm.assert_equal(end, np.array(expected_end, dtype="int64")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tm.assert_numpy_array_equal
I believe
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
pandas/tests/window/test_rolling.py
Outdated
), | ||
], | ||
) | ||
def test_rolling_groupby_with_fixed_forward_specific(df, window_size, expected): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can move these tests in pandas/tests/window/test_base_indexer.py
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
doc/source/whatsnew/v1.4.0.rst
Outdated
@@ -355,7 +355,7 @@ Groupby/resample/rolling | |||
- Bug in :meth:`pandas.DataFrame.ewm`, where non-float64 dtypes were silently failing (:issue:`42452`) | |||
- Bug in :meth:`pandas.DataFrame.rolling` operation along rows (``axis=1``) incorrectly omits columns containing ``float16`` and ``float32`` (:issue:`41779`) | |||
- Bug in :meth:`Resampler.aggregate` did not allow the use of Named Aggregation (:issue:`32803`) | |||
- | |||
- Bug in :class:`GroupbyIndexer` and :class:`FixedForwardWindowIndexer` leading to segfaults and incorrect windows (:issue:`43267`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-
:class:`pandas.api.indexers.FixedForwardWindowIndexer`
-
instead of mentioning
GroupbyIndexer
, mention thegroupby.rolling
operation -
Could you be more specific with "incorrect windows"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mroeschke would these changes be okay for 1.3.3? #43267 (comment) or would we want to split this PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure this would be okay for 1.3.3. @dsm054 could you moved this note to 1.3.3?
5900e00
to
9fc7317
Compare
@github-actions pre-commit |
doc/source/whatsnew/v1.4.0.rst
Outdated
@@ -355,7 +355,7 @@ Groupby/resample/rolling | |||
- Bug in :meth:`pandas.DataFrame.ewm`, where non-float64 dtypes were silently failing (:issue:`42452`) | |||
- Bug in :meth:`pandas.DataFrame.rolling` operation along rows (``axis=1``) incorrectly omits columns containing ``float16`` and ``float32`` (:issue:`41779`) | |||
- Bug in :meth:`Resampler.aggregate` did not allow the use of Named Aggregation (:issue:`32803`) | |||
- | |||
- Bug in :meth:`pandas.DataFrame.groupby.rolling` and :class:`pandas.api.indexers.FixedForwardWindowIndexer` leading to segfaults and window endpoints being mixed across groups (:issue:`43267`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you move this to the bug fix section in 1.3.3
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
planning the 1.3.3 release tomorrow. changing milestone to 1.3.4. (1.3.4 release notes won't be added to master until after 1.3.3 is released)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved to 1.3.4 per request
pandas/core/indexers/objects.py
Outdated
@@ -341,18 +342,20 @@ def get_window_bounds( | |||
) | |||
start = start.astype(np.int64) | |||
end = end.astype(np.int64) | |||
# Cannot use groupby_indicies as they might not be monotonic with the object | |||
# From get_window_bounds, those two should be equal in length of array | |||
assert len(start) == len(end) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you move this assert to before the return
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could, but that would hide a category of bugs: say that there were two keys, and the start,end pairs were (3,4) and (4,3). After concatenation we'd see 7==7 and be happy.
I'd be fine with adding another check before the return.
@dsm054 a few comments if you can update (and merge master) |
doc/source/whatsnew/v1.4.0.rst
Outdated
@@ -355,7 +355,7 @@ Groupby/resample/rolling | |||
- Bug in :meth:`pandas.DataFrame.ewm`, where non-float64 dtypes were silently failing (:issue:`42452`) | |||
- Bug in :meth:`pandas.DataFrame.rolling` operation along rows (``axis=1``) incorrectly omits columns containing ``float16`` and ``float32`` (:issue:`41779`) | |||
- Bug in :meth:`Resampler.aggregate` did not allow the use of Named Aggregation (:issue:`32803`) | |||
- | |||
- Bug in :meth:`pandas.DataFrame.groupby.rolling` and :class:`pandas.api.indexers.FixedForwardWindowIndexer` leading to segfaults and window endpoints being mixed across groups (:issue:`43267`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
planning the 1.3.3 release tomorrow. changing milestone to 1.3.4. (1.3.4 release notes won't be added to master until after 1.3.3 is released)
I've booked time tomorrow to update to master and address outstanding comments. |
53f4ac1
to
4b9a480
Compare
@github-actions pre-commit |
@jreback: regarding the type problems, I'm talking about the mypy errors the build is preventing at the moment. (So far, it's taken far longer to deal with the overhead of the process than it did to debug and fix the problem in the first place. :-) |
pandas/core/window/rolling.py
Outdated
elif self._win_freq_i8 is not None: | ||
rolling_indexer = VariableWindowIndexer | ||
window = self._win_freq_i8 | ||
else: | ||
rolling_indexer = FixedWindowIndexer | ||
window = self.window | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you undo these whitespace changes? (Seems to be in other places as well)
absolutely. code fixes are easy, tests hard, typing hard-est |
@dsm054 can you fixup / merge master when you can |
@mroeschke 1.3.4 is scheduled for the end of the week. will you have time to push this over the line? if not, would prefer not to push to 1.3.5, as it will hopefully be the last in 1.3.x and would require another 1.3.x release if any issues, so could either push to 1.4 or close as stale. |
@simonjayhawkins I may be able to finish this on the weekend, so let's assume this will target 1.4 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM @simonjayhawkins should be ready to merge
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mroeschke minor comment otherwise lgtm and merge when ready
pandas/core/indexers/objects.py
Outdated
@@ -279,8 +278,8 @@ class GroupbyIndexer(BaseIndexer): | |||
def __init__( | |||
self, | |||
index_array: np.ndarray | None = None, | |||
window_size: int = 0, | |||
groupby_indicies: dict | None = None, | |||
window_size: int | type[BaseIndexer] = 0, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm u can only set the default if this is an int
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
window_size: int | type[BaseIndexer] = 0, | |
window_size: int | BaseIndexer = 0, |
and can remove an ignore.
doc/source/whatsnew/v1.3.4.rst
Outdated
@@ -33,6 +33,7 @@ Fixed regressions | |||
|
|||
Bug fixes | |||
~~~~~~~~~ | |||
- Bug in :meth:`pandas.DataFrame.groupby.rolling` and :class:`pandas.api.indexers.FixedForwardWindowIndexer` leading to segfaults and window endpoints being mixed across groups (:issue:`43267`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit. for consistency
- Bug in :meth:`pandas.DataFrame.groupby.rolling` and :class:`pandas.api.indexers.FixedForwardWindowIndexer` leading to segfaults and window endpoints being mixed across groups (:issue:`43267`) | |
- Fixed bug in :meth:`pandas.DataFrame.groupby.rolling` and :class:`pandas.api.indexers.FixedForwardWindowIndexer` leading to segfaults and window endpoints being mixed across groups (:issue:`43267`) |
pandas/core/indexers/objects.py
Outdated
@@ -279,8 +278,8 @@ class GroupbyIndexer(BaseIndexer): | |||
def __init__( | |||
self, | |||
index_array: np.ndarray | None = None, | |||
window_size: int = 0, | |||
groupby_indicies: dict | None = None, | |||
window_size: int | type[BaseIndexer] = 0, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
window_size: int | type[BaseIndexer] = 0, | |
window_size: int | BaseIndexer = 0, |
and can remove an ignore.
@mroeschke, @simonjayhawkins: thanks for picking this up! (long story) |
This comment has been minimized.
This comment has been minimized.
Thanks @dsm054 and @mroeschke |
… and GroupbyIndexer (pandas-dev#43267)
…byIndexer (#43267) (#44061) Co-authored-by: DSM <[email protected]>
This PR addresses three issues, two major and one minor:
(1) If you repeated the use of a FixedForwardWindowIndexer, the indexer would be mutated and a window_size of 0 would be used. This would lead to an empty end array which unsurprisingly led to segfaults on the second run. Similar issues could be produced with other BaseIndexer subclasses, this one was simply the loudest.
(2) The end array generated by FixedForwardWindowIndexer.get_window_bounds was the wrong length, leading to longer arrays than necessary. When there was only one involved, it didn't matter much: the extra ones were simply ignored. But in the case of a groupby, these end arrays were concatenated, meaning that end values for the first group could leak into the second, and so on.
(3) There were some typos for "indices".