Skip to content

BUG: Fixes to FixedForwardWindowIndexer and GroupbyIndexer (#43267) #43291

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

dsm054
Copy link
Contributor

@dsm054 dsm054 commented Aug 30, 2021

This PR addresses three issues, two major and one minor:

(1) If you repeated the use of a FixedForwardWindowIndexer, the indexer would be mutated and a window_size of 0 would be used. This would lead to an empty end array which unsurprisingly led to segfaults on the second run. Similar issues could be produced with other BaseIndexer subclasses, this one was simply the loudest.

(2) The end array generated by FixedForwardWindowIndexer.get_window_bounds was the wrong length, leading to longer arrays than necessary. When there was only one involved, it didn't matter much: the extra ones were simply ignored. But in the case of a groupby, these end arrays were concatenated, meaning that end values for the first group could leak into the second, and so on.

(3) There were some typos for "indices".

@dsm054
Copy link
Contributor Author

dsm054 commented Aug 30, 2021

@github-actions pre-commit

@dsm054 dsm054 force-pushed the repair_fixedforwardwindowindexer branch 2 times, most recently from 2755500 to 5900e00 Compare August 30, 2021 01:49
@pep8speaks
Copy link

pep8speaks commented Aug 30, 2021

Hello @dsm054! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2021-10-16 18:44:59 UTC

@@ -93,7 +93,6 @@ def get_window_bounds(

end = np.clip(end, 0, num_values)
start = np.clip(start, 0, num_values)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mind reverting these whitespace changes? It's somewhat distracting from the other changes in this PR

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure.

start, end = window_indexer.get_window_bounds(
num_values=len(x),
min_periods=min_periods,
center=self.center,
closed=self.closed,
)

# From get_window_bounds, those two should be equal in length of array
assert len(start) == len(end)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this assertion be moved to GroupbyIndexer.get_window_bounds instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine with putting one there too -- I was just mirroring existing practice -- but I think an independent check here has merit.

If start/end are broken here, result = calc(values) is probably going to crash a few lines below.

Since there's logic in _apply which calculates min_periods which is passed as an arg to get_window_bounds, we can dream up scenarios where everything would be fine at that level but broken here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay. Putting this assert here is fine as well then

@@ -1429,6 +1435,10 @@ def cov_func(x, y):
center=self.center,
closed=self.closed,
)

# From get_window_bounds, those two should be equal in length of array
assert len(start) == len(end)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same (as well as below)

@pytest.mark.parametrize(
"df",
[
DataFrame({"a": [1, 1], "b": [0, 1]}),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you make the DataFrame call in the test method instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

start, end = indexer.get_window_bounds(num_values=num_values)

tm.assert_equal(start, np.array(expected_start, dtype="int64"))
tm.assert_equal(end, np.array(expected_end, dtype="int64"))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tm.assert_numpy_array_equal I believe

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

),
],
)
def test_rolling_groupby_with_fixed_forward_specific(df, window_size, expected):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can move these tests in pandas/tests/window/test_base_indexer.py

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -355,7 +355,7 @@ Groupby/resample/rolling
- Bug in :meth:`pandas.DataFrame.ewm`, where non-float64 dtypes were silently failing (:issue:`42452`)
- Bug in :meth:`pandas.DataFrame.rolling` operation along rows (``axis=1``) incorrectly omits columns containing ``float16`` and ``float32`` (:issue:`41779`)
- Bug in :meth:`Resampler.aggregate` did not allow the use of Named Aggregation (:issue:`32803`)
-
- Bug in :class:`GroupbyIndexer` and :class:`FixedForwardWindowIndexer` leading to segfaults and incorrect windows (:issue:`43267`)
Copy link
Member

@mroeschke mroeschke Aug 30, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • :class:`pandas.api.indexers.FixedForwardWindowIndexer`

  • instead of mentioning GroupbyIndexer, mention the groupby.rolling operation

  • Could you be more specific with "incorrect windows"

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mroeschke would these changes be okay for 1.3.3? #43267 (comment) or would we want to split this PR?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure this would be okay for 1.3.3. @dsm054 could you moved this note to 1.3.3?

@simonjayhawkins simonjayhawkins linked an issue Aug 30, 2021 that may be closed by this pull request
3 tasks
@dsm054 dsm054 force-pushed the repair_fixedforwardwindowindexer branch from 5900e00 to 9fc7317 Compare August 30, 2021 12:32
@dsm054
Copy link
Contributor Author

dsm054 commented Aug 30, 2021

@github-actions pre-commit

@@ -355,7 +355,7 @@ Groupby/resample/rolling
- Bug in :meth:`pandas.DataFrame.ewm`, where non-float64 dtypes were silently failing (:issue:`42452`)
- Bug in :meth:`pandas.DataFrame.rolling` operation along rows (``axis=1``) incorrectly omits columns containing ``float16`` and ``float32`` (:issue:`41779`)
- Bug in :meth:`Resampler.aggregate` did not allow the use of Named Aggregation (:issue:`32803`)
-
- Bug in :meth:`pandas.DataFrame.groupby.rolling` and :class:`pandas.api.indexers.FixedForwardWindowIndexer` leading to segfaults and window endpoints being mixed across groups (:issue:`43267`)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you move this to the bug fix section in 1.3.3

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

planning the 1.3.3 release tomorrow. changing milestone to 1.3.4. (1.3.4 release notes won't be added to master until after 1.3.3 is released)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved to 1.3.4 per request

@@ -341,18 +342,20 @@ def get_window_bounds(
)
start = start.astype(np.int64)
end = end.astype(np.int64)
# Cannot use groupby_indicies as they might not be monotonic with the object
# From get_window_bounds, those two should be equal in length of array
assert len(start) == len(end)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you move this assert to before the return?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could, but that would hide a category of bugs: say that there were two keys, and the start,end pairs were (3,4) and (4,3). After concatenation we'd see 7==7 and be happy.

I'd be fine with adding another check before the return.

@simonjayhawkins simonjayhawkins added this to the 1.3.3 milestone Aug 30, 2021
@simonjayhawkins simonjayhawkins added Groupby Segfault Non-Recoverable Error Window rolling, ewma, expanding labels Aug 30, 2021
@jreback
Copy link
Contributor

jreback commented Sep 9, 2021

@dsm054 a few comments if you can update (and merge master)

@@ -355,7 +355,7 @@ Groupby/resample/rolling
- Bug in :meth:`pandas.DataFrame.ewm`, where non-float64 dtypes were silently failing (:issue:`42452`)
- Bug in :meth:`pandas.DataFrame.rolling` operation along rows (``axis=1``) incorrectly omits columns containing ``float16`` and ``float32`` (:issue:`41779`)
- Bug in :meth:`Resampler.aggregate` did not allow the use of Named Aggregation (:issue:`32803`)
-
- Bug in :meth:`pandas.DataFrame.groupby.rolling` and :class:`pandas.api.indexers.FixedForwardWindowIndexer` leading to segfaults and window endpoints being mixed across groups (:issue:`43267`)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

planning the 1.3.3 release tomorrow. changing milestone to 1.3.4. (1.3.4 release notes won't be added to master until after 1.3.3 is released)

@simonjayhawkins simonjayhawkins modified the milestones: 1.3.3, 1.3.4 Sep 11, 2021
@dsm054
Copy link
Contributor Author

dsm054 commented Sep 12, 2021

I've booked time tomorrow to update to master and address outstanding comments.

@dsm054 dsm054 force-pushed the repair_fixedforwardwindowindexer branch from 53f4ac1 to 4b9a480 Compare September 13, 2021 01:40
@dsm054
Copy link
Contributor Author

dsm054 commented Sep 13, 2021

@github-actions pre-commit

@dsm054
Copy link
Contributor Author

dsm054 commented Sep 22, 2021

@jreback: regarding the type problems, I'm talking about the mypy errors the build is preventing at the moment. (So far, it's taken far longer to deal with the overhead of the process than it did to debug and fix the problem in the first place. :-)

elif self._win_freq_i8 is not None:
rolling_indexer = VariableWindowIndexer
window = self._win_freq_i8
else:
rolling_indexer = FixedWindowIndexer
window = self.window

Copy link
Member

@mroeschke mroeschke Sep 22, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you undo these whitespace changes? (Seems to be in other places as well)

@jreback
Copy link
Contributor

jreback commented Sep 22, 2021

@dsm054

(So far, it's taken far longer to deal with the overhead of the process than it did to debug and fix the problem in the first place. :-)

absolutely. code fixes are easy, tests hard, typing hard-est

@jreback
Copy link
Contributor

jreback commented Sep 29, 2021

@dsm054 can you fixup / merge master when you can

@simonjayhawkins
Copy link
Member

@mroeschke 1.3.4 is scheduled for the end of the week. will you have time to push this over the line?

if not, would prefer not to push to 1.3.5, as it will hopefully be the last in 1.3.x and would require another 1.3.x release if any issues, so could either push to 1.4 or close as stale.

@mroeschke
Copy link
Member

@simonjayhawkins I may be able to finish this on the weekend, so let's assume this will target 1.4

Copy link
Member

@mroeschke mroeschke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM @simonjayhawkins should be ready to merge

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mroeschke minor comment otherwise lgtm and merge when ready

@@ -279,8 +278,8 @@ class GroupbyIndexer(BaseIndexer):
def __init__(
self,
index_array: np.ndarray | None = None,
window_size: int = 0,
groupby_indicies: dict | None = None,
window_size: int | type[BaseIndexer] = 0,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm u can only set the default if this is an int

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
window_size: int | type[BaseIndexer] = 0,
window_size: int | BaseIndexer = 0,

and can remove an ignore.

@@ -33,6 +33,7 @@ Fixed regressions

Bug fixes
~~~~~~~~~
- Bug in :meth:`pandas.DataFrame.groupby.rolling` and :class:`pandas.api.indexers.FixedForwardWindowIndexer` leading to segfaults and window endpoints being mixed across groups (:issue:`43267`)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit. for consistency

Suggested change
- Bug in :meth:`pandas.DataFrame.groupby.rolling` and :class:`pandas.api.indexers.FixedForwardWindowIndexer` leading to segfaults and window endpoints being mixed across groups (:issue:`43267`)
- Fixed bug in :meth:`pandas.DataFrame.groupby.rolling` and :class:`pandas.api.indexers.FixedForwardWindowIndexer` leading to segfaults and window endpoints being mixed across groups (:issue:`43267`)

@@ -279,8 +278,8 @@ class GroupbyIndexer(BaseIndexer):
def __init__(
self,
index_array: np.ndarray | None = None,
window_size: int = 0,
groupby_indicies: dict | None = None,
window_size: int | type[BaseIndexer] = 0,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
window_size: int | type[BaseIndexer] = 0,
window_size: int | BaseIndexer = 0,

and can remove an ignore.

@simonjayhawkins simonjayhawkins mentioned this pull request Oct 16, 2021
@dsm054
Copy link
Contributor Author

dsm054 commented Oct 16, 2021

@mroeschke, @simonjayhawkins: thanks for picking this up! (long story)

@simonjayhawkins simonjayhawkins merged commit 776329f into pandas-dev:master Oct 16, 2021
@lumberbot-app

This comment has been minimized.

@simonjayhawkins
Copy link
Member

Thanks @dsm054 and @mroeschke

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Groupby Segfault Non-Recoverable Error Window rolling, ewma, expanding
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: crash in df.groupby.rolling.mean with forward window indexer
5 participants