Skip to content

BUG: support min/max functions for rolling windows with custom BaseIndexer #33180

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 21 commits into from
Apr 3, 2020

Conversation

AlexKirko
Copy link
Member

@AlexKirko AlexKirko commented Mar 31, 2020

The problem

We currently don't support several rolling window functions when building a rolling window object using a custom class descended from pandas.api.indexers.Baseindexer. The implementations were written with backward-looking windows in mind, and this led to these functions breaking.
Currently, using these functions returns a NotImplemented error thanks to #33057, but ideally we want to update the implementations, so that they will work without a performance hit. This is what I aim to do over a series of PRs.

Scope of PR

This PR proposes an alternative implementation for the min and max rolling window functions implemented in the _roll_min_max_variable function in the aggregations.pyx file.

Perf note

This implementation is slightly faster than the one on master branch (tested with a Series with 10e6 randomly generated values and timeit). If necessary, I can add a benchmark to the benchmark suite, although it must be noted that we need to benchmark with large arrays of data to minimize the effect of overhead.

@AlexKirko
Copy link
Member Author

AlexKirko commented Mar 31, 2020

The core idea is that we ancor the output index we write at to the index in the input value array it corresponds to. This generalizes the function and also removes the need to fiddle with close_offset.

Aside from that, I reuse the original algorithm with some minor adjustments that allow the functions to work with any monotonic windows (window starts and ends stay in place or move to the right).

@AlexKirko
Copy link
Member Author

On another note, with these changes, it is no longer necessary to have separate code for initializing the first value and for writing the final value. Everything can be done in the main loop.

@jreback jreback added the Reshaping Concat, Merge/Join, Stack/Unstack, Explode label Mar 31, 2020

values = np.arange(10)

indexer = ForwardIndexer(window_size=3)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm are you anticpating exposing this to users? should we be using this internally?

Copy link
Member Author

@AlexKirko AlexKirko Mar 31, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean allowing the creation of a forward-looking rolling window without creating a custom BaseIndexer subclass?

My goal was to make sure that our implementations are robust against custom indexers with arbitrary offsets relative to the position in the input array they are calculated for. Additionally, I made this implementation robust against weird stuff, like smoothly expanding or contracting windows, for example.
The ForwardIndexer class is used in testing for convenience, because it is, by far, the most common customization people expect us to support, in my limited experience.

We could make forward-looking a Boolean argument in the rolling window constructor, but I don't know how useful of an enhancement it would be. In my opinion, it's enough to make sure that we support reasonable customizations based on BaseIndexer and let the users take care of the rest.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be interesting to provide a standard set of BaseIndexers in the library. ForwardIndexer may be one that we provide once we standardize the definition and API.

@AlexKirko feel free to open a new issue about including ForwardIndexer and BackwardIndexer in the pandas.api.indexers

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, created a new issue. Will be happy to submit a PR in a day or two.

("max", [2.0, 3.0, 4.0, 100.0, 100.0, 100.0, 8.0, 9.0, 9.0, np.nan]),
],
)
def test_rolling_forward_window(constructor, func, expected):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you put this test in pandas/tests/window/test_base_indexer.py?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.


indexer = ForwardIndexer(window_size=3)
rolling = constructor(values).rolling(window=indexer, min_periods=2)
result = getattr(rolling, func)()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we also test this against rolling.apply(lambda x: max(x)) (or the equivalent)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link
Member Author

@AlexKirko AlexKirko Mar 31, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mroeschke
Should we drop testing against an explicitly provided array then? Doing both seems redundant to me.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's okay to keep the explicit array as the source of truth on the off chance that rolling(Indexer).max() and rolling.apply(lambda x: max(x) both incorrectly change in the same direction

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, let's keep it then. I was afraid of something like this, which is why I started by testing against an explicit array and not against apply.

@AlexKirko AlexKirko requested a review from jreback April 1, 2020 09:14
@mroeschke mroeschke added Window rolling, ewma, expanding Bug and removed Reshaping Concat, Merge/Join, Stack/Unstack, Explode labels Apr 1, 2020
@mroeschke mroeschke added this to the 1.1 milestone Apr 1, 2020
@AlexKirko
Copy link
Member Author

@jreback
Between the new issue and the changes made, I think all the comments are covered. Is there anything else we should do before merging?

@jreback
Copy link
Contributor

jreback commented Apr 2, 2020

will looks again soon

@AlexKirko
Copy link
Member Author

Changed the ForwardIndexer implementation to mirroring the one I will propose exposing to users in a separate PR.

@AlexKirko
Copy link
Member Author

AlexKirko commented Apr 2, 2020

Some pipelines are canceling on timeout. Should have nothing to do with this PR, but I restarted the tests just in case.

@AlexKirko
Copy link
Member Author

AlexKirko commented Apr 2, 2020

Hm. py37_np_dev seems to be hanging for everyone. If it doesn't recover after an hour, I'll post an issue (unless someone beats me to it).

@jreback jreback merged commit 4521308 into pandas-dev:master Apr 3, 2020
@jreback
Copy link
Contributor

jreback commented Apr 3, 2020

thanks @AlexKirko

@AlexKirko
Copy link
Member Author

Thanks!

@AlexKirko AlexKirko deleted the rolling-min-max branch April 9, 2020 06:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Window rolling, ewma, expanding
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants