-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG: support median function for custom BaseIndexer rolling windows #33626
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Place of implementationI think touching if (end - start).max() == 0:
output[:] = NaN
return output
win = (end - start).max()
sl = skiplist_init(<int>win) This would set UPDATE: After discussing with @mroeschke , we decided to implement the bugfix directly in roll_median_c. |
@mroeschke @jreback I'm interested in what you think on where we should implement the bugfix. For now, I added the code to I checked both options, and they both work. Because rolling median calculation itself is O(N * log(window_size)) , the performance hit from the extra All the details are above. UPDATE: After discussing with @mroeschke , we decided to implement the bugfix directly in roll_median_c. |
I think I would prefer the solution described here: #33626 (comment) And we could make due with a comment in I'd prefer to call |
@mroeschke Because we are now ignoring the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm. minor comment, ping on green.
@@ -166,3 +172,8 @@ def test_rolling_forward_window(constructor, func, np_func, expected, np_kwargs) | |||
tm.assert_equal(result, expected) | |||
expected2 = constructor(rolling.apply(lambda x: np_func(x, **np_kwargs))) | |||
tm.assert_equal(result, expected2) | |||
# Check that the function works without min_periods supplied |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you put a blank line before each case
and add comments for lines 169, 173.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jreback
Done, all green.
thanks @AlexKirko |
This PR caused a big performance regression for rolling median (see eg https://pandas.pydata.org/speed/pandas/#rolling.ForwardWindowMethods.time_rolling?p-constructor='DataFrame'&p-window_size=1000&p-dtype='float'&p-method='median'&commits=3a801661). |
@jorisvandenbossche pls read the perf section above |
black pandas
git diff upstream/master -u -- "*.py" | flake8 --diff
Scope of PR
This PR makes sure that when we calculate the median in
roll_median_c
, we override the maximum window width with the correct value, and the function doesn't shortcut by returning all NaNs.Details
The
median
function eventually callsroll_median_c
which accepts awin
parameter (maximum window width) to correctly allocate memory and initialize the skiplist data structure which is the backbone of the rolling median algorithm. Currently,win
is determined by the_get_window
function which returnsmin_periods or 0
for customBaseIndexer
subclasses:Thus,
roll_median_c
either shortcuts or initializes to incorrect depth:I propose we determine max window length directly in themedian
function. This means thatstart
andend
arrays get calculated twice: here and in_apply
. However, I belive this is better than injecting a median-specific crutch into_apply
or messing with the shortcut inroll_median_c
(we could attempt to overridewin
if(end - start).max() > 0
. This other option is explored below.After discussing with @mroeschke , we decided to implement the bugfix directly in
roll_median_c
. Details below.Please say if you think another approach would be preferable.
Background on the wider issue
We currently don't support several rolling window functions when building a rolling window object using a custom class descended from
pandas.api.indexers.Baseindexer
. The implementations were written with backward-looking windows in mind, and this led to these functions breaking.Currently, using these functions returns a
NotImplemented
error thanks to #33057, but ideally we want to update the implementations, so that they will work without a performance hit. This is what I aim to do over a series of PRs.Perf notes
The function currently shortcuts because of the bug or initializes the main datastructure incorrectly. For this reason, benchmarks are meaningless.
Ran benchmarks vs. the shortcut anyway: