Skip to content

BUG: support median function for custom BaseIndexer rolling windows #33626

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 10 commits into from
Apr 21, 2020

Conversation

AlexKirko
Copy link
Member

@AlexKirko AlexKirko commented Apr 18, 2020

Scope of PR

This PR makes sure that when we calculate the median in roll_median_c, we override the maximum window width with the correct value, and the function doesn't shortcut by returning all NaNs.

Details

The median function eventually calls roll_median_c which accepts a win parameter (maximum window width) to correctly allocate memory and initialize the skiplist data structure which is the backbone of the rolling median algorithm. Currently, win is determined by the _get_window function which returns min_periods or 0 for custom BaseIndexer subclasses:

    def _get_window(self, other=None, win_type: Optional[str] = None) -> int:
        """
        Return window length.

        Parameters
        ----------
        other :
            ignored, exists for compatibility
        win_type :
            ignored, exists for compatibility

        Returns
        -------
        window : int
        """
        if isinstance(self.window, BaseIndexer):
            return self.min_periods or 0
        return self.window

Thus, roll_median_c either shortcuts or initializes to incorrect depth:

...
    if win == 0 or (end - start).max() == 0:
        output[:] = NaN
        return output
    win = (end - start).max()
    sl = skiplist_init(<int>win)
...

I propose we determine max window length directly in the median function. This means that start and end arrays get calculated twice: here and in _apply. However, I belive this is better than injecting a median-specific crutch into _apply or messing with the shortcut in roll_median_c (we could attempt to override win if (end - start).max() > 0. This other option is explored below.
After discussing with @mroeschke , we decided to implement the bugfix directly in roll_median_c. Details below.

Please say if you think another approach would be preferable.

Background on the wider issue

We currently don't support several rolling window functions when building a rolling window object using a custom class descended from pandas.api.indexers.Baseindexer. The implementations were written with backward-looking windows in mind, and this led to these functions breaking.
Currently, using these functions returns a NotImplemented error thanks to #33057, but ideally we want to update the implementations, so that they will work without a performance hit. This is what I aim to do over a series of PRs.

Perf notes

The function currently shortcuts because of the bug or initializes the main datastructure incorrectly. For this reason, benchmarks are meaningless.
Ran benchmarks vs. the shortcut anyway:

 asv continuous -f 1.1 master HEAD -b ^rolling.ForwardWindowMethods
...
       before           after         ratio
     [b630cdbc]       [ce82372f]
     <master>         <rolling-median>
+      3.55▒0.2ms       75.2▒0.9ms    21.19  rolling.ForwardWindowMethods.time_rolling('DataFrame', 1000, 'float', 'median')
+     4.25▒0.08ms         76.7▒2ms    18.06  rolling.ForwardWindowMethods.time_rolling('Series', 1000, 'float', 'median')
+      4.02▒0.4ms       60.3▒0.5ms    14.99  rolling.ForwardWindowMethods.time_rolling('DataFrame', 1000, 'int', 'median')
+      3.42▒0.2ms         47.4▒1ms    13.86  rolling.ForwardWindowMethods.time_rolling('DataFrame', 10, 'float', 'median')
+      4.80▒0.1ms       61.2▒0.5ms    12.74  rolling.ForwardWindowMethods.time_rolling('Series', 1000, 'int', 'median')
+     4.58▒0.06ms       49.7▒0.9ms    10.85  rolling.ForwardWindowMethods.time_rolling('Series', 10, 'float', 'median')
+      4.45▒0.5ms       45.0▒0.8ms    10.10  rolling.ForwardWindowMethods.time_rolling('DataFrame', 10, 'int', 'median')
+     5.02▒0.07ms       45.9▒0.6ms     9.14  rolling.ForwardWindowMethods.time_rolling('Series', 10, 'int', 'median')

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
PERFORMANCE DECREASED.

@AlexKirko
Copy link
Member Author

AlexKirko commented Apr 18, 2020

Place of implementation

I think touching _apply is a bad idea, but we could make changes to roll_median_c in aggregations.pyx instead of median in rolling.py.
Something along the lines of:

    if (end - start).max() == 0:
        output[:] = NaN
        return output
    win = (end - start).max()
    sl = skiplist_init(<int>win)

This would set win correctly if it arrived as 0. We would basically be silently ignoring win and setting it ourselves. I'm not too exicted about such an approach, but it would avoid an extra call get_window_bounds

UPDATE: After discussing with @mroeschke , we decided to implement the bugfix directly in roll_median_c.

@AlexKirko
Copy link
Member Author

AlexKirko commented Apr 18, 2020

@mroeschke @jreback
Turned out that median needs some work too. All green, please take a look.

I'm interested in what you think on where we should implement the bugfix. For now, I added the code to _Rolling_and_Expanding.median, but we could instead alter roll_median_c in aggregations.pyx (link to relevant code). The latter option would avoid an extra call to get_window_bounds, but we'd be silently fixing erroneous input in roll_median_c, and I'm not too excited about that. On the other hand, we already do that when non-zero window size is passed into the function.

I checked both options, and they both work. Because rolling median calculation itself is O(N * log(window_size)) , the performance hit from the extra get_window_bounds call should be a problem only if the user implements it sub-optimally.

All the details are above.

UPDATE: After discussing with @mroeschke , we decided to implement the bugfix directly in roll_median_c.

@mroeschke
Copy link
Member

mroeschke commented Apr 18, 2020

I think I would prefer the solution described here: #33626 (comment)

And we could make due with a comment in def median saying we override the value for BaseIndexer.

I'd prefer to call get_window_bounds in once place in possible. I understand the indirection in def median is not the greatest, but I think it'd prefer it as a lot of this rolling code is due to a conceptual refactor in the future.

@AlexKirko
Copy link
Member Author

@mroeschke
Done, please take a look.

Because we are now ignoring the win argument, there is no reason to calculate it in median and call partial. I set win to default 0 in roll_median_c for backward compatibility reasons (even if it's an inner function). The comments hopefully make it clear how things work now.

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm. minor comment, ping on green.

@@ -166,3 +172,8 @@ def test_rolling_forward_window(constructor, func, np_func, expected, np_kwargs)
tm.assert_equal(result, expected)
expected2 = constructor(rolling.apply(lambda x: np_func(x, **np_kwargs)))
tm.assert_equal(result, expected2)
# Check that the function works without min_periods supplied
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you put a blank line before each case

and add comments for lines 169, 173.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback
Done, all green.

@jreback jreback added the Window rolling, ewma, expanding label Apr 19, 2020
@jreback jreback added this to the 1.1 milestone Apr 19, 2020
@AlexKirko AlexKirko requested a review from jreback April 20, 2020 07:23
@jreback jreback merged commit 3a80166 into pandas-dev:master Apr 21, 2020
@jreback
Copy link
Contributor

jreback commented Apr 21, 2020

thanks @AlexKirko

@AlexKirko AlexKirko deleted the rolling-median branch April 23, 2020 06:16
@jorisvandenbossche
Copy link
Member

This PR caused a big performance regression for rolling median (see eg https://pandas.pydata.org/speed/pandas/#rolling.ForwardWindowMethods.time_rolling?p-constructor='DataFrame'&p-window_size=1000&p-dtype='float'&p-method='median'&commits=3a801661).
I didn't check anything yet, also not whether this is a valid regression report (maybe this was considered, or the benchmark is not checking the correct thing, ..), but just noting.

@jreback
Copy link
Contributor

jreback commented Apr 26, 2020

@jorisvandenbossche pls read the perf section above

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Window rolling, ewma, expanding
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants