QST/Feature/Bug: On the performance of a rolling window operation when the window is a column of arbitrary integers #46716

epigramx · 2022-04-09T06:41:29Z

Research

I have searched the [pandas] tag on StackOverflow for similar questions.
I have asked my usage related question on StackOverflow.

Link to question on StackOverflow

https://stackoverflow.com/a/71803558/277716

Question about pandas

I linked a specific complete answer at stackoverflow which tackles the problem of deriving the equivalent of pandas.core.window.rolling.Rolling.max but the window is an arbitrary column of integers in the same dataframe; however: even if that solution strives to be vectorized: it's extremely slow to the point of becoming unusable for large dataframes compared to the basic case of a constant window size; I suspect it may be impossible to be fast because SIMD hardware may prefer a constant nature of window size.

However: I wonder if the devs of the pandas software itself may have ideas of how to do that since they are the ones who have coded the extremely fast (vectorized) pandas.core.window.rolling.Rolling.max.

It would normally be a feature request for pandas.DataFrame.rolling to accept arbitrary integers from a column in the dataframe as a window but I don't know if it's even performant to do that.

Bug related to later comments below

import pandas as pd
from pandas import api
import numpy as np

class MultiWindowIndexer(api.indexers.BaseIndexer):
    def __init__(self, window):
        self.window = np.array(window)
        super().__init__()

    def get_window_bounds(self, num_values, min_periods, center, closed):
        end = np.arange(num_values, dtype='int64') + 1
        start = np.clip(end - self.window, 0, num_values)
        return start, end

np.random.seed([3,14])
a = np.random.randn(20).cumsum()
w = np.minimum(
    np.random.randint(1, 4, size=a.shape),
    np.arange(len(a))+1
)

df = pd.DataFrame({'Data': a, 'Window': w})

df['max1'] = df.Data.rolling(MultiWindowIndexer(df.Window)).max(engine='cython')

print(df)

Expected outcome: index 18 max1 should be -1.487828 instead of -1.932612

source of code and further discussion on the bug at stackoverflow

The text was updated successfully, but these errors were encountered:

epigramx · 2022-04-09T12:26:50Z

This new answer appears to add an obscenely faster method compared to others I've tried, though the correctness of its results is uncertain right now.

Though the simplicity of its usage after the custom indexer is set up appears to imply it would be a good candidate for a default feature.

epigramx · 2022-04-09T12:57:08Z

The author of the new performant method, appears to believe ~~the correctness issue they encountered was fixed, at 1.4.x~~ (relevant comment in pandas code).

epigramx · 2022-04-09T13:47:20Z

There appear to be still incorrectness issues (albeit limited), as mentioned here , so I guess this instead of not just a question and a feature request it's also now a bug report.

jreback · 2022-04-09T13:53:14Z

@epigramx pls edit the issue to give a specific issue - if it's a bug then show reproducible code where that is

epigramx · 2022-04-09T14:22:50Z

@jreback I added code directly replicating the issue (and expected outcome being incorrect) and edited the title slightly. I thought of starting a new issue since this started as a question and a possible feature request, but I guess that could have made it messier.

rhshadrach · 2022-04-09T20:47:24Z

Thanks for the report @epigramx. While somewhat related, it would be best to separate out a performance question/issue from a bug report. It's unclear to me what the performance question you're asking here is, but I think I have a guess: if the code you posted in the OP here with MultiWindowIndexer gave the correct result, that would be performant for your situation. Is this correct?

Assuming that is the case, I recommend making this issue be a bug report. I've verified that the code you posted does give the incorrect result on main; the result you expected is the correct one.

epigramx · 2022-04-10T03:36:01Z

@rhshadrach Yes: main focus is now on fixing a bug related to that method since originally the post was about asking for such a method or a default feature of pandas.DataFrame.rolling that would use such a method.

I'm not sure about what I should do to make this into a bug report. Should I close this and make a new one since I can not change the labels of this one?

rhshadrach · 2022-04-10T14:01:09Z

I'm not sure about what I should do to make this into a bug report. Should I close this and make a new one since I can not change the labels of this one?

It would have been to edit the OP to make it clear that this is a bug report. Labels can be modified by those with triage permissions. With #46726 now opened, these modifications are no longer necessary.

I'm still not sure if you're also asking about performance. If #46726 is resolved (i.e. the bug is fixed), does that also give you a sufficiently performat method to do your computation?

epigramx · 2022-04-10T19:13:54Z

@rhshadrach performance appears to not be an issue with the method posted at the bug report (assuming the bug is fixed). It appears comparable in performance to a regular rolling operation while a slow apply() operation was practically unusable for large dataframes.

In fact it appears so snappy that I wouldn't be surprised if it was a default pandas.DataFrame.rolling feature; but I guess it's not necessary if a custom indexer can do the same; by the way: an only ~40% slower method without cython was posted here.

mroeschke · 2022-07-06T23:55:45Z

Seems like the core issue is being discussed in #46726 so closing in favor of further discussion there

epigramx added Needs Triage Issue that has not been reviewed by a pandas team member Usage Question labels Apr 9, 2022

epigramx changed the title ~~QST: On the performance of a rolling window operation when the window is a column of arbitrary integers~~ QST/Feature/Bug: On the performance of a rolling window operation when the window is a column of arbitrary integers Apr 9, 2022

epigramx mentioned this issue Apr 10, 2022

BUG: Partially incorrect results when using a custom indexer for a rolling window for max and min #46726

Open

3 tasks

mroeschke closed this as completed Jul 6, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

QST/Feature/Bug: On the performance of a rolling window operation when the window is a column of arbitrary integers #46716

QST/Feature/Bug: On the performance of a rolling window operation when the window is a column of arbitrary integers #46716

epigramx commented Apr 9, 2022 •

edited

Loading

epigramx commented Apr 9, 2022

epigramx commented Apr 9, 2022 •

edited

Loading

epigramx commented Apr 9, 2022 •

edited

Loading

jreback commented Apr 9, 2022

epigramx commented Apr 9, 2022 •

edited

Loading

rhshadrach commented Apr 9, 2022 •

edited

Loading

epigramx commented Apr 10, 2022 •

edited

Loading

rhshadrach commented Apr 10, 2022

epigramx commented Apr 10, 2022 •

edited

Loading

mroeschke commented Jul 6, 2022

QST/Feature/Bug: On the performance of a rolling window operation when the window is a column of arbitrary integers #46716

QST/Feature/Bug: On the performance of a rolling window operation when the window is a column of arbitrary integers #46716

Comments

epigramx commented Apr 9, 2022 • edited Loading

Research

Link to question on StackOverflow

Question about pandas

Bug related to later comments below

epigramx commented Apr 9, 2022

epigramx commented Apr 9, 2022 • edited Loading

epigramx commented Apr 9, 2022 • edited Loading

jreback commented Apr 9, 2022

epigramx commented Apr 9, 2022 • edited Loading

rhshadrach commented Apr 9, 2022 • edited Loading

epigramx commented Apr 10, 2022 • edited Loading

rhshadrach commented Apr 10, 2022

epigramx commented Apr 10, 2022 • edited Loading

mroeschke commented Jul 6, 2022

epigramx commented Apr 9, 2022 •

edited

Loading

epigramx commented Apr 9, 2022 •

edited

Loading

epigramx commented Apr 9, 2022 •

edited

Loading

epigramx commented Apr 9, 2022 •

edited

Loading

rhshadrach commented Apr 9, 2022 •

edited

Loading

epigramx commented Apr 10, 2022 •

edited

Loading

epigramx commented Apr 10, 2022 •

edited

Loading