Skip to content

QST/Feature/Bug: On the performance of a rolling window operation when the window is a column of arbitrary integers #46716

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
2 tasks done
epigramx opened this issue Apr 9, 2022 · 10 comments
Labels
Needs Triage Issue that has not been reviewed by a pandas team member Usage Question

Comments

@epigramx
Copy link

epigramx commented Apr 9, 2022

Research

  • I have searched the [pandas] tag on StackOverflow for similar questions.

  • I have asked my usage related question on StackOverflow.

Link to question on StackOverflow

https://stackoverflow.com/a/71803558/277716

Question about pandas

I linked a specific complete answer at stackoverflow which tackles the problem of deriving the equivalent of pandas.core.window.rolling.Rolling.max but the window is an arbitrary column of integers in the same dataframe; however: even if that solution strives to be vectorized: it's extremely slow to the point of becoming unusable for large dataframes compared to the basic case of a constant window size; I suspect it may be impossible to be fast because SIMD hardware may prefer a constant nature of window size.

However: I wonder if the devs of the pandas software itself may have ideas of how to do that since they are the ones who have coded the extremely fast (vectorized) pandas.core.window.rolling.Rolling.max.

It would normally be a feature request for pandas.DataFrame.rolling to accept arbitrary integers from a column in the dataframe as a window but I don't know if it's even performant to do that.

Bug related to later comments below

import pandas as pd
from pandas import api
import numpy as np

class MultiWindowIndexer(api.indexers.BaseIndexer):
    def __init__(self, window):
        self.window = np.array(window)
        super().__init__()

    def get_window_bounds(self, num_values, min_periods, center, closed):
        end = np.arange(num_values, dtype='int64') + 1
        start = np.clip(end - self.window, 0, num_values)
        return start, end

np.random.seed([3,14])
a = np.random.randn(20).cumsum()
w = np.minimum(
    np.random.randint(1, 4, size=a.shape),
    np.arange(len(a))+1
)

df = pd.DataFrame({'Data': a, 'Window': w})

df['max1'] = df.Data.rolling(MultiWindowIndexer(df.Window)).max(engine='cython')

print(df)

Expected outcome: index 18 max1 should be -1.487828 instead of -1.932612

source of code and further discussion on the bug at stackoverflow

@epigramx epigramx added Needs Triage Issue that has not been reviewed by a pandas team member Usage Question labels Apr 9, 2022
@epigramx
Copy link
Author

epigramx commented Apr 9, 2022

This new answer appears to add an obscenely faster method compared to others I've tried, though the correctness of its results is uncertain right now.

Though the simplicity of its usage after the custom indexer is set up appears to imply it would be a good candidate for a default feature.

@epigramx
Copy link
Author

epigramx commented Apr 9, 2022

The author of the new performant method, appears to believe the correctness issue they encountered was fixed, at 1.4.x (relevant comment in pandas code).

@epigramx
Copy link
Author

epigramx commented Apr 9, 2022

There appear to be still incorrectness issues (albeit limited), as mentioned here , so I guess this instead of not just a question and a feature request it's also now a bug report.

@jreback
Copy link
Contributor

jreback commented Apr 9, 2022

@epigramx pls edit the issue to give a specific issue - if it's a bug then show reproducible code where that is

@epigramx epigramx changed the title QST: On the performance of a rolling window operation when the window is a column of arbitrary integers QST/Feature/Bug: On the performance of a rolling window operation when the window is a column of arbitrary integers Apr 9, 2022
@epigramx
Copy link
Author

epigramx commented Apr 9, 2022

@jreback I added code directly replicating the issue (and expected outcome being incorrect) and edited the title slightly. I thought of starting a new issue since this started as a question and a possible feature request, but I guess that could have made it messier.

@rhshadrach
Copy link
Member

rhshadrach commented Apr 9, 2022

Thanks for the report @epigramx. While somewhat related, it would be best to separate out a performance question/issue from a bug report. It's unclear to me what the performance question you're asking here is, but I think I have a guess: if the code you posted in the OP here with MultiWindowIndexer gave the correct result, that would be performant for your situation. Is this correct?

Assuming that is the case, I recommend making this issue be a bug report. I've verified that the code you posted does give the incorrect result on main; the result you expected is the correct one.

@epigramx
Copy link
Author

epigramx commented Apr 10, 2022

@rhshadrach Yes: main focus is now on fixing a bug related to that method since originally the post was about asking for such a method or a default feature of pandas.DataFrame.rolling that would use such a method.

I'm not sure about what I should do to make this into a bug report. Should I close this and make a new one since I can not change the labels of this one?

@rhshadrach
Copy link
Member

I'm not sure about what I should do to make this into a bug report. Should I close this and make a new one since I can not change the labels of this one?

It would have been to edit the OP to make it clear that this is a bug report. Labels can be modified by those with triage permissions. With #46726 now opened, these modifications are no longer necessary.

I'm still not sure if you're also asking about performance. If #46726 is resolved (i.e. the bug is fixed), does that also give you a sufficiently performat method to do your computation?

@epigramx
Copy link
Author

epigramx commented Apr 10, 2022

@rhshadrach performance appears to not be an issue with the method posted at the bug report (assuming the bug is fixed). It appears comparable in performance to a regular rolling operation while a slow apply() operation was practically unusable for large dataframes.

In fact it appears so snappy that I wouldn't be surprised if it was a default pandas.DataFrame.rolling feature; but I guess it's not necessary if a custom indexer can do the same; by the way: an only ~40% slower method without cython was posted here.

@mroeschke
Copy link
Member

Seems like the core issue is being discussed in #46726 so closing in favor of further discussion there

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Needs Triage Issue that has not been reviewed by a pandas team member Usage Question
Projects
None yet
Development

No branches or pull requests

4 participants