Skip to content

BUG: Partially incorrect results when using a custom indexer for a rolling window for max and min #46726

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
3 tasks done
epigramx opened this issue Apr 10, 2022 · 6 comments · May be fixed by #61288
Open
3 tasks done
Labels
Bug Window rolling, ewma, expanding

Comments

@epigramx
Copy link

epigramx commented Apr 10, 2022

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd
from pandas import api
import numpy as np

class MultiWindowIndexer(api.indexers.BaseIndexer):
    def __init__(self, window):
        self.window = np.array(window)
        super().__init__()

    def get_window_bounds(self, num_values, min_periods, center, closed):
        end = np.arange(num_values, dtype='int64') + 1
        start = np.clip(end - self.window, 0, num_values)
        return start, end

np.random.seed([3,14])
a = np.random.randn(20).cumsum()
w = np.minimum(
    np.random.randint(1, 4, size=a.shape),
    np.arange(len(a))+1
)

df = pd.DataFrame({'Data': a, 'Window': w})

df['max1'] = df.Data.rolling(MultiWindowIndexer(df.Window)).max(engine='cython')

print(df)

Issue Description

This method basically tries to use a rolling operation where the window is an arbitrary series of integers instead of an integer or an offset. It is related to question/feature request #46716 and it was originally authored as an answer for a StackOverflow question here. There the author of the method notes on the bug: "The cython implementation seems to remember the largest starting index encountered so far and 'clips' smaller starting indices to the stored value. More technically correct: only stores the range of the largest start and largest end indices encountered so far in a queue, discarding smaller start indices and making them unavailable."

Expected Behavior

The result printed for index 18, should be -1.487828 instead of -1.932612, because at that point the window is 3 and it looks for the max between -1.932612 and -2.539703 and -1.487828,

Installed Versions

commit : 4bfe3d0
python : 3.8.10.final.0
python-bits : 64
OS : Linux
OS-release : 5.10.102.1-microsoft-standard-WSL2
Version : #1 SMP Wed Mar 2 00:30:59 UTC 2022
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : C.UTF-8
LOCALE : en_US.UTF-8

pandas : 1.4.2
numpy : 1.22.3
pytz : 2021.1
dateutil : 2.8.2
pip : 22.0.4
setuptools : 61.1.1
Cython : None
pytest : 7.1.1
hypothesis : None
sphinx : None
blosc : 1.10.6
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 3.1.1
IPython : None
pandas_datareader: None
bs4 : 4.10.0
bottleneck : None
brotli : None
fastparquet : None
fsspec : None
gcsfs : None
markupsafe : 2.0.1
matplotlib : None
numba : None
numexpr : 2.7.3
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pyreadstat : None
pyxlsb : None
s3fs : None
scipy : 1.8.0
snappy : None
sqlalchemy : 1.4.34
tables : 3.7.0
tabulate : 0.8.9
xarray : None
xlrd : None
xlwt : None
zstandard : None

@epigramx epigramx added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Apr 10, 2022
@simonjayhawkins
Copy link
Member

Thanks @epigramx for the report.

Expected Behavior

The result printed for index 18, should be -1.487828 instead of -1.932612, because at that point the window is 3 and it looks for the max between -1.932612 and -2.539703 and -1.487828,

Note that on main this now raises ValueError: MultiWindowIndexer does not implement the correct signature for get_window_bounds

@rhshadrach
Copy link
Member

@simonjayhawkins - on main, adding the (unused) argument step at the end of get_window_bounds gives the same behavior as reported in OP.

@rhshadrach rhshadrach added this to the Contributions Welcome milestone Apr 10, 2022
@rhshadrach rhshadrach removed the Needs Triage Issue that has not been reviewed by a pandas team member label Apr 10, 2022
@mroeschke
Copy link
Member

I think this bug will be specific to max & min since it doesn't use the traditional sliding window algorithm that most all the other aggregation functions use: https://stackoverflow.com/a/12239580

@mroeschke mroeschke changed the title BUG: Partially incorrect results when using a custom indexer for a rolling window BUG: Partially incorrect results when using a custom indexer for a rolling window for max and min Apr 10, 2022
@rhshadrach
Copy link
Member

@mroeschke - I haven't taken a look if the used algorithm can be adapted for arbitrary windows; if not, does it make sense to have two different algorithms (fastpath/slowpath)?

@mroeschke
Copy link
Member

I am a little doubtful it can be sharable for other aggregations because IIUC the min/min window algorithm uses value comparisons since it's just looking for min/max

does it make sense to have two different algorithms (fastpath/slowpath)

I suppose so, but not too thrilled about maintaining heuristics when to use fast vs slow in addition to maintaining both algorithms.

We've had precedent for collapsing two different algorithms before trading off performance for the sake of correctness & maintainability, so if going back to the more correct algorithm doesn't incur that much of a performance hit I think that would be worthwhile

@mroeschke mroeschke removed this from the Contributions Welcome milestone Oct 13, 2022
viable-alternative added a commit to viable-alternative/pandas that referenced this issue Apr 14, 2025
viable-alternative added a commit to viable-alternative/pandas that referenced this issue Apr 14, 2025
viable-alternative added a commit to viable-alternative/pandas that referenced this issue Apr 14, 2025
viable-alternative added a commit to viable-alternative/pandas that referenced this issue Apr 14, 2025
viable-alternative added a commit to viable-alternative/pandas that referenced this issue Apr 14, 2025
…max rolling calc. (Reduced scope of the original check-in to avoid docstring errors.)
@simonjayhawkins
Copy link
Member

@viable-alternative has opened PR #61288

viable-alternative added a commit to viable-alternative/pandas that referenced this issue Apr 17, 2025
viable-alternative added a commit to viable-alternative/pandas that referenced this issue Apr 21, 2025
viable-alternative added a commit to viable-alternative/pandas that referenced this issue Apr 22, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Window rolling, ewma, expanding
Projects
None yet
4 participants