BUG: Rolling operations on groupby with apply behaves differently from looping the window. #48056

jtho123 · 2022-08-12T12:35:24Z

Pandas version checks

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd
from datetime import date
df = pd.DataFrame({'A': [i for i in range(10)], 
                   'date': [date(1970,1,i) for _ in range(2) for i in range(1,6)], 
                   'id': [i for i in range(2) for _ in range(5)], 
                   'exp_date': [date(1970,1,3) for _ in range(2) for _ in range(1,6)]})
df = df.set_index(['date', 'exp_date']).sort_index()
for window in df.groupby('id').A.rolling(2):
    print(window.loc[pd.IndexSlice[:, window.index[-1][0]:]].sum(min_count=2))
print(df.groupby('id').A.rolling(2).apply(lambda x: x.loc[pd.IndexSlice[:, x.index[-1][0]:]].sum(min_count=2)))

Issue Description

I have a dataframe with a multiindex on date and an expiration date as provided in the minimal code example.

import pandas as pd
from datetime import date
df = pd.DataFrame({'A': [i for i in range(10)], 
                   'date': [date(1970,1,i) for _ in range(2) for i in range(1,6)], 
                   'id': [i for i in range(2) for _ in range(5)], 
                   'exp_date': [date(1970,1,3) for _ in range(2) for _ in range(1,6)]})
df = df.set_index(['date', 'exp_date']).sort_index()
for window in df.groupby('id').A.rolling(2):
    print(window.loc[pd.IndexSlice[:, window.index[-1][0]:]].sum(min_count=2))
print(df.groupby('id').A.rolling(2).apply(lambda x: x.loc[pd.IndexSlice[:, x.index[-1][0]:]].sum(min_count=2)))

I want to roll sum of a column - here A - grouped by an ID such we only use values where expiration date is higher than or equal to actual date.

When I loop over the window I get fine results compared to my expected results:

nan
1
3
nan
nan
nan
11
13
nan
nan

But when I use the apply method, I get an unexpected result:

id  date        exp_date  
0   1970-01-01  1970-01-03    NaN
    1970-01-02  1970-01-03    1.0
    1970-01-03  1970-01-03    3.0
    1970-01-04  1970-01-03    5.0
    1970-01-05  1970-01-03    7.0
1   1970-01-01  1970-01-03    NaN
    1970-01-02  1970-01-03    NaN
    1970-01-03  1970-01-03    NaN
    1970-01-04  1970-01-03    NaN
    1970-01-05  1970-01-03    NaN
Name: A, dtype: float64

where I would expect the similar results as my loop:

id  date        exp_date  
0   1970-01-01  1970-01-03    NaN
    1970-01-02  1970-01-03    1.0
    1970-01-03  1970-01-03    3.0
    1970-01-04  1970-01-03    NaN
    1970-01-05  1970-01-03    NaN
1   1970-01-01  1970-01-03    NaN
    1970-01-02  1970-01-03    11.0
    1970-01-03  1970-01-03    13.0
    1970-01-04  1970-01-03    NaN
    1970-01-05  1970-01-03    NaN
Name: A, dtype: float64

Expected Behavior

first print:

nan
1
3
nan
nan
nan
11
13
nan
nan

second print:

id  date        exp_date  
0   1970-01-01  1970-01-03    NaN
    1970-01-02  1970-01-03    1.0
    1970-01-03  1970-01-03    3.0
    1970-01-04  1970-01-03    NaN
    1970-01-05  1970-01-03    NaN
1   1970-01-01  1970-01-03    NaN
    1970-01-02  1970-01-03    11.0
    1970-01-03  1970-01-03    13.0
    1970-01-04  1970-01-03    NaN
    1970-01-05  1970-01-03    NaN
Name: A, dtype: float64

Installed Versions

INSTALLED VERSIONS

commit : e8093ba
python : 3.8.5.final.0
python-bits : 64
OS : Linux
OS-release : 5.10.104-linuxkit
Version : #1 SMP Thu Mar 17 17:08:06 UTC 2022
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : C.UTF-8
LANG : C.UTF-8
LOCALE : en_US.UTF-8
pandas : 1.4.3
numpy : 1.20.3
pytz : 2022.1
dateutil : 2.8.2
setuptools : 58.0.4
pip : 20.2.4
Cython : None
pytest : 6.2.2
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : None
IPython : None
pandas_datareader: None
bs4 : None
bottleneck : 1.3.5
brotli :
fastparquet : None
fsspec : None
gcsfs : None
markupsafe : None
matplotlib : 3.5.1
numba : None
numexpr : 2.8.3
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : 3.0.0
pyreadstat : None
pyxlsb : None
s3fs : None
scipy : 1.8.1
snappy : None
sqlalchemy : 1.4.27
tables : None
tabulate : 0.8.10
xarray : None
xlrd : None
xlwt : None
zstandard : None

The text was updated successfully, but these errors were encountered:

phofl · 2022-08-15T15:56:11Z

Hi, thanks for your report. Can you trim this down a bit? This is really complicated.

Udneowi · 2022-10-13T19:41:58Z

@phofl Having had similar issues I believe this has something to do with the indexing of the windows passed. Sorting the data frame beforehand by id and then time fixes the issue for me.

print(df.sort_values(['id', 'date']).groupby('id').A.rolling(2).apply(lambda x: x.loc[pd.IndexSlice[:, x.index[-1][0]:]].sum(min_count=2)))

I was personally using a temporal rolling window where for each user each day was unique. I wanted to weigh days further away lower but ran into trouble running a custom function in apply.

df.groupby('user').rolling('7D', on = 'datetime', min_periods = 5).apply(custom_func)

Basically the data passed to the custom function has duplicated time indexes, eg:

datetime
2016-10-17    0.0
2016-10-17    0.0
2016-10-17    0.0
2016-10-18    0.0
2016-10-20    0.0
dtype: float64

Even though that should not be possible (as they are unique for each group/user). Without knowing too much of the pandas structure it might have something to do with how the indexes are initialized. Calling RollingGroupby._get_window_indexer().get_window_bounds() gives:

(array([    0,     0,     0, ..., 50455, 50455, 50456]),
 array([    1,     2,     3, ..., 50458, 50459, 50460]))

which are structured according to all of the data, and not the individual groups. I might be totally off though. Similar to OPs problem I fixed mine by sorting the data first.

df.sort_values(['user', 'datetime']).groupby('user').rolling('7D', on = 'datetime', min_periods = 5).apply(custom_func)

phofl · 2022-12-09T16:42:58Z

Closing for now. We need a simple reproducer so that we can see that this is a bug inside of pandas and not something in your code

jtho123 added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Aug 12, 2022

mroeschke added Needs Info Clarification about behavior needed to assess issue and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Aug 15, 2022

fakio mentioned this issue Nov 24, 2022

BUG: groupby rolling apply giving strange index #49878

Open

3 tasks

phofl closed this as completed Dec 9, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: Rolling operations on groupby with apply behaves differently from looping the window. #48056

BUG: Rolling operations on groupby with apply behaves differently from looping the window. #48056

jtho123 commented Aug 12, 2022 •

edited

Loading

INSTALLED VERSIONS

phofl commented Aug 15, 2022

Udneowi commented Oct 13, 2022 •

edited

Loading

phofl commented Dec 9, 2022

BUG: Rolling operations on groupby with apply behaves differently from looping the window. #48056

BUG: Rolling operations on groupby with apply behaves differently from looping the window. #48056

Comments

jtho123 commented Aug 12, 2022 • edited Loading

Pandas version checks

Reproducible Example

Issue Description

Expected Behavior

Installed Versions

INSTALLED VERSIONS

phofl commented Aug 15, 2022

Udneowi commented Oct 13, 2022 • edited Loading

phofl commented Dec 9, 2022

jtho123 commented Aug 12, 2022 •

edited

Loading

Udneowi commented Oct 13, 2022 •

edited

Loading