BUG: groupby sort issue when using rolling #50296

domjanbaric · 2022-12-16T12:56:36Z

Pandas version checks

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
I have confirmed this bug exists on the main branch of pandas.

Simpler reproducible example

index = pd.MultiIndex.from_product(
    [[580, 546], [520, 655, 529], pd.date_range(dt.date(2020, 11, 1), dt.date(2020, 11, 3))],
    names=["machine_id",'product_id', "sample_date"],
)
df = pd.DataFrame(
    data={'sales_qty': np.random.random(index.shape[0])},
    index=index,
).reset_index()
df = df.sort_values(by=['machine_id', 'product_id', 'sample_date'])
df = df[df['machine_id'].ne(546) | df['product_id'].ne(529)]
result = df.groupby(['machine_id','product_id'], sort=False)['sales_qty'].rolling(2, center=True).sum()
print(result)

Reproducible Example

import pandas as pd

df = pd.MultiIndex.from_product(
    [[580,546], [519,520,527,526,528,655,526,529,532],pd.date_range(dt.date(2020, 11, 1), dt.date(2022, 10, 31))],
    names=["machine_id",'product_id', "sample_date"],
)
df = pd.DataFrame(index=df).reset_index()
df["sample_date"] = pd.to_datetime(df["sample_date"]).dt.date
df['sales_qty']=np.random.random(df.shape[0])

df=df.sort_values(by=['machine_id','product_id','sample_date'])
df=df.reset_index(drop=True)
df=df[~((df['machine_id']==546) & (df['product_id'].isin([526,529,532])))]
df = df.groupby(['machine_id','product_id'],sort=False)['sales_qty'].rolling(7, center=True).sum().reset_index()[['machine_id','product_id']].drop_duplicates()
df

Issue Description

When using pandas groupby with rolling, flag sort in groupby, sort order of group keys is changed. This occurs when we have more than 1 group key and values in key 2 are not equal for all values of key 1.

In example above we create dataframe with 4 columns, machine_id, product_id, sample_date and sales_qty. Dataframe consist of 2 different machine and 9 different products. Dataframe is sorted by: machine_id, product_id, sample_date. We want to calculate rolling mean for each machine_id-product_id group. However, if we remove some products for machine with lower machine_id (machine_id which is prior in sort, in this example 546) and then calculate rolling mean with flag sort= False in groupby order of group keys in result is actually changed. product_ids which were removed from machine with machine_id = 546 (product_id = 526,529,532) are at the end of dataframe.

    machine_id 	product_id
0 	546 	519
730 	546 	520
1460 	546 	527
2190 	546 	528
2920 	546 	655
3650 	580 	519
4380 	580 	520
5110 	580 	527
5840 	580 	528
6570 	580 	655
7300 	580 	526    <---
8760 	580 	529    <---
9490 	580 	532    <---

Expected Behavior

Because flag sort = False in groupby expected behaviour is that result will have same sort as the original dataframe, i.e. expected result is:

    machine_id 	product_id
0 	546 	519
730 	546 	520
1460 	546 	527
2190 	546 	528
2920 	546 	655
3650 	580 	519
4380 	580 	520
5110 	580 	526    <---
6570 	580 	527
7300 	580 	528
8030 	580 	529    <---
8760 	580 	532    <---
9490 	580 	655

Installed Versions

INSTALLED VERSIONS

commit : 91111fd
python : 3.10.8.final.0
python-bits : 64
OS : Linux
OS-release : 5.15.0-56-generic
Version : #62-Ubuntu SMP Tue Nov 22 19:54:14 UTC 2022
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 1.5.1
numpy : 1.23.4
pytz : 2022.6
dateutil : 2.8.2
setuptools : 63.2.0
pip : 22.2.2
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : 2.9.5
jinja2 : 3.1.2
IPython : 8.6.0
pandas_datareader: None
bs4 : 4.11.1
bottleneck : None
brotli : None
fastparquet : None
fsspec : None
gcsfs : None
matplotlib : 3.5.3
numba : 0.56.3
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : 10.0.0
pyreadstat : None
pyxlsb : None
s3fs : None
scipy : 1.9.3
snappy : None
sqlalchemy : 1.4.42
tables : None
tabulate : 0.9.0
xarray : None
xlrd : None
xlwt : None
zstandard : None
tzdata : None

The text was updated successfully, but these errors were encountered:

rhshadrach · 2022-12-18T13:42:03Z

Thanks for the report @domjanbaric - can you add an informative title to this issue

sqali · 2023-07-03T11:57:09Z

Hi @rhshadrach ,

I would love to work on this issue. Let me go through the issue thoroughly and provide a solution

sqali · 2023-07-03T11:57:13Z

take

rhshadrach · 2024-01-14T14:56:37Z

Confirmed on main. I've simplified the reproducer in the OP. Further investigations and PRs to fix are welcome.

domjanbaric added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Dec 16, 2022

rhshadrach added Groupby Window rolling, ewma, expanding labels Dec 18, 2022

domjanbaric changed the title ~~BUG:~~ BUG: groupby sort issue when using rolling Dec 19, 2022

github-actions bot assigned sqali Jul 3, 2023

rhshadrach added Sorting e.g. sort_index, sort_values and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Jan 14, 2024

hsorsky mentioned this issue Apr 2, 2024

BUG: Inconsistent RollingGroupby Behaviour #58124

Open

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: groupby sort issue when using rolling #50296

BUG: groupby sort issue when using rolling #50296

domjanbaric commented Dec 16, 2022 •

edited by rhshadrach

Loading

INSTALLED VERSIONS

rhshadrach commented Dec 18, 2022

sqali commented Jul 3, 2023

sqali commented Jul 3, 2023

rhshadrach commented Jan 14, 2024

BUG: groupby sort issue when using rolling #50296

BUG: groupby sort issue when using rolling #50296

Comments

domjanbaric commented Dec 16, 2022 • edited by rhshadrach Loading

Pandas version checks

Simpler reproducible example

Reproducible Example

Issue Description

Expected Behavior

Installed Versions

INSTALLED VERSIONS

rhshadrach commented Dec 18, 2022

sqali commented Jul 3, 2023

sqali commented Jul 3, 2023

rhshadrach commented Jan 14, 2024

domjanbaric commented Dec 16, 2022 •

edited by rhshadrach

Loading