Skip to content

BUG: groupby sort issue when using rolling #50296

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
3 tasks done
domjanbaric opened this issue Dec 16, 2022 · 4 comments
Open
3 tasks done

BUG: groupby sort issue when using rolling #50296

domjanbaric opened this issue Dec 16, 2022 · 4 comments
Assignees
Labels
Bug Groupby Sorting e.g. sort_index, sort_values Window rolling, ewma, expanding

Comments

@domjanbaric
Copy link

domjanbaric commented Dec 16, 2022

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Simpler reproducible example

index = pd.MultiIndex.from_product(
    [[580, 546], [520, 655, 529], pd.date_range(dt.date(2020, 11, 1), dt.date(2020, 11, 3))],
    names=["machine_id",'product_id', "sample_date"],
)
df = pd.DataFrame(
    data={'sales_qty': np.random.random(index.shape[0])},
    index=index,
).reset_index()
df = df.sort_values(by=['machine_id', 'product_id', 'sample_date'])
df = df[df['machine_id'].ne(546) | df['product_id'].ne(529)]
result = df.groupby(['machine_id','product_id'], sort=False)['sales_qty'].rolling(2, center=True).sum()
print(result)

Reproducible Example

import pandas as pd

df = pd.MultiIndex.from_product(
    [[580,546], [519,520,527,526,528,655,526,529,532],pd.date_range(dt.date(2020, 11, 1), dt.date(2022, 10, 31))],
    names=["machine_id",'product_id', "sample_date"],
)
df = pd.DataFrame(index=df).reset_index()
df["sample_date"] = pd.to_datetime(df["sample_date"]).dt.date
df['sales_qty']=np.random.random(df.shape[0])

df=df.sort_values(by=['machine_id','product_id','sample_date'])
df=df.reset_index(drop=True)
df=df[~((df['machine_id']==546) & (df['product_id'].isin([526,529,532])))]
df = df.groupby(['machine_id','product_id'],sort=False)['sales_qty'].rolling(7, center=True).sum().reset_index()[['machine_id','product_id']].drop_duplicates()
df

Issue Description

When using pandas groupby with rolling, flag sort in groupby, sort order of group keys is changed. This occurs when we have more than 1 group key and values in key 2 are not equal for all values of key 1.

In example above we create dataframe with 4 columns, machine_id, product_id, sample_date and sales_qty. Dataframe consist of 2 different machine and 9 different products. Dataframe is sorted by: machine_id, product_id, sample_date. We want to calculate rolling mean for each machine_id-product_id group. However, if we remove some products for machine with lower machine_id (machine_id which is prior in sort, in this example 546) and then calculate rolling mean with flag sort= False in groupby order of group keys in result is actually changed. product_ids which were removed from machine with machine_id = 546 (product_id = 526,529,532) are at the end of dataframe.

    machine_id 	product_id
0 	546 	519
730 	546 	520
1460 	546 	527
2190 	546 	528
2920 	546 	655
3650 	580 	519
4380 	580 	520
5110 	580 	527
5840 	580 	528
6570 	580 	655
7300 	580 	526    <---
8760 	580 	529    <---
9490 	580 	532    <---

Expected Behavior

Because flag sort = False in groupby expected behaviour is that result will have same sort as the original dataframe, i.e. expected result is:

    machine_id 	product_id
0 	546 	519
730 	546 	520
1460 	546 	527
2190 	546 	528
2920 	546 	655
3650 	580 	519
4380 	580 	520
5110 	580 	526    <---
6570 	580 	527
7300 	580 	528
8030 	580 	529    <---
8760 	580 	532    <---
9490 	580 	655

Installed Versions

INSTALLED VERSIONS

commit : 91111fd
python : 3.10.8.final.0
python-bits : 64
OS : Linux
OS-release : 5.15.0-56-generic
Version : #62-Ubuntu SMP Tue Nov 22 19:54:14 UTC 2022
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 1.5.1
numpy : 1.23.4
pytz : 2022.6
dateutil : 2.8.2
setuptools : 63.2.0
pip : 22.2.2
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : 2.9.5
jinja2 : 3.1.2
IPython : 8.6.0
pandas_datareader: None
bs4 : 4.11.1
bottleneck : None
brotli : None
fastparquet : None
fsspec : None
gcsfs : None
matplotlib : 3.5.3
numba : 0.56.3
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : 10.0.0
pyreadstat : None
pyxlsb : None
s3fs : None
scipy : 1.9.3
snappy : None
sqlalchemy : 1.4.42
tables : None
tabulate : 0.9.0
xarray : None
xlrd : None
xlwt : None
zstandard : None
tzdata : None

@domjanbaric domjanbaric added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Dec 16, 2022
@rhshadrach
Copy link
Member

Thanks for the report @domjanbaric - can you add an informative title to this issue

@rhshadrach rhshadrach added Groupby Window rolling, ewma, expanding labels Dec 18, 2022
@domjanbaric domjanbaric changed the title BUG: BUG: groupby sort issue when using rolling Dec 19, 2022
@sqali
Copy link
Contributor

sqali commented Jul 3, 2023

Hi @rhshadrach ,

I would love to work on this issue. Let me go through the issue thoroughly and provide a solution

@sqali
Copy link
Contributor

sqali commented Jul 3, 2023

take

@rhshadrach
Copy link
Member

Confirmed on main. I've simplified the reproducer in the OP. Further investigations and PRs to fix are welcome.

@rhshadrach rhshadrach added Sorting e.g. sort_index, sort_values and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Jan 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Groupby Sorting e.g. sort_index, sort_values Window rolling, ewma, expanding
Projects
None yet
Development

No branches or pull requests

3 participants