Skip to content

PERF: Using Series or DataFrame in grouped iterator leads to significant increase in run time #55541

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
3 tasks done
koizumihiroo opened this issue Oct 16, 2023 · 1 comment
Closed
3 tasks done
Labels
Duplicate Report Duplicate issue or pull request Groupby Performance Memory or execution speed performance Regression Functionality that used to work in a prior pandas version
Milestone

Comments

@koizumihiroo
Copy link

koizumihiroo commented Oct 16, 2023

Pandas version checks

  • I have checked that this issue has not already been reported.
  • I have confirmed this issue exists on the latest version of pandas.
  • I have confirmed this issue exists on the main branch of pandas.

Reproducible Example

import random
import time
import pandas as pd

n, m = 20_000, 4  # 20_000, 10
ids = random.sample(list(range(n)) * m, n * m)
df = pd.DataFrame({"id": ids, "value": range(n * m)})

pd.options.mode.copy_on_write = True # not affect performance improvement

ITER_CHECKPOINT = 5_000

def measure(callback):
    out = []
    timer = time.time()
    for i, (_, g) in enumerate(df.groupby("id")):
        out.append(callback(g))
        if i % ITER_CHECKPOINT == ITER_CHECKPOINT - 1:
            print(i, time.time() - timer)
            timer = time.time()

if __name__ == '__main__':
    print(f"pandas {pd.__version__}")

    print("pure loop") # OK
    measure(lambda g: ...)

    print("method access") # OK
    measure(lambda g: g["value"].sum())

    print("Series")
    measure(lambda g: g["value"]) #increases runtime

    print("DataFrame")
    measure(lambda g: g) #increases runtime

Starting from pandas version 2.1.1, accessing a Series or DataFrame within a grouped iterator results in a significant increase in run time.

python main.py
pandas 2.1.1
pure loop
4999 0.06582403182983398
9999 0.05802297592163086
14999 0.056886911392211914
19999 0.05962705612182617
method access
4999 0.30263376235961914
9999 0.3000309467315674
14999 0.30356383323669434
19999 0.30495715141296387
Series
4999 1.2764630317687988
9999 5.174383163452148
14999 11.724796772003174
19999 21.37700319290161
DataFrame
4999 0.7938072681427002
9999 2.8290090560913086
14999 5.989538908004761
19999 8.968477725982666

pandas 2.2.0.dev0+383.g746e5eee860 still exists.

$ python main.py 
pandas 2.2.0.dev0+383.g746e5eee860
pure loop
4999 0.07438325881958008
9999 0.06659722328186035
14999 0.07559394836425781
19999 0.07405471801757812
method access
4999 0.32605791091918945
9999 0.3061239719390869
14999 0.39515018463134766
19999 0.4200468063354492
Series
4999 1.4171099662780762
9999 4.541456937789917
14999 10.480732202529907
19999 21.792606115341187
DataFrame
4999 0.8137969970703125
9999 2.920412063598633
14999 5.8703453540802
19999 9.294758081436157

Installed Versions

>>> pd.show_versions()
/usr/local/lib/python3.11/site-packages/_distutils_hack/init.py:33: UserWarning: Setuptools is replacing distutils.
warnings.warn("Setuptools is replacing distutils.")

INSTALLED VERSIONS

commit : e86ed37
python : 3.11.6.final.0
python-bits : 64
OS : Linux
OS-release : 5.15.49-linuxkit-pr
Version : #1 SMP Thu May 25 07:17:40 UTC 2023
machine : x86_64
processor :
byteorder : little
LC_ALL : None
LANG : C.UTF-8
LOCALE : en_US.UTF-8

pandas : 2.1.1
numpy : 1.26.1
pytz : 2023.3.post1
dateutil : 2.8.2
setuptools : 65.5.1
pip : 23.3
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : None
IPython : None
pandas_datareader : None
bs4 : None
bottleneck : None
dataframe-api-compat: None
fastparquet : None
fsspec : None
gcsfs : None
matplotlib : None
numba : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pyreadstat : None
pyxlsb : None
s3fs : None
scipy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
zstandard : None
tzdata : 2023.3
qtpy : None
pyqt5 : None

Prior Performance

# pip install pandas==2.1.0

pandas 2.1.0
pure loop
4999 0.0665280818939209
9999 0.0554051399230957
14999 0.05495023727416992
19999 0.0611119270324707
method access
4999 0.31636691093444824
9999 0.306002140045166
14999 0.32291364669799805
19999 0.294874906539917
Series
4999 0.21496009826660156
9999 0.16458582878112793
14999 0.19824504852294922
19999 0.21349406242370605
DataFrame
4999 0.08517193794250488
9999 0.1010444164276123
14999 0.06780314445495605
19999 0.11310791969299316
@koizumihiroo koizumihiroo added Needs Triage Issue that has not been reviewed by a pandas team member Performance Memory or execution speed performance labels Oct 16, 2023
@rhshadrach rhshadrach added Groupby Regression Functionality that used to work in a prior pandas version labels Oct 19, 2023
@rhshadrach rhshadrach added this to the 2.1.2 milestone Oct 19, 2023
@rhshadrach
Copy link
Member

rhshadrach commented Oct 19, 2023

Edit: I thought #55518 was merged 🤦. I'm seeing

DataFrame - 19999
0.07098388671875  # #55518
0.06751871109008789 # v2.1.0

Closing as a duplicate of #55256.


I'm seeing:

DataFrame - 19999
2.975979804992676  # main
0.06751871109008789 # v2.1.0

My first guess was this is similar to #55518, but since this hasn't improved much, I'm guessing it is not. Still need to run a git bisect.

cc @phofl

@rhshadrach rhshadrach added Duplicate Report Duplicate issue or pull request and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Oct 19, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Duplicate Report Duplicate issue or pull request Groupby Performance Memory or execution speed performance Regression Functionality that used to work in a prior pandas version
Projects
None yet
Development

No branches or pull requests

2 participants