Skip to content

Resample on MultiIndex level takes much more time than on normal column #28635

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
pgrudzinski opened this issue Sep 26, 2019 · 4 comments · Fixed by #49274
Closed

Resample on MultiIndex level takes much more time than on normal column #28635

pgrudzinski opened this issue Sep 26, 2019 · 4 comments · Fixed by #49274
Assignees
Labels
Benchmark Performance (ASV) benchmarks good first issue MultiIndex Resample resample method

Comments

@pgrudzinski
Copy link

pgrudzinski commented Sep 26, 2019

Code Sample

import pandas as pd
import numpy as np

n=80000
g=5

index = pd.MultiIndex.from_product([
    np.arange(g),
    pd.to_timedelta(np.arange(n), unit='s')
])
data = pd.DataFrame(
    np.random.randint(0,1000,size=(len(index))),
    index=index
)

%timeit data.groupby(level=0).resample('10s',level=1).mean()

# 3.93 s ± 295 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit data.reset_index(1).groupby(level=0).resample('10s',on='level_1').mean()

# 157 ms ± 3.33 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Problem description

resample seem to take much more time when resampling on a level of MultiIndex instead of normal data column. The second, faster approach is more convoluted and is not what first comes to mind.

Expected Output

Both operations should around the same amount of time with second possibly slightly more, because of additional reset_index operation. If the difference is expected than first operation should show warning hinting on optimal solution.

Output of pd.show_versions()

INSTALLED VERSIONS

commit : None
python : 3.7.4.final.0
python-bits : 64
OS : Windows
OS-release : 10
machine : AMD64
processor : Intel64 Family 6 Model 94 Stepping 3, GenuineIntel
byteorder : little
LC_ALL : None
LANG : None
LOCALE : None.None

pandas : 0.25.1
numpy : 1.16.4
pytz : 2019.2
dateutil : 2.8.0
pip : 19.2.2
setuptools : 41.0.1
Cython : 0.29.13
pytest : 5.0.1
hypothesis : None
sphinx : 2.1.2
blosc : None
feather : None
xlsxwriter : 1.1.8
lxml.etree : 4.4.1
html5lib : 1.0.1
pymysql : None
psycopg2 : None
jinja2 : 2.10.1
IPython : 7.8.0
pandas_datareader: None
bs4 : 4.8.0
bottleneck : 1.2.1
fastparquet : None
gcsfs : None
lxml.etree : 4.4.1
matplotlib : 3.1.1
numexpr : 2.7.0
odfpy : None
openpyxl : 2.6.2
pandas_gbq : None
pyarrow : None
pytables : None
s3fs : None
scipy : 1.3.1
sqlalchemy : 1.3.7
tables : 3.5.2
xarray : None
xlrd : 1.2.0
xlwt : 1.3.0
xlsxwriter : 1.1.8

@pgrudzinski pgrudzinski changed the title Resample on MultiIndex level is much longer than on normal column Resample on MultiIndex level takes much more time than on normal column Oct 6, 2019
@jbrockmendel jbrockmendel added MultiIndex Performance Memory or execution speed performance Resample resample method labels Oct 16, 2019
@mroeschke
Copy link
Member

Performance looks comparable now. Could use an ASV

In [9]: import pandas as pd
   ...: import numpy as np
   ...:
   ...: n=80000
   ...: g=5
   ...:
   ...: index = pd.MultiIndex.from_product([
   ...:     np.arange(g),
   ...:     pd.to_timedelta(np.arange(n), unit='s')
   ...: ])
   ...: data = pd.DataFrame(
   ...:     np.random.randint(0,1000,size=(len(index))),
   ...:     index=index
   ...: )
   ...:
   ...: %timeit data.groupby(level=0).resample('10s',level=1).mean()
116 ms ± 7.01 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [10]: %timeit data.reset_index(1).groupby(level=0).resample('10s',on='level_1').mean()
111 ms ± 7.26 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

@mroeschke mroeschke added Benchmark Performance (ASV) benchmarks good first issue and removed Performance Memory or execution speed performance labels Jul 21, 2021
@Matt711
Copy link

Matt711 commented Mar 19, 2022

Hey, @mroeschke I'm also getting similar performance for both of them. Is there anything else to do here? I'm happy to work on it if there is.

import pandas as pd
import numpy as np

n=80000
g=5

index = pd.MultiIndex.from_product([
    np.arange(g),
    pd.to_timedelta(np.arange(n), unit='s')
])
data = pd.DataFrame(
    np.random.randint(0,1000,size=(len(index))),
    index=index
)

%timeit data.groupby(level=0).resample('10s',level=1).mean()

%timeit data.reset_index(1).groupby(level=0).resample('10s',on='level_1').mean()
104 ms ± 4.3 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
105 ms ± 2.45 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

@mroeschke
Copy link
Member

Is there anything else to do here?

Yes, to track performance regressions an ASV benchmark needs to be added in asv_bench/benchmarks/groupby.py

@krasch
Copy link
Contributor

krasch commented Oct 24, 2022

take

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Benchmark Performance (ASV) benchmarks good first issue MultiIndex Resample resample method
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants