Resample on MultiIndex level takes much more time than on normal column #28635

pgrudzinski · 2019-09-26T14:13:47Z

Code Sample

import pandas as pd
import numpy as np

n=80000
g=5

index = pd.MultiIndex.from_product([
    np.arange(g),
    pd.to_timedelta(np.arange(n), unit='s')
])
data = pd.DataFrame(
    np.random.randint(0,1000,size=(len(index))),
    index=index
)

%timeit data.groupby(level=0).resample('10s',level=1).mean()

# 3.93 s ± 295 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit data.reset_index(1).groupby(level=0).resample('10s',on='level_1').mean()

# 157 ms ± 3.33 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Problem description

resample seem to take much more time when resampling on a level of MultiIndex instead of normal data column. The second, faster approach is more convoluted and is not what first comes to mind.

Expected Output

Both operations should around the same amount of time with second possibly slightly more, because of additional reset_index operation. If the difference is expected than first operation should show warning hinting on optimal solution.

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit : None
python : 3.7.4.final.0
python-bits : 64
OS : Windows
OS-release : 10
machine : AMD64
processor : Intel64 Family 6 Model 94 Stepping 3, GenuineIntel
byteorder : little
LC_ALL : None
LANG : None
LOCALE : None.None

pandas : 0.25.1
numpy : 1.16.4
pytz : 2019.2
dateutil : 2.8.0
pip : 19.2.2
setuptools : 41.0.1
Cython : 0.29.13
pytest : 5.0.1
hypothesis : None
sphinx : 2.1.2
blosc : None
feather : None
xlsxwriter : 1.1.8
lxml.etree : 4.4.1
html5lib : 1.0.1
pymysql : None
psycopg2 : None
jinja2 : 2.10.1
IPython : 7.8.0
pandas_datareader: None
bs4 : 4.8.0
bottleneck : 1.2.1
fastparquet : None
gcsfs : None
lxml.etree : 4.4.1
matplotlib : 3.1.1
numexpr : 2.7.0
odfpy : None
openpyxl : 2.6.2
pandas_gbq : None
pyarrow : None
pytables : None
s3fs : None
scipy : 1.3.1
sqlalchemy : 1.3.7
tables : 3.5.2
xarray : None
xlrd : 1.2.0
xlwt : 1.3.0
xlsxwriter : 1.1.8

The text was updated successfully, but these errors were encountered:

mroeschke · 2021-07-21T05:03:28Z

Performance looks comparable now. Could use an ASV

In [9]: import pandas as pd
   ...: import numpy as np
   ...:
   ...: n=80000
   ...: g=5
   ...:
   ...: index = pd.MultiIndex.from_product([
   ...:     np.arange(g),
   ...:     pd.to_timedelta(np.arange(n), unit='s')
   ...: ])
   ...: data = pd.DataFrame(
   ...:     np.random.randint(0,1000,size=(len(index))),
   ...:     index=index
   ...: )
   ...:
   ...: %timeit data.groupby(level=0).resample('10s',level=1).mean()
116 ms ± 7.01 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [10]: %timeit data.reset_index(1).groupby(level=0).resample('10s',on='level_1').mean()
111 ms ± 7.26 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Matt711 · 2022-03-19T18:38:33Z

Hey, @mroeschke I'm also getting similar performance for both of them. Is there anything else to do here? I'm happy to work on it if there is.

import pandas as pd
import numpy as np

n=80000
g=5

index = pd.MultiIndex.from_product([
    np.arange(g),
    pd.to_timedelta(np.arange(n), unit='s')
])
data = pd.DataFrame(
    np.random.randint(0,1000,size=(len(index))),
    index=index
)

%timeit data.groupby(level=0).resample('10s',level=1).mean()

%timeit data.reset_index(1).groupby(level=0).resample('10s',on='level_1').mean()

104 ms ± 4.3 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
105 ms ± 2.45 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

mroeschke · 2022-03-19T18:51:01Z

Is there anything else to do here?

Yes, to track performance regressions an ASV benchmark needs to be added in asv_bench/benchmarks/groupby.py

krasch · 2022-10-24T08:20:37Z

take

pgrudzinski changed the title ~~Resample on MultiIndex level is much longer than on normal column~~ Resample on MultiIndex level takes much more time than on normal column Oct 6, 2019

jbrockmendel added MultiIndex Performance Memory or execution speed performance Resample resample method labels Oct 16, 2019

mroeschke added Benchmark Performance (ASV) benchmarks good first issue and removed Performance Memory or execution speed performance labels Jul 21, 2021

github-actions bot assigned krasch Oct 24, 2022

phofl mentioned this issue Oct 24, 2022

PERF: GH28635 Add ASV benchmark for resample after groupby #49274

Merged

5 tasks

phofl closed this as completed in #49274 Oct 24, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Resample on MultiIndex level takes much more time than on normal column #28635

Resample on MultiIndex level takes much more time than on normal column #28635

pgrudzinski commented Sep 26, 2019 •

edited

Loading

INSTALLED VERSIONS

mroeschke commented Jul 21, 2021

Matt711 commented Mar 19, 2022

mroeschke commented Mar 19, 2022

krasch commented Oct 24, 2022

Resample on MultiIndex level takes much more time than on normal column #28635

Resample on MultiIndex level takes much more time than on normal column #28635

Comments

pgrudzinski commented Sep 26, 2019 • edited Loading

Code Sample

Problem description

Expected Output

Output of pd.show_versions()

INSTALLED VERSIONS

mroeschke commented Jul 21, 2021

Matt711 commented Mar 19, 2022

mroeschke commented Mar 19, 2022

krasch commented Oct 24, 2022

pgrudzinski commented Sep 26, 2019 •

edited

Loading

Output of `pd.show_versions()`