PERF: regression in reindex. Pandas 0.23.4 is 60x slower than 0.22.0 with a MultiIndex with datetime64 #23735

apm582 · 2018-11-16T06:11:54Z

Re-indexing a series with a two-level MultiIndex where the first level is datetime64 and the second level is int is 40x slower than in 0.22.0. Output first then repro code below. The issue persists if you change the first level to int instead of datetime, but the perf difference is less (0.40 seconds vs 0.03 seconds).

"""
pandas version: 0.23.4
reindex took 1.9770500659942627 seconds

pandas version: 0.22.0
reindex took 0.0306899547577 seconds
"""


import pandas as pd
import time
import numpy as np


if __name__ == '__main__':
    n_days = 300
    dr = pd.date_range(end="20181118", periods=n_days)
    mi = pd.MultiIndex.from_product([dr, range(1440)])

    v = np.random.randn(len(mi))
    mask = np.random.rand(len(v)) < .3
    v[mask] = np.nan
    s = pd.Series(v, index=mi)
    s = s.sort_index()

    s2 = s.dropna()

    start = time.time()

    s2.reindex(index=s.index)

    end = time.time()
    print("pandas version: %s" % pd.__version__)
    print("reindex took %s seconds" % (end - start))

apm582 · 2018-11-16T07:05:58Z

I did some digging, seems the difference is coming from the new BaseMultiIndexCodesEngine.get_indexer vs the old MultiIndexHashEngine.get_indexer

TomAugspurger · 2018-11-16T12:55:57Z

cc @toobaz, in case this is related to #19074

@apm582 is a DatetimeIndex for one of the levels necessary to observe the slowdown, or can you reproduce with just integer levels?

apm582 · 2018-11-16T13:44:29Z

@TomAugspurger There is still a perf regression with just integer levels but it's to a smaller degree. With DatetimeIndex the perf regression was from 0.03 seconds to 1.97 seconds. With just integer levels the perf regression is from 0.03 seconds seconds to 0.40 seconds, so a slowdown of 10x instead of 60x.

When I change the dr variable to

dr = range(300) # instead of date_range(...)

I get the following output:
pandas version: 0.23.4
reindex took 0.41175103187561035 seconds

pandas version: 0.22.0
reindex took 0.0308949947357 seconds

qiuwei · 2021-03-10T09:22:45Z

This issue still persists with the latest 1.2.3 version and reindexing seems to get even slower.

pandas version: 1.2.3
reindex took 2.6638526916503906 seconds

qiuwei · 2021-04-20T06:49:20Z

I did some preliminary profiling.
It seems most time is wasted in the type conversion.

shaoyucheng · 2021-04-20T08:17:25Z

I met the same problem, even in the newest version, look forward to a official improvement.

jreback · 2021-04-20T10:48:06Z

@shaoyucheng you are welcome to submit a patch

toobaz · 2021-04-21T06:40:27Z

I did some preliminary profiling.
It seems most time is wasted in the type conversion.

Thanks for your in depth analysis. Should I interpret the two columns of the screenshot as coming from the two versions? Are the numbers such as 25895 the number of calls of a given method?

I'm happy to take a look at what caused the regression but having the stack would help a lot.

qiuwei · 2021-04-22T03:01:21Z

Hi, @toobaz Thanks for the reply!

The screenshot is only for the 0.23.4 version.
The first column is for the call count. The second column is the total time in ms used for the call. The third column is the own time used for the call.

I have located the cause in the code:

pandas/pandas/core/indexes/multi.py

Line 706 in 3513f59

if isinstance(vals.dtype, ExtensionDtype) or isinstance(

The conversion is really slow if the index contains a large number of DateTimeIndex.
I am not sure why the conversion to object is needed, but reindex seems to work for my case if I remove those lines.

BTW, the conversion to object only explains part of the regression of the performance.
After removing the conversion of DataTimeIndex to object, pandas version: 0.23.4
reindex took 0.7682747840881348 seconds, still around 30X times slower than 0.22.

The regression exists for all versions later than 0.23.4
I will do more investigation.

Glad that this issue got your interest. Hope this can be fixed soon.

qiuwei · 2021-04-22T03:53:31Z

ed15d8e

@jbrockmendel It seems that DateTimeIndex does not require a force conversion?

Using @apm582 's example, hasattr(vals, "_box_values") before the commit actually evaluates to False while
isinstance(vals, (ABCDatetimeIndex, ABCTimedeltaIndex)) evaluates to True. Thus the time consuming force conversion of DataTimeIndex is triggered.

qiuwei · 2021-04-22T08:21:46Z

@toobaz I did some further investigation and find out that _extract_level_codes is the other major cause of the performance regression.

A minimal example:

import pandas as pd
import time
import numpy as np


if __name__ == '__main__':
    n_days = 2500
    dr = pd.date_range(end="20120101", periods=n_days)
    mi = pd.MultiIndex.from_product([dr, range(1440)])

    v = np.random.randn(len(mi))
    mask = np.random.rand(len(v)) < .3
    v[mask] = np.nan
    s = pd.Series(v, index=mi)
    s = s.sort_index()

    s2 = s.dropna()

    start = time.time()


    match_seq = s2.index.get_indexer_for(s.index)
    # result = s2.index.values
    # resul2 = s.index.values

    end1 = time.time()
    match_seq = s2.index.get_indexer_for(s.index)
    end2 = time.time()

    s2.index._engine._extract_level_codes(s.index)
    end3 = time.time()
    # print(s2)
    print("pandas version: %s" % pd.__version__)
    print("reindex for the first time(include time cost of populating the hash mapping) took %s seconds" % (end1 - start))
    print("reindex with mapping populated took %s seconds" % (end2 - end1))
    print("extrace level codes takes %s seconds" % (end3 - end2))

The result :

pandas version: 1.1.4
reindex for the first time(include of time cost of populating the hash mapping) took 5.976720809936523 seconds
reindex with mapping populated took 3.858426332473755 seconds
extract level codes takes 3.790076732635498 seconds

So extract_level_codes almost takes the same amount of time as get_indexer_for!
To be more specific, it is the following line which is causing the performance regression:

pandas/pandas/_libs/index.pyx

Line 602 in fd67546

level_codes = [lev.get_indexer(codes) + 1 for lev, codes

So the conlusion is that:

conversion from datatime to object contributes to the 6X speed difference (This can be easily fixed if ed15d8e can be reverted)
inefficiency of extract_level_codes contributes to other 10X speed difference.

jbrockmendel · 2021-04-22T21:05:22Z

@jbrockmendel It seems that DateTimeIndex does not require a force conversion?

For MultiIndex.values to be correct, DatetimeIndex elements should become Timestamp objects, not np.datetime64. If this is a major source of overhead, then maybe we should construct tuples under a different name that dont cast these to Timestamps.

Using @apm582 's example, hasattr(vals, "_box_values") before the commit actually evaluates to False while
isinstance(vals, (ABCDatetimeIndex, ABCTimedeltaIndex)) evaluates to True. Thus the time consuming force conversion of DataTimeIndex is triggered.

It correct, this suggests that inherit_names is not working as intended, as that should have pinned _box_values to DTI/TDI.

qiuwei · 2021-04-24T13:32:40Z

I think I have found a solution for the 2nd problem, e.g., slow extract_level_codes.

Working on a patch.

lukemanley · 2022-06-03T01:18:20Z

I believe this can be closed. #46235 seems to have helped here.

Using the example in the OP and comparing main to latest 1.4.x:

pandas version: 1.5.0.dev0+859.g4791cd75c0
reindex took 0.033670663833618164 seconds

pandas version: 1.4.2+33.gdc52fd6212
reindex took 0.8928627967834473 seconds

jreback · 2022-06-03T01:58:06Z

awesome

so we have asvs that cover this case?

lukemanley · 2022-06-04T02:45:14Z

so we have asvs that cover this case?

I opened #47221 which adds a benchmark that more directly tracks this case.

jreback · 2022-06-05T22:26:26Z

closing as fixed (indicated above & asv's added)

apm582 changed the title ~~Performance regression: Reindex in pandas 0.23.4 is 60x slower than in 0.22.0 with a MultiIndex with datetime64~~ PERF regression: Reindex in pandas 0.23.4 is 60x slower than in 0.22.0 with a MultiIndex with datetime64 Nov 16, 2018

apm582 changed the title ~~PERF regression: Reindex in pandas 0.23.4 is 60x slower than in 0.22.0 with a MultiIndex with datetime64~~ PERF: regression in reindex. Pandas 0.23.4 is 60x slower than in 0.22.0 with a MultiIndex with datetime64 Nov 16, 2018

apm582 changed the title ~~PERF: regression in reindex. Pandas 0.23.4 is 60x slower than in 0.22.0 with a MultiIndex with datetime64~~ PERF: regression in reindex. Pandas 0.23.4 is 60x slower than 0.22.0 with a MultiIndex with datetime64 Nov 16, 2018

gfyoung added Indexing Related to indexing on series/frames, not to indexes themselves Performance Memory or execution speed performance labels Nov 16, 2018

jbrockmendel removed the Indexing Related to indexing on series/frames, not to indexes themselves label Oct 15, 2020

qiuwei mentioned this issue Apr 20, 2021

Extreme performance issue in pandas 1.0.3 when setting a new column with DatetimeIndex #34531

Closed

qiuwei mentioned this issue Apr 21, 2021

New engine for MultiIndex? #18519

Closed

qiuwei mentioned this issue Apr 22, 2021

CLN: make DTA/TDA check clearer #33665

Merged

qiuwei mentioned this issue Apr 27, 2021

WIP: avoid zip and unzip to improve the speed of get_indexer #41177

Closed

4 tasks

lukemanley added the Closing Candidate May be closeable, needs more eyeballs label Jun 3, 2022

lukemanley mentioned this issue Jun 4, 2022

ASV: add benchmark for reindex with MultiIndex containing dates #47221

Merged

2 tasks

jreback closed this as completed in #47221 Jun 5, 2022

jreback added this to the 1.5 milestone Jun 5, 2022

brynpickering mentioned this issue Sep 26, 2023

Fix long run time and reintroduce timing benchmark test arup-group/pam#258

Merged

3 tasks

morotti mentioned this issue Nov 3, 2023

PERF: avoid unnecessary method call in get_indexer_non_unique() on MultiIndex #55811

Merged

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PERF: regression in reindex. Pandas 0.23.4 is 60x slower than 0.22.0 with a MultiIndex with datetime64 #23735

PERF: regression in reindex. Pandas 0.23.4 is 60x slower than 0.22.0 with a MultiIndex with datetime64 #23735

apm582 commented Nov 16, 2018 •

edited

Loading

apm582 commented Nov 16, 2018

TomAugspurger commented Nov 16, 2018

apm582 commented Nov 16, 2018

qiuwei commented Mar 10, 2021 •

edited

Loading

qiuwei commented Apr 20, 2021

shaoyucheng commented Apr 20, 2021

jreback commented Apr 20, 2021

toobaz commented Apr 21, 2021

qiuwei commented Apr 22, 2021

qiuwei commented Apr 22, 2021 •

edited

Loading

qiuwei commented Apr 22, 2021 •

edited

Loading

jbrockmendel commented Apr 22, 2021

qiuwei commented Apr 24, 2021 •

edited

Loading

lukemanley commented Jun 3, 2022

jreback commented Jun 3, 2022

lukemanley commented Jun 4, 2022

jreback commented Jun 5, 2022

PERF: regression in reindex. Pandas 0.23.4 is 60x slower than 0.22.0 with a MultiIndex with datetime64 #23735

PERF: regression in reindex. Pandas 0.23.4 is 60x slower than 0.22.0 with a MultiIndex with datetime64 #23735

Comments

apm582 commented Nov 16, 2018 • edited Loading

apm582 commented Nov 16, 2018

TomAugspurger commented Nov 16, 2018

apm582 commented Nov 16, 2018

qiuwei commented Mar 10, 2021 • edited Loading

qiuwei commented Apr 20, 2021

shaoyucheng commented Apr 20, 2021

jreback commented Apr 20, 2021

toobaz commented Apr 21, 2021

qiuwei commented Apr 22, 2021

qiuwei commented Apr 22, 2021 • edited Loading

qiuwei commented Apr 22, 2021 • edited Loading

jbrockmendel commented Apr 22, 2021

qiuwei commented Apr 24, 2021 • edited Loading

lukemanley commented Jun 3, 2022

jreback commented Jun 3, 2022

lukemanley commented Jun 4, 2022

jreback commented Jun 5, 2022

apm582 commented Nov 16, 2018 •

edited

Loading

qiuwei commented Mar 10, 2021 •

edited

Loading

qiuwei commented Apr 22, 2021 •

edited

Loading

qiuwei commented Apr 22, 2021 •

edited

Loading

qiuwei commented Apr 24, 2021 •

edited

Loading