Skip to content

PERF: regression in reindex. Pandas 0.23.4 is 60x slower than 0.22.0 with a MultiIndex with datetime64 #23735

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
apm582 opened this issue Nov 16, 2018 · 17 comments · Fixed by #47221 or #55811
Labels
Closing Candidate May be closeable, needs more eyeballs Performance Memory or execution speed performance
Milestone

Comments

@apm582
Copy link

apm582 commented Nov 16, 2018

Re-indexing a series with a two-level MultiIndex where the first level is datetime64 and the second level is int is 40x slower than in 0.22.0. Output first then repro code below. The issue persists if you change the first level to int instead of datetime, but the perf difference is less (0.40 seconds vs 0.03 seconds).

"""
pandas version: 0.23.4
reindex took 1.9770500659942627 seconds

pandas version: 0.22.0
reindex took 0.0306899547577 seconds
"""


import pandas as pd
import time
import numpy as np


if __name__ == '__main__':
    n_days = 300
    dr = pd.date_range(end="20181118", periods=n_days)
    mi = pd.MultiIndex.from_product([dr, range(1440)])

    v = np.random.randn(len(mi))
    mask = np.random.rand(len(v)) < .3
    v[mask] = np.nan
    s = pd.Series(v, index=mi)
    s = s.sort_index()

    s2 = s.dropna()

    start = time.time()

    s2.reindex(index=s.index)

    end = time.time()
    print("pandas version: %s" % pd.__version__)
    print("reindex took %s seconds" % (end - start))
@apm582
Copy link
Author

apm582 commented Nov 16, 2018

I did some digging, seems the difference is coming from the new BaseMultiIndexCodesEngine.get_indexer vs the old MultiIndexHashEngine.get_indexer

@apm582 apm582 changed the title Performance regression: Reindex in pandas 0.23.4 is 60x slower than in 0.22.0 with a MultiIndex with datetime64 PERF regression: Reindex in pandas 0.23.4 is 60x slower than in 0.22.0 with a MultiIndex with datetime64 Nov 16, 2018
@apm582 apm582 changed the title PERF regression: Reindex in pandas 0.23.4 is 60x slower than in 0.22.0 with a MultiIndex with datetime64 PERF: regression in reindex. Pandas 0.23.4 is 60x slower than in 0.22.0 with a MultiIndex with datetime64 Nov 16, 2018
@apm582 apm582 changed the title PERF: regression in reindex. Pandas 0.23.4 is 60x slower than in 0.22.0 with a MultiIndex with datetime64 PERF: regression in reindex. Pandas 0.23.4 is 60x slower than 0.22.0 with a MultiIndex with datetime64 Nov 16, 2018
@TomAugspurger
Copy link
Contributor

cc @toobaz, in case this is related to #19074

@apm582 is a DatetimeIndex for one of the levels necessary to observe the slowdown, or can you reproduce with just integer levels?

@apm582
Copy link
Author

apm582 commented Nov 16, 2018

@TomAugspurger There is still a perf regression with just integer levels but it's to a smaller degree. With DatetimeIndex the perf regression was from 0.03 seconds to 1.97 seconds. With just integer levels the perf regression is from 0.03 seconds seconds to 0.40 seconds, so a slowdown of 10x instead of 60x.

When I change the dr variable to

dr = range(300) # instead of date_range(...)

I get the following output:
pandas version: 0.23.4
reindex took 0.41175103187561035 seconds

pandas version: 0.22.0
reindex took 0.0308949947357 seconds

@gfyoung gfyoung added Indexing Related to indexing on series/frames, not to indexes themselves Performance Memory or execution speed performance labels Nov 16, 2018
@jbrockmendel jbrockmendel removed the Indexing Related to indexing on series/frames, not to indexes themselves label Oct 15, 2020
@qiuwei
Copy link

qiuwei commented Mar 10, 2021

This issue still persists with the latest 1.2.3 version and reindexing seems to get even slower.

pandas version: 1.2.3
reindex took 2.6638526916503906 seconds

@qiuwei
Copy link

qiuwei commented Apr 20, 2021

image

I did some preliminary profiling.
It seems most time is wasted in the type conversion.

@shaoyucheng
Copy link

I met the same problem, even in the newest version, look forward to a official improvement.

@jreback
Copy link
Contributor

jreback commented Apr 20, 2021

@shaoyucheng you are welcome to submit a patch

@toobaz
Copy link
Member

toobaz commented Apr 21, 2021

image

I did some preliminary profiling.
It seems most time is wasted in the type conversion.

Thanks for your in depth analysis. Should I interpret the two columns of the screenshot as coming from the two versions? Are the numbers such as 25895 the number of calls of a given method?

I'm happy to take a look at what caused the regression but having the stack would help a lot.

@qiuwei
Copy link

qiuwei commented Apr 22, 2021

Hi, @toobaz Thanks for the reply!

The screenshot is only for the 0.23.4 version.
The first column is for the call count. The second column is the total time in ms used for the call. The third column is the own time used for the call.

I have located the cause in the code:

if isinstance(vals.dtype, ExtensionDtype) or isinstance(

The conversion is really slow if the index contains a large number of DateTimeIndex.
I am not sure why the conversion to object is needed, but reindex seems to work for my case if I remove those lines.

BTW, the conversion to object only explains part of the regression of the performance.
After removing the conversion of DataTimeIndex to object, pandas version: 0.23.4
reindex took 0.7682747840881348 seconds, still around 30X times slower than 0.22.

The regression exists for all versions later than 0.23.4
I will do more investigation.

Glad that this issue got your interest. Hope this can be fixed soon.

@qiuwei
Copy link

qiuwei commented Apr 22, 2021

ed15d8e

@jbrockmendel It seems that DateTimeIndex does not require a force conversion?

Using @apm582 's example, hasattr(vals, "_box_values") before the commit actually evaluates to False while
isinstance(vals, (ABCDatetimeIndex, ABCTimedeltaIndex)) evaluates to True. Thus the time consuming force conversion of DataTimeIndex is triggered.

@qiuwei
Copy link

qiuwei commented Apr 22, 2021

@toobaz I did some further investigation and find out that _extract_level_codes is the other major cause of the performance regression.

A minimal example:

import pandas as pd
import time
import numpy as np


if __name__ == '__main__':
    n_days = 2500
    dr = pd.date_range(end="20120101", periods=n_days)
    mi = pd.MultiIndex.from_product([dr, range(1440)])

    v = np.random.randn(len(mi))
    mask = np.random.rand(len(v)) < .3
    v[mask] = np.nan
    s = pd.Series(v, index=mi)
    s = s.sort_index()

    s2 = s.dropna()

    start = time.time()


    match_seq = s2.index.get_indexer_for(s.index)
    # result = s2.index.values
    # resul2 = s.index.values

    end1 = time.time()
    match_seq = s2.index.get_indexer_for(s.index)
    end2 = time.time()

    s2.index._engine._extract_level_codes(s.index)
    end3 = time.time()
    # print(s2)
    print("pandas version: %s" % pd.__version__)
    print("reindex for the first time(include time cost of populating the hash mapping) took %s seconds" % (end1 - start))
    print("reindex with mapping populated took %s seconds" % (end2 - end1))
    print("extrace level codes takes %s seconds" % (end3 - end2))

The result :

pandas version: 1.1.4
reindex for the first time(include of time cost of populating the hash mapping) took 5.976720809936523 seconds
reindex with mapping populated took 3.858426332473755 seconds
extract level codes takes 3.790076732635498 seconds

So extract_level_codes almost takes the same amount of time as get_indexer_for!
To be more specific, it is the following line which is causing the performance regression:

level_codes = [lev.get_indexer(codes) + 1 for lev, codes

So the conlusion is that:

  1. conversion from datatime to object contributes to the 6X speed difference (This can be easily fixed if ed15d8e can be reverted)
  2. inefficiency of extract_level_codes contributes to other 10X speed difference.

@jbrockmendel
Copy link
Member

@jbrockmendel It seems that DateTimeIndex does not require a force conversion?

For MultiIndex.values to be correct, DatetimeIndex elements should become Timestamp objects, not np.datetime64. If this is a major source of overhead, then maybe we should construct tuples under a different name that dont cast these to Timestamps.

Using @apm582 's example, hasattr(vals, "_box_values") before the commit actually evaluates to False while
isinstance(vals, (ABCDatetimeIndex, ABCTimedeltaIndex)) evaluates to True. Thus the time consuming force conversion of DataTimeIndex is triggered.

It correct, this suggests that inherit_names is not working as intended, as that should have pinned _box_values to DTI/TDI.

@qiuwei
Copy link

qiuwei commented Apr 24, 2021

I think I have found a solution for the 2nd problem, e.g., slow extract_level_codes.

Working on a patch.

@lukemanley
Copy link
Member

I believe this can be closed. #46235 seems to have helped here.

Using the example in the OP and comparing main to latest 1.4.x:

pandas version: 1.5.0.dev0+859.g4791cd75c0
reindex took 0.033670663833618164 seconds
pandas version: 1.4.2+33.gdc52fd6212
reindex took 0.8928627967834473 seconds

@lukemanley lukemanley added the Closing Candidate May be closeable, needs more eyeballs label Jun 3, 2022
@jreback
Copy link
Contributor

jreback commented Jun 3, 2022

awesome

so we have asvs that cover this case?

@lukemanley
Copy link
Member

so we have asvs that cover this case?

I opened #47221 which adds a benchmark that more directly tracks this case.

@jreback jreback added this to the 1.5 milestone Jun 5, 2022
@jreback
Copy link
Contributor

jreback commented Jun 5, 2022

closing as fixed (indicated above & asv's added)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Closing Candidate May be closeable, needs more eyeballs Performance Memory or execution speed performance
Projects
None yet
9 participants