-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
PERF: regression in reindex. Pandas 0.23.4 is 60x slower than 0.22.0 with a MultiIndex with datetime64 #23735
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I did some digging, seems the difference is coming from the new BaseMultiIndexCodesEngine.get_indexer vs the old MultiIndexHashEngine.get_indexer |
@TomAugspurger There is still a perf regression with just integer levels but it's to a smaller degree. With DatetimeIndex the perf regression was from 0.03 seconds to 1.97 seconds. With just integer levels the perf regression is from 0.03 seconds seconds to 0.40 seconds, so a slowdown of 10x instead of 60x. When I change the dr variable to
I get the following output: pandas version: 0.22.0 |
This issue still persists with the latest 1.2.3 version and reindexing seems to get even slower.
|
I met the same problem, even in the newest version, look forward to a official improvement. |
@shaoyucheng you are welcome to submit a patch |
Thanks for your in depth analysis. Should I interpret the two columns of the screenshot as coming from the two versions? Are the numbers such as 25895 the number of calls of a given method? I'm happy to take a look at what caused the regression but having the stack would help a lot. |
Hi, @toobaz Thanks for the reply! The screenshot is only for the 0.23.4 version. I have located the cause in the code: pandas/pandas/core/indexes/multi.py Line 706 in 3513f59
The conversion is really slow if the index contains a large number of DateTimeIndex. BTW, the conversion to object only explains part of the regression of the performance. The regression exists for all versions later than 0.23.4 Glad that this issue got your interest. Hope this can be fixed soon. |
@jbrockmendel It seems that DateTimeIndex does not require a force conversion? Using @apm582 's example, |
@toobaz I did some further investigation and find out that A minimal example:
The result :
So Line 602 in fd67546
So the conlusion is that:
|
For MultiIndex.values to be correct, DatetimeIndex elements should become Timestamp objects, not np.datetime64. If this is a major source of overhead, then maybe we should construct tuples under a different name that dont cast these to Timestamps.
It correct, this suggests that inherit_names is not working as intended, as that should have pinned _box_values to DTI/TDI. |
I think I have found a solution for the 2nd problem, e.g., slow Working on a patch. |
I believe this can be closed. #46235 seems to have helped here. Using the example in the OP and comparing main to latest 1.4.x:
|
awesome so we have asvs that cover this case? |
I opened #47221 which adds a benchmark that more directly tracks this case. |
closing as fixed (indicated above & asv's added) |
Re-indexing a series with a two-level MultiIndex where the first level is datetime64 and the second level is int is 40x slower than in 0.22.0. Output first then repro code below. The issue persists if you change the first level to int instead of datetime, but the perf difference is less (0.40 seconds vs 0.03 seconds).
The text was updated successfully, but these errors were encountered: