Skip to content

PERF: merge on sorted MultiIndex #48504

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 7 commits into from
Sep 19, 2022

Conversation

lukemanley
Copy link
Member

@lukemanley lukemanley commented Sep 11, 2022

Perf improvement when merging on a sorted MultiIndex. The improvement comes from avoiding MultiIndex._values.

Overall, seems to generalize across dtypes and merge types quite well. The ASVs show one slowdown when doing an outer merge with a datetime64 index. In that case the time is spent within MultiIndex._union. Coincidently, @phofl just opened #48505 which shows a nice improvement for datetimes in MultiIndex._union so that slowdown might well go away if #48505 is merged.

ASVs added:

       before           after         ratio
     [fe9e5d02]       [367fbd89]
     <main>           <perf-merge-with-mulitindexes>
+         259±3ms          383±2ms     1.48  join_merge.MergeMultiIndex.time_merge_sorted_multiindex(('datetime64[ns]', 'int64'), 'outer')
-         101±3ms         85.1±1ms     0.84  join_merge.MergeMultiIndex.time_merge_sorted_multiindex(('Int64', 'Int64'), 'inner')
-        51.0±1ms       40.6±0.4ms     0.80  join_merge.MergeMultiIndex.time_merge_sorted_multiindex(('Int64', 'Int64'), 'right')
-      50.6±0.5ms       39.3±0.3ms     0.78  join_merge.MergeMultiIndex.time_merge_sorted_multiindex(('Int64', 'Int64'), 'left')
-         185±3ms        138±0.4ms     0.75  join_merge.MergeMultiIndex.time_merge_sorted_multiindex(('datetime64[ns]', 'int64'), 'inner')
-        186±20ms          137±1ms     0.73  join_merge.MergeMultiIndex.time_merge_sorted_multiindex(('int64', 'int64'), 'outer')
-      83.0±0.7ms       46.6±0.8ms     0.56  join_merge.MergeMultiIndex.time_merge_sorted_multiindex(('int64', 'int64'), 'inner')
-      36.9±0.3ms       11.4±0.1ms     0.31  join_merge.MergeMultiIndex.time_merge_sorted_multiindex(('int64', 'int64'), 'left')
-        37.6±1ms       11.5±0.2ms     0.31  join_merge.MergeMultiIndex.time_merge_sorted_multiindex(('int64', 'int64'), 'right')
-         146±3ms       14.0±0.1ms     0.10  join_merge.MergeMultiIndex.time_merge_sorted_multiindex(('datetime64[ns]', 'int64'), 'left')
-         146±3ms       11.7±0.4ms     0.08  join_merge.MergeMultiIndex.time_merge_sorted_multiindex(('datetime64[ns]', 'int64'), 'right')

[update]: updated times following the merge of #48505:

       before           after         ratio
     [ac648eea]       [666d3990]
     <perf-merge-with-mulitindexes^2>       <perf-merge-with-mulitindexes>
-       103±0.4ms         85.2±2ms     0.83  join_merge.MergeMultiIndex.time_merge_sorted_multiindex(('Int64', 'Int64'), 'inner')
-         212±1ms          165±2ms     0.78  join_merge.MergeMultiIndex.time_merge_sorted_multiindex(('Int64', 'Int64'), 'outer')
-      53.4±0.9ms       41.2±0.5ms     0.77  join_merge.MergeMultiIndex.time_merge_sorted_multiindex(('Int64', 'Int64'), 'right')
-      52.4±0.4ms       40.1±0.7ms     0.76  join_merge.MergeMultiIndex.time_merge_sorted_multiindex(('Int64', 'Int64'), 'left')
-         186±2ms        139±0.7ms     0.75  join_merge.MergeMultiIndex.time_merge_sorted_multiindex(('datetime64[ns]', 'int64'), 'inner')
-         265±3ms        184±0.8ms     0.69  join_merge.MergeMultiIndex.time_merge_sorted_multiindex(('datetime64[ns]', 'int64'), 'outer')
-      85.5±0.4ms         48.3±2ms     0.57  join_merge.MergeMultiIndex.time_merge_sorted_multiindex(('int64', 'int64'), 'inner')
-         196±3ms       78.9±0.5ms     0.40  join_merge.MergeMultiIndex.time_merge_sorted_multiindex(('int64', 'int64'), 'outer')
-      40.7±0.4ms         14.0±3ms     0.34  join_merge.MergeMultiIndex.time_merge_sorted_multiindex(('int64', 'int64'), 'right')
-        40.6±3ms       12.1±0.5ms     0.30  join_merge.MergeMultiIndex.time_merge_sorted_multiindex(('int64', 'int64'), 'left')
-       150±0.6ms       12.7±0.2ms     0.09  join_merge.MergeMultiIndex.time_merge_sorted_multiindex(('datetime64[ns]', 'int64'), 'right')
-         153±5ms       12.2±0.3ms     0.08  join_merge.MergeMultiIndex.time_merge_sorted_multiindex(('datetime64[ns]', 'int64'), 'left')

@lukemanley lukemanley added Performance Memory or execution speed performance Reshaping Concat, Merge/Join, Stack/Unstack, Explode MultiIndex labels Sep 11, 2022
Copy link
Member

@phofl phofl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is faster in all possible cases? Meaning different dtypes

@lukemanley
Copy link
Member Author

I just expanded the ASVs to cover additional dtypes and merge types. See the timings updated timings and commentary re: MultiIndex._union.

@lukemanley
Copy link
Member Author

@phofl - see updated times in the summary

Copy link
Member

@phofl phofl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@mroeschke mroeschke added this to the 1.6 milestone Sep 19, 2022
@mroeschke mroeschke merged commit 81b5f1d into pandas-dev:main Sep 19, 2022
@mroeschke
Copy link
Member

Thanks @lukemanley

@lukemanley lukemanley deleted the perf-merge-with-mulitindexes branch September 24, 2022 00:48
@mroeschke mroeschke modified the milestones: 1.6, 2.0 Oct 13, 2022
noatamir pushed a commit to noatamir/pandas that referenced this pull request Nov 9, 2022
* merge on sorted multiindex performance

* whatsnew

* faster asv

* additional asv cases

* avoid going through multi._values
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
MultiIndex Performance Memory or execution speed performance Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants