Skip to content

REF/PERF: MultiIndex.get_locs to use boolean arrays internally #46330

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 9 commits into from
Mar 18, 2022

Conversation

lukemanley
Copy link
Member

@lukemanley lukemanley commented Mar 11, 2022

Use boolean arrays internally within MultiIndex.get_locs rather than int64 indexes. Logical operations show performance improvements over intersecting int64 indexes. The output remains an integer positional indexer.

       before           after         ratio
     [17dda440]       [94121581]
     <main>           <multiindex-get-locs-bool-arrays>
-        563±10μs          519±8μs     0.92  indexing.MultiIndexing.time_loc_all_scalars(True)
-      33.9±0.6ms       30.4±0.4ms     0.89  indexing.MultiIndexing.time_loc_all_null_slices(True)
-        38.7±1ms         34.5±1ms     0.89  indexing.MultiIndexing.time_loc_all_null_slices(False)
-     1.62±0.02ms      1.43±0.01ms     0.88  indexing.MultiIndexing.time_loc_all_slices(True)
-     6.40±0.06ms       5.53±0.1ms     0.86  indexing.MultiIndexing.time_loc_all_bool_indexers(True)
-         107±1ms       41.6±0.6ms     0.39  indexing.MultiIndexing.time_loc_all_lists(True)
-      34.6±0.8ms       8.24±0.4ms     0.24  indexing.MultiIndexing.time_loc_all_slices(False)
-         236±4ms       23.6±0.2ms     0.10  indexing.MultiIndexing.time_loc_all_lists(False)
-      97.3±0.7ms       9.16±0.4ms     0.09  indexing.MultiIndexing.time_loc_null_slice_plus_slice(False)
-      36.5±0.5ms      1.36±0.03ms     0.04  indexing.MultiIndexing.time_loc_null_slice_plus_slice(True)

@lukemanley lukemanley added Performance Memory or execution speed performance Refactor Internal refactoring of code MultiIndex Indexing Related to indexing on series/frames, not to indexes themselves labels Mar 11, 2022
@jreback jreback added this to the 1.5 milestone Mar 11, 2022
# if we have a provided indexer, then this need not consider
# the entire labels set
if step is not None and step < 0:
# Switch elements for negative step size
start, stop = stop - 1, start - 1
r = np.arange(start, stop, step)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add an explanation (similar to the below) say around L3160, e.g. for a future reader to understand what this algorithm is doing.

@jreback
Copy link
Contributor

jreback commented Mar 11, 2022

wow. cc @mroeschke @phofl @jbrockmendel if comments.

@jreback
Copy link
Contributor

jreback commented Mar 11, 2022

yep needs a rebase :->

@lukemanley
Copy link
Member Author

yep needs a rebase :->

rebased this one

@@ -310,7 +310,7 @@ Performance improvements
- Performance improvement in :meth:`.GroupBy.diff` (:issue:`16706`)
- Performance improvement in :meth:`.GroupBy.transform` when broadcasting values for user-defined functions (:issue:`45708`)
- Performance improvement in :meth:`.GroupBy.transform` for user-defined functions when only a single group exists (:issue:`44977`)
- Performance improvement in :meth:`MultiIndex.get_locs` (:issue:`45681`, :issue:`46040`)
- Performance improvement in :meth:`MultiIndex.get_locs` (:issue:`45681`, :issue:`46040`, :issue:`46330`)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

most users dont use get_locs directly; is there a more user-facing description?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about this:

Performance improvement in :meth:DataFrame.loc and :meth:Series.loc for tuple-based indexing of a :class:MultiIndex

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated, thanks

@jbrockmendel
Copy link
Member

LGTM

@jreback
Copy link
Contributor

jreback commented Mar 16, 2022

can you merge master once again

@lukemanley
Copy link
Member Author

@jreback - merged main and greenish. I don't think the error is related as I see it showing up in other PRs as well

)
indexer &= lvl_indexer
if not np.any(indexer) and np.any(lvl_indexer):
raise KeyError(seq)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this hit by tests?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, covered by test_loc.py > test_missing_key_combination

@jreback jreback merged commit 2278923 into pandas-dev:main Mar 18, 2022
@jreback
Copy link
Contributor

jreback commented Mar 18, 2022

thanks @lukemanley

@lukemanley lukemanley deleted the multiindex-get-locs-bool-arrays branch March 20, 2022 23:18
yehoshuadimarsky pushed a commit to yehoshuadimarsky/pandas that referenced this pull request Jul 13, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Indexing Related to indexing on series/frames, not to indexes themselves MultiIndex Performance Memory or execution speed performance Refactor Internal refactoring of code
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants