Skip to content

PERF: Indexing a multi-index is a lot slower #31648

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
valtron opened this issue Feb 4, 2020 · 7 comments · Fixed by #31651
Closed

PERF: Indexing a multi-index is a lot slower #31648

valtron opened this issue Feb 4, 2020 · 7 comments · Fixed by #31651
Labels
Indexing Related to indexing on series/frames, not to indexes themselves MultiIndex Performance Memory or execution speed performance Regression Functionality that used to work in a prior pandas version
Milestone

Comments

@valtron
Copy link

valtron commented Feb 4, 2020

Indexing a multi-index seemingly went from O(1) to O(N):

bug

I did a bisect, and found this was caused by the _shallow_copy here: b0f33b3#diff-4ffd1c69d47e0ac9f2de4f9e3e4a118cR643.

Code Sample

from time import perf_counter as time
import pandas as pd

for N in [1000, 2000, 4000, 8000, 16000, 32000]:
	values = list(range(N))
	df = pd.DataFrame({ 'a': values })
	df['b'] = 1
	df.set_index(['a', 'b'], inplace = True)

	t = time()
	df.loc[values]
	t = time() - t
	print(N, t)
@TomAugspurger
Copy link
Contributor

Mmm, thanks for bisecting things. I'm a bit surprised to see that the shallow_copy is causing issues. It should be relatively cheap.

cc @topper-123.

@TomAugspurger TomAugspurger added Performance Memory or execution speed performance Regression Functionality that used to work in a prior pandas version Indexing Related to indexing on series/frames, not to indexes themselves MultiIndex labels Feb 4, 2020
@TomAugspurger
Copy link
Contributor

TomAugspurger commented Feb 4, 2020

Changing levels to cache_readonly seems to help.

diff --git a/pandas/core/indexes/multi.py b/pandas/core/indexes/multi.py
index c560d81ba9..95921bd90e 100644
--- a/pandas/core/indexes/multi.py
+++ b/pandas/core/indexes/multi.py
@@ -677,7 +677,7 @@ class MultiIndex(Index):
     # --------------------------------------------------------------------
     # Levels Methods
 
-    @property
+    @cache_readonly
     def levels(self):
         result = [
             x._shallow_copy(name=name) for x, name in zip(self._levels, self._names)

With that, I see the timings

1000 0.1286114020000042
2000 0.26856144899999634
4000 0.5997097740000044
8000 1.2386244449999992
16000 3.030814161000002
32000 7.507717883000005

These are about 1.5x the 0.25 timings on my machine. So slower, but by a constant factor I think.

I'm not sure, but I suspect that the IndexEngine was previously cached, and MultiIndex.get_locs involves many calls to Index.get_loc. The change in b0f33b3#diff-4ffd1c69d47e0ac9f2de4f9e3e4a118cR643 was returning a new index object, and the cache was lost (maybe).

@topper-123
Copy link
Contributor

Thanks for the report @valtron.

MultiIndex._shallow_copy could benefit from a refactor and definately shouldn't cause a slowdown like here. I had a small discussion on the subject here (maybe unrelated to this specific issue, but it's the same theme).

If _shallow_engine doesn't move the reference to the indexing engine over to the new index, I'd think that should be considered a performance bug. I.e. we should probably have that mi._shallow_copy()._engine is mi._engine, which a casusual rundown shows is not the case in master.

@TomAugspurger , wouldn't you agree that idx._shallow_copy()._engine is idx._engine should be True in general? The indexing engines are expensive to create, but immutable, so we should copy reference where possible.

@TomAugspurger
Copy link
Contributor

wouldn't you agree that idx._shallow_copy()._engine is idx._engine should be True in general?

Yep, that's #28584. Caching those makes sense to me.

@jbrockmendel
Copy link
Member

Changing levels to cache_readonly seems to help.

IIRC there was an issue a few weeks ago with MultiIndex names propogating to levels. If we cache levels and then change the names on a MultiIndex, will subsequent levels calls have the wrong names?

wouldn't you agree that idx._shallow_copy()._engine is idx._engine should be True in general?

That sounds like a good idea.

@TomAugspurger
Copy link
Contributor

and then change the names on a MultiIndex, will subsequent levels calls have the wrong names?

See #31651. MultiIndex._set_names now includes a call to self._reset_cache

@topper-123
Copy link
Contributor

topper-123 commented Feb 4, 2020

Right, I had forgotten about that issue. Shouldn't be too difficult to fix.

EDIT: Want to try your hand on this one, @valtron? Will fix the underlying issue you're reporting and I'm sure this will have a large impact on performance in general for pandas.

TomAugspurger added a commit that referenced this issue Feb 4, 2020
* PERF: Cache MultiIndex.levels

Closes #31648

* fixup tests
TomAugspurger added a commit to TomAugspurger/pandas that referenced this issue Feb 4, 2020
* PERF: Cache MultiIndex.levels

Closes pandas-dev#31648

* fixup tests
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Indexing Related to indexing on series/frames, not to indexes themselves MultiIndex Performance Memory or execution speed performance Regression Functionality that used to work in a prior pandas version
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants