Skip to content

PERF: regression in MultiIndex get_loc indexing performance #29311

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
jorisvandenbossche opened this issue Oct 31, 2019 · 3 comments · Fixed by #29469
Closed

PERF: regression in MultiIndex get_loc indexing performance #29311

jorisvandenbossche opened this issue Oct 31, 2019 · 3 comments · Fixed by #29469
Labels
MultiIndex Performance Memory or execution speed performance Regression Functionality that used to work in a prior pandas version
Milestone

Comments

@jorisvandenbossche
Copy link
Member

From our benchmarks (GetLoc class in https://github.com/pandas-dev/pandas/blob/master/asv_bench/benchmarks/multiindex_object.py):

In [65]: mi_med = pd.MultiIndex.from_product( 
    ...:     [np.arange(1000), np.arange(10), list("A")], names=["one", "two", "three"] 
    ...: ) 

In [66]: %timeit mi_med.get_loc((999, 9, "A"))   
9.58 µs ± 106 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

In [67]: pd.__version__  
Out[67]: '0.25.0'

vs

In [18]: mi_med = pd.MultiIndex.from_product( 
    ...:     [np.arange(1000), np.arange(10), list("A")], names=["one", "two", "three"] 
    ...: )

In [19]: %timeit mi_med.get_loc((999, 9, "A"))  
34.7 µs ± 454 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

In [20]: pd.__version__
Out[20]: '0.26.0.dev0+691.g157495696'

Not directly sure what recently changed related to MultiIndexes.
And the benchmark suite is giving a rather broad range: 2b28454...7c8c8c8 (because the benchmarks were not running for a while I suppose)

@jorisvandenbossche
Copy link
Member Author

jorisvandenbossche commented Oct 31, 2019

From a quick profile, seems to be related to shallow copying the levels.

With 0.25, it's mainly the actual engine:

image

On master there is a big chunk of "levels" related time:

image

So potentially related to the MI.levels changes (#27242, #29061)
(cc @topper-123 didn't yet further look at the actual cause though)

@jorisvandenbossche jorisvandenbossche added MultiIndex Performance Memory or execution speed performance Regression Functionality that used to work in a prior pandas version labels Oct 31, 2019
@jorisvandenbossche jorisvandenbossche added this to the 1.0 milestone Oct 31, 2019
@jbrockmendel
Copy link
Member

Not sure what changed, but nlevels should be pretty quick. ATM nlevels is a property that gives len(self.levels), and looking levels, it looks like that could be changed to len(self._levels) and avoid the expensive levels call

@topper-123
Copy link
Contributor

topper-123 commented Nov 2, 2019

When copying the names in _shallow_copy(names=names), there's a lot of unnecessary validation going on for the names (when the names exist already, the name must have been validated already). It's a good guess that's where the problem lies. I'm planning on optimize that after #27138 (which was more demanding than anticipated, but is close to ready for review now)

EDIT: see the discussion on that here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
MultiIndex Performance Memory or execution speed performance Regression Functionality that used to work in a prior pandas version
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants