Skip to content

New engine for MultiIndex? #18519

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
toobaz opened this issue Nov 27, 2017 · 4 comments · Fixed by #19074
Closed

New engine for MultiIndex? #18519

toobaz opened this issue Nov 27, 2017 · 4 comments · Fixed by #19074
Labels
MultiIndex Performance Memory or execution speed performance
Milestone

Comments

@toobaz
Copy link
Member

toobaz commented Nov 27, 2017

Currently, MultiIndex.get_loc() and MultiIndex.get_indexer() both rely on an _engine which is either a MultiIndexObjectEngine or a MultiIndexHashEngine: but both of these are thin layers over the flat ObjectEngine. This means that the actual structure of labels and levels is completely discarded (except e.g. for partial indexing, see _get_level_indexer()).

In principle, a completely different scheme could be used:

  • first look for the key elements in levels, and find the corresponding code
  • then look for the code in the levels

In most cases, the second part should be the computationally expensive one. It would consist in running nlevels searches in arrays of dtype=int (the .labels) rather than (as it is now) one search in an object array in which each element is actually a tuple of nlevels elements. My guess is that thanks to vectorization the former should be much faster than the latter.

Moreover (and maybe more importantly), with the current engine fixing a bug such as #18485 is a nightmare. And the same applies to

In [2]: (4, True) in pd.MultiIndex.from_tuples([(4, 1)])
Out[2]: True

and probably others. This is because even though levels are not mixed, the elements are compared as objects.

One caveat is that the single levels would be very often non-unique, and I'm not sure what is the impact of this with the current implementation of hash tables.

@chris-b1
Copy link
Contributor

See also previous discussion in #1752.

@chris-b1 chris-b1 added MultiIndex Performance Memory or execution speed performance labels Nov 27, 2017
@toobaz
Copy link
Member Author

toobaz commented Nov 27, 2017

Aha, and #16324, which was a response to #16319, which was a reaction to #15245, which was a response to #13904, which was a follow-up to the issue @chris-b1 mentioned.

And while the MultiIndexHashEngine is not what I described above (it's probably better), indeed only now I see that

In [2]: mi = pd.MultiIndex.from_product([[True, False], range(1, 10000)])

In [3]: (1, 3) in mi
Out[3]: False

In [4]: mi = pd.MultiIndex.from_product([[1, np.nan], range(1, 10000)])

In [5]: (np.nan, 3) in mi
Out[5]: True

In [6]: mi = pd.MultiIndex.from_product([[1, np.nan], range(1, 10)])

In [7]: (np.nan, 3) in mi
Out[7]: False

that is, large indexes are free from #18485 and friends.

@toobaz
Copy link
Member Author

toobaz commented Nov 27, 2017

OK, after reviewing this evidence, I guess we probably don't need a new engine: but I think we should (be able to) improve the MultiIndexEngine to the point that we can stop relying on the ObjectEngine for small MI, by having it hash codes rather than values.

@jreback jreback added this to the 0.23.0 milestone Jan 17, 2018
@qiuwei
Copy link

qiuwei commented Apr 21, 2021

The new engine seems to cause performance issues for get_indexer.
See #34531
and #23735

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
MultiIndex Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants