REF: codes-based MultiIndex engine #19074

toobaz · 2018-01-04T10:41:25Z

closes #18519
closes #18818
closes #18520
closes #18485
closes #15994
closes #19086

tests added / passed
passes git diff upstream/master -u -- "*.py" | flake8 --diff
whatsnew entry

This PR provides a cleaner and more robust MultiIndex engine.

asv benchmarks

       before           after         ratio
     [93033151]       [6e1ecec1]
+     4.26±0.03ms      20.8±0.08ms     4.87  multiindex_object.GetLoc.time_med_get_loc_warm
+     4.28±0.02μs       19.6±0.1μs     4.58  multiindex_object.GetLoc.time_string_get_loc
+     4.12±0.02ms      18.1±0.09ms     4.40  multiindex_object.GetLoc.time_small_get_loc_warm
+      4.55±0.1μs       18.8±0.3μs     4.13  multiindex_object.GetLoc.time_med_get_loc
-         178±4ms        148±0.7ms     0.83  multiindex_object.GetLoc.time_large_get_loc
-         163±1ms        120±0.5ms     0.73  multiindex_object.Integer.time_get_indexer
-       336±0.6ms          167±1ms     0.50  multiindex_object.GetLoc.time_large_get_loc_warm

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.

... clearly show that the engine increases performance for large indexes, and significantly decreases performance for get_loc on small/medium indexes. For reference, on the small test case (10 elements), 16 μs. are lost on each get_loc call, while on the large test case (1 million elements), 146 μs. are gained on each get_loc call. I think the tradeoff is acceptable, also considering that with this approach, any improvement to flat indexes automatically transfers to MultiIndex. But most importantly, I think we have no other options to fix several annoying open issues (and more bugs - I'm sure - which were never reported) about incoherence between flat indexes and MultiIndex.

Notice that moving the engine into index.pyx is problematic because cython does not support multiple inheritance. Anyway, trying to cythonize a couple of methods brought no gain at all (as the overhead is mostly due to the multiple get_loc(val) calls on the different levels).

pep8speaks · 2018-01-04T10:41:37Z

Hello @toobaz! Thanks for updating the PR.

Cheers ! There are no PEP8 issues in this Pull Request. 🍻

Comment last updated on January 27, 2018 at 18:16 Hours UTC

codecov · 2018-01-04T12:10:22Z

Codecov Report

Merging #19074 into master will decrease coverage by <.01%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master   #19074      +/-   ##
==========================================
- Coverage   91.63%   91.62%   -0.01%     
==========================================
  Files         150      150              
  Lines       48726    48722       -4     
==========================================
- Hits        44648    44640       -8     
- Misses       4078     4082       +4

Flag	Coverage Δ
#multiple	`89.99% <100%> (-0.01%)`	⬇️
#single	`41.75% <65.21%> (+0.01%)`	⬆️

Impacted Files	Coverage Δ
pandas/core/indexes/multi.py	`95.14% <100%> (-1.09%)`	⬇️
pandas/util/testing.py	`83.64% <0%> (-0.21%)`	⬇️
pandas/core/frame.py	`97.57% <0%> (-0.06%)`	⬇️
pandas/plotting/_converter.py	`66.95% <0%> (+1.73%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 24d9509...468bb08. Read the comment docs.

jreback · 2018-01-04T14:08:36Z

.. clearly show that the engine increases performance for large indexes, and significantly decreases performance for get_loc on small/medium indexes. For reference, on the small test case (10 elements), 16 μs. are lost on each get_loc call, while on the large test case (1 million elements), 146 μs. are gained on each get_loc call. I think the tradeoff is acceptable, also considering that

This is exactly why we changed this previously. Please look at the previous issue and how @jorisvandenbossche made the tradeoff chart.

This needs a full asv run as well as significant code cleanup. But first if you can narrow the small indexing perf then that would be helpful.

toobaz · 2018-01-04T14:29:33Z

This is exactly why we changed this previously. Please look at the previous issue

What do you mean by "this"? I don't think pandas ever had a MultiIndex engine similar to the one I am proposing. I saw #16319, but there we were talking about a ~x1000, not ~x5, performance impact.

and how @jorisvandenbossche made the tradeoff chart.

pointer?

But first if you can narrow the small indexing perf then that would be helpful.

I guess so. If you have any ideas, great.

TomAugspurger · 2018-01-04T15:17:12Z

pointer?

#16324 (comment)

jreback · 2018-01-04T15:25:53Z

@toobaz what I want to see since this is a regression on perf, how much it is relative to where we are. yes these seem like tiny numbers, but so did the other ones.

toobaz · 2018-01-04T16:06:13Z

#16324 (comment)

Thanks, had missed that! So here is a comparison of the three engines:
http://nbviewer.jupyter.org/gist/toobaz/de76b3fbc6745266fd4c2a6e428bd2a4

And here is the figure equivalent to that in #16324 (comment) :

jreback · 2018-01-04T16:21:12Z

ok from a real world perspective then for < 1000 is slower, but how much in total time?

toobaz · 2018-01-04T18:06:45Z

ok from a real world perspective then for < 1000 is slower, but how much in total time?

Assuming you are asking for the difference between UInt and Object in the "warm combination" case (dashed lines), we are in the order of 100 μs:

9        0.000178
100      0.000147
900      0.000059
10000   -0.000757

(this is tested with more iterations than the figure above, see the updated notebook)

jbrockmendel · 2018-01-04T20:06:24Z

pandas/_libs/index.pyx

@@ -40,6 +40,10 @@ cdef inline is_definitely_invalid_key(object val):
            or PyList_Check(val) or hasattr(val, '_data'))


+def is_definitely_invalid_key(val):


Easier to just make the original cpdef bint is_definitely_invalid_key(object val)? Plus small perf improvement from adding the bint type.

Possible... I have no idea of how cdef bint compares to cdef inline, which should also bring a small performance improvement.

The main motivation would be to avoid needing both a is_definitely_invalid_key and _is_definitely_invalid_key. That comes from the cpdef part.

bint and inline are not mutually exclusive, so even if you don't use the cpdef suggestion, changing cdef inline to cdef inline bint is an improvement regardless.

That said, I expect the performance impact to be indistinguishable from noise.

(had missed the "p" in cpdef!)

jbrockmendel · 2018-01-04T20:11:04Z

pandas/core/indexes/multi.py

@@ -50,6 +50,101 @@
         target_klass='MultiIndex or list of tuples'))


+class BaseMultiIndexCodesEngine(object):


If cython multiple inheritance is a problem, this can still go in cython and be a cdef class and the subclasses below can still inherit from it.

I'm afraid I'm not following you: I have no problem in having the subclasses below inherit from this, but rather of also inheriting from their base classes.

The only way out I see is a template analogous to index_class_helper.pxi.in... but I really don't think it is worth the added complexity.

(My preference would be to not cythonize anyway, because I doubt any measurable gain can be attained, given that the cost overwhelmingly comes from numpy and Index.get_loc() calls - but I'm ready to be convinced I'm wrong)

OK, now I understand, you're suggesting to keep the two subclasses in Python. Makes sense.

... well, except that

apart from cdef class, the only internal method which could in principle be a cdef (_codes_to_ints) is precisely the one which would remain in multi.py (because it's type-specific)

I would have to import pandas.core.dtypes.missing.isna inside index.pyx (maybe harmless, but I see all other current imports are lower level)

... but if you think this second point is not a problem, I will give it a try.

toobaz · 2018-01-04T23:54:59Z

@jbrockmendel's suggestion to cythonize did help reducing a bit the performance decrease:

       before           after         ratio
     [93033151]       [b3ba47c6]
+     3.97±0.02μs      17.4±0.04μs     4.39  multiindex_object.GetLoc.time_med_get_loc
+     4.13±0.02ms      18.0±0.08ms     4.36  multiindex_object.GetLoc.time_small_get_loc_warm
+     4.26±0.03μs       17.6±0.1μs     4.13  multiindex_object.GetLoc.time_string_get_loc
+     4.32±0.02ms      16.6±0.05ms     3.84  multiindex_object.GetLoc.time_med_get_loc_warm
-         181±5ms        153±0.8ms     0.84  multiindex_object.GetLoc.time_large_get_loc
-         170±2ms        123±0.5ms     0.73  multiindex_object.Integer.time_get_indexer
-       321±0.8ms        169±0.6ms     0.53  multiindex_object.GetLoc.time_large_get_loc_warm

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.

jreback · 2018-01-05T00:37:42Z

doc/source/whatsnew/v0.23.0.txt

@@ -318,6 +318,10 @@ Indexing
 - Bug in :func:`MultiIndex.get_level_values` which would return an invalid index on level of ints with missing values (:issue:`17924`)
 - Bug in :func:`MultiIndex.remove_unused_levels` which would fill nan values (:issue:`18417`)
 - Bug in :func:`MultiIndex.from_tuples`` which would fail to take zipped tuples in python3 (:issue:`18434`)
+- Bug in :func:`MultiIndex.get_loc`` which would fail to automatically cast values between float and int (:issue:`18818`, :issue:`15994`)


can you split all of the MultiIndex out to a separate sub-section (in bug fixes)

jreback · 2018-01-05T00:39:07Z

pandas/_libs/index.pyx

-    but use the IndexEngine for computation
+cdef class BaseMultiIndexCodesEngine(object):
+    def __init__(self, levels, labels, offsets, **kwargs):
+        self._levels = levels


these don't need to be private (levels & offsets)

you can type them (offsets is an uint64_t array?)

furthermore these can be made readonly as well (you can set them once), see how we do this in Timestamp for example

If I add

cdef readonly: object levels ndarray offsets

I get

In [1]: import pandas as pd --------------------------------------------------------------------------- TypeError Traceback (most recent call last) <ipython-input-1-7dd3504c366f> in <module>() ----> 1 import pandas as pd /home/nobackup/repo/pandas/pandas/__init__.py in <module>() 40 import pandas.core.config_init 41 ---> 42 from pandas.core.api import * 43 from pandas.core.sparse.api import * 44 from pandas.stats.api import * /home/nobackup/repo/pandas/pandas/core/api.py in <module>() 8 from pandas.core.dtypes.missing import isna, isnull, notna, notnull 9 from pandas.core.categorical import Categorical ---> 10 from pandas.core.groupby import Grouper 11 from pandas.io.formats.format import set_eng_float_format 12 from pandas.core.index import (Index, CategoricalIndex, Int64Index, /home/nobackup/repo/pandas/pandas/core/groupby.py in <module>() 46 from pandas.core.base import (PandasObject, SelectionMixin, GroupByError, 47 DataError, SpecificationError) ---> 48 from pandas.core.index import (Index, MultiIndex, 49 CategoricalIndex, _ensure_index) 50 from pandas.core.categorical import Categorical /home/nobackup/repo/pandas/pandas/core/index.py in <module>() 1 # flake8: noqa ----> 2 from pandas.core.indexes.api import * 3 from pandas.core.indexes.multi import _sparsify /home/nobackup/repo/pandas/pandas/core/indexes/api.py in <module>() 6 InvalidIndexError) # noqa 7 from pandas.core.indexes.category import CategoricalIndex # noqa ----> 8 from pandas.core.indexes.multi import MultiIndex # noqa 9 from pandas.core.indexes.interval import IntervalIndex # noqa 10 from pandas.core.indexes.numeric import (NumericIndex, Float64Index, # noqa /home/nobackup/repo/pandas/pandas/core/indexes/multi.py in <module>() 52 53 class MultiIndexUIntEngine(libindex.BaseMultiIndexCodesEngine, ---> 54 libindex.UInt64Engine): 55 """ 56 Manage a MultiIndex by mapping label combinations to positive integers. TypeError: multiple bases have instance lay-out conflict

jreback · 2018-01-05T00:39:58Z

pandas/_libs/index.pyx

-        if hasattr(values, 'values'):
-            values = values.values
-        return super(MultiIndexObjectEngine, self).get_indexer(values)
+        # Map each combination to an integer


explain how you are doing the mapping, give an example in the comments, this should be explained as much as possible

jreback · 2018-01-05T00:40:24Z

pandas/_libs/index.pyx

-        return super(MultiIndexObjectEngine, self).get_indexer(values)
+        # Map each combination to an integer
+        codes = (np.array(labels, dtype='int64').T + 1).astype('uint64')
+        lab_ints = self._codes_to_ints(codes)


de-privatize method names, spell them out, e.g.codes_to_integer

This is is the only method name starting with _ as it's not meant to be used from outside (like e.g. IndexEngine._maybe_get_bool_indexer). Why should it not be private?

I understand but that a historic effect. These classes themselves are totally private, so making the methods is overkill.

I thought it would make it clear that e.g. BaseMultiIndexCodesEngine.get_loc is meant to be used from Index (internal) code, while BaseMultiIndexCodesEngine._codes_to_ints is only meant to be used from code specific to the engine. But fine, will change.

I agree with @toobaz: if something is a helper method that is only used within the class, make this clear by using a _... name (also for within the codebase is it useful to know which method is 'internally' public and which not, even if the class itself is fully private)

jreback · 2018-01-05T00:40:32Z

pandas/_libs/index.pyx

-    provide the same interface as the MultiIndexEngine
-    but use the IndexEngine for computation
+cdef class BaseMultiIndexCodesEngine(object):
+    def __init__(self, levels, labels, offsets, **kwargs):


type the arguments here as well

jreback · 2018-01-05T00:41:56Z

pandas/_libs/index.pyx

+        if not isinstance(key, tuple):
+            raise KeyError(key)
+        try:
+            idces = [0 if checknull(v) else self._levels[l].get_loc(v) + 1


don't use abbreviations (e.g. idces)

jreback · 2018-01-05T00:42:28Z

pandas/core/indexes/multi.py

+    _base = libindex.UInt64Engine
+
+    def _codes_to_ints(self, codes):
+        # Shift:


doc strings & explanation

jreback · 2018-01-05T00:43:10Z

pandas/core/indexes/multi.py

@@ -691,16 +721,15 @@ def _get_level_number(self, level):

    @cache_readonly
    def _engine(self):
+        # Find powers of 2 which dominate level sizes - including -1 for NaN:


more explanation

write out the lev_bits in a more clear way

jreback · 2018-01-05T00:43:24Z

pandas/core/indexes/multi.py

+        # Find powers of 2 which dominate level sizes - including -1 for NaN:
+        lev_bits = np.cumsum(np.ceil(np.log2([len(l) + 1 for l in
+                                              self.levels[::-1]])))[::-1]
+        offsets = np.concatenate([lev_bits[1:], [0]]).astype('uint')


you sure about uint here? (and not uint64)?

jreback · 2018-01-05T00:43:51Z

pandas/tests/indexes/test_multi.py

+    @pytest.mark.parametrize('level', [0, 1])
+    @pytest.mark.parametrize('null_val', [np.nan, pd.NaT, None])
+    def test_get_loc_nan(self, level, null_val):
+        # GH 18485


explain what you are testing

toobaz · 2018-01-08T08:29:26Z

Waiting to rebase on #19054 before pushing again

toobaz · 2018-01-11T18:01:33Z

New asv run shows some small additional improvement:

       before           after         ratio
     [9d8dbefb]       [b152cb80]
+     4.71±0.07μs       19.9±0.5μs     4.21  multiindex_object.GetLoc.time_string_get_loc
+     4.69±0.05ms       18.2±0.2ms     3.88  multiindex_object.GetLoc.time_small_get_loc_warm
+      5.80±0.9μs       19.8±0.1μs     3.41  multiindex_object.GetLoc.time_med_get_loc
+      5.86±0.5ms       18.7±0.4ms     3.19  multiindex_object.GetLoc.time_med_get_loc_warm
+      12.8±0.1ms      18.2±0.05ms     1.42  multiindex_object.Values.time_datetime_level_values_copy
-         464±8μs          421±5μs     0.91  multiindex_object.Duplicates.time_remove_unused_levels
-         201±7ms        165±0.9ms     0.82  multiindex_object.GetLoc.time_large_get_loc
-         197±4ms        132±0.4ms     0.67  multiindex_object.Integer.time_get_indexer
-        426±20ms          182±2ms     0.43  multiindex_object.GetLoc.time_large_get_loc_warm

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.

toobaz · 2018-01-12T08:39:41Z

@jreback ping

toobaz · 2018-01-26T18:59:23Z

so the main issue number is a perf entry (note that its for larger MI & a small regression on smaller)

Added, ping

jreback · 2018-01-27T16:59:51Z

if you can rebase and ping on green, ok to merge.

closes pandas-dev#18519 closes pandas-dev#18818 closes pandas-dev#18520 closes pandas-dev#18485 closes pandas-dev#15994 closes pandas-dev#19086

toobaz · 2018-01-28T10:19:55Z

@jreback rebased, ping

jorisvandenbossche · 2018-01-28T11:53:20Z

@toobaz Great work!

jreback · 2018-01-29T02:51:24Z

this is failing the 32 bit builds
https://travis-ci.org/MacPython/pandas-wheels/jobs/334527944

if u can issue a follow up to fix
not easy to test this

toobaz force-pushed the mi_engine_asuint_18519 branch 3 times, most recently from 6219768 to 2cfec77 Compare January 4, 2018 12:10

toobaz force-pushed the mi_engine_asuint_18519 branch 2 times, most recently from 9b9173d to f1476d0 Compare January 4, 2018 13:54

jreback added Indexing Related to indexing on series/frames, not to indexes themselves MultiIndex labels Jan 4, 2018

jbrockmendel reviewed Jan 4, 2018

View reviewed changes

toobaz mentioned this pull request Jan 4, 2018

MultiIndex (and not flat Index) casts bool to int #19086

Closed

toobaz force-pushed the mi_engine_asuint_18519 branch 3 times, most recently from 544683d to b3ba47c Compare January 4, 2018 23:52

jreback requested changes Jan 5, 2018

View reviewed changes

toobaz mentioned this pull request Jan 8, 2018

pd.MultiIndex.get_loc(np.nan) #19132

Closed

toobaz force-pushed the mi_engine_asuint_18519 branch from b3ba47c to b152cb8 Compare January 10, 2018 08:46

toobaz force-pushed the mi_engine_asuint_18519 branch from b152cb8 to 36b9cee Compare January 11, 2018 22:35

toobaz force-pushed the mi_engine_asuint_18519 branch from f9d6970 to 1a71ffb Compare January 26, 2018 14:04

jreback approved these changes Jan 27, 2018

View reviewed changes

toobaz added 10 commits January 27, 2018 18:56

BUG: implement new engine for codes-based MultiIndex indexing

034ba25

closes pandas-dev#18519 closes pandas-dev#18818 closes pandas-dev#18520 closes pandas-dev#18485 closes pandas-dev#15994 closes pandas-dev#19086

TST: added test for implicit casting in MultiIndex.get_loc()

998ddcb

TST: test get_loc on MultiIndex with tuple containing NaN

805cb45

TST: test get_loc on MultiIndex with levels of different dtypes

ca967bc

TST: test MultiIndexPyIntEngine

a059c61

CLN: remove obsolete special-casing of MI

0606d80

CLN: remove obsolete MultiIndex engines

d3d526e

TST: test get_loc on MultiIndex with tuple containing bool

52e4a02

PERF: specify type of is_definitely_invalid_key

e7343af

PERF: transform single keys without going 2D

468bb08

toobaz force-pushed the mi_engine_asuint_18519 branch from 1a71ffb to 468bb08 Compare January 27, 2018 18:16

jorisvandenbossche changed the title ~~codes-based MultiIndex engine~~ REF: codes-based MultiIndex engine Jan 28, 2018

jorisvandenbossche merged commit 8cbee35 into pandas-dev:master Jan 28, 2018

toobaz deleted the mi_engine_asuint_18519 branch January 28, 2018 16:03

toobaz mentioned this pull request Jan 29, 2018

MultiIndexPyIntEngine fails on 32 bit builds #19439

Closed

chris-b1 mentioned this pull request Feb 2, 2018

BUG? Possible issue with multi indexing column names of different types #19517

Closed

Zaharid mentioned this pull request Feb 5, 2018

pandas >0.20 breaks alpha_s determination NNPDF/nnpdf#16

Closed

toobaz mentioned this pull request May 18, 2018

Validate that MultiIndex labels are Int64 type when not called internally. #5212

Closed

TomAugspurger mentioned this pull request Nov 16, 2018

PERF: regression in reindex. Pandas 0.23.4 is 60x slower than 0.22.0 with a MultiIndex with datetime64 #23735

Closed

hksonngan mentioned this pull request Feb 24, 2019

Fixed regression of Multi index with NaN #25424

Closed

4 tasks

Ahrimanox mentioned this pull request Apr 16, 2019

PERF: Performance regression (memory and time) on set_index method between 0.20.3 and 0.24.2 #26108

Closed

toobaz mentioned this pull request Oct 18, 2020

AttributeError when slicing a datetime MultiIndex level #15928

Closed

toobaz mentioned this pull request May 14, 2024

ENH: improve performance of lexsort for integers. numpy/numpy#26434

Open

3 tasks

		@@ -40,6 +40,10 @@ cdef inline is_definitely_invalid_key(object val):
		or PyList_Check(val) or hasattr(val, '_data'))


		def is_definitely_invalid_key(val):

		@@ -50,6 +50,101 @@
		target_klass='MultiIndex or list of tuples'))


		class BaseMultiIndexCodesEngine(object):

REF: codes-based MultiIndex engine #19074

REF: codes-based MultiIndex engine #19074

Conversation

toobaz commented Jan 4, 2018 • edited by jreback Loading

pep8speaks commented Jan 4, 2018 • edited Loading

Comment last updated on January 27, 2018 at 18:16 Hours UTC

codecov bot commented Jan 4, 2018 • edited Loading

Codecov Report

jreback commented Jan 4, 2018

toobaz commented Jan 4, 2018 • edited Loading

TomAugspurger commented Jan 4, 2018

jreback commented Jan 4, 2018

toobaz commented Jan 4, 2018

jreback commented Jan 4, 2018

toobaz commented Jan 4, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

toobaz commented Jan 4, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

toobaz commented Jan 8, 2018

toobaz commented Jan 11, 2018

toobaz commented Jan 12, 2018

toobaz commented Jan 26, 2018

jreback commented Jan 27, 2018

toobaz commented Jan 28, 2018

jorisvandenbossche commented Jan 28, 2018

jreback commented Jan 29, 2018

toobaz commented Jan 4, 2018 •

edited by jreback

Loading

pep8speaks commented Jan 4, 2018 •

edited

Loading

codecov bot commented Jan 4, 2018 •

edited

Loading

toobaz commented Jan 4, 2018 •

edited

Loading