Skip to content

REF: codes-based MultiIndex engine #19074

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 10 commits into from
Jan 28, 2018

Conversation

toobaz
Copy link
Member

@toobaz toobaz commented Jan 4, 2018

closes #18519
closes #18818
closes #18520
closes #18485
closes #15994
closes #19086

  • tests added / passed
  • passes git diff upstream/master -u -- "*.py" | flake8 --diff
  • whatsnew entry

This PR provides a cleaner and more robust MultiIndex engine.

asv benchmarks

       before           after         ratio
     [93033151]       [6e1ecec1]
+     4.26±0.03ms      20.8±0.08ms     4.87  multiindex_object.GetLoc.time_med_get_loc_warm
+     4.28±0.02μs       19.6±0.1μs     4.58  multiindex_object.GetLoc.time_string_get_loc
+     4.12±0.02ms      18.1±0.09ms     4.40  multiindex_object.GetLoc.time_small_get_loc_warm
+      4.55±0.1μs       18.8±0.3μs     4.13  multiindex_object.GetLoc.time_med_get_loc
-         178±4ms        148±0.7ms     0.83  multiindex_object.GetLoc.time_large_get_loc
-         163±1ms        120±0.5ms     0.73  multiindex_object.Integer.time_get_indexer
-       336±0.6ms          167±1ms     0.50  multiindex_object.GetLoc.time_large_get_loc_warm

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.

... clearly show that the engine increases performance for large indexes, and significantly decreases performance for get_loc on small/medium indexes. For reference, on the small test case (10 elements), 16 μs. are lost on each get_loc call, while on the large test case (1 million elements), 146 μs. are gained on each get_loc call. I think the tradeoff is acceptable, also considering that with this approach, any improvement to flat indexes automatically transfers to MultiIndex. But most importantly, I think we have no other options to fix several annoying open issues (and more bugs - I'm sure - which were never reported) about incoherence between flat indexes and MultiIndex.

Notice that moving the engine into index.pyx is problematic because cython does not support multiple inheritance. Anyway, trying to cythonize a couple of methods brought no gain at all (as the overhead is mostly due to the multiple get_loc(val) calls on the different levels).

@pep8speaks
Copy link

pep8speaks commented Jan 4, 2018

Hello @toobaz! Thanks for updating the PR.

Cheers ! There are no PEP8 issues in this Pull Request. 🍻

Comment last updated on January 27, 2018 at 18:16 Hours UTC

@toobaz toobaz force-pushed the mi_engine_asuint_18519 branch 3 times, most recently from 6219768 to 2cfec77 Compare January 4, 2018 12:10
@codecov
Copy link

codecov bot commented Jan 4, 2018

Codecov Report

Merging #19074 into master will decrease coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #19074      +/-   ##
==========================================
- Coverage   91.63%   91.62%   -0.01%     
==========================================
  Files         150      150              
  Lines       48726    48722       -4     
==========================================
- Hits        44648    44640       -8     
- Misses       4078     4082       +4
Flag Coverage Δ
#multiple 89.99% <100%> (-0.01%) ⬇️
#single 41.75% <65.21%> (+0.01%) ⬆️
Impacted Files Coverage Δ
pandas/core/indexes/multi.py 95.14% <100%> (-1.09%) ⬇️
pandas/util/testing.py 83.64% <0%> (-0.21%) ⬇️
pandas/core/frame.py 97.57% <0%> (-0.06%) ⬇️
pandas/plotting/_converter.py 66.95% <0%> (+1.73%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 24d9509...468bb08. Read the comment docs.

@toobaz toobaz force-pushed the mi_engine_asuint_18519 branch 2 times, most recently from 9b9173d to f1476d0 Compare January 4, 2018 13:54
@jreback
Copy link
Contributor

jreback commented Jan 4, 2018

.. clearly show that the engine increases performance for large indexes, and significantly decreases performance for get_loc on small/medium indexes. For reference, on the small test case (10 elements), 16 μs. are lost on each get_loc call, while on the large test case (1 million elements), 146 μs. are gained on each get_loc call. I think the tradeoff is acceptable, also considering that

This is exactly why we changed this previously. Please look at the previous issue and how @jorisvandenbossche made the tradeoff chart.

This needs a full asv run as well as significant code cleanup. But first if you can narrow the small indexing perf then that would be helpful.

@jreback jreback added Indexing Related to indexing on series/frames, not to indexes themselves MultiIndex labels Jan 4, 2018
@toobaz
Copy link
Member Author

toobaz commented Jan 4, 2018

This is exactly why we changed this previously. Please look at the previous issue

What do you mean by "this"? I don't think pandas ever had a MultiIndex engine similar to the one I am proposing. I saw #16319, but there we were talking about a ~x1000, not ~x5, performance impact.

and how @jorisvandenbossche made the tradeoff chart.

pointer?

But first if you can narrow the small indexing perf then that would be helpful.

I guess so. If you have any ideas, great.

@TomAugspurger
Copy link
Contributor

pointer?

#16324 (comment)

@jreback
Copy link
Contributor

jreback commented Jan 4, 2018

@toobaz what I want to see since this is a regression on perf, how much it is relative to where we are. yes these seem like tiny numbers, but so did the other ones.

@toobaz
Copy link
Member Author

toobaz commented Jan 4, 2018

#16324 (comment)

Thanks, had missed that! So here is a comparison of the three engines:
http://nbviewer.jupyter.org/gist/toobaz/de76b3fbc6745266fd4c2a6e428bd2a4

And here is the figure equivalent to that in #16324 (comment) :

perf_warm_w_10

@jreback
Copy link
Contributor

jreback commented Jan 4, 2018

ok from a real world perspective then for < 1000 is slower, but how much in total time?

@toobaz
Copy link
Member Author

toobaz commented Jan 4, 2018

ok from a real world perspective then for < 1000 is slower, but how much in total time?

Assuming you are asking for the difference between UInt and Object in the "warm combination" case (dashed lines), we are in the order of 100 μs:

9        0.000178
100      0.000147
900      0.000059
10000   -0.000757

(this is tested with more iterations than the figure above, see the updated notebook)

@@ -40,6 +40,10 @@ cdef inline is_definitely_invalid_key(object val):
or PyList_Check(val) or hasattr(val, '_data'))


def is_definitely_invalid_key(val):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Easier to just make the original cpdef bint is_definitely_invalid_key(object val)? Plus small perf improvement from adding the bint type.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Possible... I have no idea of how cdef bint compares to cdef inline, which should also bring a small performance improvement.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main motivation would be to avoid needing both a is_definitely_invalid_key and _is_definitely_invalid_key. That comes from the cpdef part.

bint and inline are not mutually exclusive, so even if you don't use the cpdef suggestion, changing cdef inline to cdef inline bint is an improvement regardless.

That said, I expect the performance impact to be indistinguishable from noise.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(had missed the "p" in cpdef!)

@@ -50,6 +50,101 @@
target_klass='MultiIndex or list of tuples'))


class BaseMultiIndexCodesEngine(object):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If cython multiple inheritance is a problem, this can still go in cython and be a cdef class and the subclasses below can still inherit from it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm afraid I'm not following you: I have no problem in having the subclasses below inherit from this, but rather of also inheriting from their base classes.

The only way out I see is a template analogous to index_class_helper.pxi.in... but I really don't think it is worth the added complexity.

(My preference would be to not cythonize anyway, because I doubt any measurable gain can be attained, given that the cost overwhelmingly comes from numpy and Index.get_loc() calls - but I'm ready to be convinced I'm wrong)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, now I understand, you're suggesting to keep the two subclasses in Python. Makes sense.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... well, except that

  • apart from cdef class, the only internal method which could in principle be a cdef (_codes_to_ints) is precisely the one which would remain in multi.py (because it's type-specific)
  • I would have to import pandas.core.dtypes.missing.isna inside index.pyx (maybe harmless, but I see all other current imports are lower level)

... but if you think this second point is not a problem, I will give it a try.

@toobaz toobaz force-pushed the mi_engine_asuint_18519 branch 3 times, most recently from 544683d to b3ba47c Compare January 4, 2018 23:52
@toobaz
Copy link
Member Author

toobaz commented Jan 4, 2018

@jbrockmendel's suggestion to cythonize did help reducing a bit the performance decrease:

       before           after         ratio
     [93033151]       [b3ba47c6]
+     3.97±0.02μs      17.4±0.04μs     4.39  multiindex_object.GetLoc.time_med_get_loc
+     4.13±0.02ms      18.0±0.08ms     4.36  multiindex_object.GetLoc.time_small_get_loc_warm
+     4.26±0.03μs       17.6±0.1μs     4.13  multiindex_object.GetLoc.time_string_get_loc
+     4.32±0.02ms      16.6±0.05ms     3.84  multiindex_object.GetLoc.time_med_get_loc_warm
-         181±5ms        153±0.8ms     0.84  multiindex_object.GetLoc.time_large_get_loc
-         170±2ms        123±0.5ms     0.73  multiindex_object.Integer.time_get_indexer
-       321±0.8ms        169±0.6ms     0.53  multiindex_object.GetLoc.time_large_get_loc_warm

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.

@@ -318,6 +318,10 @@ Indexing
- Bug in :func:`MultiIndex.get_level_values` which would return an invalid index on level of ints with missing values (:issue:`17924`)
- Bug in :func:`MultiIndex.remove_unused_levels` which would fill nan values (:issue:`18417`)
- Bug in :func:`MultiIndex.from_tuples`` which would fail to take zipped tuples in python3 (:issue:`18434`)
- Bug in :func:`MultiIndex.get_loc`` which would fail to automatically cast values between float and int (:issue:`18818`, :issue:`15994`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you split all of the MultiIndex out to a separate sub-section (in bug fixes)

but use the IndexEngine for computation
cdef class BaseMultiIndexCodesEngine(object):
def __init__(self, levels, labels, offsets, **kwargs):
self._levels = levels
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these don't need to be private (levels & offsets)

you can type them (offsets is an uint64_t array?)

furthermore these can be made readonly as well (you can set them once), see how we do this in Timestamp for example

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I add

    cdef readonly:
        object levels
        ndarray offsets

I get

In [1]: import pandas as pd
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-1-7dd3504c366f> in <module>()
----> 1 import pandas as pd

/home/nobackup/repo/pandas/pandas/__init__.py in <module>()
     40 import pandas.core.config_init
     41 
---> 42 from pandas.core.api import *
     43 from pandas.core.sparse.api import *
     44 from pandas.stats.api import *

/home/nobackup/repo/pandas/pandas/core/api.py in <module>()
      8 from pandas.core.dtypes.missing import isna, isnull, notna, notnull
      9 from pandas.core.categorical import Categorical
---> 10 from pandas.core.groupby import Grouper
     11 from pandas.io.formats.format import set_eng_float_format
     12 from pandas.core.index import (Index, CategoricalIndex, Int64Index,

/home/nobackup/repo/pandas/pandas/core/groupby.py in <module>()
     46 from pandas.core.base import (PandasObject, SelectionMixin, GroupByError,
     47                               DataError, SpecificationError)
---> 48 from pandas.core.index import (Index, MultiIndex,
     49                                CategoricalIndex, _ensure_index)
     50 from pandas.core.categorical import Categorical

/home/nobackup/repo/pandas/pandas/core/index.py in <module>()
      1 # flake8: noqa
----> 2 from pandas.core.indexes.api import *
      3 from pandas.core.indexes.multi import _sparsify

/home/nobackup/repo/pandas/pandas/core/indexes/api.py in <module>()
      6                                       InvalidIndexError)  # noqa
      7 from pandas.core.indexes.category import CategoricalIndex  # noqa
----> 8 from pandas.core.indexes.multi import MultiIndex  # noqa
      9 from pandas.core.indexes.interval import IntervalIndex  # noqa
     10 from pandas.core.indexes.numeric import (NumericIndex, Float64Index,  # noqa

/home/nobackup/repo/pandas/pandas/core/indexes/multi.py in <module>()
     52 
     53 class MultiIndexUIntEngine(libindex.BaseMultiIndexCodesEngine,
---> 54                            libindex.UInt64Engine):
     55     """
     56     Manage a MultiIndex by mapping label combinations to positive integers.

TypeError: multiple bases have instance lay-out conflict

if hasattr(values, 'values'):
values = values.values
return super(MultiIndexObjectEngine, self).get_indexer(values)
# Map each combination to an integer
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

explain how you are doing the mapping, give an example in the comments, this should be explained as much as possible

return super(MultiIndexObjectEngine, self).get_indexer(values)
# Map each combination to an integer
codes = (np.array(labels, dtype='int64').T + 1).astype('uint64')
lab_ints = self._codes_to_ints(codes)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

de-privatize method names, spell them out, e.g.codes_to_integer

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is is the only method name starting with _ as it's not meant to be used from outside (like e.g. IndexEngine._maybe_get_bool_indexer). Why should it not be private?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand but that a historic effect. These classes themselves are totally private, so making the methods is overkill.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought it would make it clear that e.g. BaseMultiIndexCodesEngine.get_loc is meant to be used from Index (internal) code, while BaseMultiIndexCodesEngine._codes_to_ints is only meant to be used from code specific to the engine. But fine, will change.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @toobaz: if something is a helper method that is only used within the class, make this clear by using a _... name (also for within the codebase is it useful to know which method is 'internally' public and which not, even if the class itself is fully private)

provide the same interface as the MultiIndexEngine
but use the IndexEngine for computation
cdef class BaseMultiIndexCodesEngine(object):
def __init__(self, levels, labels, offsets, **kwargs):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

type the arguments here as well

if not isinstance(key, tuple):
raise KeyError(key)
try:
idces = [0 if checknull(v) else self._levels[l].get_loc(v) + 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't use abbreviations (e.g. idces)

_base = libindex.UInt64Engine

def _codes_to_ints(self, codes):
# Shift:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doc strings & explanation

@@ -691,16 +721,15 @@ def _get_level_number(self, level):

@cache_readonly
def _engine(self):
# Find powers of 2 which dominate level sizes - including -1 for NaN:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

more explanation

write out the lev_bits in a more clear way

# Find powers of 2 which dominate level sizes - including -1 for NaN:
lev_bits = np.cumsum(np.ceil(np.log2([len(l) + 1 for l in
self.levels[::-1]])))[::-1]
offsets = np.concatenate([lev_bits[1:], [0]]).astype('uint')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you sure about uint here? (and not uint64)?

@pytest.mark.parametrize('level', [0, 1])
@pytest.mark.parametrize('null_val', [np.nan, pd.NaT, None])
def test_get_loc_nan(self, level, null_val):
# GH 18485
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

explain what you are testing

@toobaz
Copy link
Member Author

toobaz commented Jan 8, 2018

Waiting to rebase on #19054 before pushing again

@toobaz toobaz force-pushed the mi_engine_asuint_18519 branch from b3ba47c to b152cb8 Compare January 10, 2018 08:46
@toobaz
Copy link
Member Author

toobaz commented Jan 11, 2018

New asv run shows some small additional improvement:

       before           after         ratio
     [9d8dbefb]       [b152cb80]
+     4.71±0.07μs       19.9±0.5μs     4.21  multiindex_object.GetLoc.time_string_get_loc
+     4.69±0.05ms       18.2±0.2ms     3.88  multiindex_object.GetLoc.time_small_get_loc_warm
+      5.80±0.9μs       19.8±0.1μs     3.41  multiindex_object.GetLoc.time_med_get_loc
+      5.86±0.5ms       18.7±0.4ms     3.19  multiindex_object.GetLoc.time_med_get_loc_warm
+      12.8±0.1ms      18.2±0.05ms     1.42  multiindex_object.Values.time_datetime_level_values_copy
-         464±8μs          421±5μs     0.91  multiindex_object.Duplicates.time_remove_unused_levels
-         201±7ms        165±0.9ms     0.82  multiindex_object.GetLoc.time_large_get_loc
-         197±4ms        132±0.4ms     0.67  multiindex_object.Integer.time_get_indexer
-        426±20ms          182±2ms     0.43  multiindex_object.GetLoc.time_large_get_loc_warm

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.

@toobaz toobaz force-pushed the mi_engine_asuint_18519 branch from b152cb8 to 36b9cee Compare January 11, 2018 22:35
@toobaz
Copy link
Member Author

toobaz commented Jan 12, 2018

@jreback ping

@toobaz toobaz force-pushed the mi_engine_asuint_18519 branch from f9d6970 to 1a71ffb Compare January 26, 2018 14:04
@toobaz
Copy link
Member Author

toobaz commented Jan 26, 2018

so the main issue number is a perf entry (note that its for larger MI & a small regression on smaller)

Added, ping

@jreback
Copy link
Contributor

jreback commented Jan 27, 2018

if you can rebase and ping on green, ok to merge.

@toobaz toobaz force-pushed the mi_engine_asuint_18519 branch from 1a71ffb to 468bb08 Compare January 27, 2018 18:16
@toobaz
Copy link
Member Author

toobaz commented Jan 28, 2018

@jreback rebased, ping

@jorisvandenbossche jorisvandenbossche changed the title codes-based MultiIndex engine REF: codes-based MultiIndex engine Jan 28, 2018
@jorisvandenbossche jorisvandenbossche merged commit 8cbee35 into pandas-dev:master Jan 28, 2018
@jorisvandenbossche
Copy link
Member

@toobaz Great work!

@toobaz toobaz deleted the mi_engine_asuint_18519 branch January 28, 2018 16:03
@jreback
Copy link
Contributor

jreback commented Jan 29, 2018

this is failing the 32 bit builds
https://travis-ci.org/MacPython/pandas-wheels/jobs/334527944

if u can issue a follow up to fix
not easy to test this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Indexing Related to indexing on series/frames, not to indexes themselves MultiIndex
Projects
None yet
6 participants