-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
REF: codes-based MultiIndex engine #19074
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
REF: codes-based MultiIndex engine #19074
Conversation
Hello @toobaz! Thanks for updating the PR. Cheers ! There are no PEP8 issues in this Pull Request. 🍻 Comment last updated on January 27, 2018 at 18:16 Hours UTC |
6219768
to
2cfec77
Compare
Codecov Report
@@ Coverage Diff @@
## master #19074 +/- ##
==========================================
- Coverage 91.63% 91.62% -0.01%
==========================================
Files 150 150
Lines 48726 48722 -4
==========================================
- Hits 44648 44640 -8
- Misses 4078 4082 +4
Continue to review full report at Codecov.
|
9b9173d
to
f1476d0
Compare
This is exactly why we changed this previously. Please look at the previous issue and how @jorisvandenbossche made the tradeoff chart. This needs a full asv run as well as significant code cleanup. But first if you can narrow the small indexing perf then that would be helpful. |
What do you mean by "this"? I don't think pandas ever had a
pointer?
I guess so. If you have any ideas, great. |
|
@toobaz what I want to see since this is a regression on perf, how much it is relative to where we are. yes these seem like tiny numbers, but so did the other ones. |
Thanks, had missed that! So here is a comparison of the three engines: And here is the figure equivalent to that in #16324 (comment) : |
ok from a real world perspective then for < 1000 is slower, but how much in total time? |
Assuming you are asking for the difference between UInt and Object in the "warm combination" case (dashed lines), we are in the order of 100 μs:
(this is tested with more iterations than the figure above, see the updated notebook) |
pandas/_libs/index.pyx
Outdated
@@ -40,6 +40,10 @@ cdef inline is_definitely_invalid_key(object val): | |||
or PyList_Check(val) or hasattr(val, '_data')) | |||
|
|||
|
|||
def is_definitely_invalid_key(val): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Easier to just make the original cpdef bint is_definitely_invalid_key(object val)
? Plus small perf improvement from adding the bint type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Possible... I have no idea of how cdef bint
compares to cdef inline
, which should also bring a small performance improvement.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The main motivation would be to avoid needing both a is_definitely_invalid_key
and _is_definitely_invalid_key
. That comes from the cpdef
part.
bint and inline are not mutually exclusive, so even if you don't use the cpdef suggestion, changing cdef inline
to cdef inline bint
is an improvement regardless.
That said, I expect the performance impact to be indistinguishable from noise.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(had missed the "p" in cpdef
!)
pandas/core/indexes/multi.py
Outdated
@@ -50,6 +50,101 @@ | |||
target_klass='MultiIndex or list of tuples')) | |||
|
|||
|
|||
class BaseMultiIndexCodesEngine(object): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If cython multiple inheritance is a problem, this can still go in cython and be a cdef class
and the subclasses below can still inherit from it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm afraid I'm not following you: I have no problem in having the subclasses below inherit from this, but rather of also inheriting from their base classes.
The only way out I see is a template analogous to index_class_helper.pxi.in
... but I really don't think it is worth the added complexity.
(My preference would be to not cythonize anyway, because I doubt any measurable gain can be attained, given that the cost overwhelmingly comes from numpy and Index.get_loc()
calls - but I'm ready to be convinced I'm wrong)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, now I understand, you're suggesting to keep the two subclasses in Python. Makes sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
... well, except that
- apart from
cdef class
, the only internal method which could in principle be acdef
(_codes_to_ints
) is precisely the one which would remain inmulti.py
(because it's type-specific) - I would have to import
pandas.core.dtypes.missing.isna
insideindex.pyx
(maybe harmless, but I see all other current imports are lower level)
... but if you think this second point is not a problem, I will give it a try.
544683d
to
b3ba47c
Compare
@jbrockmendel's suggestion to cythonize did help reducing a bit the performance decrease:
|
doc/source/whatsnew/v0.23.0.txt
Outdated
@@ -318,6 +318,10 @@ Indexing | |||
- Bug in :func:`MultiIndex.get_level_values` which would return an invalid index on level of ints with missing values (:issue:`17924`) | |||
- Bug in :func:`MultiIndex.remove_unused_levels` which would fill nan values (:issue:`18417`) | |||
- Bug in :func:`MultiIndex.from_tuples`` which would fail to take zipped tuples in python3 (:issue:`18434`) | |||
- Bug in :func:`MultiIndex.get_loc`` which would fail to automatically cast values between float and int (:issue:`18818`, :issue:`15994`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you split all of the MultiIndex out to a separate sub-section (in bug fixes)
pandas/_libs/index.pyx
Outdated
but use the IndexEngine for computation | ||
cdef class BaseMultiIndexCodesEngine(object): | ||
def __init__(self, levels, labels, offsets, **kwargs): | ||
self._levels = levels |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these don't need to be private (levels & offsets)
you can type them (offsets is an uint64_t array?)
furthermore these can be made readonly as well (you can set them once), see how we do this in Timestamp for example
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I add
cdef readonly:
object levels
ndarray offsets
I get
In [1]: import pandas as pd
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-1-7dd3504c366f> in <module>()
----> 1 import pandas as pd
/home/nobackup/repo/pandas/pandas/__init__.py in <module>()
40 import pandas.core.config_init
41
---> 42 from pandas.core.api import *
43 from pandas.core.sparse.api import *
44 from pandas.stats.api import *
/home/nobackup/repo/pandas/pandas/core/api.py in <module>()
8 from pandas.core.dtypes.missing import isna, isnull, notna, notnull
9 from pandas.core.categorical import Categorical
---> 10 from pandas.core.groupby import Grouper
11 from pandas.io.formats.format import set_eng_float_format
12 from pandas.core.index import (Index, CategoricalIndex, Int64Index,
/home/nobackup/repo/pandas/pandas/core/groupby.py in <module>()
46 from pandas.core.base import (PandasObject, SelectionMixin, GroupByError,
47 DataError, SpecificationError)
---> 48 from pandas.core.index import (Index, MultiIndex,
49 CategoricalIndex, _ensure_index)
50 from pandas.core.categorical import Categorical
/home/nobackup/repo/pandas/pandas/core/index.py in <module>()
1 # flake8: noqa
----> 2 from pandas.core.indexes.api import *
3 from pandas.core.indexes.multi import _sparsify
/home/nobackup/repo/pandas/pandas/core/indexes/api.py in <module>()
6 InvalidIndexError) # noqa
7 from pandas.core.indexes.category import CategoricalIndex # noqa
----> 8 from pandas.core.indexes.multi import MultiIndex # noqa
9 from pandas.core.indexes.interval import IntervalIndex # noqa
10 from pandas.core.indexes.numeric import (NumericIndex, Float64Index, # noqa
/home/nobackup/repo/pandas/pandas/core/indexes/multi.py in <module>()
52
53 class MultiIndexUIntEngine(libindex.BaseMultiIndexCodesEngine,
---> 54 libindex.UInt64Engine):
55 """
56 Manage a MultiIndex by mapping label combinations to positive integers.
TypeError: multiple bases have instance lay-out conflict
pandas/_libs/index.pyx
Outdated
if hasattr(values, 'values'): | ||
values = values.values | ||
return super(MultiIndexObjectEngine, self).get_indexer(values) | ||
# Map each combination to an integer |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
explain how you are doing the mapping, give an example in the comments, this should be explained as much as possible
pandas/_libs/index.pyx
Outdated
return super(MultiIndexObjectEngine, self).get_indexer(values) | ||
# Map each combination to an integer | ||
codes = (np.array(labels, dtype='int64').T + 1).astype('uint64') | ||
lab_ints = self._codes_to_ints(codes) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
de-privatize method names, spell them out, e.g.codes_to_integer
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is is the only method name starting with _
as it's not meant to be used from outside (like e.g. IndexEngine._maybe_get_bool_indexer
). Why should it not be private?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand but that a historic effect. These classes themselves are totally private, so making the methods is overkill.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought it would make it clear that e.g. BaseMultiIndexCodesEngine.get_loc
is meant to be used from Index
(internal) code, while BaseMultiIndexCodesEngine._codes_to_ints
is only meant to be used from code specific to the engine. But fine, will change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with @toobaz: if something is a helper method that is only used within the class, make this clear by using a _...
name (also for within the codebase is it useful to know which method is 'internally' public and which not, even if the class itself is fully private)
pandas/_libs/index.pyx
Outdated
provide the same interface as the MultiIndexEngine | ||
but use the IndexEngine for computation | ||
cdef class BaseMultiIndexCodesEngine(object): | ||
def __init__(self, levels, labels, offsets, **kwargs): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
type the arguments here as well
pandas/_libs/index.pyx
Outdated
if not isinstance(key, tuple): | ||
raise KeyError(key) | ||
try: | ||
idces = [0 if checknull(v) else self._levels[l].get_loc(v) + 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
don't use abbreviations (e.g. idces)
pandas/core/indexes/multi.py
Outdated
_base = libindex.UInt64Engine | ||
|
||
def _codes_to_ints(self, codes): | ||
# Shift: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
doc strings & explanation
pandas/core/indexes/multi.py
Outdated
@@ -691,16 +721,15 @@ def _get_level_number(self, level): | |||
|
|||
@cache_readonly | |||
def _engine(self): | |||
# Find powers of 2 which dominate level sizes - including -1 for NaN: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
more explanation
write out the lev_bits in a more clear way
pandas/core/indexes/multi.py
Outdated
# Find powers of 2 which dominate level sizes - including -1 for NaN: | ||
lev_bits = np.cumsum(np.ceil(np.log2([len(l) + 1 for l in | ||
self.levels[::-1]])))[::-1] | ||
offsets = np.concatenate([lev_bits[1:], [0]]).astype('uint') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you sure about uint here? (and not uint64)?
pandas/tests/indexes/test_multi.py
Outdated
@pytest.mark.parametrize('level', [0, 1]) | ||
@pytest.mark.parametrize('null_val', [np.nan, pd.NaT, None]) | ||
def test_get_loc_nan(self, level, null_val): | ||
# GH 18485 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
explain what you are testing
Waiting to rebase on #19054 before pushing again |
b3ba47c
to
b152cb8
Compare
New asv run shows some small additional improvement:
|
b152cb8
to
36b9cee
Compare
@jreback ping |
f9d6970
to
1a71ffb
Compare
Added, ping |
if you can rebase and ping on green, ok to merge. |
closes pandas-dev#18519 closes pandas-dev#18818 closes pandas-dev#18520 closes pandas-dev#18485 closes pandas-dev#15994 closes pandas-dev#19086
1a71ffb
to
468bb08
Compare
@jreback rebased, ping |
@toobaz Great work! |
this is failing the 32 bit builds if u can issue a follow up to fix |
closes #18519
closes #18818
closes #18520
closes #18485
closes #15994
closes #19086
git diff upstream/master -u -- "*.py" | flake8 --diff
This PR provides a cleaner and more robust
MultiIndex
engine.asv benchmarks
... clearly show that the engine increases performance for large indexes, and significantly decreases performance for
get_loc
on small/medium indexes. For reference, on thesmall
test case (10 elements), 16 μs. are lost on eachget_loc
call, while on thelarge
test case (1 million elements), 146 μs. are gained on eachget_loc
call. I think the tradeoff is acceptable, also considering that with this approach, any improvement to flat indexes automatically transfers toMultiIndex
. But most importantly, I think we have no other options to fix several annoying open issues (and more bugs - I'm sure - which were never reported) about incoherence between flat indexes andMultiIndex
.Notice that moving the engine into
index.pyx
is problematic because cython does not support multiple inheritance. Anyway, trying to cythonize a couple of methods brought no gain at all (as the overhead is mostly due to the multipleget_loc(val)
calls on the different levels).