-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
PERF: high memory in MI #15245
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PERF: high memory in MI #15245
Conversation
pandas/indexes/multi.py
Outdated
# fast-path | ||
if not self.levels[0].is_monotonic: | ||
return False | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If someone comes up with a 'nice' way to do this (from a list of arrays), would be great.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jreback are you trying to do this just from self.labels
and self.levels
, or are you okay with materializing the level values arrays? If the latter, I think you can do this with np.lexsort
:
# lexsort_test.py
import numpy as np
import pandas as pd
sorted_index = pd.MultiIndex.from_product([[0, 1, 2], ['a', 'b', 'c']])
unsorted_first = pd.MultiIndex.from_product([[0, 2, 1], ['a', 'b', 'c']])
unsorted_second = pd.MultiIndex.from_product([[0, 1, 2], ['a', 'c', 'b']])
duplicates = pd.MultiIndex.from_product([[0, 1, 2], ['b', 'b', 'b']])
def level_values(mi, num):
return np.take(mi.levels[num], mi.labels[num])
def int64_is_monotonic(a):
# This is needlessly expensive in the non-monotonic case. I assume this is
# already implemented efficiently somewhere else in pandas.
return (np.diff(a) >= 0).all()
def multi_index_is_monotonic(mi):
# reversed() because lexsort() wants the most significant key last.
values = [level_values(mi, i) for i in reversed(range(len(mi.levels)))]
sort_order = np.lexsort(values)
return int64_is_monotonic(sort_order)
print("Sorted: ", multi_index_is_monotonic(sorted_index))
print("Sorted With Duplicates:", multi_index_is_monotonic(duplicates))
print("Unsorted First:", multi_index_is_monotonic(unsorted_first))
print("Unsorted Second:", multi_index_is_monotonic(unsorted_second))
$ python lexsort_test.py
Sorted: True
Unsorted First: False
Unsorted Second: False
Sorted With Duplicates: True
One caveat here is that I'm not 100% sure that lexsort
is guaranteed to be stable, which I think you'd need to avoid false negatives in the case where there are duplicates. The docs for lexsort
itself don't seem to mention stability, but the See Also
from argsort
says: lexsort : Indirect stable sort with multiple keys.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
before
In [1]: i = MultiIndex.from_product([np.arange(1000), np.arange(1000)], names=['one','two'])
In [2]: %time i.is_monotonic
CPU times: user 620 ms, sys: 42.1 ms, total: 662 ms
Wall time: 662 ms
after
In [2]: %time i.is_monotonic
CPU times: user 67.1 ms, sys: 14.1 ms, total: 81.2 ms
Wall time: 80.8 ms
@ssanderson thanks! this seems to pass, pushed up a commit with this. for sure an improvement.
e070cf0
to
0411455
Compare
Codecov Report
@@ Coverage Diff @@
## master #15245 +/- ##
==========================================
+ Coverage 90.34% 90.36% +0.01%
==========================================
Files 135 135
Lines 49378 49425 +47
==========================================
+ Hits 44613 44661 +48
+ Misses 4765 4764 -1
Continue to review full report at Codecov.
|
7d430da
to
e13b5e0
Compare
from pandas.tools.hashing import hash_tuples | ||
key = self.mi._as_valid_indexing_key(key) | ||
value = hash_tuples(key) | ||
k = kh_get_uint64(self.table, key) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
value
is not used here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oops, typo
value = hash_tuples(key) | ||
|
||
# should check nlevels == len tuple | ||
k = kh_get_uint64(self.table, value) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how does this handle hash collisions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure. haven't really looked into how khash resolves collisions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, the issue is that the keys in the table are now integers, so you can resolve collisions but only collisions on the hashed value. So when you have a collision you don't have the original tuple available to compare against for equality. So you need a way to say "give me the tuple for the value that's currently occupying this slot" to see if it's equal to the tuple you just hashed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there is a 1-1 mapping on tuple -> hashed value. IOW, these have a unique hashing guarantee. This is quite similar to the original impl where the PyObject pointer was the hashing value (which also is unique). So I am not sure that it is possible to actually have collisions (sure you can have collisions inside the hashtable itself, but we don't directly manage that, we simply ask for the value at the hash and get it).
Or am not understanding your point?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having not looked at the details of pandas.tools.hashing
too carefully (so take with grain of salt), it seems very difficult to establish a tuple to uint64_t
mapping that cannot produce collisions (I'm not sure how you could prove it). The probability of collisions may be extremely small, but they are still possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
here are some references on the hashing schemes:
python using siphash
siphash
blog by ameuer about hashing
So the short answer is, we don't deal with collisions at all. This is different than hash()
which must be-definition as it has some idiosyncratic hashing built in (e.g. -1 is not possible), so it must deal with collisions.
Note that PyObject hashing doesn't deal with collisions either, though I suppose since its memory address based it by-definition is 1-1 with the tuple.
For a unique MI, I don't think in practice this is actually a problem, and if it shows up it is a bug. A tuple should be hashed 1-1 to its tuples. I suppose there is a theoretical possibility that this is violated.
So I suppose I could check, and I think it would be efficient (e.g. only constructing the wanted tuples as needed), and raise if this is the case. Let me see about that.
OTOH, a duplicate MI by-definition will not work with the scheme (in fact we don't use this indexer directly for duplicated MI).
@mikegraham any light?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
0dae56a should provide a run-time assertion for this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it will slow down worst case checking a bit (still about 2x faster than master)
In [1]: i = MultiIndex.from_product([np.arange(1000), np.arange(1000)], names=['one','two'])
In [2]: v = i.values
In [3]: %time i.get_indexer(v)
CPU times: user 356 ms, sys: 84.4 ms, total: 441 ms
Wall time: 441 ms
Out[3]: array([ 0, 1, 2, ..., 999997, 999998, 999999])
but type of hit would only occur if you are actually reindexing from tuples themselves (and not an actual MI).
9a2cb69
to
0dae56a
Compare
pandas/tests/indexes/test_multi.py
Outdated
|
||
for i in [0, 1, len(index)-2, len(index)-1]: | ||
result = index.get_loc(index[i]) | ||
self.assertEqual(result, i) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
collisions could come from other tuples. rare, but possible
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep, that's why this is a non-smoke tests :>
if you have any better ideas on how to test, all ears.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only fool-proof way to prevent mistaken lookups is to compare the tuples for equality -- keeping around big arrays of tuples is how we got into this mess in the first place. So you can
- compute the hashed key value
- look up the index location
- construct the tuple for that index location
- compare the tuple in the index with your tuple key
- if they are unequal, raise a KeyError
With reindex
operations, it gets more expense. But at least we aren't keeping around an huge internal array of tuples like we were before.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I already pushed code to do exactly this. 99f7cc1
I do a run-time check if there is a collision by comparison. It actually turned up a bug (separate)
ok, updated. had an insiduous condition in |
594f653
to
1583099
Compare
pandas/indexes/multi.py
Outdated
# reversed() because lexsort() wants the most significant key last. | ||
values = [level_values(i) for i in reversed(range(len(self.levels)))] | ||
sort_order = np.lexsort(values) | ||
return Index(sort_order).is_monotonic |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ssanderson this avoids the worst case scenario as this is a short-circuit operator (in cython)
pandas/indexes/multi.py
Outdated
unique = self.levels[level] | ||
labels = self.labels[level] | ||
return algos.take_1d(unique.values, labels, | ||
fill_value=unique._na_value) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you have null values in the uniques here? If the level values contain NaNs, then I imagine the behavior of lexsort
is undefined, which means you could get false positives.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for most dtypes this not s problem as we fill with iNaT for example in i8; only for floats this could be an issue i suppose
do you have an example that fails?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and i just added the try except which will catch type errors and fallback
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the case where this could potentially fail currently is something like:
pd.MultiIndex.from_product([[1.0, np.nan, 2.0], ['a', 'b', 'c']])
You've made some interesting life choices if you've got NaNs in your index (or really if you're intentionally using a float index at all IMO), but I'd expect is_monotonic
to be reliably False
in this case, which I don't think my proposed implementation guarantees, since any sorting algorithm is going to return garbage if the inputs contain NaN.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah added as a test. it works, but subject to numpy sorting order to nan.
f47131c
to
c3fd7e8
Compare
ok this is all set. any more comments |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
small comment, but won't have time to look more in detail the coming days
pandas/indexes/base.py
Outdated
@@ -1429,6 +1429,10 @@ def inferred_type(self): | |||
""" return a string of the type inferred from the values """ | |||
return lib.infer_dtype(self) | |||
|
|||
def is_memory_usage_qualified(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we make this a private method?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep
6b2001f
to
780f1dc
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice performance improvements! Did a review and added some comments (apart from the hashing-related code, that's not my cup of tea)
There is part of __delitem__
that is not covered anymore (https://codecov.io/gh/pandas-dev/pandas/pulls/15245/src/pandas/core/generic.py). This is due to change behaviour of __contains__
in the MultiIndexEngine:
In [22]: midx = pd.MultiIndex.from_product([['A', 'B'], [1,2]])
In [23]: 'A' in midx._engine
...
TypeError: must be convertible to a list-of-tuples
In [24]: ('A',) in midx._engine
...
KeyError:
Before this PR, both the above return False, taking the shortcut in that code.
The exact behaviour of __contains__
is maybe not that important/public, but this also results in some slight changes in behaviour of del
:
In [32]: df = pd.DataFrame(np.random.randn(4,4), columns=midx)
In [33]: del df['A']
In [34]: df
Out[34]:
B
1 2
0 -1.170604 -0.905548
1 -1.198860 -1.163344
2 2.352208 0.214405
3 -0.520114 -0.702562
In [35]: del df['A']
...
IndexError: index 0 is out of bounds for axis 0 with size 0
## -> the above should be KeyError
In [37]: df = pd.DataFrame(np.random.randn(4,4), columns=midx)
In [38]: del df[('A',)]
...
KeyError:
## -> the above worked previously
I have to say that I never use del
, but still, the above changes can be fixed I think.
@@ -656,9 +665,74 @@ def _has_complex_internals(self): | |||
return True | |||
|
|||
@cache_readonly | |||
def is_monotonic(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Docstring for public function
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Something else, if you are updating is_monotonic
, should is_monotonic_increasing/decreasing
be updated accordingly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
pandas/indexes/multi.py
Outdated
# e.g. with embedded tuples, this might fail | ||
# TODO: should prob not allow construction of a MI | ||
# like this in the first place | ||
return Index(self.values).get_loc(key) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't seem covered by a test
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
turns out this is some old code (added for another reason). deleted.
pandas/indexes/multi.py
Outdated
return False | ||
if not isinstance(other, tuple): | ||
return False | ||
other = Index([other]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason for this change?
I can remember we had some discussion about this before whether or not Index.equals
should coerce it's argument to an index. But it seems a bit strange to do this solely for a single tuple.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is to allow equivalence testing of a tuple and a MI which basically a list-of-tuples (though impl is different).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't really understand your comment here.
You mean you want to have pd.MultiIndex.from_tuples([('a', 1)]).equals(('a', 1))
be True? But why? Also for a single level Index, a 1-length Index does not equals to a scalar with the same value.
pandas/tests/indexes/test_multi.py
Outdated
@@ -1425,8 +1448,6 @@ def test_equals_multi(self): | |||
|
|||
self.assertFalse(self.index.equals(self.index[:-1])) | |||
|
|||
self.assertTrue(self.index.equals(self.index._tuple_index)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still test this using pd.Index(self.index.values)
instead of self.index._tuple_index
? (unless this is covered somewhere else to test equality of MI against object Index of tuples).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
_tuple_index
is gone, that's part of the reason for this PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I understand that, but we still want to test the fact that Index of tuples evaluates as equal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added test. And they don't evaluate as equals (nor did they in master). the are equivalent, but not equal (you have to wrap it in a MI to make it equal, NOT an Index
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In [1]: major_axis = Index(['foo', 'bar', 'baz', 'qux'])
...: minor_axis = Index(['one', 'two'])
...:
...: major_labels = np.array([0, 0, 1, 2, 3, 3])
...: minor_labels = np.array([0, 1, 0, 1, 0, 1])
...: index=MultiIndex(levels=[major_axis, minor_axis],
...: labels=[major_labels, minor_labels
...: ], names=['foo','bar'],
...: verify_integrity=False)
...:
In [2]: index
Out[2]:
MultiIndex(levels=[['foo', 'bar', 'baz', 'qux'], ['one', 'two']],
labels=[[0, 0, 1, 2, 3, 3], [0, 1, 0, 1, 0, 1]],
names=['foo', 'bar'])
In [3]: index.equals(index.values)
Out[3]: False
In [4]: index.equals(MultiIndex.from_tuples(index.values))
Out[4]: True
In [6]: from pandas.core.index import _ensure_index
In [7]: index.equals(_ensure_index(index.values))
Out[7]: True
In [8]: index.equals(Index(index.values))
Out[8]: True
so I could make them compare equal (let me try), its more consistent this way
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, that was not what I meant. I would just leave it as it was (we had a discussion about this before, and explicitly change arrays no longer to evaluate equal to Index objects). The behaviour was already good, my only question was about to keep the test (but with updated syntax of course)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See the discussion here: #13986 (comment). We changed Index.equals
behaviour to return False for everything that is not an Index. I was not in favor of that change, but since we only changed that in 0.19.0, we should keep to that decision I think
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, I remember that. Though [25] should not be True then. (though I guess its an index, and NOT a list-like). But its NOT an MI either....
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok fixed back to original (with my small mod)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Though [25] should not be True then. (though I guess its an index, and NOT a list-like). But its NOT an MI either....
Yeah, you are right. I would not really have a problem with changing this to False (the comparison of MI with object index of tuples). That seems to follow our previous changes of making equals
more strict.
assert not ind.is_monotonic | ||
self.assertFalse(ind.is_monotonic) | ||
|
||
def test_is_monotonic(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also include some tests for is_monotonic_increasing
and is_monotonic_decreasing
?
@jorisvandenbossche fixed up. |
assert len(df.columns) == 2 | ||
|
||
with pytest.raises(KeyError): | ||
del df[('A',)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it the purpose that this also raises if the A
key is still in df
. If so, I would put this before del df['A']
. Otherwise we also need to test that this works on the initial df
They already compare equal, also on your PR branch. So nothing to change
(only the test)
(github is doing annoying, and I cannot find this thread in the web
interface, but the above is about the `midx.equals(Index(midx.values))`)
|
We have this in 0.19.2 / master
|
b07d50a
to
97e1080
Compare
In [5] seems certainly like a bug. Possibly because it's looking in the full level and not the actually used one? |
with pytest.raises(KeyError): | ||
del df[('A',)] | ||
|
||
assert 'A' in df.columns |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As in my other comment, shouldn't we regard this as a bug?
Not needed to fix in this PR, but I would at least add a FIXME comment here so it is clear this test may change
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an old / long discussed issue. I'll annotate the tess though.
# make sure take is not using -1 | ||
i = pd.MultiIndex.from_tuples([(0, pd.NaT), | ||
(0, pd.Timestamp('20130101'))]) | ||
result = i[0:1].equals(i[0]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume this is the test related to my other comment. So what I also said there, is that this equivalence is not true for single-level indices:
In [15]: i = pd.Index([1,2,3])
In [16]: i[0:1]
Out[16]: Int64Index([1], dtype='int64')
In [17]: i[0:1].equals(i[0])
Out[17]: False
So why making this different for MultiIndex ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
was to make things a bit easier on collsision check, but you are right, not necessary (or correct), reverted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM modulo other pending comments
except TypeError: | ||
|
||
# we have mixed types and np.lexsort is not happy | ||
return Index(self.values).is_monotonic |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I presume the memory use / perf here is no worse than it was before -- where might this get called unexpectedly (causing possibly balooning memory use)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, only thing that hits it is under py3 with something like this (e.g. with a np.nan
in a level)
In [1]: mi = MultiIndex(levels=[[1, 2, 3, 4], ['gb00b03mlx29', 'lu0197800237', 'nl0000289783', 'nl0000289965', 'nl0000301109']],
...: labels=[[0, 1, 1, 2, 2, 2, 3], [4, 2, 0, 0, 1, 3, -1]],
...: names=['household_id', 'asset_id'])
In [2]: mi.is_monotonic
Out[2]: False
In [3]: values = [mi._get_level_values(i) for i in reversed(range(len(mi.levels)))]
In [4]: values
Out[4]:
[array(['nl0000301109', 'nl0000289783', 'gb00b03mlx29', 'gb00b03mlx29',
'lu0197800237', 'nl0000289965', nan], dtype=object),
array([1, 2, 2, 3, 3, 3, 4])]
In [5]: np.lexsort(values)
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-5-b1af67214d96> in <module>()
----> 1 np.lexsort(values)
TypeError: unorderable types: float() > str()
self._check_for_collisions(locs, key) | ||
return True | ||
|
||
return False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here, perhaps
try:
self.getitem(key)
return True
except (KeyError, ValueError, TypeError):
return False
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
623d27c
to
04d828b
Compare
all changes pushed. WIll merge on pass tomorrow sometime unless any more comments. |
closes pandas-dev#13904 BUG: a qualifer (+) would always display with a MultiIndex, regardless if it needed deep introspection for memory usage PERF: rework MultiIndex.is_monotonic as per @ssanderson idea
neat article about hash collisions: https://medium.com/@robertgrosse/generating-64-bit-hash-collisions-to-dos-python-5b21404a5306#.wi6ivpqtv (more about cryptographic hashing, but still neat) |
closes pandas-dev#13904 Creates an efficient MultiIndexHashTable in cython. This allows us to efficiently store a multi-index for fast indexing (.get_loc() and .get_indexer()), with the current tuple-based (and gil holding) use of the PyObject Hash Table. This uses the pandas.tools.hashing routines to hash each of the 'values' of a MI to a single uint64. So this makes MI more memory friendly and much more efficient. You get these speedups, because the creation of the hashtable is now much more efficient. Author: Jeff Reback <[email protected]> Closes pandas-dev#15245 from jreback/mi and squashes the following commits: 7df6c34 [Jeff Reback] PERF: high memory in MI
closes #13904
Creates an efficient MultiIndexHashTable in cython. This allows us to efficiently store a multi-index for fast indexing (
.get_loc()
and ``.get_indexer()`), with the current tuple-based (and gil holding) use of the PyObject Hash Table.This uses the
pandas.tools.hashing
routines to hash each of the 'values' of a MI to a single uint64.So this makes MI more memory friendly and much more efficient. You get these speedups, because the creation of the hashtable is now much more efficient.
This doesn't quite close the above issue yet. Still working on getting rid of all/most of the actual instantiation of array-of-tuples. but that's the next stage.