Skip to content

PERF: high memory in MI #15245

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from
Closed

PERF: high memory in MI #15245

wants to merge 1 commit into from

Conversation

jreback
Copy link
Contributor

@jreback jreback commented Jan 27, 2017

closes #13904

Creates an efficient MultiIndexHashTable in cython. This allows us to efficiently store a multi-index for fast indexing (.get_loc() and ``.get_indexer()`), with the current tuple-based (and gil holding) use of the PyObject Hash Table.

This uses the pandas.tools.hashing routines to hash each of the 'values' of a MI to a single uint64.

So this makes MI more memory friendly and much more efficient. You get these speedups, because the creation of the hashtable is now much more efficient.

    before     after       ratio
  [48fc9d61] [f47131ce]
-  871.05ms   423.93ms      0.49  reindex.Reindexing.time_reindex_multiindex
-  230.31ms    72.51ms      0.31  indexing.MultiIndexing.time_is_monotonic
-  747.11ms   199.49ms      0.27  indexing.MultiIndexing.time_multiindex_get_indexer

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.

This doesn't quite close the above issue yet. Still working on getting rid of all/most of the actual instantiation of array-of-tuples. but that's the next stage.

@jreback jreback added MultiIndex Performance Memory or execution speed performance labels Jan 27, 2017
# fast-path
if not self.levels[0].is_monotonic:
return False

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If someone comes up with a 'nice' way to do this (from a list of arrays), would be great.

Copy link
Contributor

@ssanderson ssanderson Jan 31, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback are you trying to do this just from self.labels and self.levels, or are you okay with materializing the level values arrays? If the latter, I think you can do this with np.lexsort:

# lexsort_test.py

import numpy as np
import pandas as pd

sorted_index = pd.MultiIndex.from_product([[0, 1, 2], ['a', 'b', 'c']])
unsorted_first = pd.MultiIndex.from_product([[0, 2, 1], ['a', 'b', 'c']])
unsorted_second = pd.MultiIndex.from_product([[0, 1, 2], ['a', 'c', 'b']])
duplicates = pd.MultiIndex.from_product([[0, 1, 2], ['b', 'b', 'b']])


def level_values(mi, num):
    return np.take(mi.levels[num], mi.labels[num])


def int64_is_monotonic(a):
    # This is needlessly expensive in the non-monotonic case. I assume this is
    # already implemented efficiently somewhere else in pandas.
    return (np.diff(a) >= 0).all()


def multi_index_is_monotonic(mi):
    # reversed() because lexsort() wants the most significant key last.
    values = [level_values(mi, i) for i in reversed(range(len(mi.levels)))]
    sort_order = np.lexsort(values)
    return int64_is_monotonic(sort_order)


print("Sorted: ", multi_index_is_monotonic(sorted_index))
print("Sorted With Duplicates:", multi_index_is_monotonic(duplicates))
print("Unsorted First:", multi_index_is_monotonic(unsorted_first))
print("Unsorted Second:", multi_index_is_monotonic(unsorted_second))
$ python lexsort_test.py
Sorted:  True
Unsorted First: False
Unsorted Second: False
Sorted With Duplicates: True

One caveat here is that I'm not 100% sure that lexsort is guaranteed to be stable, which I think you'd need to avoid false negatives in the case where there are duplicates. The docs for lexsort itself don't seem to mention stability, but the See Also from argsort says: lexsort : Indirect stable sort with multiple keys.

Copy link
Contributor Author

@jreback jreback Jan 31, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

before

In [1]: i = MultiIndex.from_product([np.arange(1000), np.arange(1000)], names=['one','two'])

In [2]: %time i.is_monotonic
CPU times: user 620 ms, sys: 42.1 ms, total: 662 ms
Wall time: 662 ms

after

In [2]: %time i.is_monotonic
CPU times: user 67.1 ms, sys: 14.1 ms, total: 81.2 ms
Wall time: 80.8 ms

@ssanderson thanks! this seems to pass, pushed up a commit with this. for sure an improvement.

@jreback jreback changed the title WIP: high memory in MI PERF: high memory in MI Jan 27, 2017
@jreback jreback force-pushed the mi branch 2 times, most recently from e070cf0 to 0411455 Compare January 27, 2017 20:41
@codecov-io
Copy link

codecov-io commented Jan 27, 2017

Codecov Report

Merging #15245 into master will increase coverage by 0.01%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master   #15245      +/-   ##
==========================================
+ Coverage   90.34%   90.36%   +0.01%     
==========================================
  Files         135      135              
  Lines       49378    49425      +47     
==========================================
+ Hits        44613    44661      +48     
+ Misses       4765     4764       -1
Impacted Files Coverage Δ
pandas/types/cast.py 85.42% <ø> (ø)
pandas/io/pytables.py 93.9% <100%> (ø)
pandas/core/frame.py 97.87% <100%> (ø)
pandas/indexes/multi.py 96.72% <100%> (+0.14%)
pandas/core/algorithms.py 94.47% <100%> (-0.01%)
pandas/tools/hashing.py 99.01% <100%> (ø)
pandas/indexes/base.py 96.23% <100%> (ø)
pandas/tseries/index.py 95.42% <ø> (+0.09%)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update bbb583c...7df6c34. Read the comment docs.

@jreback jreback force-pushed the mi branch 2 times, most recently from 7d430da to e13b5e0 Compare January 30, 2017 14:25
@jreback jreback added this to the 0.20.0 milestone Jan 30, 2017
@jreback
Copy link
Contributor Author

jreback commented Jan 30, 2017

from pandas.tools.hashing import hash_tuples
key = self.mi._as_valid_indexing_key(key)
value = hash_tuples(key)
k = kh_get_uint64(self.table, key)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value is not used here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oops, typo

value = hash_tuples(key)

# should check nlevels == len tuple
k = kh_get_uint64(self.table, value)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how does this handle hash collisions?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure. haven't really looked into how khash resolves collisions.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, the issue is that the keys in the table are now integers, so you can resolve collisions but only collisions on the hashed value. So when you have a collision you don't have the original tuple available to compare against for equality. So you need a way to say "give me the tuple for the value that's currently occupying this slot" to see if it's equal to the tuple you just hashed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there is a 1-1 mapping on tuple -> hashed value. IOW, these have a unique hashing guarantee. This is quite similar to the original impl where the PyObject pointer was the hashing value (which also is unique). So I am not sure that it is possible to actually have collisions (sure you can have collisions inside the hashtable itself, but we don't directly manage that, we simply ask for the value at the hash and get it).

Or am not understanding your point?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having not looked at the details of pandas.tools.hashing too carefully (so take with grain of salt), it seems very difficult to establish a tuple to uint64_t mapping that cannot produce collisions (I'm not sure how you could prove it). The probability of collisions may be extremely small, but they are still possible.

Copy link
Contributor Author

@jreback jreback Jan 31, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here are some references on the hashing schemes:

python using siphash
siphash
blog by ameuer about hashing

So the short answer is, we don't deal with collisions at all. This is different than hash() which must be-definition as it has some idiosyncratic hashing built in (e.g. -1 is not possible), so it must deal with collisions.

Note that PyObject hashing doesn't deal with collisions either, though I suppose since its memory address based it by-definition is 1-1 with the tuple.

For a unique MI, I don't think in practice this is actually a problem, and if it shows up it is a bug. A tuple should be hashed 1-1 to its tuples. I suppose there is a theoretical possibility that this is violated.

So I suppose I could check, and I think it would be efficient (e.g. only constructing the wanted tuples as needed), and raise if this is the case. Let me see about that.

OTOH, a duplicate MI by-definition will not work with the scheme (in fact we don't use this indexer directly for duplicated MI).

@mikegraham any light?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

0dae56a should provide a run-time assertion for this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it will slow down worst case checking a bit (still about 2x faster than master)

In [1]: i = MultiIndex.from_product([np.arange(1000), np.arange(1000)], names=['one','two'])

In [2]: v = i.values

In [3]: %time i.get_indexer(v)
CPU times: user 356 ms, sys: 84.4 ms, total: 441 ms
Wall time: 441 ms
Out[3]: array([     0,      1,      2, ..., 999997, 999998, 999999])

but type of hit would only occur if you are actually reindexing from tuples themselves (and not an actual MI).

@jreback jreback force-pushed the mi branch 2 times, most recently from 9a2cb69 to 0dae56a Compare January 31, 2017 16:28

for i in [0, 1, len(index)-2, len(index)-1]:
result = index.get_loc(index[i])
self.assertEqual(result, i)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

collisions could come from other tuples. rare, but possible

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep, that's why this is a non-smoke tests :>

if you have any better ideas on how to test, all ears.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only fool-proof way to prevent mistaken lookups is to compare the tuples for equality -- keeping around big arrays of tuples is how we got into this mess in the first place. So you can

  • compute the hashed key value
  • look up the index location
  • construct the tuple for that index location
  • compare the tuple in the index with your tuple key
  • if they are unequal, raise a KeyError

With reindex operations, it gets more expense. But at least we aren't keeping around an huge internal array of tuples like we were before.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I already pushed code to do exactly this. 99f7cc1

I do a run-time check if there is a collision by comparison. It actually turned up a bug (separate)

@jreback
Copy link
Contributor Author

jreback commented Jan 31, 2017

ok, updated. had an insiduous condition in MultiIndex.equals with NaT only tuples. This was a latent bug.

@jreback jreback force-pushed the mi branch 2 times, most recently from 594f653 to 1583099 Compare January 31, 2017 21:35
# reversed() because lexsort() wants the most significant key last.
values = [level_values(i) for i in reversed(range(len(self.levels)))]
sort_order = np.lexsort(values)
return Index(sort_order).is_monotonic
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ssanderson this avoids the worst case scenario as this is a short-circuit operator (in cython)

unique = self.levels[level]
labels = self.labels[level]
return algos.take_1d(unique.values, labels,
fill_value=unique._na_value)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you have null values in the uniques here? If the level values contain NaNs, then I imagine the behavior of lexsort is undefined, which means you could get false positives.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for most dtypes this not s problem as we fill with iNaT for example in i8; only for floats this could be an issue i suppose

do you have an example that fails?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and i just added the try except which will catch type errors and fallback

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the case where this could potentially fail currently is something like:

pd.MultiIndex.from_product([[1.0, np.nan, 2.0], ['a', 'b', 'c']])

You've made some interesting life choices if you've got NaNs in your index (or really if you're intentionally using a float index at all IMO), but I'd expect is_monotonic to be reliably False in this case, which I don't think my proposed implementation guarantees, since any sorting algorithm is going to return garbage if the inputs contain NaN.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah added as a test. it works, but subject to numpy sorting order to nan.

@jreback jreback force-pushed the mi branch 4 times, most recently from f47131c to c3fd7e8 Compare February 2, 2017 13:11
@jreback
Copy link
Contributor Author

jreback commented Feb 2, 2017

ok this is all set. any more comments

@wesm @jorisvandenbossche

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

small comment, but won't have time to look more in detail the coming days

@@ -1429,6 +1429,10 @@ def inferred_type(self):
""" return a string of the type inferred from the values """
return lib.infer_dtype(self)

def is_memory_usage_qualified(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we make this a private method?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep

@jreback jreback force-pushed the mi branch 2 times, most recently from 6b2001f to 780f1dc Compare February 6, 2017 15:26
Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice performance improvements! Did a review and added some comments (apart from the hashing-related code, that's not my cup of tea)

There is part of __delitem__ that is not covered anymore (https://codecov.io/gh/pandas-dev/pandas/pulls/15245/src/pandas/core/generic.py). This is due to change behaviour of __contains__ in the MultiIndexEngine:

In [22]: midx = pd.MultiIndex.from_product([['A', 'B'], [1,2]])

In [23]: 'A' in midx._engine
...
TypeError: must be convertible to a list-of-tuples

In [24]: ('A',) in midx._engine
...
KeyError: 

Before this PR, both the above return False, taking the shortcut in that code.
The exact behaviour of __contains__ is maybe not that important/public, but this also results in some slight changes in behaviour of del:

In [32]: df = pd.DataFrame(np.random.randn(4,4), columns=midx)

In [33]: del df['A']

In [34]: df
Out[34]: 
          B          
          1         2
0 -1.170604 -0.905548
1 -1.198860 -1.163344
2  2.352208  0.214405
3 -0.520114 -0.702562

In [35]: del df['A']
...
IndexError: index 0 is out of bounds for axis 0 with size 0

## -> the above should be KeyError

In [37]: df = pd.DataFrame(np.random.randn(4,4), columns=midx)

In [38]: del df[('A',)]
...
KeyError: 

## -> the above worked previously

I have to say that I never use del, but still, the above changes can be fixed I think.

@@ -656,9 +665,74 @@ def _has_complex_internals(self):
return True

@cache_readonly
def is_monotonic(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Docstring for public function

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something else, if you are updating is_monotonic, should is_monotonic_increasing/decreasing be updated accordingly?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

# e.g. with embedded tuples, this might fail
# TODO: should prob not allow construction of a MI
# like this in the first place
return Index(self.values).get_loc(key)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't seem covered by a test

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

turns out this is some old code (added for another reason). deleted.

return False
if not isinstance(other, tuple):
return False
other = Index([other])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason for this change?
I can remember we had some discussion about this before whether or not Index.equals should coerce it's argument to an index. But it seems a bit strange to do this solely for a single tuple.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is to allow equivalence testing of a tuple and a MI which basically a list-of-tuples (though impl is different).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't really understand your comment here.
You mean you want to have pd.MultiIndex.from_tuples([('a', 1)]).equals(('a', 1)) be True? But why? Also for a single level Index, a 1-length Index does not equals to a scalar with the same value.

@@ -1425,8 +1448,6 @@ def test_equals_multi(self):

self.assertFalse(self.index.equals(self.index[:-1]))

self.assertTrue(self.index.equals(self.index._tuple_index))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still test this using pd.Index(self.index.values) instead of self.index._tuple_index? (unless this is covered somewhere else to test equality of MI against object Index of tuples).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_tuple_index is gone, that's part of the reason for this PR

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I understand that, but we still want to test the fact that Index of tuples evaluates as equal.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added test. And they don't evaluate as equals (nor did they in master). the are equivalent, but not equal (you have to wrap it in a MI to make it equal, NOT an Index)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In [1]:         major_axis = Index(['foo', 'bar', 'baz', 'qux'])
   ...:         minor_axis = Index(['one', 'two'])
   ...: 
   ...:         major_labels = np.array([0, 0, 1, 2, 3, 3])
   ...:         minor_labels = np.array([0, 1, 0, 1, 0, 1])
   ...:         index=MultiIndex(levels=[major_axis, minor_axis],
   ...:                                              labels=[major_labels, minor_labels
   ...:                                                      ], names=['foo','bar'],
   ...:                                              verify_integrity=False)
   ...: 

In [2]: index
Out[2]: 
MultiIndex(levels=[['foo', 'bar', 'baz', 'qux'], ['one', 'two']],
           labels=[[0, 0, 1, 2, 3, 3], [0, 1, 0, 1, 0, 1]],
           names=['foo', 'bar'])

In [3]: index.equals(index.values)
Out[3]: False

In [4]: index.equals(MultiIndex.from_tuples(index.values))
Out[4]: True

In [6]: from pandas.core.index import _ensure_index

In [7]: index.equals(_ensure_index(index.values))
Out[7]: True

In [8]: index.equals(Index(index.values))
Out[8]: True

so I could make them compare equal (let me try), its more consistent this way

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, that was not what I meant. I would just leave it as it was (we had a discussion about this before, and explicitly change arrays no longer to evaluate equal to Index objects). The behaviour was already good, my only question was about to keep the test (but with updated syntax of course)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See the discussion here: #13986 (comment). We changed Index.equals behaviour to return False for everything that is not an Index. I was not in favor of that change, but since we only changed that in 0.19.0, we should keep to that decision I think

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, I remember that. Though [25] should not be True then. (though I guess its an index, and NOT a list-like). But its NOT an MI either....

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok fixed back to original (with my small mod)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Though [25] should not be True then. (though I guess its an index, and NOT a list-like). But its NOT an MI either....

Yeah, you are right. I would not really have a problem with changing this to False (the comparison of MI with object index of tuples). That seems to follow our previous changes of making equals more strict.

assert not ind.is_monotonic
self.assertFalse(ind.is_monotonic)

def test_is_monotonic(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also include some tests for is_monotonic_increasing and is_monotonic_decreasing ?

@jreback
Copy link
Contributor Author

jreback commented Feb 11, 2017

@jorisvandenbossche fixed up. __contains__ was tested much, so behavior wasn't nailed down.

assert len(df.columns) == 2

with pytest.raises(KeyError):
del df[('A',)]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it the purpose that this also raises if the A key is still in df. If so, I would put this before del df['A']. Otherwise we also need to test that this works on the initial df

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Feb 11, 2017 via email

@jreback
Copy link
Contributor Author

jreback commented Feb 11, 2017

We have this in 0.19.2 / master
seems odd

In [1]:         midx = MultiIndex.from_product([['A', 'B'], [1, 2]])
   ...:         df = DataFrame(np.random.randn(4, 4), columns=midx)
   ...: 

In [2]: 'A' in df
Out[2]: True

In [3]: ('A', ) in df
Out[3]: True

In [4]: del df['A']

In [5]: 'A' in df
Out[5]: True

In [6]: ('A', ) in df
Out[6]: False

@jreback jreback force-pushed the mi branch 2 times, most recently from b07d50a to 97e1080 Compare February 12, 2017 17:39
@jorisvandenbossche
Copy link
Member

In [5] seems certainly like a bug. Possibly because it's looking in the full level and not the actually used one?

with pytest.raises(KeyError):
del df[('A',)]

assert 'A' in df.columns
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As in my other comment, shouldn't we regard this as a bug?
Not needed to fix in this PR, but I would at least add a FIXME comment here so it is clear this test may change

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#2770

This is an old / long discussed issue. I'll annotate the tess though.

# make sure take is not using -1
i = pd.MultiIndex.from_tuples([(0, pd.NaT),
(0, pd.Timestamp('20130101'))])
result = i[0:1].equals(i[0])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume this is the test related to my other comment. So what I also said there, is that this equivalence is not true for single-level indices:

In [15]: i = pd.Index([1,2,3])

In [16]: i[0:1]
Out[16]: Int64Index([1], dtype='int64')

In [17]: i[0:1].equals(i[0])
Out[17]: False

So why making this different for MultiIndex ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

was to make things a bit easier on collsision check, but you are right, not necessary (or correct), reverted.

Copy link
Member

@wesm wesm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM modulo other pending comments

except TypeError:

# we have mixed types and np.lexsort is not happy
return Index(self.values).is_monotonic
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I presume the memory use / perf here is no worse than it was before -- where might this get called unexpectedly (causing possibly balooning memory use)?

Copy link
Contributor Author

@jreback jreback Feb 15, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, only thing that hits it is under py3 with something like this (e.g. with a np.nan in a level)

In [1]: mi = MultiIndex(levels=[[1, 2, 3, 4], ['gb00b03mlx29', 'lu0197800237', 'nl0000289783', 'nl0000289965', 'nl0000301109']],
   ...:            labels=[[0, 1, 1, 2, 2, 2, 3], [4, 2, 0, 0, 1, 3, -1]],
   ...:            names=['household_id', 'asset_id'])

In [2]: mi.is_monotonic
Out[2]: False

In [3]:  values = [mi._get_level_values(i) for i in reversed(range(len(mi.levels)))]

In [4]: values
Out[4]: 
[array(['nl0000301109', 'nl0000289783', 'gb00b03mlx29', 'gb00b03mlx29',
        'lu0197800237', 'nl0000289965', nan], dtype=object),
 array([1, 2, 2, 3, 3, 3, 4])]

In [5]: np.lexsort(values)
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-5-b1af67214d96> in <module>()
----> 1 np.lexsort(values)

TypeError: unorderable types: float() > str()

self._check_for_collisions(locs, key)
return True

return False
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here, perhaps

try:
    self.getitem(key)
    return True
except (KeyError, ValueError, TypeError):
    return False

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@jreback
Copy link
Contributor Author

jreback commented Feb 15, 2017

all changes pushed. WIll merge on pass tomorrow sometime unless any more comments.

closes pandas-dev#13904

BUG: a qualifer (+) would always display with a MultiIndex, regardless if it needed deep introspection for memory usage
PERF: rework MultiIndex.is_monotonic as per @ssanderson idea
@jreback
Copy link
Contributor Author

jreback commented Mar 2, 2017

neat article about hash collisions: https://medium.com/@robertgrosse/generating-64-bit-hash-collisions-to-dos-python-5b21404a5306#.wi6ivpqtv (more about cryptographic hashing, but still neat)

AnkurDedania pushed a commit to AnkurDedania/pandas that referenced this pull request Mar 21, 2017
closes pandas-dev#13904

Creates an efficient MultiIndexHashTable in cython.
This allows us to efficiently store a multi-index for fast indexing
(.get_loc() and .get_indexer()), with the current tuple-based
(and gil holding) use of the PyObject Hash Table.

This uses the pandas.tools.hashing routines to hash each of the 'values' of a MI
to a single uint64. So this makes MI more memory friendly and much
more efficient. You get these speedups, because the creation of the
hashtable is now much more efficient.

Author: Jeff Reback <[email protected]>

Closes pandas-dev#15245 from jreback/mi and squashes the following commits:

7df6c34 [Jeff Reback] PERF: high memory in MI
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
MultiIndex Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Example of High Memory Usage of MultiIndex
7 participants