-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
PERF: improve MultiIndex get_loc performance #16346
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
3d3074b
d1d1513
664d2b3
7cd3cc1
3bd0404
287817a
638f011
8acc9e8
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -4,14 +4,17 @@ | |
import itertools | ||
|
||
import numpy as np | ||
from pandas._libs import hashing | ||
from pandas._libs import hashing, tslib | ||
from pandas.core.dtypes.generic import ( | ||
ABCMultiIndex, | ||
ABCIndexClass, | ||
ABCSeries, | ||
ABCDataFrame) | ||
from pandas.core.dtypes.common import ( | ||
is_categorical_dtype, is_list_like) | ||
from pandas.core.dtypes.missing import isnull | ||
from pandas.core.dtypes.cast import infer_dtype_from_scalar | ||
|
||
|
||
# 16 byte long hashing key | ||
_default_hash_key = '0123456789123456' | ||
|
@@ -164,6 +167,29 @@ def hash_tuples(vals, encoding='utf8', hash_key=None): | |
return h | ||
|
||
|
||
def hash_tuple(val, encoding='utf8', hash_key=None): | ||
""" | ||
Hash a single tuple efficiently | ||
|
||
Parameters | ||
---------- | ||
val : single tuple | ||
encoding : string, default 'utf8' | ||
hash_key : string key to encode, default to _default_hash_key | ||
|
||
Returns | ||
------- | ||
hash | ||
|
||
""" | ||
hashes = (_hash_scalar(v, encoding=encoding, hash_key=hash_key) | ||
for v in val) | ||
|
||
h = _combine_hash_arrays(hashes, len(val))[0] | ||
|
||
return h | ||
|
||
|
||
def _hash_categorical(c, encoding, hash_key): | ||
""" | ||
Hash a Categorical by hashing its categories, and then mapping the codes | ||
|
@@ -276,3 +302,31 @@ def hash_array(vals, encoding='utf8', hash_key=None, categorize=True): | |
vals *= np.uint64(0x94d049bb133111eb) | ||
vals ^= vals >> 31 | ||
return vals | ||
|
||
|
||
def _hash_scalar(val, encoding='utf8', hash_key=None): | ||
""" | ||
Hash scalar value | ||
|
||
Returns | ||
------- | ||
1d uint64 numpy array of hash value, of length 1 | ||
""" | ||
|
||
if isnull(val): | ||
# this is to be consistent with the _hash_categorical implementation | ||
return np.array([np.iinfo(np.uint64).max], dtype='u8') | ||
|
||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. if you need to handle datetime w/tz directly (IOW, we basically ignore it), then I would.
I suppose an option to ignore tz is fine for There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. yes, I can certainly do that check here as well. That is maybe better to keep the custom logic here, as the keyword added to There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. he i think better locally |
||
if getattr(val, 'tzinfo', None) is not None: | ||
# for tz-aware datetimes, we need the underlying naive UTC value and | ||
# not the tz aware object or pd extension type (as | ||
# infer_dtype_from_scalar would do) | ||
if not isinstance(val, tslib.Timestamp): | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. see #16372 (for later) |
||
val = tslib.Timestamp(val) | ||
val = val.tz_convert(None) | ||
|
||
dtype, val = infer_dtype_from_scalar(val) | ||
vals = np.array([val], dtype=dtype) | ||
|
||
return hash_array(vals, hash_key=hash_key, encoding=encoding, | ||
categorize=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is duplicating lots of code. I sure there is a way to do this a bit more generically.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I know, but this was mainly to test.
And from that it, this makes it a lot faster than using the
hash_array
(the commented part inhash_tuple
). So not sure how to solve that. In principle I can put the common parts in helper functions (eg the redistributing part), but for most of it it is not possible as there are slight differences.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what does all of this buy you? (IOW can you post updated timings)?
maintaining a separate code path for scalars will cause future issues. these will need to be kept in sync (with the array hashing), if any code changes are made. you can easily share code here which would make this more palatable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hash_tuple2
uses thehash_array
(the commented out version in the current branch), thehash_tuple
uses thehash_scalar
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, its prob reasonable, but willing to sacrifice some perf to get some shared code (IOW between old and new prob a good compromise to repeating lots of code)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I pushed a new version that almost fully reuses the
hash_array
function for the actual hashing, only has some specific logic before that to convert the scalar to a proper array.This reduces a lot of the code duplication, and has only minor perf impact.
Apart from that, it still has the separate code path to first convert the scalar to an array, which might be a bit brittle, and I agree certainly not ideal for code maintenance, but using something more general (eg infer_dtype) gives a big perf penalty.