-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
PERF: Introducing HashTables for datatypes with 8,16 and 32 bits #37920
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
FYI, @jbrockmendel This is my idea, how adding further HashMap-versions could work. It probably needs more (direct) tests of HashTables, because right now they are tested somewhat indirectly via |
Definitely needs more robust testing, but this looks like a really good start. Thanks for following upon this. |
couple of ideas for either follow-ups or to get testing for "free": in core.algorithms add Int32HashTable to _hashtables dict and update algorithms._ensure_data to avoid casting int32 to int64 in _libs.index_class_helper update hashtable_name for int32 (maybe also int16 and int8) |
4967351
to
d9ab327
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wow this looks pretty good. cc @jbrockmendel
('Int64', 'int64', False, 'NPY_NAT')] | ||
('Int64', 'int64', False, 'NPY_NAT'), | ||
('UInt32', 'uint32', False, 0), | ||
('Int32', 'int32', False, 0)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this migth be a problem not sure
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are you referring to the 0 here?
i looked at this and dont think its an issue, bc it is only a placeholder that never gets used
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This field (default_na_value
) is probably not really needed.
Just set na_value2
to 0 here:
{{dtype}}_t val, na_value2 |
and drop the whole else-branch here:
pandas/pandas/_libs/hashtable_class_helper.pxi.in
Lines 432 to 433 in 4cfa97a
else: | |
na_value2 = {{default_na_value}} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah that was my comment, but agree not germane for now.
@@ -591,6 +591,9 @@ PANDAS_INLINE khint_t __ac_Wang_hash(khint_t key) | |||
#define KHASH_MAP_INIT_INT(name, khval_t) \ | |||
KHASH_INIT(name, khint32_t, khval_t, 1, kh_int_hash_func, kh_int_hash_equal) | |||
|
|||
#define KHASH_MAP_INIT_UINT(name, khval_t) \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add a comment about why the int32 version can be reused verbatim for the uint32 case
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, it is another way around: a comment is needed why int32-version uses verbatim the uint32 case:) khint32_t
is an unsigned int:
pandas/pandas/_libs/src/klib/khash.h
Line 119 in 4cfa97a
typedef unsigned int khint32_t; |
The trick for int32 is that the conversion from signed int to unsigned int is well defined by the standard (see e.g. https://stackoverflow.com/a/50632), thus we can safely pass an int as key (and we don't have something like "get_key", thus it is never casted back to signed int32, which would be implementation defined behavior (and thus not the original value).
Why it is done this way? For example, pandas has tried to improve the situation for int64:
pandas/pandas/_libs/src/klib/khash.h
Lines 125 to 126 in 4cfa97a
typedef unsigned long khuint64_t; | |
typedef signed long khint64_t; |
The problem now is, that for signed int64 there is implementation defined behavior (>>) and undefined behavior (<<) for negative or "sufficiently large" keys in the hash function:
pandas/pandas/_libs/src/klib/khash.h
Line 411 in 4cfa97a
#define kh_int64_hash_func(key) (khint32_t)((key)>>33^(key)^(key)<<11) |
as key
is now a signed int (see e.g. https://stackoverflow.com/a/4009922).
I would be surprised to see a (sane) compiler which wouldn't do "the right thing" on x86/x86_64, but still...
The hash-function should be probably be
#define kh_int64_hash_func(key) (khint32_t)(((khint64_t)(key))>>33^((khint64_t)(key))^((khint64_t)(key))<<11)
just to be safe.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, it is another way around: a comment is needed why int32-version uses verbatim the uint32 case:)
fine by me as long as its clear to the next reader
be still my beating heart you got the 8s and the 16s too! |
I've added also 8bit versions: mostly for the sake of uniformity - otherwise there is always a special case. While using 8bit hashtable is better than converting to 16bit-datatype and using 16bit hash-table, it is still an overkill. If important, later on a special version could be written for 8bit data types. Apart from some polishing and small(ish) test-issues, I consider this PR complete. Once it is merged, the upcasting can be dropped at one place after another. I'm however not really a pandas power user and cannot estimate the necessary work and hope somebody will be able to help. One might not be satisfied with the simple hash-function used (identity), but it is more or less consistent with functions we use for (u)int64. The comments say: pandas/pandas/_libs/src/klib/khash.h Lines 62 to 63 in 4cfa97a
So if there are some problems, we could switch to Wang's or murmur2-hash. |
pandas/tests/libs/test_hashtable.py
Outdated
duplicated = get_ht_function("duplicated", type_suffix) | ||
values = np.repeat(np.arange(N).astype(dtype), 5) | ||
result = duplicated(values) | ||
expected = np.ones_like(values, dtype=np.bool) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks like these need to be np.bool_ instead of np.bool
…lso to avoid undefined/implementation defined behaviors in case of an overflow
@realead can you measure what this does to the size of the distribution? |
@TomAugspurger it adds about 0.7-0.8MB to the wheel. This PR:
master:
|
0b9d94a
to
3a4c2bc
Compare
I could not help myself and sneaked some changes in, which are not really necessary:
You can decide whether you want to keep some of them in this PR, or rather in as a new PR or just to drop. |
not nothing, but nbd |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm. cc @jbrockmendel
LGTM Travis failure is an unrelated frame plotting test that im seeing locally too |
thanks @realead very nice! |
This is exciting thanks @realead |
black pandas
git diff upstream/master -u -- "*.py" | flake8 --diff
Starts work on #33287
Proof of concept for 32bit/16bit hash tables. If works it should become blue print for UInt32/Int16/UInt16/Float32-HashTables.