PERF: Introducing HashTables for datatypes with 8,16 and 32 bits #37920

realead · 2020-11-17T22:35:22Z

tests added / passed
passes black pandas
passes git diff upstream/master -u -- "*.py" | flake8 --diff
whatsnew entry

Starts work on #33287

Proof of concept for 32bit/16bit hash tables. If works it should become blue print for UInt32/Int16/UInt16/Float32-HashTables.

realead · 2020-11-17T22:42:31Z

FYI, @jbrockmendel

This is my idea, how adding further HashMap-versions could work. It probably needs more (direct) tests of HashTables, because right now they are tested somewhat indirectly via test_algos.py and a lot of infrastructure is missing for testing the functionality of new versions.

jbrockmendel · 2020-11-17T22:58:23Z

Definitely needs more robust testing, but this looks like a really good start. Thanks for following upon this.

jbrockmendel · 2020-11-17T23:02:43Z

couple of ideas for either follow-ups or to get testing for "free":

in core.algorithms add Int32HashTable to _hashtables dict and update algorithms._ensure_data to avoid casting int32 to int64

in _libs.index_class_helper update hashtable_name for int32 (maybe also int16 and int8)

jreback

wow this looks pretty good. cc @jbrockmendel

jreback · 2020-11-19T02:04:32Z

pandas/_libs/hashtable_class_helper.pxi.in

-          ('Int64', 'int64', False, 'NPY_NAT')]
+          ('Int64', 'int64', False, 'NPY_NAT'),
+          ('UInt32', 'uint32', False, 0),
+          ('Int32', 'int32', False, 0)]


this migth be a problem not sure

are you referring to the 0 here?

i looked at this and dont think its an issue, bc it is only a placeholder that never gets used

This field (default_na_value) is probably not really needed.

Just set na_value2 to 0 here:

pandas/pandas/_libs/hashtable_class_helper.pxi.in

Line 410 in 4cfa97a

{{dtype}}_t val, na_value2

and drop the whole else-branch here:

pandas/pandas/_libs/hashtable_class_helper.pxi.in

Lines 432 to 433 in 4cfa97a

else:

na_value2 = {{default_na_value}}

yeah that was my comment, but agree not germane for now.

jbrockmendel · 2020-11-19T16:57:10Z

pandas/_libs/src/klib/khash.h

@@ -591,6 +591,9 @@ PANDAS_INLINE khint_t __ac_Wang_hash(khint_t key)
 #define KHASH_MAP_INIT_INT(name, khval_t)								\
 	KHASH_INIT(name, khint32_t, khval_t, 1, kh_int_hash_func, kh_int_hash_equal)

+#define KHASH_MAP_INIT_UINT(name, khval_t)								\


can you add a comment about why the int32 version can be reused verbatim for the uint32 case

Actually, it is another way around: a comment is needed why int32-version uses verbatim the uint32 case:) khint32_t is an unsigned int:

pandas/pandas/_libs/src/klib/khash.h

Line 119 in 4cfa97a

typedef unsigned int khint32_t;

so it only naturally to use it for uint32-version

The trick for int32 is that the conversion from signed int to unsigned int is well defined by the standard (see e.g. https://stackoverflow.com/a/50632), thus we can safely pass an int as key (and we don't have something like "get_key", thus it is never casted back to signed int32, which would be implementation defined behavior (and thus not the original value).

Why it is done this way? For example, pandas has tried to improve the situation for int64:

pandas/pandas/_libs/src/klib/khash.h

Lines 125 to 126 in 4cfa97a

typedef unsigned long khuint64_t;

typedef signed long khint64_t;

The problem now is, that for signed int64 there is implementation defined behavior (>>) and undefined behavior (<<) for negative or "sufficiently large" keys in the hash function:

pandas/pandas/_libs/src/klib/khash.h

Line 411 in 4cfa97a

#define kh_int64_hash_func(key) (khint32_t)((key)>>33^(key)^(key)<<11)

as key is now a signed int (see e.g. https://stackoverflow.com/a/4009922).

I would be surprised to see a (sane) compiler which wouldn't do "the right thing" on x86/x86_64, but still...

The hash-function should be probably be

#define kh_int64_hash_func(key) (khint32_t)(((khint64_t)(key))>>33^((khint64_t)(key))^((khint64_t)(key))<<11)

just to be safe.

Actually, it is another way around: a comment is needed why int32-version uses verbatim the uint32 case:)

fine by me as long as its clear to the next reader

pandas/tests/libs/test_hashtable.py

jbrockmendel · 2020-11-19T22:47:15Z

be still my beating heart you got the 8s and the 16s too!

realead · 2020-11-19T22:48:16Z

@jreback @jbrockmendel

I've added also 8bit versions: mostly for the sake of uniformity - otherwise there is always a special case. While using 8bit hashtable is better than converting to 16bit-datatype and using 16bit hash-table, it is still an overkill. If important, later on a special version could be written for 8bit data types.

Apart from some polishing and small(ish) test-issues, I consider this PR complete. Once it is merged, the upcasting can be dropped at one place after another. I'm however not really a pandas power user and cannot estimate the necessary work and hope somebody will be able to help.

One might not be satisfied with the simple hash-function used (identity), but it is more or less consistent with functions we use for (u)int64. The comments say:

pandas/pandas/_libs/src/klib/khash.h

Lines 62 to 63 in 4cfa97a

    
           	* Added Wang's integer hash function (not used by default). This hash 
        
           	  function is more robust to certain non-random input.

So if there are some problems, we could switch to Wang's or murmur2-hash.

jbrockmendel · 2020-11-19T22:48:19Z

pandas/tests/libs/test_hashtable.py

+        duplicated = get_ht_function("duplicated", type_suffix)
+        values = np.repeat(np.arange(N).astype(dtype), 5)
+        result = duplicated(values)
+        expected = np.ones_like(values, dtype=np.bool)


looks like these need to be np.bool_ instead of np.bool

…lso to avoid undefined/implementation defined behaviors in case of an overflow

TomAugspurger · 2020-11-20T21:35:04Z

@realead can you measure what this does to the size of the distribution? python setup.py bdist_wheel should do the trick.

realead · 2020-11-20T22:07:33Z

@TomAugspurger it adds about 0.7-0.8MB to the wheel.

This PR:

31.7 Nov 20 22:36 pandas-1.2.0.dev0+1209.g0b9d94a-cp38-cp38-linux_x86_64.whl
7,4M Nov 20 22:24 hashtable.cpython-38-x86_64-linux-gnu.so*

master:

31M Nov 20 22:51 pandas-1.2.0.dev0+1191.g8d1b8ab-cp38-cp38-linux_x86_64.whl
3.8M Nov 20 22:42 hashtable.cpython-38-x86_64-linux-gnu.so*

…r rebuild

realead · 2020-11-20T22:42:19Z

@jreback @jbrockmendel

I could not help myself and sneaked some changes in, which are not really necessary:

for the sake of consistency and to avoid undefined behavior, using unsigned ints also for 64bit integers. This is somewhat hacky (but the way khash has done it in the first place) See also this comment: PERF: Introducing HashTables for datatypes with 8,16 and 32 bits #37920 (comment). Commit: 897512d
adding dependency on khash.h, otherwise changing it doesn't trigger rebuild (probably anything cimporting khash.pxd (e.g. via cimport hashtable.pxd) should have it as well, but I'm not sure). Commit: 17a3fee
getting rid of unnecessary default_na_value, see this comment: PERF: Introducing HashTables for datatypes with 8,16 and 32 bits #37920 (comment) Commit: 3a4c2bc

You can decide whether you want to keep some of them in this PR, or rather in as a new PR or just to drop.

jreback · 2020-11-20T23:00:14Z

@TomAugspurger it adds about 0.7-0.8MB to the wheel.

This PR:

31.7 Nov 20 22:36 pandas-1.2.0.dev0+1209.g0b9d94a-cp38-cp38-linux_x86_64.whl
7,4M Nov 20 22:24 hashtable.cpython-38-x86_64-linux-gnu.so*

master:

31M Nov 20 22:51 pandas-1.2.0.dev0+1191.g8d1b8ab-cp38-cp38-linux_x86_64.whl
3.8M Nov 20 22:42 hashtable.cpython-38-x86_64-linux-gnu.so*

not nothing, but nbd

jreback

lgtm. cc @jbrockmendel

jbrockmendel · 2020-11-21T01:27:25Z

LGTM

Travis failure is an unrelated frame plotting test that im seeing locally too

jreback · 2020-11-21T22:22:29Z

thanks @realead very nice!

jbrockmendel · 2020-11-21T22:59:07Z

This is exciting thanks @realead

realead added 4 commits November 17, 2020 21:18

extracting khash for primitive types into a helper-file

b871012

use template for int64-map

8983425

use template for uint64/float64/int32-map

9b3c5a5

remove unused define

e2f062b

realead added 3 commits November 17, 2020 23:49

introducing Int32HashTable

8a7fc6c

expanding some tests to test Int32HashTable

5ab4d68

moving cimport to helper, so it can become a template

d9ab327

realead force-pushed the 32bit_hashmap branch from 4967351 to d9ab327 Compare November 17, 2020 23:12

realead added 2 commits November 18, 2020 23:38

adding some tests for hashtables

41d4b57

introducing UInt32HashTable

70c6fc5

jreback reviewed Nov 19, 2020

View reviewed changes

jreback added the Dtype Conversions Unexpected or buggy dtype conversions label Nov 19, 2020

jreback added this to the 1.2 milestone Nov 19, 2020

jbrockmendel reviewed Nov 19, 2020

View reviewed changes

pandas/tests/libs/test_hashtable.py Show resolved Hide resolved

realead added 4 commits November 19, 2020 20:35

formating test case (and adding some missing asserts)

15dfe49

introducing Float32HashMap

8975f06

introducing Int16HashTable and UInt16HashTable

0ffd3b2

introducing UInt8HashTable and Int8HashTable

c952d68

realead changed the title ~~PERF: Introducing Int32HashTable~~ PERF: Introducing HashTables for datatypes with 8,16 and 32 bits Nov 19, 2020

jbrockmendel reviewed Nov 19, 2020

View reviewed changes

realead added 3 commits November 20, 2020 21:44

fixing minor issues with tests

6d026b2

adding comment why unsigned int is used for maps

a2e5679

use unsigned ints also for 64 maps, for the sake of consistance but a…

897512d

…lso to avoid undefined/implementation defined behaviors in case of an overflow

realead added 2 commits November 20, 2020 23:09

adding missing dependency, otherwise changing khash.h does not trigge…

17a3fee

…r rebuild

removing not really needed default_na_value

3a4c2bc

realead force-pushed the 32bit_hashmap branch from 0b9d94a to 3a4c2bc Compare November 20, 2020 22:17

jreback approved these changes Nov 20, 2020

View reviewed changes

jreback merged commit fb2bd10 into pandas-dev:master Nov 21, 2020

realead mentioned this pull request Nov 24, 2020

ENH: adding support for Py3.6+ memory tracing for khash-maps #38048

Merged

4 tasks

realead mentioned this pull request Jan 1, 2021

CLN: Use signed integers in khash maps for signed integer keys #38882

Merged

3 tasks

realead mentioned this pull request Feb 14, 2021

ENH: consider using sets and not maps for isin, unique and duplicated #39799

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PERF: Introducing HashTables for datatypes with 8,16 and 32 bits #37920

PERF: Introducing HashTables for datatypes with 8,16 and 32 bits #37920

realead commented Nov 17, 2020 •

edited

Loading

realead commented Nov 17, 2020

jbrockmendel commented Nov 17, 2020

jbrockmendel commented Nov 17, 2020

jreback left a comment

jreback Nov 19, 2020

jbrockmendel Nov 19, 2020

realead Nov 19, 2020

jreback Nov 20, 2020

jbrockmendel Nov 19, 2020

realead Nov 19, 2020

jbrockmendel Nov 19, 2020

jbrockmendel commented Nov 19, 2020

realead commented Nov 19, 2020

jbrockmendel Nov 19, 2020

TomAugspurger commented Nov 20, 2020

realead commented Nov 20, 2020

realead commented Nov 20, 2020

jreback commented Nov 20, 2020

jreback left a comment

jbrockmendel commented Nov 21, 2020

jreback commented Nov 21, 2020

jbrockmendel commented Nov 21, 2020

	typedef unsigned long khuint64_t;
	typedef signed long khint64_t;

PERF: Introducing HashTables for datatypes with 8,16 and 32 bits #37920

PERF: Introducing HashTables for datatypes with 8,16 and 32 bits #37920

Conversation

realead commented Nov 17, 2020 • edited Loading

realead commented Nov 17, 2020

jbrockmendel commented Nov 17, 2020

jbrockmendel commented Nov 17, 2020

jreback left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jbrockmendel commented Nov 19, 2020

realead commented Nov 19, 2020

Choose a reason for hiding this comment

TomAugspurger commented Nov 20, 2020

realead commented Nov 20, 2020

realead commented Nov 20, 2020

jreback commented Nov 20, 2020

jreback left a comment

Choose a reason for hiding this comment

jbrockmendel commented Nov 21, 2020

jreback commented Nov 21, 2020

jbrockmendel commented Nov 21, 2020

realead commented Nov 17, 2020 •

edited

Loading