Skip to content

PERF: Introducing HashTables for datatypes with 8,16 and 32 bits #37920

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 18 commits into from
Nov 21, 2020

Conversation

realead
Copy link
Contributor

@realead realead commented Nov 17, 2020

  • tests added / passed
  • passes black pandas
  • passes git diff upstream/master -u -- "*.py" | flake8 --diff
  • whatsnew entry

Starts work on #33287

Proof of concept for 32bit/16bit hash tables. If works it should become blue print for UInt32/Int16/UInt16/Float32-HashTables.

@realead
Copy link
Contributor Author

realead commented Nov 17, 2020

FYI, @jbrockmendel

This is my idea, how adding further HashMap-versions could work. It probably needs more (direct) tests of HashTables, because right now they are tested somewhat indirectly via test_algos.py and a lot of infrastructure is missing for testing the functionality of new versions.

@jbrockmendel
Copy link
Member

Definitely needs more robust testing, but this looks like a really good start. Thanks for following upon this.

@jbrockmendel
Copy link
Member

couple of ideas for either follow-ups or to get testing for "free":

in core.algorithms add Int32HashTable to _hashtables dict and update algorithms._ensure_data to avoid casting int32 to int64

in _libs.index_class_helper update hashtable_name for int32 (maybe also int16 and int8)

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wow this looks pretty good. cc @jbrockmendel

('Int64', 'int64', False, 'NPY_NAT')]
('Int64', 'int64', False, 'NPY_NAT'),
('UInt32', 'uint32', False, 0),
('Int32', 'int32', False, 0)]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this migth be a problem not sure

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are you referring to the 0 here?

i looked at this and dont think its an issue, bc it is only a placeholder that never gets used

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This field (default_na_value) is probably not really needed.

Just set na_value2 to 0 here:

{{dtype}}_t val, na_value2

and drop the whole else-branch here:
else:
na_value2 = {{default_na_value}}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah that was my comment, but agree not germane for now.

@jreback jreback added the Dtype Conversions Unexpected or buggy dtype conversions label Nov 19, 2020
@jreback jreback added this to the 1.2 milestone Nov 19, 2020
@@ -591,6 +591,9 @@ PANDAS_INLINE khint_t __ac_Wang_hash(khint_t key)
#define KHASH_MAP_INIT_INT(name, khval_t) \
KHASH_INIT(name, khint32_t, khval_t, 1, kh_int_hash_func, kh_int_hash_equal)

#define KHASH_MAP_INIT_UINT(name, khval_t) \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a comment about why the int32 version can be reused verbatim for the uint32 case

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, it is another way around: a comment is needed why int32-version uses verbatim the uint32 case:) khint32_t is an unsigned int:

typedef unsigned int khint32_t;
so it only naturally to use it for uint32-version

The trick for int32 is that the conversion from signed int to unsigned int is well defined by the standard (see e.g. https://stackoverflow.com/a/50632), thus we can safely pass an int as key (and we don't have something like "get_key", thus it is never casted back to signed int32, which would be implementation defined behavior (and thus not the original value).

Why it is done this way? For example, pandas has tried to improve the situation for int64:

typedef unsigned long khuint64_t;
typedef signed long khint64_t;

The problem now is, that for signed int64 there is implementation defined behavior (>>) and undefined behavior (<<) for negative or "sufficiently large" keys in the hash function:

#define kh_int64_hash_func(key) (khint32_t)((key)>>33^(key)^(key)<<11)

as key is now a signed int (see e.g. https://stackoverflow.com/a/4009922).

I would be surprised to see a (sane) compiler which wouldn't do "the right thing" on x86/x86_64, but still...

The hash-function should be probably be

#define kh_int64_hash_func(key) (khint32_t)(((khint64_t)(key))>>33^((khint64_t)(key))^((khint64_t)(key))<<11)

just to be safe.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, it is another way around: a comment is needed why int32-version uses verbatim the uint32 case:)

fine by me as long as its clear to the next reader

@realead realead changed the title PERF: Introducing Int32HashTable PERF: Introducing HashTables for datatypes with 8,16 and 32 bits Nov 19, 2020
@jbrockmendel
Copy link
Member

be still my beating heart you got the 8s and the 16s too!

@realead
Copy link
Contributor Author

realead commented Nov 19, 2020

@jreback @jbrockmendel

I've added also 8bit versions: mostly for the sake of uniformity - otherwise there is always a special case. While using 8bit hashtable is better than converting to 16bit-datatype and using 16bit hash-table, it is still an overkill. If important, later on a special version could be written for 8bit data types.

Apart from some polishing and small(ish) test-issues, I consider this PR complete. Once it is merged, the upcasting can be dropped at one place after another. I'm however not really a pandas power user and cannot estimate the necessary work and hope somebody will be able to help.

One might not be satisfied with the simple hash-function used (identity), but it is more or less consistent with functions we use for (u)int64. The comments say:

* Added Wang's integer hash function (not used by default). This hash
function is more robust to certain non-random input.

So if there are some problems, we could switch to Wang's or murmur2-hash.

duplicated = get_ht_function("duplicated", type_suffix)
values = np.repeat(np.arange(N).astype(dtype), 5)
result = duplicated(values)
expected = np.ones_like(values, dtype=np.bool)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like these need to be np.bool_ instead of np.bool

@TomAugspurger
Copy link
Contributor

@realead can you measure what this does to the size of the distribution? python setup.py bdist_wheel should do the trick.

@realead
Copy link
Contributor Author

realead commented Nov 20, 2020

@TomAugspurger it adds about 0.7-0.8MB to the wheel.

This PR:

31.7 Nov 20 22:36 pandas-1.2.0.dev0+1209.g0b9d94a-cp38-cp38-linux_x86_64.whl
7,4M Nov 20 22:24 hashtable.cpython-38-x86_64-linux-gnu.so*

master:

31M Nov 20 22:51 pandas-1.2.0.dev0+1191.g8d1b8ab-cp38-cp38-linux_x86_64.whl
3.8M Nov 20 22:42 hashtable.cpython-38-x86_64-linux-gnu.so*

@realead
Copy link
Contributor Author

realead commented Nov 20, 2020

@jreback @jbrockmendel

I could not help myself and sneaked some changes in, which are not really necessary:

You can decide whether you want to keep some of them in this PR, or rather in as a new PR or just to drop.

@jreback
Copy link
Contributor

jreback commented Nov 20, 2020

@TomAugspurger it adds about 0.7-0.8MB to the wheel.

This PR:

31.7 Nov 20 22:36 pandas-1.2.0.dev0+1209.g0b9d94a-cp38-cp38-linux_x86_64.whl
7,4M Nov 20 22:24 hashtable.cpython-38-x86_64-linux-gnu.so*

master:

31M Nov 20 22:51 pandas-1.2.0.dev0+1191.g8d1b8ab-cp38-cp38-linux_x86_64.whl
3.8M Nov 20 22:42 hashtable.cpython-38-x86_64-linux-gnu.so*

not nothing, but nbd

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm. cc @jbrockmendel

@jbrockmendel
Copy link
Member

LGTM

Travis failure is an unrelated frame plotting test that im seeing locally too

@jreback jreback merged commit fb2bd10 into pandas-dev:master Nov 21, 2020
@jreback
Copy link
Contributor

jreback commented Nov 21, 2020

thanks @realead very nice!

@jbrockmendel
Copy link
Member

This is exciting thanks @realead

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Dtype Conversions Unexpected or buggy dtype conversions
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants