Skip to content

PERF: upgrade khash lib to 0.2.8 #8547

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 10 commits into from

Conversation

immerrr
Copy link
Contributor

@immerrr immerrr commented Oct 13, 2014

This should close #8524.

The idea is that quadratic probing (i**2 + i)/2 is faster than double hashing and can be shown to traverse all elements if nbuckets == 2**N.

This branch compiles and passes tests, but I haven't done any benchmarks yet.

TODO

(probably in other issues)

  • see if the "flag compression" hack can be removed, it complicates syncing and
    I'm not sure if there was any noticeable performance increase
  • request upstream klib maintainers to add a way to override "kh_inline",
    e.g. by wrapping its definition with #ifndef kh_inline, so that
    PANDAS_INLINE flag can be moved away from khash.h and specified in
    `khash_python.h" with
#define kh_inline PANDAS_INLINE
#include "khash.h"
  • I don't like it that pandas makes khint64_t definition signed. In
    upstream it is unsigned which -- I agree -- is inconsistent, but bit shifting
    operations on signed integers have a) undefined behaviour according to the
    standard and b) different semantics which might cause performance degradation
    on some hash functions:
int main(int argc, char *argv[])
{
  int64_t foo = -1;
  uint64_t ufoo = *((uint64_t*)&foo);
  printf("foo: %lx; foo>>1: %lx\n", foo, foo >> 1);
  printf("ufoo: %lx; ufoo>>1: %lx\n", ufoo, ufoo >> 1);
  return 0;
}
$ ./a.out 
foo: ffffffffffffffff; foo>>1: ffffffffffffffff
ufoo: ffffffffffffffff; ufoo>>1: 7fffffffffffffff
  • memory allocation functions kmalloc/krealloc/kfree should be overridden to
    use PyMem interface to be shown in tracemalloc output. PyMem_Calloc is
    only provided for 3.5, so it may require more work than just defining a macro
    rename.
  • khash-0.2.8 API introduces -1 return values for operations that have failed,
    most likely because of memory allocation errors. these should be handled and
    reported as MemoryErrors to the interpreter

@jreback jreback added the Performance Memory or execution speed performance label Oct 13, 2014
@jreback jreback added this to the 0.15.1 milestone Oct 13, 2014
@immerrr immerrr force-pushed the upgrade-khash-lib branch 2 times, most recently from 9b3faa3 to 9745cc8 Compare October 13, 2014 12:54
@immerrr
Copy link
Contributor Author

immerrr commented Oct 23, 2014

tl;dr let's adopt xxhash for string hashing. cc-ing performance guys: @jreback, @bjonen, what do you think? Are there any other hash functions uses that may benefit by using xxhash?

As reported in #8524, new probing algorithm brings in a corner case: strings that have common prefix and differ in last few characters with x31 hash function fall into neighbouring buckets which causes a lot more collisions because step sequence is 1, 3, 6, 10... This causes serious performance degradation:

double hashing, hash=x31

In [9]: %%timeit np.random.seed(0); s = pd.util.testing.rands_array(1, 10000)
ht = pd.hashtable.StringHashTable(len(s)); ht.factorize(s)
   ...: 
10000 loops, best of 3: 157 µs per loop


In [2]: %%timeit np.random.seed(0); s = pd.util.testing.rands_array(2, 10000)
ht = pd.hashtable.StringHashTable(len(s)); ht.factorize(s)
   ....: 
1000 loops, best of 3: 537 µs per loop

In [11]: %%timeit np.random.seed(0); s = ' ' * 100 + pd.util.testing.rands_array(2, 10000)
ht = pd.hashtable.StringHashTable(len(s)); ht.factorize(s)
   ....: 
1000 loops, best of 3: 1.87 ms per loop

quadratic probing, hash=x31

In [1]: %%timeit np.random.seed(0); s = pd.util.testing.rands_array(1, 10000)
..ht = pd.hashtable.StringHashTable(len(s)); ht.factorize(s)
   ...: 
10000 loops, best of 3: 150 µs per loop

In [2]: %%timeit np.random.seed(0); s = pd.util.testing.rands_array(2, 10000)
ht = pd.hashtable.StringHashTable(len(s)); ht.factorize(s)
   ...: 
100 loops, best of 3: 2.51 ms per loop

In [3]: %%timeit np.random.seed(0); s = ' ' * 100 + pd.util.testing.rands_array(2, 10000)
ht = pd.hashtable.StringHashTable(len(s)); ht.factorize(s)
   ...: 
100 loops, best of 3: 5.17 ms per loop

quadratic probing, hash=sbox

I've tried using sbox hash as described here:

In [1]: %%timeit np.random.seed(0); s = pd.util.testing.rands_array(1, 10000)
ht = pd.hashtable.StringHashTable(len(s)); ht.factorize(s)
   ...: 
10000 loops, best of 3: 159 µs per loop

In [2]: %%timeit np.random.seed(0); s = pd.util.testing.rands_array(2, 10000)
ht = pd.hashtable.StringHashTable(len(s)); ht.factorize(s)
   ...: 
1000 loops, best of 3: 409 µs per loop

In [3]: %%timeit np.random.seed(0); s = ' ' * 100 + pd.util.testing.rands_array(2, 10000)
ht = pd.hashtable.StringHashTable(len(s)); ht.factorize(s)
   ...: 
1000 loops, best of 3: 1.6 ms per loop

It beats old version in all tested scenarios, but I was unable to find any licensing information, so bringing it in may be a grey-ish legal area

quadratic probing, hash=xxhash.

Then I went to see comparative benchmarks involving sbox and found this one mentioning xxHash. I pulled it in and it performed great on longer strings but had some slowness on the shortest ones:

In [1]: %%timeit np.random.seed(0); s = pd.util.testing.rands_array(1, 10000)
ht = pd.hashtable.StringHashTable(len(s)); ht.factorize(s)
   ...: 
1000 loops, best of 3: 211 µs per loop

In [2]: %%timeit np.random.seed(0); s = pd.util.testing.rands_array(2, 10000)
ht = pd.hashtable.StringHashTable(len(s)); ht.factorize(s)
   ...: 
1000 loops, best of 3: 505 µs per loop

In [3]: %%timeit np.random.seed(0); s = ' ' * 100 + pd.util.testing.rands_array(2, 10000)
ht = pd.hashtable.StringHashTable(len(s)); ht.factorize(s)
   ...: 
1000 loops, best of 3: 867 µs per loop

@jreback
Copy link
Contributor

jreback commented Oct 23, 2014

@immerrr would it be possible to do something like this (have to weight code complexitiy/ benefit here)

sample 3 values from the array, get lengths
if avg length is small:
   use exisiting hasher
else:
   use new one

(obviously this could only be used INTERNALLY in a single function as the hashes are different)

@immerrr
Copy link
Contributor Author

immerrr commented Oct 23, 2014

This would require keeping two versions of khash, I strongly object against it.

It should be a lot easier to precompute xxhash values for 0-1 length strings and look them up from a table.

@immerrr
Copy link
Contributor Author

immerrr commented Oct 24, 2014

Huh, so it turns out I was benchmarking wrong stuff :)

hashtable.StringHashtable is not used anywhere, pd.factorize only uses int64, float64 and generic object tables, I wonder if we could use string table for factorization anywhere at all (without incurring type inference penalty).

strbox hash table OTOH is used, but at first sight I couldn't say what purpose does it serve there and how should I put together a representative benchmark for it.

@jreback jreback modified the milestones: 0.15.2, 0.16.0 Nov 26, 2014
@jreback jreback modified the milestones: 0.16.0, Next Major Release Mar 3, 2015
@jreback
Copy link
Contributor

jreback commented May 9, 2015

closing pls reopen if/when updated

@jreback jreback closed this May 9, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Klib upgrade (factorizing performance increase)
2 participants