PERF: float hash slow in py3 #13436

chris-b1 · 2016-06-14T03:31:30Z

Using exactly the approach suggested by @ruoyu0088

significant changes in asv below

     before     after       ratio
-     8.88s    78.12ms      0.01  indexing.float_loc.time_float_loc
-    13.11s    78.12ms      0.01  groupby.groupby_float32.time_groupby_sum

Factor of 10 smaller benches

     before     after       ratio
-  171.88ms    43.29ms      0.25  indexing.float_loc.time_float_loc
-     1.42s    11.23ms      0.01  groupby.groupby_float32.time_groupby_sum

jreback · 2016-06-14T13:13:56Z

wow those were slow before, can you make 10x smaller in the asv (so the slow ones don't take as much time). you should still have the same ratio.

jreback · 2016-06-14T13:15:17Z

pandas/src/klib/khash_python.h

-
-#define kh_float64_hash_func _Py_HashDouble
+inline khint64_t asint64(double key) {
+  return *(khint64_t *)(&key);


add a comment (and a link to the python source code) of why we are not using _Py_HashDouble here (and this issue number as well). And maybe an explanation of what this is doing and why its better (in pandas).

chris-b1 · 2016-06-15T01:01:19Z

@jreback - I've updated this. It took a while to figure out, but the issue here isn't actually the time of the python hash function, it's that in py3 it returns a Py_ssize_t (was long), and our khash keys are 32 bit, so the truncation is causing the collisions. I updated the asv at the top with the smaller benches - because it's a collision issue the ratios aren't both the same, but still show the issue.

In [50]: a = np.arange(1000000)
    ...: ind = pd.Float64Index(a * 4.8000000418824129e-08)

In [51]: hashes = np.array([hash(x) for x in ind])

In [52]: len(hashes), len(pd.unique(hashes))
Out[52]: (1000000, 1000000)

In [53]: truncated = hashes.view('int32')[::2]

In [54]: len(truncated), len(pd.unique(truncated))
Out[54]: (1000000, 524288)

jreback · 2016-06-15T01:07:36Z

thanks @chris-b1 yeah, I briefly looked at the hash code in python itself and it was doing a lot of things to guarantee for example hashes of fractions and such. Yeah this is simpler.

codecov-io · 2016-06-15T01:23:31Z

Current coverage is 84.28%

Merging #13436 into master will increase coverage by 0.05%

@@             master     #13436   diff @@
==========================================
  Files           138        138          
  Lines         50805      50929   +124   
  Methods           0          0          
  Messages          0          0          
  Branches          0          0          
==========================================
+ Hits          42796      42926   +130   
+ Misses         8009       8003     -6   
  Partials          0          0

Powered by Codecov. Last updated by 62b4327...3aec078

jreback · 2016-06-15T02:00:54Z

ty sir!

PERF: float hash slow in py3

339ad1a

jreback added the Performance Memory or execution speed performance label Jun 14, 2016

jreback added this to the 0.18.2 milestone Jun 14, 2016

jreback reviewed Jun 14, 2016
View reviewed changes

smaller benches, explanatory comment

3aec078

jreback closed this in f98b4b5 Jun 15, 2016

jreback mentioned this pull request Jun 15, 2016

BLD: py2.7 failing on perf hash changes (for windows) #13448

Closed

chris-b1 deleted the float-hash branch August 6, 2016 21:24

jreback mentioned this pull request Aug 28, 2016

pandas sort_values significantly slower on Python 3.5.2 vs. Python 2.7.12 #14103

Closed

This was referenced Sep 29, 2020

PERF: using murmur hash for float64 khash-tables #36729

Merged

PERF: Always using panda's hashtable approach, dropping np.in1d #36611

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

PERF: float hash slow in py3 #13436

PERF: float hash slow in py3 #13436

Uh oh!

chris-b1 commented Jun 14, 2016 •

edited

Loading

Uh oh!

jreback commented Jun 14, 2016

Uh oh!

jreback Jun 14, 2016 •

edited

Loading

Uh oh!

chris-b1 commented Jun 15, 2016

Uh oh!

jreback commented Jun 15, 2016

Uh oh!

codecov-io commented Jun 15, 2016

Uh oh!

jreback commented Jun 15, 2016

Uh oh!

Uh oh!

Uh oh!

PERF: float hash slow in py3 #13436

PERF: float hash slow in py3 #13436

Uh oh!

Conversation

chris-b1 commented Jun 14, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jreback commented Jun 14, 2016

Uh oh!

jreback Jun 14, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

chris-b1 commented Jun 15, 2016

Uh oh!

jreback commented Jun 15, 2016

Uh oh!

codecov-io commented Jun 15, 2016

Current coverage is 84.28%

Uh oh!

jreback commented Jun 15, 2016

Uh oh!

Uh oh!

chris-b1 commented Jun 14, 2016 •

edited

Loading

jreback Jun 14, 2016 •

edited

Loading