Skip to content

PERF: float hash slow in py3 #13436

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 2 commits into from
Closed

Conversation

chris-b1
Copy link
Contributor

@chris-b1 chris-b1 commented Jun 14, 2016

closes #13166, closes #13335

Using exactly the approach suggested by @ruoyu0088

significant changes in asv below

     before     after       ratio
-     8.88s    78.12ms      0.01  indexing.float_loc.time_float_loc
-    13.11s    78.12ms      0.01  groupby.groupby_float32.time_groupby_sum

Factor of 10 smaller benches

     before     after       ratio
-  171.88ms    43.29ms      0.25  indexing.float_loc.time_float_loc
-     1.42s    11.23ms      0.01  groupby.groupby_float32.time_groupby_sum

@jreback jreback added the Performance Memory or execution speed performance label Jun 14, 2016
@jreback
Copy link
Contributor

jreback commented Jun 14, 2016

wow those were slow before, can you make 10x smaller in the asv (so the slow ones don't take as much time). you should still have the same ratio.

@jreback jreback added this to the 0.18.2 milestone Jun 14, 2016

#define kh_float64_hash_func _Py_HashDouble
inline khint64_t asint64(double key) {
return *(khint64_t *)(&key);
Copy link
Contributor

@jreback jreback Jun 14, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a comment (and a link to the python source code) of why we are not using _Py_HashDouble here (and this issue number as well). And maybe an explanation of what this is doing and why its better (in pandas).

@chris-b1
Copy link
Contributor Author

@jreback - I've updated this. It took a while to figure out, but the issue here isn't actually the time of the python hash function, it's that in py3 it returns a Py_ssize_t (was long), and our khash keys are 32 bit, so the truncation is causing the collisions. I updated the asv at the top with the smaller benches - because it's a collision issue the ratios aren't both the same, but still show the issue.

In [50]: a = np.arange(1000000)
    ...: ind = pd.Float64Index(a * 4.8000000418824129e-08)

In [51]: hashes = np.array([hash(x) for x in ind])

In [52]: len(hashes), len(pd.unique(hashes))
Out[52]: (1000000, 1000000)

In [53]: truncated = hashes.view('int32')[::2]

In [54]: len(truncated), len(pd.unique(truncated))
Out[54]: (1000000, 524288)

@jreback
Copy link
Contributor

jreback commented Jun 15, 2016

thanks @chris-b1 yeah, I briefly looked at the hash code in python itself and it was doing a lot of things to guarantee for example hashes of fractions and such. Yeah this is simpler.

@codecov-io
Copy link

Current coverage is 84.28%

Merging #13436 into master will increase coverage by 0.05%

@@             master     #13436   diff @@
==========================================
  Files           138        138          
  Lines         50805      50929   +124   
  Methods           0          0          
  Messages          0          0          
  Branches          0          0          
==========================================
+ Hits          42796      42926   +130   
+ Misses         8009       8003     -6   
  Partials          0          0          

Powered by Codecov. Last updated by 62b4327...3aec078

@jreback jreback closed this in f98b4b5 Jun 15, 2016
@jreback
Copy link
Contributor

jreback commented Jun 15, 2016

ty sir!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Performance Memory or execution speed performance
Projects
None yet
3 participants