-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
PERF: float hash slow in py3 #13436
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PERF: float hash slow in py3 #13436
Conversation
wow those were slow before, can you make 10x smaller in the asv (so the slow ones don't take as much time). you should still have the same ratio. |
|
||
#define kh_float64_hash_func _Py_HashDouble | ||
inline khint64_t asint64(double key) { | ||
return *(khint64_t *)(&key); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add a comment (and a link to the python source code) of why we are not using _Py_HashDouble
here (and this issue number as well). And maybe an explanation of what this is doing and why its better (in pandas).
@jreback - I've updated this. It took a while to figure out, but the issue here isn't actually the time of the python hash function, it's that in py3 it returns a In [50]: a = np.arange(1000000)
...: ind = pd.Float64Index(a * 4.8000000418824129e-08)
In [51]: hashes = np.array([hash(x) for x in ind])
In [52]: len(hashes), len(pd.unique(hashes))
Out[52]: (1000000, 1000000)
In [53]: truncated = hashes.view('int32')[::2]
In [54]: len(truncated), len(pd.unique(truncated))
Out[54]: (1000000, 524288) |
thanks @chris-b1 yeah, I briefly looked at the hash code in python itself and it was doing a lot of things to guarantee for example hashes of fractions and such. Yeah this is simpler. |
Current coverage is 84.28%@@ master #13436 diff @@
==========================================
Files 138 138
Lines 50805 50929 +124
Methods 0 0
Messages 0 0
Branches 0 0
==========================================
+ Hits 42796 42926 +130
+ Misses 8009 8003 -6
Partials 0 0
|
ty sir! |
closes #13166, closes #13335
Using exactly the approach suggested by @ruoyu0088
significant changes in asv below
Factor of 10 smaller benches