-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
BUG: fixing mixup of bucket and element counts in hash tables #39457
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
I have run quite asv with
will run the whole test suite over night. |
Running for the whole tests suite yields somewhat mixed results. Which is not surprising per se: If there are many duplicates, preallocating bigger tables as really needed means worse performance. However the issue in these cases is the heuristic of how much should be preallocated. I'm surprised to see a decrise of performance for
|
6d6b943
to
850b594
Compare
…ding rehashing for some cases
…additional checks
850b594
to
c30e410
Compare
what if you allocated say 2 x needed? does this affect things, e.g. that's what the current code is doing, trading memory for time. |
Allocating a bigger table than actually needed should have no positive effects for performance (on an average case). It could however have negative impact on performance: For example if the smaller table fits the cache but the bigger one would not - thus there would be additional cache-misses and a quite big regression of performance. Something similar happens for example in
the series has 5e5 elements but only 1e5 unique elements, thus prior to this fix the table was about 2 times bigger than needed - after the fix about 4 times bigger than needed - this leads to additional costs. But the issue is not this fix, but rather the used simple heuristic "preallocate as much space as needed for the worst case", which is good for the worst case, as for example
where due to the fix a rehashing of the table was avoided and thus about 50% of running time could be saved. To me it looks as if i is a sane strategy to improve the worst case and to take some potential hits for other cases. PS: I have changed the behavior of |
thanks @realead nice as always! keep em coming! |
The issue is that,
kh_resize_##name
expects the number of buckets in the map as input:pandas/pandas/_libs/src/klib/khash.h
Line 324 in 39e1261
it is however used with the number of elements throughout the pandas code, e.g.
pandas/pandas/core/algorithms.py
Line 412 in 39e1261
as can be seen in the
__cinit__
function:pandas/pandas/_libs/hashtable_class_helper.pxi.in
Lines 400 to 404 in 39e1261
Thus, for some sizes (like 10^3) there still could be a rehashing even if size-hint was used with the idea to avoid it (because at most 75% of buckets can be occupied)
I think it make sense to do the less surprising thing with the size-hint (i.e. the number of elements and not buckets).