You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I came across the combine_hash_arrays functions in pandas library here. How does it keep good collision rate after combination ? Does anyone have the mathematical proof for this code ?
The below code is retrieved from pandas implementation.
def combine_hash_arrays(
arrays: Iterator[np.ndarray], num_items: int
) -> npt.NDArray[np.uint64]:
"""
Parameters
----------
arrays : Iterator[np.ndarray]
num_items : int
Returns
-------
np.ndarray[uint64]
Should be the same as CPython's tupleobject.c
"""
try:
first = next(arrays)
except StopIteration:
return np.array([], dtype=np.uint64)
arrays = itertools.chain([first], arrays)
mult = np.uint64(1000003)
out = np.zeros_like(first) + np.uint64(0x345678)
last_i = 0
for i, a in enumerate(arrays):
inverse_i = num_items - i
out ^= a
out *= mult
mult += np.uint64(82520 + inverse_i + inverse_i)
last_i = i
assert last_i + 1 == num_items, "Fed in wrong num_items"
out += np.uint64(97531)
return out
I have already checked whether they work well by checking their combined hashes' collision rate.
The text was updated successfully, but these errors were encountered:
Research
I have searched the [pandas] tag on StackOverflow for similar questions.
I have asked my usage related question on StackOverflow.
Link to question on StackOverflow
https://stackoverflow.com/questions/77784124/the-concept-behind-how-hashes-are-combined-in-pandas
Question about pandas
I came across the
combine_hash_arrays
functions in pandas library here. How does it keep good collision rate after combination ? Does anyone have the mathematical proof for this code ?The below code is retrieved from pandas implementation.
I have already checked whether they work well by checking their combined hashes' collision rate.
The text was updated successfully, but these errors were encountered: