Skip to content

QST: The concept behind how hashes are combined in pandas? #57260

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
2 tasks done
phphuc612 opened this issue Feb 5, 2024 · 2 comments
Open
2 tasks done

QST: The concept behind how hashes are combined in pandas? #57260

phphuc612 opened this issue Feb 5, 2024 · 2 comments
Labels
hashing hash_pandas_object Usage Question

Comments

@phphuc612
Copy link

Research

  • I have searched the [pandas] tag on StackOverflow for similar questions.

  • I have asked my usage related question on StackOverflow.

Link to question on StackOverflow

https://stackoverflow.com/questions/77784124/the-concept-behind-how-hashes-are-combined-in-pandas

Question about pandas

I came across the combine_hash_arrays functions in pandas library here. How does it keep good collision rate after combination ? Does anyone have the mathematical proof for this code ?

The below code is retrieved from pandas implementation.

def combine_hash_arrays(
    arrays: Iterator[np.ndarray], num_items: int
) -> npt.NDArray[np.uint64]:
    """
    Parameters
    ----------
    arrays : Iterator[np.ndarray]
    num_items : int

    Returns
    -------
    np.ndarray[uint64]

    Should be the same as CPython's tupleobject.c
    """
    try:
        first = next(arrays)
    except StopIteration:
        return np.array([], dtype=np.uint64)

    arrays = itertools.chain([first], arrays)

    mult = np.uint64(1000003)
    out = np.zeros_like(first) + np.uint64(0x345678)
    last_i = 0
    for i, a in enumerate(arrays):
        inverse_i = num_items - i
        out ^= a
        out *= mult
        mult += np.uint64(82520 + inverse_i + inverse_i)
        last_i = i
    assert last_i + 1 == num_items, "Fed in wrong num_items"
    out += np.uint64(97531)
    return out

I have already checked whether they work well by checking their combined hashes' collision rate.

@phphuc612 phphuc612 added Needs Triage Issue that has not been reviewed by a pandas team member Usage Question labels Feb 5, 2024
@rhshadrach
Copy link
Member

#15227 indicates it was taken from https://github.com/python/cpython/blob/main/Objects/tupleobject.c; perhaps researching how CPython does this may be fruitful. I'd recommend searching the Python dev mailing list.

@rhshadrach rhshadrach added hashing hash_pandas_object and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Feb 5, 2024
@rhshadrach
Copy link
Member

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
hashing hash_pandas_object Usage Question
Projects
None yet
Development

No branches or pull requests

2 participants