Skip to content

BUG: Hash-function doesn't take upper bits into consideration for PyObjects #37615

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
3 tasks done
realead opened this issue Nov 3, 2020 · 1 comment · Fixed by #39592
Closed
3 tasks done

BUG: Hash-function doesn't take upper bits into consideration for PyObjects #37615

realead opened this issue Nov 3, 2020 · 1 comment · Fixed by #39592
Labels
Bug hashing hash_pandas_object Needs Triage Issue that has not been reviewed by a pandas team member
Milestone

Comments

@realead
Copy link
Contributor

realead commented Nov 3, 2020

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • (optional) I have confirmed this bug exists on the master branch of pandas.


Code Sample, a copy-pastable example

import pandas as pd
import numpy as np

def create(M, shift=0):
    lst=[x<<shift for x in range(M)]
    return np.array(lst, dtype=np.object)

#linear running time:
%timeit pd.unique(create(10**3, 0))
%timeit pd.unique(create(10**4, 0))

# quadratic running time:
# for shift=32, the lower 32bits returned 
# by Py_Hash are 0 for 64bit builds
%timeit pd.unique(create(10**3, 32))
%timeit pd.unique(create(10**4, 32))

Problem description

The times for small integers are: 222 µs vs 1.6 ms, i.e. about factor 10
The times for big integers are: 16.5 ms vs 1.68 s, i.e. about factor 100

Expected Output

Also for big integers, the factor should be about 10 (i.e. linear) and not quadratic.

Output of pd.show_versions()

INSTALLED VERSIONS

commit : 824869f
python : 3.7.3.final.0
python-bits : 64
OS : Linux
OS-release : 4.4.0-53-generic
Version : #74-Ubuntu SMP Fri Dec 2 15:59:10 UTC 2016
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 1.2.0.dev0+1033.g824869f
numpy : 1.19.2
pytz : 2020.1
dateutil : 2.8.1
pip : 20.2.2
setuptools : 49.6.0.post20200814
Cython : 0.29.21
pytest : 5.4.3
hypothesis : 5.36.0
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 2.11.2
IPython : 7.18.1
pandas_datareader: None
bs4 : 4.9.1
bottleneck : None
fsspec : None
fastparquet : None
gcsfs : None
matplotlib : 3.3.1
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pyxlsb : None
s3fs : None
scipy : 1.2.1
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
numba : 0.50.1

@realead realead added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Nov 3, 2020
@realead
Copy link
Contributor Author

realead commented Nov 3, 2020

The issue, is that for PyObject_Hash isn't really a great hash-function, e.g. for integers < 2**60 it returns the integer itself for 64bit-builds.

PyObject_Hash is used directly:

#define kh_python_hash_func(key) (PyObject_Hash(key))

However, for 64bit-builds that line means that a int64-value is casted to uint32 thus losing all information in the upper bits, in the example above all hash-values are just 0 leading to quadratic running time due to hash-collisions.

The value returned by PyObject_Hash should be mangled similar to int64-value:

#define kh_int64_hash_func(key) (khint32_t)((key)>>33^(key)^(key)<<11)

@arw2019 arw2019 added the hashing hash_pandas_object label Nov 4, 2020
@jreback jreback added this to the 1.3 milestone Feb 5, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug hashing hash_pandas_object Needs Triage Issue that has not been reviewed by a pandas team member
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants