Skip to content

BUG: hash_pandas_object(obj).sum() changed value with pandas 2.1.0 #55452

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
2 of 3 tasks
hagenw opened this issue Oct 9, 2023 · 3 comments
Closed
2 of 3 tasks

BUG: hash_pandas_object(obj).sum() changed value with pandas 2.1.0 #55452

hagenw opened this issue Oct 9, 2023 · 3 comments
Labels
Closing Candidate May be closeable, needs more eyeballs Usage Question

Comments

@hagenw
Copy link

hagenw commented Oct 9, 2023

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd

index = pd.Index(['f1', 'f2'], dtype='string', name='file')
pd.util.hash_pandas_object(index).sum()

Issue Description

Since version 2.1.0 of pandas the summation of the hashes returns a different value even thought the single hashes are still the same:

With pandas >= 2.1.0

>>> pd.util.hash_pandas_object(index)
file
f1    18078623642821437999
f2    14583249088160825270
dtype: uint64

>>> pd.util.hash_pandas_object(index).sum()
14215128657272711653

With pandas <= 2.0.3

>>> pd.util.hash_pandas_object(index)
file
f1    18078623642821437999
f2    14583249088160825270
dtype: uint64

>>> pd.util.hash_pandas_object(index).sum()
-4231615416436839963

Expected Behavior

No related changes are mentioned in https://pandas.pydata.org/pandas-docs/version/2.1.0/whatsnew/v2.1.0.html, so I would not expect a different behavior between the pandas versions.

Installed Versions

INSTALLED VERSIONS

commit : e86ed37
python : 3.10.11.final.0
python-bits : 64
OS : Linux
OS-release : 5.4.0-150-generic
Version : #167~18.04.1-Ubuntu SMP Wed May 24 00:51:42 UTC 2023
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_GB.UTF-8
LOCALE : en_GB.UTF-8

pandas : 2.1.1
numpy : 1.26.0
pytz : 2023.3.post1
dateutil : 2.8.2
setuptools : 68.2.2
pip : 23.2.1
Cython : None
pytest : 7.4.2
hypothesis : None
sphinx : 5.3.0
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 3.1.2
IPython : 8.16.1
pandas_datareader : None
bs4 : 4.12.2
bottleneck : None
dataframe-api-compat: None
fastparquet : None
fsspec : None
gcsfs : None
matplotlib : None
numba : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pyreadstat : None
pyxlsb : None
s3fs : None
scipy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
zstandard : None
tzdata : 2023.3
qtpy : None
pyqt5 : None

@hagenw hagenw added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Oct 9, 2023
@lithomas1
Copy link
Member

I haven't looked too closely, but I think we might have used to cast to int64 before summing, which was incorrect and fixed in
#53418?

I can get your old result if I do

pd.util.hash_pandas_object(index).astype("int64").sum()

@lithomas1 lithomas1 added Usage Question Closing Candidate May be closeable, needs more eyeballs and removed Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Oct 9, 2023
@hagenw
Copy link
Author

hagenw commented Oct 10, 2023

I can confirm that your workaround works to get the old results, thank you.

I see the point of fixing #53418, but it is also risky as this was the default behavior for a very long time and can break existing code. Would be great the next time to announce this in the release notes or introduce the change with a deprecation warning.

@lithomas1
Copy link
Member

There's a release note in the 2.1 whatsnew that goes something like
"Bug in :meth:Series.sum converting dtype uint64 to int64".

It's definitely very small, and easy to miss, though.

I'll close this now, since it's not a bug, but we'll try to make these kinds of fixes more visible in the notable bug fixes section going forward.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Closing Candidate May be closeable, needs more eyeballs Usage Question
Projects
None yet
Development

No branches or pull requests

2 participants