BUG: hashing's are the same for different key values for hash_pandas_object #41404

Sandy4321 · 2021-05-09T21:59:12Z

[x ] I have checked that this issue has not already been reported.
[ x] I have confirmed this bug exists on the latest version of pandas.
[ x] (optional) I have confirmed this bug exists on the master branch of pandas.

Note: Please read this guide detailing how to provide the necessary information for us to reproduce your bug.

Code Sample, a copy-pastable example

# Your code here
hash_pandas_object(test[       columns_names[i]      ],    index=True, encoding='utf8', hash_key='012' ,    categorize=True)
0      3713087409444908179
1      7478705303072568462
2     12024724921319894105
3     12785939622558835299
4      9788992550609991128
5      1239052552041868816
6      9610202078597672705
7     12287384021013641209
8     10264240190786022141
9     10535148974563425818
10    10238940258630658604
11    15446383648481672096
12    14265484681526586699
13     8862960024351814462
dtype: uint64

hash_pandas_object(test[       columns_names[i]      ],    index=True, encoding='utf8', hash_key='01298768755' ,    categorize=True)
0      3713087409444908179
1      7478705303072568462
2     12024724921319894105
3     12785939622558835299
4      9788992550609991128
5      1239052552041868816
6      9610202078597672705
7     12287384021013641209
8     10264240190786022141
9     10535148974563425818
10    10238940258630658604
11    15446383648481672096
12    14265484681526586699
13     8862960024351814462
dtype: uint64


hash_pandas_object(test[     [  columns_names[i] , columns_names[j]  ]     ],    index=True, encoding='utf8', hash_key='01' ,    categorize=True)
0     11107058607426530111
1     15666232225746534312
2      1136675766145783381
3     14892489092684772659
4      8519430825150424018
5       550646855301521146
6      3846031041217881485
7      2936614219041217571
8     16182698869780262111
9      2895548739675332954
10      677258434224654732
11     6105852029672525672
12    15095703462911844621
13     6081994522921680694
dtype: uint64
hash_pandas_object(test[     [  columns_names[i] , columns_names[j]  ]     ],    index=True, encoding='utf8', hash_key='0198076674534' ,    categorize=True)
0     11107058607426530111
1     15666232225746534312
2      1136675766145783381
3     14892489092684772659
4      8519430825150424018
5       550646855301521146
6      3846031041217881485
7      2936614219041217571
8     16182698869780262111
9      2895548739675332954
10      677258434224654732
11     6105852029672525672
12    15095703462911844621
13     6081994522921680694
dtype: uint64


test[     [  columns_names[i] , columns_names[j]  ]     ]
    A  B
0   0 -1
1   0 -1
2   0  0
3   0  0
4   1  0
5   1  0
6   2  0
7   2  2
8   2  2
9   2  2
10  2  2
11  2  2
12 -1  2
13 -1  2


hashing's are the same for different key values

Problem description

it should be different hashing values for different keys

Expected Output

Output of `pd.show_versions()`

[paste the output of pd.show_versions() here leaving a blank line after the details tag]

print("sklearn.version = ", sklearn.version)
sklearn.version = 0.24.2

pd.show_versions()

INSTALLED VERSIONS

commit : 2cb9652
python : 3.8.7.final.0
python-bits : 64
OS : Windows
OS-release : 10
Version : 10.0.19041
machine : AMD64
processor : Intel64 Family 6 Model 158 Stepping 9, GenuineIntel
byteorder : little
LC_ALL : None
LANG : None
LOCALE : English_United States.1252

pandas : 1.2.4
numpy : 1.19.5
pytz : 2021.1
dateutil : 2.8.1
pip : 20.3.3
setuptools : 51.3.3
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : None
IPython : None
pandas_datareader: None
bs4 : None
bottleneck : None
fsspec : None
fastparquet : None
gcsfs : None
matplotlib : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pyxlsb : None
s3fs : None
scipy : 1.6.3
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
numba : None

The text was updated successfully, but these errors were encountered:

attack68 · 2021-05-10T06:28:17Z

Although I confirm your example,

s = pd.Series([1, 2])
from pandas.util import hash_pandas_object
print(hash_pandas_object(s))
print(hash_pandas_object(s, hash_key='aaa'))

0    14639053686158035780
1     3869563279212530728
dtype: uint64
0    14639053686158035780
1     3869563279212530728
dtype: uint64

the documentation for this method alludes to the key being relevant only to string dtypes:

    hash_key : str, default _default_hash_key
        Hash_key for string key to encode.

Therefore:

s = pd.Series(["a", "b"])
from pandas.util import hash_pandas_object
print(hash_pandas_object(s))
print(hash_pandas_object(s, hash_key='aaa'))
print(hash_pandas_object(s, hash_key='abcdefghabcdefgh'))

0     4578374827886788867
1    17338122309987883691
dtype: uint64

ValueError: key should be a 16-byte string encoded, got b'aaa' (len 3)

0    13302670878694307853
1    18287704822077725462
dtype: uint64

Sandy4321 · 2021-05-10T18:30:31Z

great thanks for soon answer
but still I can not get different hashes for different keys
z = test[ [ columns_names[i] , columns_names[j] ] ].astype(str)
z
A B
0 1 0
1 1 0
2 1 1
3 1 1
4 2 1
5 2 1
6 3 1
7 3 3
8 3 3
9 3 3
10 3 3
11 3 3
12 0 3
13 0 3
z.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14 entries, 0 to 13
Data columns (total 2 columns):

Column Non-Null Count Dtype

0 A 14 non-null object
1 B 14 non-null object
dtypes: object(2)
memory usage: 352.0+ bytes
b = test[ [ columns_names[i] , columns_names[j] ] ]
b.info(verbose=None, buf=None, max_cols=None, memory_usage=None, show_counts=None, null_counts=None)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14 entries, 0 to 13
Data columns (total 2 columns):

Column Non-Null Count Dtype

0 A 14 non-null int32
1 B 14 non-null int32
dtypes: int32(2)
memory usage: 240.0 bytes
hash_pandas_object(z ,hash_key='abcdefghabc')
0 11066859559894451155
1 10128852264039700196
2 10846863718361031705
3 832869700160088919
4 7201883610395252339
5 10980860498243417291
6 9270476748602192016
7 14708764044410797073
8 5526114605206198501
9 14995225442755034452
10 16782707353416019346
11 11260739996732753750
12 17354760054521679658
13 11692895254784616729
dtype: uint64
hash_pandas_object(z ,hash_key='abcdefghabckkkk')
0 11066859559894451155
1 10128852264039700196
2 10846863718361031705
3 832869700160088919
4 7201883610395252339
5 10980860498243417291
6 9270476748602192016
7 14708764044410797073
8 5526114605206198501
9 14995225442755034452
10 16782707353416019346
11 11260739996732753750
12 17354760054521679658
13 11692895254784616729
dtype: uint64

Sandy4321 · 2021-05-10T18:32:44Z

then I converted data to str?

by the way
can you share link to this spec
"the documentation for this method alludes to the key being relevant only to string dtypes"

attack68 · 2021-05-10T19:11:14Z

https://pandas.pydata.org/pandas-docs/dev/reference/api/pandas.util.hash_pandas_object.html?highlight=hash#pandas.util.hash_pandas_object

Sandy4321 · 2021-05-10T19:28:42Z

thanks for soon answer about link

May you ASAP to help with main problem : I converted data to str but still get the shame hashes , as written above

Sandy4321 · 2021-05-10T19:30:52Z

'''
test[ [ columns_names[i] , columns_names[j] ] ]
A B
0 -1 -1
1 -1 -1
2 -1 -1
3 -1 -1
4 0 -1
5 0 -1
6 1 -1
7 1 1
8 1 1
9 1 1
10 1 1
11 1 1
12 -1 1
13 -1 1
hash_pandas_object(test[ [ columns_names[i] , columns_names[j] ] ].astype(str) ,hash_key='abcdefghabcdef')
0 13822031625715865093
1 16059071767624278230
2 14974165724741982835
3 14777292977639537941
4 4777529079098660000
5 14919852986134491896
6 13564712168478105804
7 7707465176835891907
8 10257841727640903743
9 16228073358228106746
10 8706499784337997036
11 4240757573166896840
12 3108686149513081834
13 17171217657043349337
dtype: uint64
hash_pandas_object(test[ [ columns_names[i] , columns_names[j] ] ].astype(str) ,hash_key='abcdefgh')
0 13822031625715865093
1 16059071767624278230
2 14974165724741982835
3 14777292977639537941
4 4777529079098660000
5 14919852986134491896
6 13564712168478105804
7 7707465176835891907
8 10257841727640903743
9 16228073358228106746
10 8706499784337997036
11 4240757573166896840
12 3108686149513081834
13 17171217657043349337
dtype: uint64
'''

Sandy4321 · 2021-05-10T19:35:11Z

more details example

test[ [ columns_names[i] , columns_names[j] ] ]
A B
0 -1 -1
1 -1 -1
2 -1 -1
3 -1 -1
4 0 -1
5 0 -1
6 1 -1
7 1 1
8 1 1
9 1 1
10 1 1
11 1 1
12 -1 1
13 -1 1
z= test[ [ columns_names[i] , columns_names[j] ] ].astype(str)
z.info(verbose=None, buf=None, max_cols=None, memory_usage=None, show_counts=None, null_counts=None)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14 entries, 0 to 13
Data columns (total 2 columns):

Column Non-Null Count Dtype

0 A 14 non-null object
1 B 14 non-null object
dtypes: object(2)
memory usage: 352.0+ bytes
hash_pandas_object(z , index=False , hash_key='abcdefgha')
0 9086491323826902825
1 9086491323826902825
2 9086491323826902825
3 9086491323826902825
4 8310200602398290526
5 8310200602398290526
6 17265937606657724210
7 4417460742323483519
8 4417460742323483519
9 4417460742323483519
10 4417460742323483519
11 4417460742323483519
12 604622797768995028
13 604622797768995028
dtype: uint64
hash_pandas_object(z , index=False , hash_key='abcdefghakkkkk')
0 9086491323826902825
1 9086491323826902825
2 9086491323826902825
3 9086491323826902825
4 8310200602398290526
5 8310200602398290526
6 17265937606657724210
7 4417460742323483519
8 4417460742323483519
9 4417460742323483519
10 4417460742323483519
11 4417460742323483519
12 604622797768995028
13 604622797768995028
dtype: uint64

attack68 · 2021-05-11T07:44:07Z

i see the same as you when you use .astype(str).

please reduce your examples to the bare minimum. there is much unnecessary information across the 14 rows of your dataframe. and you can express index much clearer than test[ [ columns_names[i] , columns_names[j] ] ]. You dont even need index really.

Sandy4321 · 2021-05-11T14:23:11Z

and you can express index much clearer than test[ [ columns_names[i] , columns_names[j] ] ]
what is it ?
I explicitly removed index, as you can see in code?
index=False

please reduce your examples to the bare minimum. there is much unnecessary information across the 14 rows of your dataframe.

how many rows you need?
2 or 3 ?

Sandy4321 · 2021-05-11T14:26:08Z

'''
hash_pandas_object(df[ [ columns_names[i] , columns_names[j] ] ].astype(str) , index=False , hash_key='abcdefghabcdef')
0 9086491323826902825
1 9086491323826902825
2 9086491323826902825
3 9086491323826902825
dtype: uint64
hash_pandas_object(df[ [ columns_names[i] , columns_names[j] ] ].astype(str) , index=False , hash_key='abcdefgkkkkk')
0 9086491323826902825
1 9086491323826902825
2 9086491323826902825
3 9086491323826902825
dtype: uint64
df
A B C
0 -1 -1 0
1 -1 -1 1
2 -1 -1 -1
3 -1 -1 -1
'''

Sandy4321 · 2021-05-11T14:27:01Z

where
i
0
j
1

Sandy4321 · 2021-05-11T14:31:03Z

meantime
seems to be it is bug in pandas
can you fix it ASAP?

attack68 · 2021-05-12T14:40:15Z

meantime
seems to be it is bug in pandas
can you fix it ASAP?

Pandas is a volunteer open source project. There are over 3.5k outstanding issues. I will not be focusing my efforts on this. You are very welcome and encouraged to submit your own PR to address this issue.

Sandy4321 added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels May 9, 2021

attack68 added hashing hash_pandas_object and removed Needs Triage Issue that has not been reviewed by a pandas team member labels May 10, 2021

i-aki-y mentioned this issue Jun 16, 2021

BUG: hash_pandas_object ignores optional arguments when the input is a DataFrame. #42049

Merged

4 tasks

jreback added this to the 1.3 milestone Jun 17, 2021

jreback closed this as completed in #42049 Jun 18, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: hashing's are the same for different key values for hash_pandas_object #41404

BUG: hashing's are the same for different key values for hash_pandas_object #41404

Sandy4321 commented May 9, 2021

INSTALLED VERSIONS

attack68 commented May 10, 2021

Sandy4321 commented May 10, 2021

Sandy4321 commented May 10, 2021

attack68 commented May 10, 2021

Sandy4321 commented May 10, 2021

Sandy4321 commented May 10, 2021

Sandy4321 commented May 10, 2021

attack68 commented May 11, 2021

Sandy4321 commented May 11, 2021

Sandy4321 commented May 11, 2021

Sandy4321 commented May 11, 2021

Sandy4321 commented May 11, 2021

attack68 commented May 12, 2021

BUG: hashing's are the same for different key values for hash_pandas_object #41404

BUG: hashing's are the same for different key values for hash_pandas_object #41404

Comments

Sandy4321 commented May 9, 2021

Code Sample, a copy-pastable example

Problem description

Expected Output

Output of pd.show_versions()

INSTALLED VERSIONS

attack68 commented May 10, 2021

Sandy4321 commented May 10, 2021

Column Non-Null Count Dtype

Column Non-Null Count Dtype

Sandy4321 commented May 10, 2021

attack68 commented May 10, 2021

Sandy4321 commented May 10, 2021

Sandy4321 commented May 10, 2021

Sandy4321 commented May 10, 2021

more details example

Column Non-Null Count Dtype

attack68 commented May 11, 2021

Sandy4321 commented May 11, 2021

Sandy4321 commented May 11, 2021

Sandy4321 commented May 11, 2021

Sandy4321 commented May 11, 2021

attack68 commented May 12, 2021

Output of `pd.show_versions()`