Skip to content

BUG: hashing's are the same for different key values for hash_pandas_object #41404

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
Sandy4321 opened this issue May 9, 2021 · 13 comments · Fixed by #42049
Closed

BUG: hashing's are the same for different key values for hash_pandas_object #41404

Sandy4321 opened this issue May 9, 2021 · 13 comments · Fixed by #42049
Labels
Bug hashing hash_pandas_object
Milestone

Comments

@Sandy4321
Copy link

  • [x ] I have checked that this issue has not already been reported.

  • [ x] I have confirmed this bug exists on the latest version of pandas.

  • [ x] (optional) I have confirmed this bug exists on the master branch of pandas.


Note: Please read this guide detailing how to provide the necessary information for us to reproduce your bug.

Code Sample, a copy-pastable example

# Your code here
hash_pandas_object(test[       columns_names[i]      ],    index=True, encoding='utf8', hash_key='012' ,    categorize=True)
0      3713087409444908179
1      7478705303072568462
2     12024724921319894105
3     12785939622558835299
4      9788992550609991128
5      1239052552041868816
6      9610202078597672705
7     12287384021013641209
8     10264240190786022141
9     10535148974563425818
10    10238940258630658604
11    15446383648481672096
12    14265484681526586699
13     8862960024351814462
dtype: uint64

hash_pandas_object(test[       columns_names[i]      ],    index=True, encoding='utf8', hash_key='01298768755' ,    categorize=True)
0      3713087409444908179
1      7478705303072568462
2     12024724921319894105
3     12785939622558835299
4      9788992550609991128
5      1239052552041868816
6      9610202078597672705
7     12287384021013641209
8     10264240190786022141
9     10535148974563425818
10    10238940258630658604
11    15446383648481672096
12    14265484681526586699
13     8862960024351814462
dtype: uint64


hash_pandas_object(test[     [  columns_names[i] , columns_names[j]  ]     ],    index=True, encoding='utf8', hash_key='01' ,    categorize=True)
0     11107058607426530111
1     15666232225746534312
2      1136675766145783381
3     14892489092684772659
4      8519430825150424018
5       550646855301521146
6      3846031041217881485
7      2936614219041217571
8     16182698869780262111
9      2895548739675332954
10      677258434224654732
11     6105852029672525672
12    15095703462911844621
13     6081994522921680694
dtype: uint64
hash_pandas_object(test[     [  columns_names[i] , columns_names[j]  ]     ],    index=True, encoding='utf8', hash_key='0198076674534' ,    categorize=True)
0     11107058607426530111
1     15666232225746534312
2      1136675766145783381
3     14892489092684772659
4      8519430825150424018
5       550646855301521146
6      3846031041217881485
7      2936614219041217571
8     16182698869780262111
9      2895548739675332954
10      677258434224654732
11     6105852029672525672
12    15095703462911844621
13     6081994522921680694
dtype: uint64


test[     [  columns_names[i] , columns_names[j]  ]     ]
    A  B
0   0 -1
1   0 -1
2   0  0
3   0  0
4   1  0
5   1  0
6   2  0
7   2  2
8   2  2
9   2  2
10  2  2
11  2  2
12 -1  2
13 -1  2


hashing's are the same for different key values 


Problem description

it should be different hashing values for different keys

Expected Output

Output of pd.show_versions()

[paste the output of pd.show_versions() here leaving a blank line after the details tag]

print("sklearn.version = ", sklearn.version)
sklearn.version = 0.24.2

pd.show_versions()

INSTALLED VERSIONS

commit : 2cb9652
python : 3.8.7.final.0
python-bits : 64
OS : Windows
OS-release : 10
Version : 10.0.19041
machine : AMD64
processor : Intel64 Family 6 Model 158 Stepping 9, GenuineIntel
byteorder : little
LC_ALL : None
LANG : None
LOCALE : English_United States.1252

pandas : 1.2.4
numpy : 1.19.5
pytz : 2021.1
dateutil : 2.8.1
pip : 20.3.3
setuptools : 51.3.3
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : None
IPython : None
pandas_datareader: None
bs4 : None
bottleneck : None
fsspec : None
fastparquet : None
gcsfs : None
matplotlib : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pyxlsb : None
s3fs : None
scipy : 1.6.3
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
numba : None

@Sandy4321 Sandy4321 added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels May 9, 2021
@attack68
Copy link
Contributor

Although I confirm your example,

s = pd.Series([1, 2])
from pandas.util import hash_pandas_object
print(hash_pandas_object(s))
print(hash_pandas_object(s, hash_key='aaa'))

0    14639053686158035780
1     3869563279212530728
dtype: uint64
0    14639053686158035780
1     3869563279212530728
dtype: uint64

the documentation for this method alludes to the key being relevant only to string dtypes:

    hash_key : str, default _default_hash_key
        Hash_key for string key to encode.

Therefore:

s = pd.Series(["a", "b"])
from pandas.util import hash_pandas_object
print(hash_pandas_object(s))
print(hash_pandas_object(s, hash_key='aaa'))
print(hash_pandas_object(s, hash_key='abcdefghabcdefgh'))

0     4578374827886788867
1    17338122309987883691
dtype: uint64

ValueError: key should be a 16-byte string encoded, got b'aaa' (len 3)

0    13302670878694307853
1    18287704822077725462
dtype: uint64

@attack68 attack68 added hashing hash_pandas_object and removed Needs Triage Issue that has not been reviewed by a pandas team member labels May 10, 2021
@Sandy4321
Copy link
Author

great thanks for soon answer
but still I can not get different hashes for different keys
z = test[ [ columns_names[i] , columns_names[j] ] ].astype(str)
z
A B
0 1 0
1 1 0
2 1 1
3 1 1
4 2 1
5 2 1
6 3 1
7 3 3
8 3 3
9 3 3
10 3 3
11 3 3
12 0 3
13 0 3
z.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14 entries, 0 to 13
Data columns (total 2 columns):

Column Non-Null Count Dtype


0 A 14 non-null object
1 B 14 non-null object
dtypes: object(2)
memory usage: 352.0+ bytes
b = test[ [ columns_names[i] , columns_names[j] ] ]
b.info(verbose=None, buf=None, max_cols=None, memory_usage=None, show_counts=None, null_counts=None)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14 entries, 0 to 13
Data columns (total 2 columns):

Column Non-Null Count Dtype


0 A 14 non-null int32
1 B 14 non-null int32
dtypes: int32(2)
memory usage: 240.0 bytes
hash_pandas_object(z ,hash_key='abcdefghabc')
0 11066859559894451155
1 10128852264039700196
2 10846863718361031705
3 832869700160088919
4 7201883610395252339
5 10980860498243417291
6 9270476748602192016
7 14708764044410797073
8 5526114605206198501
9 14995225442755034452
10 16782707353416019346
11 11260739996732753750
12 17354760054521679658
13 11692895254784616729
dtype: uint64
hash_pandas_object(z ,hash_key='abcdefghabckkkk')
0 11066859559894451155
1 10128852264039700196
2 10846863718361031705
3 832869700160088919
4 7201883610395252339
5 10980860498243417291
6 9270476748602192016
7 14708764044410797073
8 5526114605206198501
9 14995225442755034452
10 16782707353416019346
11 11260739996732753750
12 17354760054521679658
13 11692895254784616729
dtype: uint64

@Sandy4321
Copy link
Author

then I converted data to str?

by the way
can you share link to this spec
"the documentation for this method alludes to the key being relevant only to string dtypes"

@Sandy4321
Copy link
Author

thanks for soon answer about link

May you ASAP to help with main problem : I converted data to str but still get the shame hashes , as written above

@Sandy4321
Copy link
Author

'''
test[ [ columns_names[i] , columns_names[j] ] ]
A B
0 -1 -1
1 -1 -1
2 -1 -1
3 -1 -1
4 0 -1
5 0 -1
6 1 -1
7 1 1
8 1 1
9 1 1
10 1 1
11 1 1
12 -1 1
13 -1 1
hash_pandas_object(test[ [ columns_names[i] , columns_names[j] ] ].astype(str) ,hash_key='abcdefghabcdef')
0 13822031625715865093
1 16059071767624278230
2 14974165724741982835
3 14777292977639537941
4 4777529079098660000
5 14919852986134491896
6 13564712168478105804
7 7707465176835891907
8 10257841727640903743
9 16228073358228106746
10 8706499784337997036
11 4240757573166896840
12 3108686149513081834
13 17171217657043349337
dtype: uint64
hash_pandas_object(test[ [ columns_names[i] , columns_names[j] ] ].astype(str) ,hash_key='abcdefgh')
0 13822031625715865093
1 16059071767624278230
2 14974165724741982835
3 14777292977639537941
4 4777529079098660000
5 14919852986134491896
6 13564712168478105804
7 7707465176835891907
8 10257841727640903743
9 16228073358228106746
10 8706499784337997036
11 4240757573166896840
12 3108686149513081834
13 17171217657043349337
dtype: uint64
'''

@Sandy4321
Copy link
Author

more details example

test[ [ columns_names[i] , columns_names[j] ] ]
A B
0 -1 -1
1 -1 -1
2 -1 -1
3 -1 -1
4 0 -1
5 0 -1
6 1 -1
7 1 1
8 1 1
9 1 1
10 1 1
11 1 1
12 -1 1
13 -1 1
z= test[ [ columns_names[i] , columns_names[j] ] ].astype(str)
z.info(verbose=None, buf=None, max_cols=None, memory_usage=None, show_counts=None, null_counts=None)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14 entries, 0 to 13
Data columns (total 2 columns):

Column Non-Null Count Dtype


0 A 14 non-null object
1 B 14 non-null object
dtypes: object(2)
memory usage: 352.0+ bytes
hash_pandas_object(z , index=False , hash_key='abcdefgha')
0 9086491323826902825
1 9086491323826902825
2 9086491323826902825
3 9086491323826902825
4 8310200602398290526
5 8310200602398290526
6 17265937606657724210
7 4417460742323483519
8 4417460742323483519
9 4417460742323483519
10 4417460742323483519
11 4417460742323483519
12 604622797768995028
13 604622797768995028
dtype: uint64
hash_pandas_object(z , index=False , hash_key='abcdefghakkkkk')
0 9086491323826902825
1 9086491323826902825
2 9086491323826902825
3 9086491323826902825
4 8310200602398290526
5 8310200602398290526
6 17265937606657724210
7 4417460742323483519
8 4417460742323483519
9 4417460742323483519
10 4417460742323483519
11 4417460742323483519
12 604622797768995028
13 604622797768995028
dtype: uint64

@attack68
Copy link
Contributor

i see the same as you when you use .astype(str).

please reduce your examples to the bare minimum. there is much unnecessary information across the 14 rows of your dataframe. and you can express index much clearer than test[ [ columns_names[i] , columns_names[j] ] ]. You dont even need index really.

@Sandy4321
Copy link
Author

and you can express index much clearer than test[ [ columns_names[i] , columns_names[j] ] ]
what is it ?
I explicitly removed index, as you can see in code?
index=False

please reduce your examples to the bare minimum. there is much unnecessary information across the 14 rows of your dataframe.

how many rows you need?
2 or 3 ?

@Sandy4321
Copy link
Author

'''
hash_pandas_object(df[ [ columns_names[i] , columns_names[j] ] ].astype(str) , index=False , hash_key='abcdefghabcdef')
0 9086491323826902825
1 9086491323826902825
2 9086491323826902825
3 9086491323826902825
dtype: uint64
hash_pandas_object(df[ [ columns_names[i] , columns_names[j] ] ].astype(str) , index=False , hash_key='abcdefgkkkkk')
0 9086491323826902825
1 9086491323826902825
2 9086491323826902825
3 9086491323826902825
dtype: uint64
df
A B C
0 -1 -1 0
1 -1 -1 1
2 -1 -1 -1
3 -1 -1 -1
'''

@Sandy4321
Copy link
Author

where
i
0
j
1

@Sandy4321
Copy link
Author

meantime
seems to be it is bug in pandas
can you fix it ASAP?

@attack68
Copy link
Contributor

meantime
seems to be it is bug in pandas
can you fix it ASAP?

Pandas is a volunteer open source project. There are over 3.5k outstanding issues. I will not be focusing my efforts on this. You are very welcome and encouraged to submit your own PR to address this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug hashing hash_pandas_object
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants