Skip to content

BUG: hash_pandas_object ignores optional arguments when the input is a DataFrame. #42049

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Jun 18, 2021

Conversation

i-aki-y
Copy link
Contributor

@i-aki-y i-aki-y commented Jun 16, 2021

In this PR, I will try to solve issue #41404.
I would appreciate it if you could review this.

About the issue

In the current pandas.util.hash_pandas_object function implementation, the hash_key argument is ignored when the input object type is a DataFrame.
This is a reproducible example.

df = pd.DataFrame({"x": list("abc")})
new_hash_key = "a"*16
print("Series")
print(pd.util.hash_pandas_object(df["x"]).values)
print(pd.util.hash_pandas_object(df["x"], hash_key=new_hash_key).values)
print("DataFrame")
print(pd.util.hash_pandas_object(df).values)
print(pd.util.hash_pandas_object(df, hash_key=new_hash_key).values)

The output is here.

Series
[ 4578374827886788867 17338122309987883691  5473791562133574857]
[14372590553486244284 12017144478307119910    34790805896362026]
DataFrame
[ 4578374827886788867 17338122309987883691  5473791562133574857]
[ 4578374827886788867 17338122309987883691  5473791562133574857]

The last output should be

[14372590553486244284 12017144478307119910    34790805896362026]

About the fix

When the input object is a DataFrame, the hash values are calculated by applying hash_array function for each Series.
However, optional arguments, including hash_key, are not passed to the hash_array function in the current implementation.
To fix this, I passed the optional arguments to the hash_array function.

Copy link
Member

@mzeitlin11 mzeitlin11 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the pr @i-aki-y! Some quick initial comments

@@ -255,6 +255,16 @@ def test_hash_keys():
assert (a != b).all()


def test_df_hash_keys():
# DataFrame version of the test_hash_keys
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please link the issue here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mzeitlin11 Thank you for your comment. I added a link to the issue.

@@ -139,7 +139,10 @@ def hash_pandas_object(
ser = Series(h, index=obj.index, dtype="uint64", copy=False)

elif isinstance(obj, ABCDataFrame):
hashes = (hash_array(series._values) for _, series in obj.items())
hashes = (
hash_array(series._values, encoding, hash_key, categorize)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since encoding also wasn't passed before, can you add a test that ensures we respect encoding too?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a test_df_encoding to confirm that hash values can change when the different encodings are specified.

@mzeitlin11 mzeitlin11 added Bug hashing hash_pandas_object labels Jun 16, 2021
@jreback jreback added this to the 1.3 milestone Jun 17, 2021
@jreback
Copy link
Contributor

jreback commented Jun 17, 2021

lgtm @i-aki-y can you add a release note bug fix section (Other is fine) for 1.3 ping on green.

@i-aki-y
Copy link
Contributor Author

i-aki-y commented Jun 18, 2021

@jreback Thanks, I have added a whatsnew entry.

@jreback jreback merged commit 1ff6970 into pandas-dev:master Jun 18, 2021
@jreback
Copy link
Contributor

jreback commented Jun 18, 2021

thanks @i-aki-y

@jreback
Copy link
Contributor

jreback commented Jun 18, 2021

@meeseeksdev backport 1.3.x

@lumberbot-app
Copy link

lumberbot-app bot commented Jun 18, 2021

Something went wrong ... Please have a look at my logs.

simonjayhawkins pushed a commit that referenced this pull request Jun 18, 2021
…s when the input is a DataFrame. (#42108)

Co-authored-by: i-aki-y <[email protected]>
JulianWgs pushed a commit to JulianWgs/pandas that referenced this pull request Jul 3, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug hashing hash_pandas_object
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: hashing's are the same for different key values for hash_pandas_object
3 participants