-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG: hash_pandas_object ignores optional arguments when the input is a DataFrame. #42049
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: hash_pandas_object ignores optional arguments when the input is a DataFrame. #42049
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the pr @i-aki-y! Some quick initial comments
pandas/tests/util/test_hashing.py
Outdated
@@ -255,6 +255,16 @@ def test_hash_keys(): | |||
assert (a != b).all() | |||
|
|||
|
|||
def test_df_hash_keys(): | |||
# DataFrame version of the test_hash_keys |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please link the issue here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mzeitlin11 Thank you for your comment. I added a link to the issue.
@@ -139,7 +139,10 @@ def hash_pandas_object( | |||
ser = Series(h, index=obj.index, dtype="uint64", copy=False) | |||
|
|||
elif isinstance(obj, ABCDataFrame): | |||
hashes = (hash_array(series._values) for _, series in obj.items()) | |||
hashes = ( | |||
hash_array(series._values, encoding, hash_key, categorize) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since encoding
also wasn't passed before, can you add a test that ensures we respect encoding
too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a test_df_encoding
to confirm that hash values can change when the different encodings are specified.
lgtm @i-aki-y can you add a release note bug fix section (Other is fine) for 1.3 ping on green. |
@jreback Thanks, I have added a whatsnew entry. |
thanks @i-aki-y |
@meeseeksdev backport 1.3.x |
…l arguments when the input is a DataFrame.
Something went wrong ... Please have a look at my logs. |
…s when the input is a DataFrame. (#42108) Co-authored-by: i-aki-y <[email protected]>
In this PR, I will try to solve issue #41404.
I would appreciate it if you could review this.
About the issue
In the current
pandas.util.hash_pandas_object
function implementation, the hash_key argument is ignored when the input object type is a DataFrame.This is a reproducible example.
The output is here.
The last output should be
About the fix
When the input object is a DataFrame, the hash values are calculated by applying
hash_array
function for each Series.However, optional arguments, including
hash_key
, are not passed to the hash_array function in the current implementation.To fix this, I passed the optional arguments to the hash_array function.