Skip to content

ENH: Add a safe Option to hash_pandas_object with Default Value Set to True #60428

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
2 of 3 tasks
cryptochecktool opened this issue Nov 27, 2024 · 6 comments
Open
2 of 3 tasks
Labels
Enhancement hashing hash_pandas_object Needs Discussion Requires discussion from core team before further action

Comments

@cryptochecktool
Copy link

cryptochecktool commented Nov 27, 2024

Feature Type

  • Adding new functionality to pandas

  • Changing existing functionality in pandas

  • Removing existing functionality in pandas

Problem Description

The current implementation of hash_pandas_object does not meet collision resistance requirements, although this is known to the developers. However, it is not prominently documented, and the function is already widely used in many downstream AI platforms, such as MLflow, AutoGluon, and others. These platforms use pandas_hash_object to convert DataFrame structures and then apply MD5 or SHA-256 for uniqueness checks, enabling caching and related functionalities. This makes these platforms more vulnerable to malicious datasets.

Therefore, I propose adding a safe option with a default value set to True. This would directly benefit the security of a large number of downstream applications. If not, the documentation should explicitly state that the function does not provide collision resistance and should not be used for caching or similar tasks.

Feature Description

def hash_pandas_object(,,,,, safe=True):
        if safe == True:
            safe_hash_pandas_object(,,,,,)
        else:
             # Existing code

Alternative Solutions

Alternatively, if users need to modify the function themselves, they can use to_pickle() to serialize the DataFrame before hashing.

df_bytes = df.to_pickle()
hash_object = hashlib.sha256(df_bytes)

Additional Context

autogluon code:
https://github.com/autogluon/autogluon/blob/082d8bae7343f02e9dc9ce3db76bc3f305027b10/common/src/autogluon/common/utils/utils.py#L176

mlflow code at:
https://github.com/mlflow/mlflow/blob/615c4cbafd616e818ff17bfcd964e8366a5cd3ed/mlflow/data/digest_utils.py#L39

graphistry code at:
https://github.com/graphistry/pygraphistry/blob/52ea49afbea55291c41962f79a90d74d76c721b9/graphistry/util.py#L84

Developer discussion on pandas functionality: #16372 (comment)

Documentation link for hash_pandas_object: https://pandas.pydata.org/docs/reference/api/pandas.util.hash_pandas_object.html#pandas.util.hash_pandas_object

one demo:

import pandas as pd
# Define two data dictionaries
data1 = {
    'A': [1604090909467468979, 2],
    'B': [4, 4]
}
data2 = {
    'A': [1, 2],
    'B': [3, 4]
}
# Convert dictionaries to DataFrame
df1 = pd.DataFrame(data1)
df2 = pd.DataFrame(data2)
# Calculate the hash value for each DataFrame
hash_df1 = pd.util.hash_pandas_object(df1)
hash_df2 = pd.util.hash_pandas_object(df2)

@cryptochecktool cryptochecktool added Enhancement Needs Triage Issue that has not been reviewed by a pandas team member labels Nov 27, 2024
@cryptochecktool cryptochecktool changed the title ENH: Add a safe Option to pandas_hash_object with Default Value Set to True ENH: Add a safe Option to hash_pandas_object with Default Value Set to True Nov 27, 2024
@rhshadrach
Copy link
Member

Thanks for the report!

The current implementation of hash_pandas_object does not meet collision resistance requirements, although this is known to the developers.

Can you provide support for this statement. You link to #16372, but I do not find anything supporting this there, only that there is currently a deficiency in hashing a particular dtype (which would be desirable to address).

Therefore, I propose adding a safe option with a default value set to True.

What would this do that is different from the current hashing method?

@rhshadrach rhshadrach added Needs Discussion Requires discussion from core team before further action Needs Info Clarification about behavior needed to assess issue hashing hash_pandas_object and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Dec 1, 2024
@cryptochecktool
Copy link
Author

cryptochecktool commented Dec 2, 2024

It is a very obvious characteristic that it does not satisfy collision resistance. For example, I gave a demo that has the same hash value. Moreover, the algorithm used by the developer is copied from Python's hash, and Python's hash also does not meet this requirement. This is determined by the inherent mathematical properties.
I am not clear how pandas uses this hash internally. However, in Python, tuple comparison will use the built-in hash for comparison, and if the hashes are the same, it will further compare the specific elements. Since having the same hash is a very low probability event, this can effectively save computation time. On average, calling Python's built-in hash is about 10x faster than calculating with sha256.
This is very helpful for improving performance.
However, if an external interface is provided, users may not have a clear safety awareness for further handling of hash collisions, which could pose a security risk to users. For users who require performance, they can use unsafe mode and further implement related subsequent processing based on their environment.
If using a safe_hash, then one could serialize the data first and then generate a hash with sha256, which helps users avoid the possibility of collisions and reduces many security risks.Certainly, to avoid destructive updates, we may need to maintain consistency in the output format of the original algorithm.

@rhshadrach
Copy link
Member

However, if an external interface is provided, users may not have a clear safety awareness for further handling of hash collisions, which could pose a security risk to users.

In general, relying on security from a non-cryptographic library that is written by experts is a security risk. I think pandas should not be in the business of promising security at all.

If using a safe_hash, then one could serialize the data first and then generate a hash with sha256

I'm personally open to including a safer version for hash_pandas_object. However I'm opposed to calling it "safe" or anything synonymous with that. In addition, I wonder if there are better options than serialization.

@rhshadrach rhshadrach removed the Needs Info Clarification about behavior needed to assess issue label Dec 2, 2024
@cryptochecktool
Copy link
Author

In cryptography, the input object for any hash function is a byte stream. The process of converting a data structure into bytes is called serialization. Specifically, you might consider using DataFrame.to_pickle(compression='None') as this could be faster. In my brief investigation into the use of this function on GitHub, I found that most of the time it is used for importing datasets. If it is for the purpose of importing datasets, the use of this function is very infrequent.
However, I discovered that there is a solution provided by static-frame that works well for generating hashes from df structures. You can find it here: https://github.com/static-frame/static-frame/blob/9eeaad79852b5b9ab86c8cc71404652ea8a6b0e6/doc/source/articles/hash.rst#creating-a-message-digest-from-a-dataframe

@rhshadrach
Copy link
Member

rhshadrach commented Dec 3, 2024

In cryptography, the input object for any hash function is a byte stream. The process of converting a data structure into bytes is called serialization.

Serialization requires being able to reconstruct said object; producing a byte stream to apply a hashing algorithm to does not.

@cryptochecktool
Copy link
Author

ok,but I'm not an expert in performance optimization and don't have very effective methods to handle it. I hope you can refer to the methods of static-frame.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement hashing hash_pandas_object Needs Discussion Requires discussion from core team before further action
Projects
None yet
Development

No branches or pull requests

2 participants