-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
ENH: Add a safe Option to hash_pandas_object with Default Value Set to True #60428
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Thanks for the report!
Can you provide support for this statement. You link to #16372, but I do not find anything supporting this there, only that there is currently a deficiency in hashing a particular dtype (which would be desirable to address).
What would this do that is different from the current hashing method? |
It is a very obvious characteristic that it does not satisfy collision resistance. For example, I gave a demo that has the same hash value. Moreover, the algorithm used by the developer is copied from Python's hash, and Python's hash also does not meet this requirement. This is determined by the inherent mathematical properties. |
In general, relying on security from a non-cryptographic library that is written by experts is a security risk. I think pandas should not be in the business of promising security at all.
I'm personally open to including a safer version for |
In cryptography, the input object for any hash function is a byte stream. The process of converting a data structure into bytes is called serialization. Specifically, you might consider using DataFrame.to_pickle(compression='None') as this could be faster. In my brief investigation into the use of this function on GitHub, I found that most of the time it is used for importing datasets. If it is for the purpose of importing datasets, the use of this function is very infrequent. |
Serialization requires being able to reconstruct said object; producing a byte stream to apply a hashing algorithm to does not. |
ok,but I'm not an expert in performance optimization and don't have very effective methods to handle it. I hope you can refer to the methods of static-frame. |
Feature Type
Adding new functionality to pandas
Changing existing functionality in pandas
Removing existing functionality in pandas
Problem Description
The current implementation of hash_pandas_object does not meet collision resistance requirements, although this is known to the developers. However, it is not prominently documented, and the function is already widely used in many downstream AI platforms, such as MLflow, AutoGluon, and others. These platforms use pandas_hash_object to convert DataFrame structures and then apply MD5 or SHA-256 for uniqueness checks, enabling caching and related functionalities. This makes these platforms more vulnerable to malicious datasets.
Therefore, I propose adding a safe option with a default value set to True. This would directly benefit the security of a large number of downstream applications. If not, the documentation should explicitly state that the function does not provide collision resistance and should not be used for caching or similar tasks.
Feature Description
Alternative Solutions
Alternatively, if users need to modify the function themselves, they can use to_pickle() to serialize the DataFrame before hashing.
Additional Context
autogluon code:
https://github.com/autogluon/autogluon/blob/082d8bae7343f02e9dc9ce3db76bc3f305027b10/common/src/autogluon/common/utils/utils.py#L176
mlflow code at:
https://github.com/mlflow/mlflow/blob/615c4cbafd616e818ff17bfcd964e8366a5cd3ed/mlflow/data/digest_utils.py#L39
graphistry code at:
https://github.com/graphistry/pygraphistry/blob/52ea49afbea55291c41962f79a90d74d76c721b9/graphistry/util.py#L84
Developer discussion on pandas functionality: #16372 (comment)
Documentation link for
hash_pandas_object
: https://pandas.pydata.org/docs/reference/api/pandas.util.hash_pandas_object.html#pandas.util.hash_pandas_objectone demo:
The text was updated successfully, but these errors were encountered: