Skip to content

CLN: Verify non-ndarray is an ExtensionArray in hash_array GH42003 #42008

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
Closed
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions pandas/core/util/hashing.py
Original file line number Diff line number Diff line change
Expand Up @@ -286,6 +286,8 @@ def hash_array(
return _hash_categorical(vals, encoding, hash_key)
elif not isinstance(vals, np.ndarray):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i would rather explictily check for an extension array here.

also pls add a test for this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's already

    if not hasattr(vals, "dtype"):
        raise TypeError("must pass a ndarray-like")

at the top of the function for validation. Could maybe change that to check for EA/np.ndarray explicitly and change the message to be more user friendly.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep agreed

Copy link
Contributor Author

@TheNeuralBit TheNeuralBit Jun 18, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

importing ExtensionArray so I can use it in an isinstance check is resulting in a circular import:

❯ python -m pytest pandas/tests/util/test_hashing.py 
ImportError while loading conftest '.../pandas/pandas/conftest.py'.
pandas/__init__.py:22: in <module>
    from pandas.compat import (
pandas/compat/__init__.py:15: in <module>
    from pandas.compat.numpy import (
pandas/compat/numpy/__init__.py:7: in <module>
    from pandas.util.version import Version
pandas/util/__init__.py:7: in <module>
    from pandas.core.util.hashing import (  # noqa
pandas/core/util/hashing.py:23: in <module>
    from pandas.core.arrays.base import ExtensionArray
pandas/core/arrays/__init__.py:1: in <module>
    from pandas.core.arrays.base import (
pandas/core/arrays/base.py:32: in <module>
    from pandas.compat import set_function_name
E   ImportError: cannot import name 'set_function_name' from partially initialized module 'pandas.compat' (most likely due to a circular import) (.../pandas/pandas/compat/__init__.py)

Any advice on how to resolve this? I could just move the import inside hash_array, but not sure if that's consistent with pandas coding conventions.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we have a is_array_like in pandas/core/dtypes/inference.py that also includes Index and Series so is not suitable here.

Most public functions that accept an numpy array/EA probably accept a array-like even if not documented, like this one used to.

It is mainly internal functions where we would be more strict on the types accepted.

So we could maybe add a is_array_like_strict but we would normally already have validated the arguments that are passed internally so may not have much utility outside this public method which is more restrictive than most.

For now, I think OK to move the imports inside hash_array until we know we want to do this type of check elsewhere.

# i.e. ExtensionArray
if not hasattr(vals, '_values_for_factorize'):
raise TypeError("must pass ndarray or ExtensionArray")
vals, _ = vals._values_for_factorize()

return _hash_ndarray(vals, encoding, hash_key, categorize)
Expand Down