Skip to content

ENH: EA._hash_pandas_object #51319

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 7 commits into from
Mar 1, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions ci/code_checks.sh
Original file line number Diff line number Diff line change
Expand Up @@ -512,6 +512,7 @@ if [[ -z "$CHECK" || "$CHECK" == "docstrings" ]]; then
pandas.api.extensions.ExtensionArray._from_factorized \
pandas.api.extensions.ExtensionArray._from_sequence \
pandas.api.extensions.ExtensionArray._from_sequence_of_strings \
pandas.api.extensions.ExtensionArray._hash_pandas_object \
pandas.api.extensions.ExtensionArray._reduce \
pandas.api.extensions.ExtensionArray._values_for_argsort \
pandas.api.extensions.ExtensionArray._values_for_factorize \
Expand Down
1 change: 1 addition & 0 deletions doc/source/reference/extensions.rst
Original file line number Diff line number Diff line change
Expand Up @@ -38,6 +38,7 @@ objects.
api.extensions.ExtensionArray._from_factorized
api.extensions.ExtensionArray._from_sequence
api.extensions.ExtensionArray._from_sequence_of_strings
api.extensions.ExtensionArray._hash_pandas_object
api.extensions.ExtensionArray._reduce
api.extensions.ExtensionArray._values_for_argsort
api.extensions.ExtensionArray._values_for_factorize
Expand Down
10 changes: 10 additions & 0 deletions pandas/core/arrays/_mixins.py
Original file line number Diff line number Diff line change
Expand Up @@ -192,6 +192,16 @@ def _values_for_argsort(self) -> np.ndarray:
def _values_for_factorize(self):
return self._ndarray, self._internal_fill_value

def _hash_pandas_object(
self, *, encoding: str, hash_key: str, categorize: bool
) -> npt.NDArray[np.uint64]:
from pandas.core.util.hashing import hash_array

values = self._ndarray
return hash_array(
values, encoding=encoding, hash_key=hash_key, categorize=categorize
)

# Signature of "argmin" incompatible with supertype "ExtensionArray"
def argmin(self, axis: AxisInt = 0, skipna: bool = True): # type: ignore[override]
# override base class by adding axis keyword
Expand Down
31 changes: 26 additions & 5 deletions pandas/core/arrays/base.py
Original file line number Diff line number Diff line change
Expand Up @@ -140,6 +140,7 @@ class ExtensionArray:
_from_factorized
_from_sequence
_from_sequence_of_strings
_hash_pandas_object
_reduce
_values_for_argsort
_values_for_factorize
Expand Down Expand Up @@ -1002,11 +1003,6 @@ def _values_for_factorize(self) -> tuple[np.ndarray, Any]:
as NA in the factorization routines, so it will be coded as
`-1` and not included in `uniques`. By default,
``np.nan`` is used.

Notes
-----
The values returned by this method are also used in
:func:`pandas.util.hash_pandas_object`.
"""
return self.astype(object), np.nan

Expand Down Expand Up @@ -1452,6 +1448,31 @@ def _reduce(self, name: str, *, skipna: bool = True, **kwargs):
# Non-Optimized Default Methods; in the case of the private methods here,
# these are not guaranteed to be stable across pandas versions.

def _hash_pandas_object(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something else: you added this method in the "Non-Optimized Default Methods" section, which says that private methods in this section are not guaranteed to be stable (in contrast to well established methods with an underscore that are meant to be overridden like _values_for_factorize, _from_sequence, etc, which are defined more above).

Is that indeed the intention? (in that case it should not be mentioned in the documentation that this is a method to override? Currently it is added to the docstring higher up in this file)
(If we don't want that external EAs override this, that's a another reason we should certainly keep using _values_for_factorize)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in contrast to well established methods with an underscore

Is that what underscores are supposed to mean? I'm not eager to revive the EA-namespace-naming-scheme discussion, but that would fit well there.

IIRC the main motivation was allowing for a performant implementation for ArrowExtensionArray (without special-casing internally ArrowEA internally)

Copy link
Member

@jorisvandenbossche jorisvandenbossche Jun 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in contrast to well established methods with an underscore

Is that what underscores are supposed to mean? I'm not eager to revive the EA-namespace-naming-scheme discussion, but that would fit well there.

It's what the documented methods (that happen to have a underscore or not) mean: we currently document which methods can (or have to / are meant to) be overriden by EA implementations. And you added _hash_pandas_object to the documentation.

For example _from_sequence or _reduce have an underscore, but for sure external EAs can rely on implementing those, regardless of the underscore.

I am not sure which discussion needs to be revived (unless you don't agree that some of the methods with an underscore can be overridden by external EAs?)

IIRC the main motivation was allowing for a performant implementation for ArrowExtensionArray (without special-casing internally ArrowEA internally)

Yes, and that's fine I think. But in general when adding methods to the EA base class, we need to be conscious about whether we see such a method as something external EA authors can also override or not. Because if we explicitly are OK with external EA authors using it, we need to give it some higher level of stability.

We could also consider adding methods to our own classes and not to the base class, or have a some pandas BaseArray(ExtensionArray) class that does a bunch of extra things we only want to do for our own ones, where we can be more flexible and don't have to worry about backwards compatibility.
That of course has the disadvantage of creating a split (and having to check, whenever we call that, if the methods exists and have some fallback), but could be easier for certain optimizations where we don't directly want to promise stability / expose this to external EAs.
Speaking as such an external EA author, I think I would be fine with this (and when there is request, we could "promote" a method to the official public EA interface). As a pandas maintainer, it might give some more complexity to know if some method is in which category.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure which discussion needs to be revived (unless you don't agree that some of the methods with an underscore can be overridden by external EAs?)

I'm specifically not interested in reviving the naming convention discussion.

I assume anything in the EA namespace is liable to be overridden by subclass authors.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume anything in the EA namespace is liable to be overridden by subclass authors.

The reason of my original comment here is because we have the following comment in the base EA class (one that you added I think):

# ------------------------------------------------------------------------
# Non-Optimized Default Methods; in the case of the private methods here,
# these are not guaranteed to be stable across pandas versions.

So while you say that you think anything in the EA namespace can be overridden by EA authors, the above comment at least seems to indicate there are some methods for which we guarantee stability, while for others it's at your own risk (to potentially keep it compatible with every new minor version).

So essentially all that my original comment wanted to say is: if we want that EA authors need to implement _hash_object_array (as you also suggest in #53501), we should move that method a bit higher up in this file to be above that comment about private methods not being stable.

self, *, encoding: str, hash_key: str, categorize: bool
) -> npt.NDArray[np.uint64]:
"""
Hook for hash_pandas_object.

Default is likely non-performant.

Parameters
----------
encoding : str
hash_key : str
categorize : bool

Returns
-------
np.ndarray[uint64]
"""
from pandas.core.util.hashing import hash_array

values = self.to_numpy(copy=False)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jbrockmendel is there a reason you didn't use self._values_for_factorize() as the values to hash (as default), which would preserve the current behaviour?

In addition, if we want to allow EAs to override this, we should either document what the keyword arguments mean, and/or we should expose hash_array publicly (eg in pandas.api.extensions) ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a reason you didn't use self._values_for_factorize()

IIRC it was some combination of _values_for_factorize not being required and trying to respect the _vfc shouldn't be used for anything but factorize policy.

In addition, if we want to allow EAs to override this, we should either document what the keyword arguments mean, and/or we should expose hash_array publicly (eg in pandas.api.extensions) ?

Documenting makes sense.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC it was some combination of _values_for_factorize not being required and trying to respect the _vfc shouldn't be used for anything but factorize policy.

I have said that before, but I don't see any problem with using that for both, as we have always done. They are very related, and typically values suitable for factorization will also be suitable for hashing, given that factorization is hash-based. And now with this new method, EAs can still override it if they want to use something else than _values_for_factorize.

We actually have a bunch of internal doc comments indicating that _values_for_factorize is also used for hash_pandas_object. And we can add this to the actual _values_for_factorize docstring in the base class to make this use case explicit.

For external EAs that relied on this, this is a regression that we no longer use this by default (e.g. for geopandas this will now fail in hash_object_array, falling back to casting to string before trying it again. But the string representation is not that faithful, and thus can give wrong hashes)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And we can add this to the actual _values_for_factorize docstring in the base class to make this use case explicit.

It was actually documented, which was removed here.

return hash_array(
values, encoding=encoding, hash_key=hash_key, categorize=categorize
)

def tolist(self) -> list:
"""
Return a list of the values.
Expand Down
43 changes: 43 additions & 0 deletions pandas/core/arrays/categorical.py
Original file line number Diff line number Diff line change
Expand Up @@ -1762,6 +1762,49 @@ def _values_for_rank(self):
)
return values

def _hash_pandas_object(
self, *, encoding: str, hash_key: str, categorize: bool
) -> npt.NDArray[np.uint64]:
"""
Hash a Categorical by hashing its categories, and then mapping the codes
to the hashes.

Parameters
----------
encoding : str
hash_key : str
categorize : bool
Ignored for Categorical.

Returns
-------
np.ndarray[uint64]
"""
# Note we ignore categorize, as we are already Categorical.
from pandas.core.util.hashing import hash_array

# Convert ExtensionArrays to ndarrays
values = np.asarray(self.categories._values)
hashed = hash_array(values, encoding, hash_key, categorize=False)

# we have uint64, as we don't directly support missing values
# we don't want to use take_nd which will coerce to float
# instead, directly construct the result with a
# max(np.uint64) as the missing value indicator
#
# TODO: GH#15362

mask = self.isna()
if len(hashed):
result = hashed.take(self._codes)
else:
result = np.zeros(len(mask), dtype="uint64")

if mask.any():
result[mask] = lib.u8max

return result

# ------------------------------------------------------------------
# NDArrayBackedExtensionArray compat

Expand Down
69 changes: 10 additions & 59 deletions pandas/core/util/hashing.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,22 +9,17 @@
Hashable,
Iterable,
Iterator,
cast,
)

import numpy as np

from pandas._libs import lib
from pandas._libs.hashing import hash_object_array
from pandas._typing import (
ArrayLike,
npt,
)

from pandas.core.dtypes.common import (
is_categorical_dtype,
is_list_like,
)
from pandas.core.dtypes.common import is_list_like
from pandas.core.dtypes.generic import (
ABCDataFrame,
ABCExtensionArray,
Expand All @@ -35,7 +30,6 @@

if TYPE_CHECKING:
from pandas import (
Categorical,
DataFrame,
Index,
MultiIndex,
Expand Down Expand Up @@ -214,53 +208,14 @@ def hash_tuples(

# hash the list-of-ndarrays
hashes = (
_hash_categorical(cat, encoding=encoding, hash_key=hash_key) for cat in cat_vals
cat._hash_pandas_object(encoding=encoding, hash_key=hash_key, categorize=False)
for cat in cat_vals
)
h = combine_hash_arrays(hashes, len(cat_vals))

return h


def _hash_categorical(
cat: Categorical, encoding: str, hash_key: str
) -> npt.NDArray[np.uint64]:
"""
Hash a Categorical by hashing its categories, and then mapping the codes
to the hashes

Parameters
----------
cat : Categorical
encoding : str
hash_key : str

Returns
-------
ndarray[np.uint64] of hashed values, same size as len(c)
"""
# Convert ExtensionArrays to ndarrays
values = np.asarray(cat.categories._values)
hashed = hash_array(values, encoding, hash_key, categorize=False)

# we have uint64, as we don't directly support missing values
# we don't want to use take_nd which will coerce to float
# instead, directly construct the result with a
# max(np.uint64) as the missing value indicator
#
# TODO: GH 15362

mask = cat.isna()
if len(hashed):
result = hashed.take(cat.codes)
else:
result = np.zeros(len(mask), dtype="uint64")

if mask.any():
result[mask] = lib.u8max

return result


def hash_array(
vals: ArrayLike,
encoding: str = "utf8",
Expand Down Expand Up @@ -288,17 +243,11 @@ def hash_array(
"""
if not hasattr(vals, "dtype"):
raise TypeError("must pass a ndarray-like")
dtype = vals.dtype

# For categoricals, we hash the categories, then remap the codes to the
# hash values. (This check is above the complex check so that we don't ask
# numpy if categorical is a subdtype of complex, as it will choke).
if is_categorical_dtype(dtype):
vals = cast("Categorical", vals)
return _hash_categorical(vals, encoding, hash_key)

elif isinstance(vals, ABCExtensionArray):
vals, _ = vals._values_for_factorize()
if isinstance(vals, ABCExtensionArray):
return vals._hash_pandas_object(
encoding=encoding, hash_key=hash_key, categorize=categorize
)

elif not isinstance(vals, np.ndarray):
# GH#42003
Expand Down Expand Up @@ -347,7 +296,9 @@ def _hash_ndarray(

codes, categories = factorize(vals, sort=False)
cat = Categorical(codes, Index(categories), ordered=False, fastpath=True)
return _hash_categorical(cat, encoding, hash_key)
return cat._hash_pandas_object(
encoding=encoding, hash_key=hash_key, categorize=False
)

try:
vals = hash_object_array(vals, hash_key, encoding)
Expand Down
11 changes: 11 additions & 0 deletions pandas/tests/extension/base/methods.py
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,17 @@
class BaseMethodsTests(BaseExtensionTests):
"""Various Series and DataFrame methods."""

def test_hash_pandas_object(self, data):
# _hash_pandas_object should return a uint64 ndarray of the same length
# as the data
res = data._hash_pandas_object(
encoding="utf-8",
hash_key=pd.core.util.hashing._default_hash_key,
categorize=False,
)
assert res.dtype == np.uint64
assert res.shape == data.shape

def test_value_counts_default_dropna(self, data):
# make sure we have consistent default dropna kwarg
if not hasattr(data, "value_counts"):
Expand Down
4 changes: 4 additions & 0 deletions pandas/tests/extension/json/test_json.py
Original file line number Diff line number Diff line change
Expand Up @@ -240,6 +240,10 @@ class TestReduce(base.BaseNoReduceTests):


class TestMethods(BaseJSON, base.BaseMethodsTests):
@pytest.mark.xfail(reason="ValueError: setting an array element with a sequence")
def test_hash_pandas_object(self, data):
super().test_hash_pandas_object(data)
Comment on lines +243 to +245
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you expand on this please? why are we adding it, why does it fail, is it expected that it should pass?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

JSONArray is kind of semi-abandoned, about 20% of the existing tests in this file are xfailed. So it is not "expected" in that I do not expect it to pass ever, but it is "expected" in that if anyone ever wanted to make it usable they would want to make this test pass.


@unhashable
def test_value_counts(self, all_data, dropna):
super().test_value_counts(all_data, dropna)
Expand Down