-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
ENH: EA._hash_pandas_object #51319
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: EA._hash_pandas_object #51319
Changes from all commits
089aa0c
aa7729a
aad544c
318adcf
b024c7d
9a7d669
ae3f0cb
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -140,6 +140,7 @@ class ExtensionArray: | |
_from_factorized | ||
_from_sequence | ||
_from_sequence_of_strings | ||
_hash_pandas_object | ||
_reduce | ||
_values_for_argsort | ||
_values_for_factorize | ||
|
@@ -1002,11 +1003,6 @@ def _values_for_factorize(self) -> tuple[np.ndarray, Any]: | |
as NA in the factorization routines, so it will be coded as | ||
`-1` and not included in `uniques`. By default, | ||
``np.nan`` is used. | ||
|
||
Notes | ||
----- | ||
The values returned by this method are also used in | ||
:func:`pandas.util.hash_pandas_object`. | ||
""" | ||
return self.astype(object), np.nan | ||
|
||
|
@@ -1452,6 +1448,31 @@ def _reduce(self, name: str, *, skipna: bool = True, **kwargs): | |
# Non-Optimized Default Methods; in the case of the private methods here, | ||
# these are not guaranteed to be stable across pandas versions. | ||
|
||
def _hash_pandas_object( | ||
self, *, encoding: str, hash_key: str, categorize: bool | ||
) -> npt.NDArray[np.uint64]: | ||
""" | ||
Hook for hash_pandas_object. | ||
|
||
Default is likely non-performant. | ||
|
||
Parameters | ||
---------- | ||
encoding : str | ||
hash_key : str | ||
categorize : bool | ||
|
||
Returns | ||
------- | ||
np.ndarray[uint64] | ||
""" | ||
from pandas.core.util.hashing import hash_array | ||
|
||
values = self.to_numpy(copy=False) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @jbrockmendel is there a reason you didn't use In addition, if we want to allow EAs to override this, we should either document what the keyword arguments mean, and/or we should expose There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
IIRC it was some combination of _values_for_factorize not being required and trying to respect the _vfc shouldn't be used for anything but factorize policy.
Documenting makes sense. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
I have said that before, but I don't see any problem with using that for both, as we have always done. They are very related, and typically values suitable for factorization will also be suitable for hashing, given that factorization is hash-based. And now with this new method, EAs can still override it if they want to use something else than We actually have a bunch of internal doc comments indicating that For external EAs that relied on this, this is a regression that we no longer use this by default (e.g. for geopandas this will now fail in There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
It was actually documented, which was removed here. |
||
return hash_array( | ||
values, encoding=encoding, hash_key=hash_key, categorize=categorize | ||
) | ||
|
||
def tolist(self) -> list: | ||
""" | ||
Return a list of the values. | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -240,6 +240,10 @@ class TestReduce(base.BaseNoReduceTests): | |
|
||
|
||
class TestMethods(BaseJSON, base.BaseMethodsTests): | ||
@pytest.mark.xfail(reason="ValueError: setting an array element with a sequence") | ||
def test_hash_pandas_object(self, data): | ||
super().test_hash_pandas_object(data) | ||
Comment on lines
+243
to
+245
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. could you expand on this please? why are we adding it, why does it fail, is it expected that it should pass? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. JSONArray is kind of semi-abandoned, about 20% of the existing tests in this file are xfailed. So it is not "expected" in that I do not expect it to pass ever, but it is "expected" in that if anyone ever wanted to make it usable they would want to make this test pass. |
||
|
||
@unhashable | ||
def test_value_counts(self, all_data, dropna): | ||
super().test_value_counts(all_data, dropna) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Something else: you added this method in the "Non-Optimized Default Methods" section, which says that private methods in this section are not guaranteed to be stable (in contrast to well established methods with an underscore that are meant to be overridden like
_values_for_factorize
,_from_sequence
, etc, which are defined more above).Is that indeed the intention? (in that case it should not be mentioned in the documentation that this is a method to override? Currently it is added to the docstring higher up in this file)
(If we don't want that external EAs override this, that's a another reason we should certainly keep using
_values_for_factorize
)There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is that what underscores are supposed to mean? I'm not eager to revive the EA-namespace-naming-scheme discussion, but that would fit well there.
IIRC the main motivation was allowing for a performant implementation for ArrowExtensionArray (without special-casing internally ArrowEA internally)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's what the documented methods (that happen to have a underscore or not) mean: we currently document which methods can (or have to / are meant to) be overriden by EA implementations. And you added
_hash_pandas_object
to the documentation.For example
_from_sequence
or_reduce
have an underscore, but for sure external EAs can rely on implementing those, regardless of the underscore.I am not sure which discussion needs to be revived (unless you don't agree that some of the methods with an underscore can be overridden by external EAs?)
Yes, and that's fine I think. But in general when adding methods to the EA base class, we need to be conscious about whether we see such a method as something external EA authors can also override or not. Because if we explicitly are OK with external EA authors using it, we need to give it some higher level of stability.
We could also consider adding methods to our own classes and not to the base class, or have a some pandas
BaseArray(ExtensionArray)
class that does a bunch of extra things we only want to do for our own ones, where we can be more flexible and don't have to worry about backwards compatibility.That of course has the disadvantage of creating a split (and having to check, whenever we call that, if the methods exists and have some fallback), but could be easier for certain optimizations where we don't directly want to promise stability / expose this to external EAs.
Speaking as such an external EA author, I think I would be fine with this (and when there is request, we could "promote" a method to the official public EA interface). As a pandas maintainer, it might give some more complexity to know if some method is in which category.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm specifically not interested in reviving the naming convention discussion.
I assume anything in the EA namespace is liable to be overridden by subclass authors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason of my original comment here is because we have the following comment in the base EA class (one that you added I think):
pandas/pandas/core/arrays/base.py
Lines 1447 to 1450 in d06f2d3
So while you say that you think anything in the EA namespace can be overridden by EA authors, the above comment at least seems to indicate there are some methods for which we guarantee stability, while for others it's at your own risk (to potentially keep it compatible with every new minor version).
So essentially all that my original comment wanted to say is: if we want that EA authors need to implement
_hash_object_array
(as you also suggest in #53501), we should move that method a bit higher up in this file to be above that comment about private methods not being stable.