-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
ENH: .equals for Extension Arrays #30652
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
thanks for your contribution! @dwhu in general here, if it is Good luck! and hope this helps |
@charlesdong1991 - Sounds good. Thanks for the feedback. Wanted to make sure I wasn't missing something or breaking a behavior contract by adding this behavior. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think most of the __eq__
implementations you've added can be removed.
@jschendel is in the middle of a PR fixing IntervalArray.__eq__
, so ignore that for now if it's causing failures.
We'll need some base tests in pandas/tests/extensions/base/ops.py
ensuring that __eq__
and __ne__
are implemented correctly. We'll want a few tests in tests/extensions/base/methods.py
ensuring that .equals
is tested. These tests are a bit tricky to write, so let us know if you have issues.
pandas/core/arrays/base.py
Outdated
bool | ||
""" | ||
|
||
raise AbstractMethodError(self) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this be return ~(self == other)
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
pandas/core/arrays/base.py
Outdated
Whether the arrays are equivalent. | ||
|
||
""" | ||
return ((self == other) | (self.isna() == other.isna())).all() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this also needs to ensure that the type of other
matches the type of self
. We don't want pd.array([1, 2]).equals(np.array([1, 2]))
to be True (I think).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
pandas/core/arrays/boolean.py
Outdated
if not isinstance(other, BooleanArray): | ||
return NotImplemented | ||
return ( | ||
hasattr(other, "_data") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BooleanArray should already have a __eq__
method (see create_comparison_method
Was this needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not needed. Missed that in my WIP PR.
Adding __eq__ to ExtensionArray Abstract method doc string. Adding ne implementation to EA base class. Also removing other implementations. Updating EA equals method and adding tests. pandas-devGH-27081
pandas/core/arrays/base.py
Outdated
@@ -350,6 +352,38 @@ def __iter__(self): | |||
for i in range(len(self)): | |||
yield self[i] | |||
|
|||
def __eq__(self, other: ABCExtensionArray) -> bool: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The type of other
should be Any I think. Passing a list / etc here is valid, it just returns False.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make sure to update the type in the docstring too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
pandas/core/arrays/base.py
Outdated
|
||
raise AbstractMethodError(self) | ||
|
||
def __ne__(self, other: ABCExtensionArray) -> bool: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment about the type on other
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
pandas/core/arrays/base.py
Outdated
bool | ||
""" | ||
|
||
raise AbstractMethodError(self) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, does this need to be abstract? Or can we define it as something like
# check that the types, length match, etc.
# check that NA values in self match NA values in other...
return np.all(np.array(self) == np.array(other))
Then subclasses can override with implementations that may be faster?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good suggestion. I like that much better.
pandas/core/arrays/base.py
Outdated
Whether the arrays are equivalent. | ||
|
||
""" | ||
return isinstance(other, self.__class__) and ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps type(self) == type(other)
? It's not clear to me whether a subclass should compare .equals
to a parent class. Thoughts @jreback?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will want some length checking here too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added type(self) == type(other)
and length checks.
pandas/core/arrays/integer.py
Outdated
def __eq__(self, other): | ||
return ( | ||
isinstance(other, IntegerArray) | ||
and hasattr(other, "_data") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We already have IntegerArray.__eq__
on master. This shouldn't be necessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed.
ser = pd.Series(cls._from_sequence(data, dtype=data.dtype)) | ||
na_ser = pd.Series(cls._from_sequence([na_value], dtype=data.dtype)) | ||
|
||
assert data.equals(data) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
make all these assert <condition> is True
, to ensure that True
or False
is actually returned
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I ran into a funny issue doing this: numpy/numpy#5942
I'm going to keep this as is now to keep the test consistent within itself.
.equals()
returns a np.bool_
type. As result when you compare it to bool type True
using is
(i.e. some_np_bool_True_val is True == False
Coercing np.bool_
to bool
using the bool()
constructor did not work either. == True
causes PEP8 errors asking me to use assert <condition>
or assert <condition> is True
.
Open to suggestions if you have any.
assert not na_ser.equals(ser) | ||
assert not ser.equals(na_ser) | ||
assert not ser.equals(0) | ||
assert not na_ser.equals(0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add test for other inputs like an ndarray.
Add a test for unequal length things.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
pandas/tests/extension/base/ops.py
Outdated
@@ -158,13 +156,3 @@ def test_compare_array(self, data, all_compare_operators): | |||
s = pd.Series(data) | |||
other = pd.Series([data[0]] * len(data)) | |||
self._compare_other(s, data, op_name, other) | |||
|
|||
def test_direct_arith_with_series_returns_not_implemented(self, data): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Undo this change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as the other comment. The whole PR is intended to break this test since we implement __eq__
and __ne__
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as the other comment. The whole PR is intended to break this test since we implement eq and ne.
The reason this needs to be kept is that, although __eq__
is a required method on an ExtensionArray now, we still require it to return NotImplemented when other
is a Series or Index.
@dwhu looks like CI is red - can you take a look and fix up the merge conflict? |
Closing as I think this has gone stale but ping if you'd like to pick back up. Thanks! |
@dwhu I am going to look into further updating this, if that's OK with you |
OK, I updated this with the following things:
Should be ready for a new review now @TomAugspurger @jbrockmendel |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks nice. Can you add a release note?
@@ -333,6 +335,18 @@ def __iter__(self): | |||
for i in range(len(self)): | |||
yield self[i] | |||
|
|||
def __eq__(self, other: Any) -> Union[np.ndarray, "ExtensionArray"]: | |||
""" | |||
Return for `self == other` (element-wise equality). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should have some guidance here that if other
is a Series then you should return NotImplemented.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could use ops.common.unpack_zerodim_and_defer
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could use ops.common.unpack_zerodim_and_defer
This method is to be implemented by EA authors, so those can't use that helper (unless we expose somewhere a public version of this).
(we could of course use that for our own EAs, but this PR is not changing any existing __eq__
implementation at the moment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should have some guidance here that if other is a Series then you should return NotImplemented.
Added a comment about that
if not type(self) == type(other): | ||
return False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should verify that this is the behavior we want. Namely
- other array-likes are not equivalent, even if they are all equal.
- subclasses are not equivalent, even if they are all equal.
The first seems fine. Not sure about the second.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was planning to open an issue about this after this PR (and keep it strict here), because this is right now a bit inconsistent within pandas, and might require a more general discussion / clean-up (eg Series.equals is more strict (requires same dtype) than Index.equals ...)
But we can certainly also have the discussion here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Happy to be strict for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made it even more strict (same dtype, not just same class), and added a test for that.
Will open an issue for the general discussion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I opened an issue for this at #33940
In the end, it seems mainly to come to whether the dtype should exactly be equal or not.
Since for EAs, the dtype is right now tied to the array class, using equal dtype for now also implies the same class (no additional check to allow sublcasses).
@@ -421,3 +421,29 @@ def test_repeat_raises(self, data, repeats, kwargs, error, msg, use_numpy): | |||
np.repeat(data, repeats, **kwargs) | |||
else: | |||
data.repeat(repeats, **kwargs) | |||
|
|||
def test_equals(self, data, na_value, as_series): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you also include DataFrame
in this parameterization, or do it somewhere else? Just to ensure coverage of the changes to blocks.py
.
pandas/core/arrays/base.py
Outdated
""" | ||
raise AbstractMethodError(self) | ||
|
||
def __ne__(self, other: Any) -> Union[np.ndarray, "ExtensionArray"]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should allow us to remove the equivalent IntervalArray
definition here:
pandas/pandas/core/arrays/interval.py
Lines 609 to 610 in d149f41
def __ne__(self, other): | |
return ~self.__eq__(other) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Union[np.ndarray, "ExtensionArray"]
-> ArrayLike
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jschendel done!
# boolean array with NA -> fill with False | ||
equal_values = equal_values.fillna(False) | ||
equal_na = self.isna() & other.isna() | ||
return (equal_values | equal_na).all().item() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is the .item()
necessary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is the .item() necessary?
It's to have a python bool, instead of a numpy bool, as result.
I added a comment to the tests to make it explicit this is the reason we are asserting with is True/False
(and later on we don't inadvertedly "clean" that up)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for clarifying. i was recently reminded that np.array(np.timedelta64(1234, "ns")).item()
gives an int instead of timedelta64, so im now cautious around .item()
@simonjayhawkins how do I solve this typing error?
So our annotation of |
Any more comments on this PR? |
will look soon |
pandas/core/arrays/base.py
Outdated
@@ -335,7 +335,7 @@ def __iter__(self): | |||
for i in range(len(self)): | |||
yield self[i] | |||
|
|||
def __eq__(self, other: Any) -> ArrayLike: | |||
def __eq__(self, other: Any) -> ArrayLike: # type: ignore[override] # NOQA |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
im not familiar with the ignore[overrride] # NOQA
pattern. do we use that elsewhere?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought I posted the mypy issue, but apparently forgot.
So the problem here is that mypy expects a "bool" return for __eq__
since the base object
is typed that way in the stubs. In python/mypy#2783, the recommended way to solve this is the above with # type: ignore[override]
.
But flake8 doesn't like that, hence also the #NOQA (PyCQA/pyflakes#475).
This was already done in another PR as well:
Line 5062 in 7aa710a
def sort_values( # type: ignore[override] # NOQA # issue 27237 |
@jreback you wanted to take a look here? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm. a couple of minor comment.s
doc/source/whatsnew/v1.1.0.rst
Outdated
@@ -150,6 +150,7 @@ Other enhancements | |||
such as ``dict`` and ``list``, mirroring the behavior of :meth:`DataFrame.update` (:issue:`33215`) | |||
- :meth:`~pandas.core.groupby.GroupBy.transform` and :meth:`~pandas.core.groupby.GroupBy.aggregate` has gained ``engine`` and ``engine_kwargs`` arguments that supports executing functions with ``Numba`` (:issue:`32854`, :issue:`33388`) | |||
- :meth:`~pandas.core.resample.Resampler.interpolate` now supports SciPy interpolation method :class:`scipy.interpolate.CubicSpline` as method ``cubicspline`` (:issue:`33670`) | |||
- The ``ExtensionArray`` class has now an ``equals`` method, similarly to ``Series.equals()`` (:issue:`27081`). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can add the doc reference here
pandas/core/internals/blocks.py
Outdated
@@ -1864,6 +1864,11 @@ def where( | |||
|
|||
return [self.make_block_same_class(result, placement=self.mgr_locs)] | |||
|
|||
def equals(self, other) -> bool: | |||
if self.dtype != other.dtype or self.shape != other.shape: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do you need this check here? (its already done on the .values no?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, good question. I just copied that from the base Block.equals method (so it's done there as well). There, array_equivalent
is used. Maybe that is less strict and the extra check is needed, which is not needed here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At least from the docstring of array_equivalent
(not sure how up to date that is), using that method indeed requires a more strict check in advance of calling it:
pandas/pandas/core/dtypes/missing.py
Lines 360 to 363 in 3ed7dff
in corresponding locations. False otherwise. It is assumed that left and | |
right are NumPy arrays of the same dtype. The behavior of this function | |
(particularly with respect to NaNs) is not defined if the dtypes are | |
different. |
But so, for EA.equals that is not needed, so will remove that check here.
Travis failures are unrelated (network issues, S3) |
Co-authored-by: Joris Van den Bossche <[email protected]>
black pandas
git diff upstream/master -u -- "*.py" | flake8 --diff
--- Jan 3 10:00 PST
Hey team,
Looking for a bit of feedback on this API design and appropriate testing for this PR before I get too deep into it. Could you provide some guidance regarding how to appropriately handle updating this test?
pandas/pandas/tests/extension/base/ops.py
Line 162 in 6f03e76
Appreciate the help,
Dave