Skip to content

ENH: .equals for Extension Arrays #30652

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 19 commits into from
May 9, 2020
Merged

Conversation

dwhu
Copy link
Contributor

@dwhu dwhu commented Jan 3, 2020

--- Jan 3 10:00 PST

Hey team,

Looking for a bit of feedback on this API design and appropriate testing for this PR before I get too deep into it. Could you provide some guidance regarding how to appropriately handle updating this test?

def test_direct_arith_with_series_returns_not_implemented(self, data):
It appears to contradict the issue #27081.

Appreciate the help,
Dave

@charlesdong1991
Copy link
Member

thanks for your contribution! @dwhu

in general here, if it is NotImplemted, and you implement some new functionalities, you could remove this one, and replace it with the corresponding tests to test your new implementations!

Good luck! and hope this helps

@dwhu
Copy link
Contributor Author

dwhu commented Jan 3, 2020

in general here, if it is NotImplemted, and you implement some new functionalities, you could remove this one, and replace it with the corresponding tests to test your new implementations!

@charlesdong1991 - Sounds good. Thanks for the feedback. Wanted to make sure I wasn't missing something or breaking a behavior contract by adding this behavior.

Copy link
Contributor

@TomAugspurger TomAugspurger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think most of the __eq__ implementations you've added can be removed.

@jschendel is in the middle of a PR fixing IntervalArray.__eq__, so ignore that for now if it's causing failures.

We'll need some base tests in pandas/tests/extensions/base/ops.py ensuring that __eq__ and __ne__ are implemented correctly. We'll want a few tests in tests/extensions/base/methods.py ensuring that .equals is tested. These tests are a bit tricky to write, so let us know if you have issues.

bool
"""

raise AbstractMethodError(self)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this be return ~(self == other)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Whether the arrays are equivalent.

"""
return ((self == other) | (self.isna() == other.isna())).all()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this also needs to ensure that the type of other matches the type of self. We don't want pd.array([1, 2]).equals(np.array([1, 2])) to be True (I think).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

if not isinstance(other, BooleanArray):
return NotImplemented
return (
hasattr(other, "_data")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BooleanArray should already have a __eq__ method (see create_comparison_method Was this needed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not needed. Missed that in my WIP PR.

dwhu added 2 commits January 3, 2020 13:40
Adding __eq__ to ExtensionArray Abstract method doc string.

Adding ne implementation to EA base class.

Also removing other implementations.

Updating EA equals method and adding tests.

pandas-devGH-27081
@dwhu dwhu changed the title [WIP] ENH: .equals for Extension Arrays ENH: .equals for Extension Arrays Jan 3, 2020
@dwhu dwhu requested a review from TomAugspurger January 3, 2020 21:54
@TomAugspurger TomAugspurger added the ExtensionArray Extending pandas with custom dtypes or arrays. label Jan 3, 2020
@@ -350,6 +352,38 @@ def __iter__(self):
for i in range(len(self)):
yield self[i]

def __eq__(self, other: ABCExtensionArray) -> bool:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The type of other should be Any I think. Passing a list / etc here is valid, it just returns False.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sure to update the type in the docstring too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.


raise AbstractMethodError(self)

def __ne__(self, other: ABCExtensionArray) -> bool:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment about the type on other.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

bool
"""

raise AbstractMethodError(self)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, does this need to be abstract? Or can we define it as something like

# check that the types, length match, etc.
# check that NA values in self match NA values in other...
return np.all(np.array(self) == np.array(other))

Then subclasses can override with implementations that may be faster?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good suggestion. I like that much better.

Whether the arrays are equivalent.

"""
return isinstance(other, self.__class__) and (
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps type(self) == type(other)? It's not clear to me whether a subclass should compare .equals to a parent class. Thoughts @jreback?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will want some length checking here too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added type(self) == type(other) and length checks.

def __eq__(self, other):
return (
isinstance(other, IntegerArray)
and hasattr(other, "_data")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We already have IntegerArray.__eq__ on master. This shouldn't be necessary.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed.

ser = pd.Series(cls._from_sequence(data, dtype=data.dtype))
na_ser = pd.Series(cls._from_sequence([na_value], dtype=data.dtype))

assert data.equals(data)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make all these assert <condition> is True, to ensure that True or False is actually returned

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ran into a funny issue doing this: numpy/numpy#5942
I'm going to keep this as is now to keep the test consistent within itself.

.equals() returns a np.bool_ type. As result when you compare it to bool type True using is (i.e. some_np_bool_True_val is True == False

Coercing np.bool_ to bool using the bool() constructor did not work either. == True causes PEP8 errors asking me to use assert <condition> or assert <condition> is True.

Open to suggestions if you have any.

assert not na_ser.equals(ser)
assert not ser.equals(na_ser)
assert not ser.equals(0)
assert not na_ser.equals(0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add test for other inputs like an ndarray.

Add a test for unequal length things.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@@ -158,13 +156,3 @@ def test_compare_array(self, data, all_compare_operators):
s = pd.Series(data)
other = pd.Series([data[0]] * len(data))
self._compare_other(s, data, op_name, other)

def test_direct_arith_with_series_returns_not_implemented(self, data):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Undo this change.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as the other comment. The whole PR is intended to break this test since we implement __eq__ and __ne__.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as the other comment. The whole PR is intended to break this test since we implement eq and ne.

The reason this needs to be kept is that, although __eq__ is a required method on an ExtensionArray now, we still require it to return NotImplemented when other is a Series or Index.

@dwhu dwhu requested a review from TomAugspurger January 3, 2020 23:33
@WillAyd
Copy link
Member

WillAyd commented Feb 12, 2020

@dwhu looks like CI is red - can you take a look and fix up the merge conflict?

@WillAyd
Copy link
Member

WillAyd commented Mar 14, 2020

Closing as I think this has gone stale but ping if you'd like to pick back up. Thanks!

@jorisvandenbossche
Copy link
Member

@dwhu I am going to look into further updating this, if that's OK with you

@jorisvandenbossche
Copy link
Member

OK, I updated this with the following things:

  • Made __eq__ into an abstract method so it is required to implement (another option could be to return NotImplemented as default, but that's actually wrong, as it results in not doing element-wise comparison for EA == EA, but just returning False)
  • Fixed the equals imlementation: it became a bit longer to ensure we check for BooleanArray (if the == comparison returns a nullable boolean array, we need to fill NAs with False)
  • Added ExtensionBlock.equals (overriding the base Block method) to ensure we dispatch to the new EA method
  • A bit of clean-up of the tests

Should be ready for a new review now @TomAugspurger @jbrockmendel

@jorisvandenbossche jorisvandenbossche added this to the 1.1 milestone May 1, 2020
Copy link
Contributor

@TomAugspurger TomAugspurger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks nice. Can you add a release note?

@@ -333,6 +335,18 @@ def __iter__(self):
for i in range(len(self)):
yield self[i]

def __eq__(self, other: Any) -> Union[np.ndarray, "ExtensionArray"]:
"""
Return for `self == other` (element-wise equality).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should have some guidance here that if other is a Series then you should return NotImplemented.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could use ops.common.unpack_zerodim_and_defer

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could use ops.common.unpack_zerodim_and_defer

This method is to be implemented by EA authors, so those can't use that helper (unless we expose somewhere a public version of this).

(we could of course use that for our own EAs, but this PR is not changing any existing __eq__ implementation at the moment)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should have some guidance here that if other is a Series then you should return NotImplemented.

Added a comment about that

Comment on lines +714 to +715
if not type(self) == type(other):
return False
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should verify that this is the behavior we want. Namely

  1. other array-likes are not equivalent, even if they are all equal.
  2. subclasses are not equivalent, even if they are all equal.

The first seems fine. Not sure about the second.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was planning to open an issue about this after this PR (and keep it strict here), because this is right now a bit inconsistent within pandas, and might require a more general discussion / clean-up (eg Series.equals is more strict (requires same dtype) than Index.equals ...)
But we can certainly also have the discussion here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Happy to be strict for now.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made it even more strict (same dtype, not just same class), and added a test for that.
Will open an issue for the general discussion.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I opened an issue for this at #33940

In the end, it seems mainly to come to whether the dtype should exactly be equal or not.

Since for EAs, the dtype is right now tied to the array class, using equal dtype for now also implies the same class (no additional check to allow sublcasses).

@@ -421,3 +421,29 @@ def test_repeat_raises(self, data, repeats, kwargs, error, msg, use_numpy):
np.repeat(data, repeats, **kwargs)
else:
data.repeat(repeats, **kwargs)

def test_equals(self, data, na_value, as_series):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you also include DataFrame in this parameterization, or do it somewhere else? Just to ensure coverage of the changes to blocks.py.

"""
raise AbstractMethodError(self)

def __ne__(self, other: Any) -> Union[np.ndarray, "ExtensionArray"]:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should allow us to remove the equivalent IntervalArray definition here:

def __ne__(self, other):
return ~self.__eq__(other)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Union[np.ndarray, "ExtensionArray"] -> ArrayLike

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jschendel done!

# boolean array with NA -> fill with False
equal_values = equal_values.fillna(False)
equal_na = self.isna() & other.isna()
return (equal_values | equal_na).all().item()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is the .item() necessary?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is the .item() necessary?

It's to have a python bool, instead of a numpy bool, as result.
I added a comment to the tests to make it explicit this is the reason we are asserting with is True/False (and later on we don't inadvertedly "clean" that up)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for clarifying. i was recently reminded that np.array(np.timedelta64(1234, "ns")).item() gives an int instead of timedelta64, so im now cautious around .item()

@jorisvandenbossche
Copy link
Member

@simonjayhawkins how do I solve this typing error?

pandas/core/arrays/base.py:338: error: Return type "Union[Any, ExtensionArray]" of "__eq__" incompatible with return type "bool" in supertype "object"

So our annotation of __eq__ is incompatible with the base "object" class, but we do this on purpose. So is there another way to solve this than to suppress the error in some way?

@jorisvandenbossche
Copy link
Member

Any more comments on this PR?

@jreback
Copy link
Contributor

jreback commented May 5, 2020

will look soon

@@ -335,7 +335,7 @@ def __iter__(self):
for i in range(len(self)):
yield self[i]

def __eq__(self, other: Any) -> ArrayLike:
def __eq__(self, other: Any) -> ArrayLike: # type: ignore[override] # NOQA
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

im not familiar with the ignore[overrride] # NOQA pattern. do we use that elsewhere?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought I posted the mypy issue, but apparently forgot.

So the problem here is that mypy expects a "bool" return for __eq__ since the base object is typed that way in the stubs. In python/mypy#2783, the recommended way to solve this is the above with # type: ignore[override].

But flake8 doesn't like that, hence also the #NOQA (PyCQA/pyflakes#475).

This was already done in another PR as well:

def sort_values( # type: ignore[override] # NOQA # issue 27237

@jorisvandenbossche
Copy link
Member

@jreback you wanted to take a look here?

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm. a couple of minor comment.s

@@ -150,6 +150,7 @@ Other enhancements
such as ``dict`` and ``list``, mirroring the behavior of :meth:`DataFrame.update` (:issue:`33215`)
- :meth:`~pandas.core.groupby.GroupBy.transform` and :meth:`~pandas.core.groupby.GroupBy.aggregate` has gained ``engine`` and ``engine_kwargs`` arguments that supports executing functions with ``Numba`` (:issue:`32854`, :issue:`33388`)
- :meth:`~pandas.core.resample.Resampler.interpolate` now supports SciPy interpolation method :class:`scipy.interpolate.CubicSpline` as method ``cubicspline`` (:issue:`33670`)
- The ``ExtensionArray`` class has now an ``equals`` method, similarly to ``Series.equals()`` (:issue:`27081`).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can add the doc reference here

@@ -1864,6 +1864,11 @@ def where(

return [self.make_block_same_class(result, placement=self.mgr_locs)]

def equals(self, other) -> bool:
if self.dtype != other.dtype or self.shape != other.shape:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do you need this check here? (its already done on the .values no?)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, good question. I just copied that from the base Block.equals method (so it's done there as well). There, array_equivalent is used. Maybe that is less strict and the extra check is needed, which is not needed here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At least from the docstring of array_equivalent (not sure how up to date that is), using that method indeed requires a more strict check in advance of calling it:

in corresponding locations. False otherwise. It is assumed that left and
right are NumPy arrays of the same dtype. The behavior of this function
(particularly with respect to NaNs) is not defined if the dtypes are
different.

But so, for EA.equals that is not needed, so will remove that check here.

@jorisvandenbossche
Copy link
Member

Travis failures are unrelated (network issues, S3)

@jorisvandenbossche jorisvandenbossche merged commit f21bc99 into pandas-dev:master May 9, 2020
rhshadrach pushed a commit to rhshadrach/pandas that referenced this pull request May 10, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ExtensionArray Extending pandas with custom dtypes or arrays.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

API: .equals for Extension Arrays
8 participants