Skip to content

ENH: .equals for Extension Arrays #30652

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 19 commits into from
May 9, 2020
Merged
Show file tree
Hide file tree
Changes from 3 commits
Commits
Show all changes
19 commits
Select commit Hold shift + click to select a range
36c8b88
ENH: .equals for Extension Arrays
dwhu Jan 3, 2020
786963c
ENH: Updating eq and ne methods for extension arrays.
dwhu Jan 3, 2020
6800315
Removing interval.py's __eq__ implementation due to conflict with @js…
dwhu Jan 3, 2020
a3e7b7f
ENH: Making EA eq and ne typed as Any.
dwhu Jan 3, 2020
860013f
ENH: Adding default implementation to ExtensionArray equals() and tests.
dwhu Jan 3, 2020
fc3d2c2
Merge remote-tracking branch 'upstream/master' into gh-27081
dwhu Jan 9, 2020
3da5726
Merge remote-tracking branch 'upstream/master' into gh-27081
jorisvandenbossche May 1, 2020
c5027dd
correct __eq/ne__ to be element-wise
jorisvandenbossche May 1, 2020
375664c
fix equals implementation (& instead of ==)
jorisvandenbossche May 1, 2020
b6ad2fb
base tests
jorisvandenbossche May 1, 2020
365362a
ensure to dispatch Series.equals to EA.equals
jorisvandenbossche May 1, 2020
aae2f94
Merge remote-tracking branch 'upstream/master' into gh-27081
jorisvandenbossche May 2, 2020
8d052ad
feedback: docs, whatsnew, dataframe test, strict dtype test
jorisvandenbossche May 2, 2020
9ee034e
add to reference docs
jorisvandenbossche May 2, 2020
38501e6
remove IntervalArray.__ne__
jorisvandenbossche May 2, 2020
dccec7f
type ignore following mypy issue (mypy/2783)
jorisvandenbossche May 5, 2020
0b1255f
Merge remote-tracking branch 'upstream/master' into gh-27081
jorisvandenbossche May 7, 2020
b8be858
try again without type: ignore
jorisvandenbossche May 7, 2020
4c7273f
updates
jorisvandenbossche May 8, 2020
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
53 changes: 53 additions & 0 deletions pandas/core/arrays/base.py
Original file line number Diff line number Diff line change
Expand Up @@ -78,6 +78,7 @@ class ExtensionArray:
dropna
factorize
fillna
equals
isna
ravel
repeat
Expand All @@ -104,6 +105,7 @@ class ExtensionArray:
* _from_sequence
* _from_factorized
* __getitem__
* __eq__
* __len__
* dtype
* nbytes
Expand Down Expand Up @@ -350,6 +352,38 @@ def __iter__(self):
for i in range(len(self)):
yield self[i]

def __eq__(self, other: ABCExtensionArray) -> bool:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The type of other should be Any I think. Passing a list / etc here is valid, it just returns False.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sure to update the type in the docstring too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

"""
Whether the two arrays are equivalent.

Parameters
----------
other: ExtensionArray
The array to compare to this array.

Returns
-------
bool
"""

raise AbstractMethodError(self)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, does this need to be abstract? Or can we define it as something like

# check that the types, length match, etc.
# check that NA values in self match NA values in other...
return np.all(np.array(self) == np.array(other))

Then subclasses can override with implementations that may be faster?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good suggestion. I like that much better.


def __ne__(self, other: ABCExtensionArray) -> bool:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment about the type on other.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

"""
Whether the two arrays are not equivalent.

Parameters
----------
other: ExtensionArray
The array to compare to this array.

Returns
-------
bool
"""

return ~(self == other)

# ------------------------------------------------------------------------
# Required attributes
# ------------------------------------------------------------------------
Expand Down Expand Up @@ -657,6 +691,25 @@ def searchsorted(self, value, side="left", sorter=None):
arr = self.astype(object)
return arr.searchsorted(value, side=side, sorter=sorter)

def equals(self, other: ABCExtensionArray) -> bool:
"""
Return if another array is equivalent to this array.

Parameters
----------
other: ExtensionArray
Array to compare to this Array.

Returns
-------
boolean
Whether the arrays are equivalent.

"""
return isinstance(other, self.__class__) and (
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps type(self) == type(other)? It's not clear to me whether a subclass should compare .equals to a parent class. Thoughts @jreback?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will want some length checking here too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added type(self) == type(other) and length checks.

((self == other) | (self.isna() == other.isna())).all()
)

def _values_for_factorize(self) -> Tuple[np.ndarray, Any]:
"""
Return an array and missing value suitable for factorization.
Expand Down
9 changes: 9 additions & 0 deletions pandas/core/arrays/integer.py
Original file line number Diff line number Diff line change
Expand Up @@ -376,6 +376,15 @@ def __getitem__(self, item):

return type(self)(self._data[item], self._mask[item])

def __eq__(self, other):
return (
isinstance(other, IntegerArray)
and hasattr(other, "_data")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We already have IntegerArray.__eq__ on master. This shouldn't be necessary.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed.

and self._data == other._data
and hasattr(other, "_mask")
and self._mask == other._mask
)

def _coerce_to_ndarray(self, dtype=None, na_value=lib._no_default):
"""
coerce to an ndarary of object dtype
Expand Down
15 changes: 15 additions & 0 deletions pandas/tests/extension/base/methods.py
Original file line number Diff line number Diff line change
Expand Up @@ -358,3 +358,18 @@ def test_repeat_raises(self, data, repeats, kwargs, error, msg, use_numpy):
np.repeat(data, repeats, **kwargs)
else:
data.repeat(repeats, **kwargs)

def test_equals(self, data, na_value):
cls = type(data)
ser = pd.Series(cls._from_sequence(data, dtype=data.dtype))
na_ser = pd.Series(cls._from_sequence([na_value], dtype=data.dtype))

assert data.equals(data)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make all these assert <condition> is True, to ensure that True or False is actually returned

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ran into a funny issue doing this: numpy/numpy#5942
I'm going to keep this as is now to keep the test consistent within itself.

.equals() returns a np.bool_ type. As result when you compare it to bool type True using is (i.e. some_np_bool_True_val is True == False

Coercing np.bool_ to bool using the bool() constructor did not work either. == True causes PEP8 errors asking me to use assert <condition> or assert <condition> is True.

Open to suggestions if you have any.

assert ser.equals(ser)
assert na_ser.equals(na_ser)

assert not data.equals(na_value)
assert not na_ser.equals(ser)
assert not ser.equals(na_ser)
assert not ser.equals(0)
assert not na_ser.equals(0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add test for other inputs like an ndarray.

Add a test for unequal length things.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

12 changes: 0 additions & 12 deletions pandas/tests/extension/base/ops.py
Original file line number Diff line number Diff line change
Expand Up @@ -132,10 +132,8 @@ class BaseComparisonOpsTests(BaseOpsUtil):
def _compare_other(self, s, data, op_name, other):
op = self.get_op_from_name(op_name)
if op_name == "__eq__":
assert getattr(data, op_name)(other) is NotImplemented
assert not op(s, other).all()
elif op_name == "__ne__":
assert getattr(data, op_name)(other) is NotImplemented
assert op(s, other).all()

else:
Expand All @@ -158,13 +156,3 @@ def test_compare_array(self, data, all_compare_operators):
s = pd.Series(data)
other = pd.Series([data[0]] * len(data))
self._compare_other(s, data, op_name, other)

def test_direct_arith_with_series_returns_not_implemented(self, data):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Undo this change.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as the other comment. The whole PR is intended to break this test since we implement __eq__ and __ne__.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as the other comment. The whole PR is intended to break this test since we implement eq and ne.

The reason this needs to be kept is that, although __eq__ is a required method on an ExtensionArray now, we still require it to return NotImplemented when other is a Series or Index.

# EAs should return NotImplemented for ops with Series.
# Pandas takes care of unboxing the series and calling the EA's op.
other = pd.Series(data)
if hasattr(data, "__eq__"):
result = data.__eq__(other)
assert result is NotImplemented
else:
raise pytest.skip(f"{type(data).__name__} does not implement __eq__")
7 changes: 7 additions & 0 deletions pandas/tests/extension/json/array.py
Original file line number Diff line number Diff line change
Expand Up @@ -110,6 +110,13 @@ def __setitem__(self, key, value):
assert isinstance(v, self.dtype.type)
self.data[k] = v

def __eq__(self, other):
return (
isinstance(other, JSONArray)
and hasattr(other, "data")
and self.data == other.data
)

def __len__(self) -> int:
return len(self.data)

Expand Down
4 changes: 4 additions & 0 deletions pandas/tests/extension/test_sparse.py
Original file line number Diff line number Diff line change
Expand Up @@ -302,6 +302,10 @@ def test_searchsorted(self, data_for_sorting, as_series):
with tm.assert_produces_warning(PerformanceWarning):
super().test_searchsorted(data_for_sorting, as_series)

def test_equals(self, data, na_value):
self._check_unsupported(data)
super().test_equals(data, na_value)


class TestCasting(BaseSparseTests, base.BaseCastingTests):
pass
Expand Down