Skip to content

PERF: array_equivalent #35256

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from
Closed
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
27 changes: 18 additions & 9 deletions pandas/core/dtypes/missing.py
Original file line number Diff line number Diff line change
Expand Up @@ -385,15 +385,23 @@ def array_equivalent(left, right, strict_nan: bool = False) -> bool:
... np.array([1, 2, np.nan]))
False
"""
# TODO: pass strict_nane to the .equals methods?
if isinstance(left, ABCExtensionArray):
return left.equals(right)
elif isinstance(right, ABCExtensionArray):
return right.equals(left)
Comment on lines +388 to +392
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this needed? array_equivalent is supposed to operate on ndarrays (and the BlockManager.equals already handles dispatching the EA.equals when needed).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the PR that removed Block.equals I thought you agreed that array_equivalent should dispatch for EAs.

I'm OK with removing this check for now; its orthogonal to the perf goal.


left, right = np.asarray(left), np.asarray(right)
ldtype = left.dtype
rdtype = right.dtype

# shape compat
if left.shape != right.shape:
return False

# Object arrays can contain None, NaN and NaT.
# string dtypes must be come to this path for NumPy 1.7.1 compat
if is_string_dtype(left.dtype) or is_string_dtype(right.dtype):
if is_string_dtype(ldtype) or is_string_dtype(rdtype):

if not strict_nan:
# isna considers NaN and None to be equivalent.
Expand Down Expand Up @@ -425,28 +433,29 @@ def array_equivalent(left, right, strict_nan: bool = False) -> bool:
return True

# NaNs can occur in float and complex arrays.
if is_float_dtype(left.dtype) or is_complex_dtype(left.dtype):
if is_float_dtype(ldtype) or is_complex_dtype(ldtype):
if ldtype == rdtype and ldtype.itemsize != 16:
ileft = left.view(f"i{ldtype.itemsize}")
iright = right.view(f"i{rdtype.itemsize}")
return (ileft == iright).all()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we sure this is always going to work correctly? It's certainly a minor corner case, but there are ways to get different NaNs, see eg https://stackoverflow.com/a/33371882/653364

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm yeah i agree here, nan's would compare not equal, this is eactly why we need the nan mask (you can use np.isnan here though)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

darn, fair point. if we cant compare after int-casting, then i think we lose pretty much all of the perf gain. closing.


# empty
if not (np.prod(left.shape) and np.prod(right.shape)):
return True
return ((left == right) | (isna(left) & isna(right))).all()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Compared to the original fastpath in Block.equals (#34962 (comment)), I think an important factor might also be to use np.isnan here instead of pd.isnan (which should be equivalent for float dtype).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, except for things like inf_as_na ... But which we were not using in Block.equals before anyway ..

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good point (two, really). Because of the inf_as_na thing, we should probably consider it as a bugfix and address it separately


elif is_datetimelike_v_numeric(left, right):
# GH#29553 avoid numpy deprecation warning
return False

elif needs_i8_conversion(left.dtype) or needs_i8_conversion(right.dtype):
elif needs_i8_conversion(ldtype) or needs_i8_conversion(rdtype):
# datetime64, timedelta64, Period
if not is_dtype_equal(left.dtype, right.dtype):
if not is_dtype_equal(ldtype, rdtype):
return False

left = left.view("i8")
right = right.view("i8")

# if we have structured dtypes, compare first
if left.dtype.type is np.void or right.dtype.type is np.void:
if left.dtype != right.dtype:
if ldtype.type is np.void or rdtype.type is np.void:
if ldtype != rdtype:
return False

return np.array_equal(left, right)
Expand Down