Skip to content

BUG: pd.compare does not recognize differences when comparing values with null Int64 data type #48966

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 11 commits into from
Oct 14, 2022
2 changes: 2 additions & 0 deletions doc/source/whatsnew/v1.6.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -209,6 +209,8 @@ Indexing
- Bug in :meth:`DataFrame.reindex` filling with wrong values when indexing columns and index for ``uint`` dtypes (:issue:`48184`)
- Bug in :meth:`DataFrame.reindex` casting dtype to ``object`` when :class:`DataFrame` has single extension array column when re-indexing ``columns`` and ``index`` (:issue:`48190`)
- Bug in :func:`~DataFrame.describe` when formatting percentiles in the resulting index showed more decimals than needed (:issue:`46362`)
- Bug in :meth:`DataFrame.compare` does not recognize differences when comparing values with null Int64 data type (:issue:`48939`)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when comparing NA with value in nullable dtypes

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

modified the note

-

Missing
^^^^^^^
Expand Down
1 change: 1 addition & 0 deletions pandas/core/generic.py
Original file line number Diff line number Diff line change
Expand Up @@ -9262,6 +9262,7 @@ def compare(
)

mask = ~((self == other) | (self.isna() & other.isna()))
mask.fillna(True, inplace=True)

if not keep_equal:
self = self.where(mask)
Expand Down
34 changes: 33 additions & 1 deletion pandas/tests/frame/methods/test_compare.py
Original file line number Diff line number Diff line change
Expand Up @@ -245,7 +245,39 @@ def test_compare_ea_and_np_dtype():
result = df1.compare(df2, keep_shape=True)
expected = pd.DataFrame(
{
("a", "self"): [4.0, np.nan],
("a", "self"): [4.0, 4],
("a", "other"): pd.Series([1, pd.NA], dtype="Int64"),
("b", "self"): np.nan,
("b", "other"): np.nan,
}
)
tm.assert_frame_equal(result, expected)


def test_compare_with_equal_null_int64_dtypes():
# GH #48939
# two int64 nulls are considered same.
df1 = pd.DataFrame({"a": pd.Series([4.0, pd.NA], dtype="Int64"), "b": [1.0, 2]})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @jorisvandenbossche should pd.NA be considered equal here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The docstring of compare says: "Matching NaNs will not appear as a difference.", so yes, that seems the intention?
(which should be updated to be "matching NaN / NA")

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that wasn’t written with pd.NA in mind. So we could decide differently here, if it makes sense in the overall picture

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

compare uses pd.isna to check missing values, and because of that it already considers NaN and NA same.
If compare currently says these two are equal then users would expected it to say 2 NA are equal.

current behavior -

In [3]: df1 = pd.Series([4, np.nan])

In [4]: df2 = pd.Series([4, pd.NA], dtype='Int64')

In [5]: df1.compare(df2)
Out[5]:
Empty DataFrame
Columns: [self, other]
Index: []

df2 = pd.DataFrame({"a": pd.Series([3, pd.NA], dtype="Int64"), "b": [1.0, 2]})
result = df1.compare(df2)
expected = pd.DataFrame(
{
("a", "self"): pd.Series([4], dtype="Int64"),
("a", "other"): pd.Series([3], dtype="Int64"),
}
)
tm.assert_frame_equal(result, expected)


def test_compare_with_unequal_null_int64_dtypes():
# GH #48939
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could parametrize the first test, would reduce the duplicated code (test_compare_ea_and_np_dtype)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

parameterized the test_compare_ea_and_np_dtype test case, added 2 cases that compare EA and Series

  1. compares non null value to NA
  2. compares NA value to NA

Added one more test with 2 cases comparing only int64 dtype frames.

  1. with different values
  2. with same values

all with keep_shape=True

# comparison with int64 null dtype shouldn't obscure result.
df1 = pd.DataFrame({"a": pd.Series([4.0, 4], dtype="Int64"), "b": [1.0, 2]})
df2 = pd.DataFrame({"a": pd.Series([1, pd.NA], dtype="Int64"), "b": [1.0, 2]})
result = df1.compare(df2, keep_shape=True)
expected = pd.DataFrame(
{
("a", "self"): pd.Series([4.0, 4], dtype="Int64"),
("a", "other"): pd.Series([1, pd.NA], dtype="Int64"),
("b", "self"): np.nan,
("b", "other"): np.nan,
Expand Down