BUG: pd.compare does not recognize differences when comparing values with null Int64 data type #48966

vamsi-verma-s · 2022-10-06T05:55:50Z

in pd.compare if any comparisons result in null values like pd.NA, consider the values unequal.

closes BUG: pd.compare does not recognize differences with the nullable Int64 data type #48939 (Replace xxxx with the Github issue number)
Tests added and passed if fixing a bug or adding a new feature
All code checks passed.
Added type annotations to new arguments/methods/functions.
Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

phofl

small comments

phofl · 2022-10-06T08:25:12Z

doc/source/whatsnew/v1.6.0.rst

@@ -209,6 +209,8 @@ Indexing
 - Bug in :meth:`DataFrame.reindex` filling with wrong values when indexing columns and index for ``uint`` dtypes (:issue:`48184`)
 - Bug in :meth:`DataFrame.reindex` casting dtype to ``object`` when :class:`DataFrame` has single extension array column when re-indexing ``columns`` and ``index`` (:issue:`48190`)
 - Bug in :func:`~DataFrame.describe` when formatting percentiles in the resulting index showed more decimals than needed (:issue:`46362`)
+- Bug in :meth:`DataFrame.compare` does not recognize differences when comparing values with null Int64 data type (:issue:`48939`)


when comparing NA with value in nullable dtypes

modified the note

phofl · 2022-10-06T08:26:17Z

pandas/tests/frame/methods/test_compare.py

+def test_compare_with_equal_null_int64_dtypes():
+    # GH #48939
+    # two int64 nulls are considered same.
+    df1 = pd.DataFrame({"a": pd.Series([4.0, pd.NA], dtype="Int64"), "b": [1.0, 2]})


cc @jorisvandenbossche should pd.NA be considered equal here?

The docstring of compare says: "Matching NaNs will not appear as a difference.", so yes, that seems the intention?
(which should be updated to be "matching NaN / NA")

I think that wasn’t written with pd.NA in mind. So we could decide differently here, if it makes sense in the overall picture

compare uses pd.isna to check missing values, and because of that it already considers NaN and NA same.
If compare currently says these two are equal then users would expected it to say 2 NA are equal.

current behavior -

In [3]: df1 = pd.Series([4, np.nan]) In [4]: df2 = pd.Series([4, pd.NA], dtype='Int64') In [5]: df1.compare(df2) Out[5]: Empty DataFrame Columns: [self, other] Index: []

phofl · 2022-10-06T08:27:44Z

pandas/tests/frame/methods/test_compare.py

+
+
+def test_compare_with_unequal_null_int64_dtypes():
+    # GH #48939


You could parametrize the first test, would reduce the duplicated code (test_compare_ea_and_np_dtype)

parameterized the test_compare_ea_and_np_dtype test case, added 2 cases that compare EA and Series

compares non null value to NA

compares NA value to NA

Added one more test with 2 cases comparing only int64 dtype frames.

with different values

with same values

all with keep_shape=True

pandas/tests/frame/methods/test_compare.py

vamsi-verma-s · 2022-10-07T02:25:25Z

do I need to change anything or the cancelled check will be rerun?

vamsi-verma-s · 2022-10-09T06:54:39Z

Hi @phofl

seems like compare considers all missing values the same.
Considering this is the current behavior of the method, can we go ahead?

In [57]: df2 = pd.DataFrame({'a': pd.Series([4, np.nan]), 'b': [3, pd.NaT] })

In [58]: df1 = pd.DataFrame({'a': pd.Series([4, pd.NaT]), 'b': [3, np.nan] })

In [59]: df1.compare(df2)
Out[59]:
Empty DataFrame
Columns: []
Index: []

Also, something to note - equals considers two pd.NAs the same.

In [61]: df1 = pd.DataFrame({'a': pd.Series([4, pd.NA], dtype='Int64'), 'b': [3, pd.NA] })

In [62]: df2 = pd.DataFrame({'a': pd.Series([4, pd.NA], dtype='Int64'), 'b': [3, pd.NA] })

In [63]: df1.equals(df2)
Out[63]: True

phofl · 2022-10-11T20:46:06Z

pandas/tests/frame/methods/test_compare.py

+    "df1,df2,expected",
+    [
+        (
+            pd.DataFrame({"a": [4.0, 4], "b": [1.0, 2]}),


I meant parametrising single values, not the whole object. This is really hard to read

I thought this would be easier to understand. I've changed it now. please take a look.

vamsi-verma-s · 2022-10-12T09:38:36Z

Don't know why the test failed in ubuntu. the test passes for me on WSL.

phofl · 2022-10-13T22:34:24Z

pandas/tests/frame/methods/test_compare.py

-    df2 = pd.DataFrame({"a": pd.Series([1, pd.NA], dtype="Int64"), "b": [1.0, 2]})
-    result = df1.compare(df2, keep_shape=True)
+@pytest.mark.parametrize(
+    "ea_val,np_dtype_val",


these names are confusing, can you clarify

ea_val and np_dtype_val intended to communicate value for extension array and value for np dtype array respectively.

is this ok?

Suggested change

"ea_val,np_dtype_val",

"df1_val,df2_val",

yep this is clearer

phofl · 2022-10-13T22:34:44Z

pandas/tests/frame/methods/test_compare.py

+    [(4, pd.NA), (pd.NA, pd.NA), (pd.NA, 4)],
+)
+def test_compare_ea_and_np_dtype(ea_val, np_dtype_val):
+    ea = [4.0, ea_val]


please add gh refs

please rename variable, ea stands for extension array

will add gh refs

is arr ok?

Suggested change

ea = [4.0, ea_val]

arr = [4.0, df1_val]

yep works for me

Thanks for the quick review.

vamsi-verma-s · 2022-10-14T03:19:00Z

pandas/tests/frame/methods/test_compare.py

+    "val1,val2",
+    [(4, pd.NA), (pd.NA, pd.NA), (pd.NA, 4)],
+)
+def test_compare_ea_and_np_dtype(val1, val2):
+    # GH 48966
+    arr = [4.0, val1]


changed variable names, please do suggest any alternate if this does not make sense

pandas/tests/frame/methods/test_compare.py

phofl · 2022-10-14T20:39:22Z

Thx @vamsi-verma-s

…with null Int64 data type (pandas-dev#48966)

vamsi-verma-s added 4 commits October 5, 2022 20:55

BUG: df.compare returns incorrect result when comparing NA values

f027ff9

add bug fix note

0f3fcef

add pd.compare tests specific to int64 null dtype

9c0a19b

change bug fix note to be specifically about int64 dtype null values

0b5ec5c

phofl reviewed Oct 6, 2022

View reviewed changes

vamsi-verma-s added 2 commits October 6, 2022 09:24

change bug fix note

d72e0e3

add nullable int64 test and parameterize ea and np dtype test

4412e33

mroeschke added the NA - MaskedArrays Related to pd.NA and nullable extension arrays label Oct 6, 2022

phofl reviewed Oct 11, 2022

View reviewed changes

vamsi-verma-s added 2 commits October 12, 2022 04:23

fix test parametrization

9b97320

Merge branch 'master' into res-48939

ffbb131

vamsi-verma-s requested a review from phofl October 12, 2022 06:13

Merge branch 'master' into res-48939

bae0808

phofl reviewed Oct 13, 2022

View reviewed changes

rename variables, add gh refs

4838dc0

vamsi-verma-s commented Oct 14, 2022

View reviewed changes

phofl reviewed Oct 14, 2022

View reviewed changes

pandas/tests/frame/methods/test_compare.py Outdated Show resolved Hide resolved

change var name

3f739dd

vamsi-verma-s requested a review from phofl October 14, 2022 20:15

phofl approved these changes Oct 14, 2022

View reviewed changes

phofl merged commit 75d91c8 into pandas-dev:main Oct 14, 2022

phofl added this to the 2.0 milestone Oct 14, 2022

vamsi-verma-s deleted the res-48939 branch October 14, 2022 20:55

noatamir pushed a commit to noatamir/pandas that referenced this pull request Nov 9, 2022

BUG: pd.compare does not recognize differences when comparing values …

dfcadab

…with null Int64 data type (pandas-dev#48966)



		def test_compare_with_unequal_null_int64_dtypes():
		# GH #48939

Uh oh!

BUG: pd.compare does not recognize differences when comparing values with null Int64 data type #48966

BUG: pd.compare does not recognize differences when comparing values with null Int64 data type #48966

Uh oh!

Conversation

vamsi-verma-s commented Oct 6, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

phofl left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

vamsi-verma-s commented Oct 7, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vamsi-verma-s commented Oct 9, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vamsi-verma-s Oct 12, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vamsi-verma-s commented Oct 12, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

phofl commented Oct 14, 2022

Uh oh!

Uh oh!

vamsi-verma-s commented Oct 6, 2022 •

edited

Loading

vamsi-verma-s commented Oct 7, 2022 •

edited

Loading

vamsi-verma-s commented Oct 9, 2022 •

edited

Loading

vamsi-verma-s Oct 12, 2022 •

edited

Loading