Skip to content

BUG: pd.compare does not recognize differences when comparing values with null Int64 data type #48966

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 11 commits into from
Oct 14, 2022

Conversation

vamsi-verma-s
Copy link
Contributor

@vamsi-verma-s vamsi-verma-s commented Oct 6, 2022

in pd.compare if any comparisons result in null values like pd.NA, consider the values unequal.

Copy link
Member

@phofl phofl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

small comments

@@ -209,6 +209,8 @@ Indexing
- Bug in :meth:`DataFrame.reindex` filling with wrong values when indexing columns and index for ``uint`` dtypes (:issue:`48184`)
- Bug in :meth:`DataFrame.reindex` casting dtype to ``object`` when :class:`DataFrame` has single extension array column when re-indexing ``columns`` and ``index`` (:issue:`48190`)
- Bug in :func:`~DataFrame.describe` when formatting percentiles in the resulting index showed more decimals than needed (:issue:`46362`)
- Bug in :meth:`DataFrame.compare` does not recognize differences when comparing values with null Int64 data type (:issue:`48939`)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when comparing NA with value in nullable dtypes

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

modified the note

def test_compare_with_equal_null_int64_dtypes():
# GH #48939
# two int64 nulls are considered same.
df1 = pd.DataFrame({"a": pd.Series([4.0, pd.NA], dtype="Int64"), "b": [1.0, 2]})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @jorisvandenbossche should pd.NA be considered equal here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The docstring of compare says: "Matching NaNs will not appear as a difference.", so yes, that seems the intention?
(which should be updated to be "matching NaN / NA")

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that wasn’t written with pd.NA in mind. So we could decide differently here, if it makes sense in the overall picture

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

compare uses pd.isna to check missing values, and because of that it already considers NaN and NA same.
If compare currently says these two are equal then users would expected it to say 2 NA are equal.

current behavior -

In [3]: df1 = pd.Series([4, np.nan])

In [4]: df2 = pd.Series([4, pd.NA], dtype='Int64')

In [5]: df1.compare(df2)
Out[5]:
Empty DataFrame
Columns: [self, other]
Index: []



def test_compare_with_unequal_null_int64_dtypes():
# GH #48939
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could parametrize the first test, would reduce the duplicated code (test_compare_ea_and_np_dtype)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

parameterized the test_compare_ea_and_np_dtype test case, added 2 cases that compare EA and Series

  1. compares non null value to NA
  2. compares NA value to NA

Added one more test with 2 cases comparing only int64 dtype frames.

  1. with different values
  2. with same values

all with keep_shape=True

@mroeschke mroeschke added the NA - MaskedArrays Related to pd.NA and nullable extension arrays label Oct 6, 2022
@vamsi-verma-s
Copy link
Contributor Author

vamsi-verma-s commented Oct 7, 2022

do I need to change anything or the cancelled check will be rerun?

@vamsi-verma-s
Copy link
Contributor Author

vamsi-verma-s commented Oct 9, 2022

Hi @phofl

seems like compare considers all missing values the same.
Considering this is the current behavior of the method, can we go ahead?

In [57]: df2 = pd.DataFrame({'a': pd.Series([4, np.nan]), 'b': [3, pd.NaT] })

In [58]: df1 = pd.DataFrame({'a': pd.Series([4, pd.NaT]), 'b': [3, np.nan] })

In [59]: df1.compare(df2)
Out[59]:
Empty DataFrame
Columns: []
Index: []

Also, something to note - equals considers two pd.NAs the same.

In [61]: df1 = pd.DataFrame({'a': pd.Series([4, pd.NA], dtype='Int64'), 'b': [3, pd.NA] })

In [62]: df2 = pd.DataFrame({'a': pd.Series([4, pd.NA], dtype='Int64'), 'b': [3, pd.NA] })

In [63]: df1.equals(df2)
Out[63]: True

"df1,df2,expected",
[
(
pd.DataFrame({"a": [4.0, 4], "b": [1.0, 2]}),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant parametrising single values, not the whole object. This is really hard to read

Copy link
Contributor Author

@vamsi-verma-s vamsi-verma-s Oct 12, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought this would be easier to understand. I've changed it now. please take a look.

@vamsi-verma-s vamsi-verma-s requested a review from phofl October 12, 2022 06:13
@vamsi-verma-s
Copy link
Contributor Author

Don't know why the test failed in ubuntu. the test passes for me on WSL.

df2 = pd.DataFrame({"a": pd.Series([1, pd.NA], dtype="Int64"), "b": [1.0, 2]})
result = df1.compare(df2, keep_shape=True)
@pytest.mark.parametrize(
"ea_val,np_dtype_val",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these names are confusing, can you clarify

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ea_val and np_dtype_val intended to communicate value for extension array and value for np dtype array respectively.

is this ok?

Suggested change
"ea_val,np_dtype_val",
"df1_val,df2_val",

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep this is clearer

[(4, pd.NA), (pd.NA, pd.NA), (pd.NA, 4)],
)
def test_compare_ea_and_np_dtype(ea_val, np_dtype_val):
ea = [4.0, ea_val]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please add gh refs

please rename variable, ea stands for extension array

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will add gh refs

is arr ok?

Suggested change
ea = [4.0, ea_val]
arr = [4.0, df1_val]

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep works for me

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the quick review.

Comment on lines +242 to +247
"val1,val2",
[(4, pd.NA), (pd.NA, pd.NA), (pd.NA, 4)],
)
def test_compare_ea_and_np_dtype(val1, val2):
# GH 48966
arr = [4.0, val1]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed variable names, please do suggest any alternate if this does not make sense

@vamsi-verma-s vamsi-verma-s requested a review from phofl October 14, 2022 20:15
@phofl phofl merged commit 75d91c8 into pandas-dev:main Oct 14, 2022
@phofl
Copy link
Member

phofl commented Oct 14, 2022

Thx @vamsi-verma-s

@phofl phofl added this to the 2.0 milestone Oct 14, 2022
@vamsi-verma-s vamsi-verma-s deleted the res-48939 branch October 14, 2022 20:55
noatamir pushed a commit to noatamir/pandas that referenced this pull request Nov 9, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
NA - MaskedArrays Related to pd.NA and nullable extension arrays
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: pd.compare does not recognize differences with the nullable Int64 data type
4 participants