-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG: pd.compare does not recognize differences when comparing values with null Int64 data type #48966
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
small comments
doc/source/whatsnew/v1.6.0.rst
Outdated
@@ -209,6 +209,8 @@ Indexing | |||
- Bug in :meth:`DataFrame.reindex` filling with wrong values when indexing columns and index for ``uint`` dtypes (:issue:`48184`) | |||
- Bug in :meth:`DataFrame.reindex` casting dtype to ``object`` when :class:`DataFrame` has single extension array column when re-indexing ``columns`` and ``index`` (:issue:`48190`) | |||
- Bug in :func:`~DataFrame.describe` when formatting percentiles in the resulting index showed more decimals than needed (:issue:`46362`) | |||
- Bug in :meth:`DataFrame.compare` does not recognize differences when comparing values with null Int64 data type (:issue:`48939`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
when comparing NA
with value in nullable dtypes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
modified the note
def test_compare_with_equal_null_int64_dtypes(): | ||
# GH #48939 | ||
# two int64 nulls are considered same. | ||
df1 = pd.DataFrame({"a": pd.Series([4.0, pd.NA], dtype="Int64"), "b": [1.0, 2]}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @jorisvandenbossche should pd.NA be considered equal here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The docstring of compare
says: "Matching NaNs will not appear as a difference.", so yes, that seems the intention?
(which should be updated to be "matching NaN / NA")
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that wasn’t written with pd.NA in mind. So we could decide differently here, if it makes sense in the overall picture
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
compare uses pd.isna
to check missing values, and because of that it already considers NaN
and NA
same.
If compare currently says these two are equal then users would expected it to say 2 NA
are equal.
current behavior -
In [3]: df1 = pd.Series([4, np.nan])
In [4]: df2 = pd.Series([4, pd.NA], dtype='Int64')
In [5]: df1.compare(df2)
Out[5]:
Empty DataFrame
Columns: [self, other]
Index: []
|
||
|
||
def test_compare_with_unequal_null_int64_dtypes(): | ||
# GH #48939 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could parametrize the first test, would reduce the duplicated code (test_compare_ea_and_np_dtype)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
parameterized the test_compare_ea_and_np_dtype
test case, added 2 cases that compare EA and Series
- compares non null value to
NA
- compares
NA
value toNA
Added one more test with 2 cases comparing only int64 dtype frames.
- with different values
- with same values
all with keep_shape=True
do I need to change anything or the cancelled check will be rerun? |
Hi @phofl seems like In [57]: df2 = pd.DataFrame({'a': pd.Series([4, np.nan]), 'b': [3, pd.NaT] })
In [58]: df1 = pd.DataFrame({'a': pd.Series([4, pd.NaT]), 'b': [3, np.nan] })
In [59]: df1.compare(df2)
Out[59]:
Empty DataFrame
Columns: []
Index: [] Also, something to note - In [61]: df1 = pd.DataFrame({'a': pd.Series([4, pd.NA], dtype='Int64'), 'b': [3, pd.NA] })
In [62]: df2 = pd.DataFrame({'a': pd.Series([4, pd.NA], dtype='Int64'), 'b': [3, pd.NA] })
In [63]: df1.equals(df2)
Out[63]: True |
"df1,df2,expected", | ||
[ | ||
( | ||
pd.DataFrame({"a": [4.0, 4], "b": [1.0, 2]}), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I meant parametrising single values, not the whole object. This is really hard to read
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought this would be easier to understand. I've changed it now. please take a look.
Don't know why the test failed in ubuntu. the test passes for me on WSL. |
df2 = pd.DataFrame({"a": pd.Series([1, pd.NA], dtype="Int64"), "b": [1.0, 2]}) | ||
result = df1.compare(df2, keep_shape=True) | ||
@pytest.mark.parametrize( | ||
"ea_val,np_dtype_val", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these names are confusing, can you clarify
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ea_val
and np_dtype_val
intended to communicate value for extension array and value for np dtype array respectively.
is this ok?
"ea_val,np_dtype_val", | |
"df1_val,df2_val", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep this is clearer
[(4, pd.NA), (pd.NA, pd.NA), (pd.NA, 4)], | ||
) | ||
def test_compare_ea_and_np_dtype(ea_val, np_dtype_val): | ||
ea = [4.0, ea_val] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please add gh refs
please rename variable, ea stands for extension array
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will add gh refs
is arr ok?
ea = [4.0, ea_val] | |
arr = [4.0, df1_val] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep works for me
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the quick review.
"val1,val2", | ||
[(4, pd.NA), (pd.NA, pd.NA), (pd.NA, 4)], | ||
) | ||
def test_compare_ea_and_np_dtype(val1, val2): | ||
# GH 48966 | ||
arr = [4.0, val1] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changed variable names, please do suggest any alternate if this does not make sense
Thx @vamsi-verma-s |
…with null Int64 data type (pandas-dev#48966)
in
pd.compare
if any comparisons result innull
values likepd.NA
, consider the values unequal.doc/source/whatsnew/vX.X.X.rst
file if fixing a bug or adding a new feature.