-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG: Add fillna at the beginning of _where not to fill NA. #60729 #60772
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from 8 commits
bc9a942
558569f
bbbc720
e2f32cb
6fd8986
d2d5f62
475f2d1
db30b58
55fe420
cb94cf7
71e442e
b6bd3af
8bac997
89bc1b4
f154cf5
eed6121
9ac81f0
7e3fd3a
8c5ffff
5516517
2437ce2
9556aa4
b64b8a7
9574746
c073c0b
4eea08e
bbc5612
0851593
98fb602
915b8a7
7611f59
88b0530
044d0a9
601f6c9
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change | ||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
@@ -9674,6 +9674,13 @@ def _where( | |||||||||||||||||||
if axis is not None: | ||||||||||||||||||||
axis = self._get_axis_number(axis) | ||||||||||||||||||||
|
||||||||||||||||||||
# We should not be filling NA. See GH#60729 | ||||||||||||||||||||
if isinstance(cond, np.ndarray): | ||||||||||||||||||||
cond = np.array(cond) | ||||||||||||||||||||
cond[np.isnan(cond)] = True | ||||||||||||||||||||
elif isinstance(cond, NDFrame): | ||||||||||||||||||||
cond = cond.fillna(True) | ||||||||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We also do fillna on L9704 below. Can these be combined? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Also, what other types besides ndarray and NDFrame get here? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Hi @rhshadrach, thanks for the review! According to the documentation, cond should be one of these : bool In case we input Lines 10096 to 10098 in 57fd502
In case we input Lines 9712 to 9717 in 57fd502
Please let me know if I'm missing anything. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Hi @rhshadrach, sorry for the confusion. I just realized that There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @rhshadrach, tests.frame.indexing.test_mask.test_mask_stringdtype[Series]def test_mask_stringdtype(frame_or_series):
# GH 40824
obj = DataFrame(
{"A": ["foo", "bar", "baz", NA]},
index=["id1", "id2", "id3", "id4"],
dtype=StringDtype(),
)
filtered_obj = DataFrame(
{"A": ["this", "that"]}, index=["id2", "id3"], dtype=StringDtype()
)
expected = DataFrame(
{"A": [NA, "this", "that", NA]},
index=["id1", "id2", "id3", "id4"],
dtype=StringDtype(),
)
if frame_or_series is Series:
obj = obj["A"]
filtered_obj = filtered_obj["A"]
expected = expected["A"]
filter_ser = Series([False, True, True, False])
result = obj.mask(filter_ser, filtered_obj)
tm.assert_equal(result, expected) result
>>>
id1 foo
id2 bar
id3 baz
id4 <NA>
Name: A, dtype: string
expected
>>>
id1 <NA>
id2 this
id3 that
id4 <NA>
Name: A, dtype: string I suspect the behaviour of obj
>>>
id1 foo
id2 bar
id3 baz
id4 <NA>
Name: A, dtype: string
filtered_obj
>>>
id2 this
id3 that
Name: A, dtype: string
filter_ser = pd.Series([False, True, True, False])
filter_ser_2 = pd.Series([False, False, False, False])
filter_ser_3 = pd.Series([True, True, True, True])
result = obj.mask(filter_ser, filtered_obj) # Should return ["foo", "this", "that", pd.NA]. But this test is currently expecthing to be [pd.NA, "this", "that", pd.NA]
result
>>>
id1 <NA>
id2 this
id3 that
id4 <NA>
Name: A, dtype: string
result_2 = obj.mask(filter_ser_2, filtered_obj) # Should return ["foo", "bar", "baz", pd.NA]
result_2
>>>
id1 <NA>
id2 this
id3 that
id4 <NA>
Name: A, dtype: string
result_3 = obj.mask(filter_ser_3, filtered_obj) # Should reutrn ["pd.NA, "this", "that", pd.NA]
result_3
>>>
id1 <NA>
id2 this
id3 that
id4 <NA>
Name: A, dtype: string I think I'd better open another issue regarding this, but for now, I suppose we'd best to leave failing teststests.frame.indexing.test_getitem.TestGetitemBooleanMask.test_getitem_boolean_series_with_duplicate_columns |
||||||||||||||||||||
|
||||||||||||||||||||
# align the cond to same shape as myself | ||||||||||||||||||||
cond = common.apply_if_callable(cond, self) | ||||||||||||||||||||
if isinstance(cond, NDFrame): | ||||||||||||||||||||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,7 +1,10 @@ | ||
import numpy as np | ||
import pytest | ||
|
||
from pandas import Series | ||
from pandas import ( | ||
Int64Dtype, | ||
Series, | ||
) | ||
import pandas._testing as tm | ||
|
||
|
||
|
@@ -67,3 +70,12 @@ def test_mask_inplace(): | |
rs = s.copy() | ||
rs.mask(cond, -s, inplace=True) | ||
tm.assert_series_equal(rs, s.mask(cond, -s)) | ||
|
||
|
||
def test_mask_na(): | ||
# We should not be filling pd.NA. See GH#60729 | ||
series = Series([None, 1, 2, None, 3, 4, None], dtype=Int64Dtype()) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can you also add a test for arrow. Can parametrize with e.g.
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thanks, just added a test for pyarrow. |
||
result = series.mask(series <= 2, -99) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can you also test the case where the condition is an ndarray and a list. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thanks @rhshadrach , add tests for a list and an ndarray. Could you review the code change? |
||
expected = Series([None, -99, -99, None, 3, 4, None], dtype=Int64Dtype()) | ||
|
||
tm.assert_series_equal(result, expected) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this trying to fill missing values when NaN is the missing value indicator? I don't think that is right either - the missing values should propogate for all types. We may just be missing coverage for the NaN case (which should be added to the test)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the feedback, @WillAyd .
I thought we could make the values propagate by filling
cond
withTrue
, since_where()
would finally keep the values inself
alive where itscond
isTrue
.Even if I don't fill those values here,
_where
would callfillna()
using inplace at the below code. That's also why the result varies depending on whetherinpalce=True
or not.pandas/pandas/core/generic.py
Lines 9695 to 9698 in e3b2de8
Could you explain in more detail what you mean by propagate for all type? Do you mean we need to keep NA as it is even after this line?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @WillAyd,
I've done some further investigations on this, but I still belive the current code is the simplest way to make the missing values propagate.
If we want to let NA propagate without calling
fillna()
here, there might be too many code changes needed. See below codes :where()
ormask()
. If we don't,fillna()
will fill them withinplace
.pandas/pandas/core/generic.py
Lines 9695 to 9698 in f1441b2
to_numpy()
will fill the missing value usinginplace
when cond is a DataFrame.pandas/pandas/core/generic.py
Lines 9703 to 9716 in f1441b2
extract_bool_array()
fills the missing values using argna_value=False
atEABackedBlock.where()
, we might need to find every single NA index from cond before we call this function(using isna() for example) and then implement additional behaviour to make those values propagate atExtensionArray._where()
.pandas/pandas/core/internals/blocks.py
Lines 1664 to 1668 in f1441b2
pandas/pandas/core/array_algos/putmask.py
Lines 116 to 127 in f1441b2
If _where() is trying to fill the missing values for cond anyway, I think we don't necessarily have to disfavour the current code change. Could you give me some feedback?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@WillAyd
By filling in the missing values on
cond
with True, the missing value in the caller propagates. It's not filling in this missing values oncond
that then fails to properly propagate the caller's missing value.