-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG: Add fillna at the beginning of _where not to fill NA. #60729 #60772
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from 13 commits
bc9a942
558569f
bbbc720
e2f32cb
6fd8986
d2d5f62
475f2d1
db30b58
55fe420
cb94cf7
71e442e
b6bd3af
8bac997
89bc1b4
f154cf5
eed6121
9ac81f0
7e3fd3a
8c5ffff
5516517
2437ce2
9556aa4
b64b8a7
9574746
c073c0b
4eea08e
bbc5612
0851593
98fb602
915b8a7
7611f59
88b0530
044d0a9
601f6c9
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -9698,8 +9698,19 @@ def _where( | |
if axis is not None: | ||
axis = self._get_axis_number(axis) | ||
|
||
# align the cond to same shape as myself | ||
cond = common.apply_if_callable(cond, self) | ||
|
||
# We should not be filling NA. See GH#60729 | ||
if isinstance(cond, np.ndarray): | ||
cond = np.array(cond) | ||
cond[isna(cond)] = True | ||
elif isinstance(cond, NDFrame): | ||
cond = cond.fillna(True) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Below we s = Series([1.0, 2.0, 3.0, 4.0])
cond = Series([True, False])
print(s.mask(cond))
# 0 NaN
# 1 2.0
# 2 NaN
# 3 NaN
# dtype: float64
s.mask(cond, inplace=True)
print(s)
# 0 NaN
# 1 2.0
# 2 3.0
# 3 4.0
# dtype: float64 There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thanks @rhshadrach , My suggestion is to add an argument to |
||
elif isinstance(cond, (list, tuple)): | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The docs for There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I've change the code, but it seems this causes test failures. I'll convert the status to draft and revise the code. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. just confirmed that all the tests passed. Thanks! |
||
cond = np.array(cond) | ||
cond[isna(cond)] = True | ||
|
||
# align the cond to same shape as myself | ||
if isinstance(cond, NDFrame): | ||
# CoW: Make sure reference is not kept alive | ||
if cond.ndim == 1 and self.ndim == 2: | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,7 +1,9 @@ | ||
import numpy as np | ||
import pytest | ||
|
||
from pandas import Series | ||
from pandas import ( | ||
Series, | ||
) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. can you revert this change. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Code reverted. Thanks! |
||
import pandas._testing as tm | ||
|
||
|
||
|
@@ -67,3 +69,15 @@ def test_mask_inplace(): | |
rs = s.copy() | ||
rs.mask(cond, -s, inplace=True) | ||
tm.assert_series_equal(rs, s.mask(cond, -s)) | ||
|
||
|
||
@pytest.mark.parametrize("dtype", ["Int64", "int64[pyarrow]"]) | ||
def test_mask_na(dtype): | ||
if dtype == "int64[pyarrow]": | ||
pytest.importorskip("pyarrow") | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Instead, can you change the parametrization to be:
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thanks for the suggestion. just changed the code. |
||
# We should not be filling pd.NA. See GH#60729 | ||
series = Series([None, 1, 2, None, 3, 4, None], dtype=dtype) | ||
result = series.mask(series <= 2, -99) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can you also test the case where the condition is an ndarray and a list. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thanks @rhshadrach , add tests for a list and an ndarray. Could you review the code change? |
||
expected = Series([None, -99, -99, None, 3, 4, None], dtype=dtype) | ||
|
||
tm.assert_series_equal(result, expected) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this trying to fill missing values when NaN is the missing value indicator? I don't think that is right either - the missing values should propogate for all types. We may just be missing coverage for the NaN case (which should be added to the test)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the feedback, @WillAyd .
I thought we could make the values propagate by filling
cond
withTrue
, since_where()
would finally keep the values inself
alive where itscond
isTrue
.Even if I don't fill those values here,
_where
would callfillna()
using inplace at the below code. That's also why the result varies depending on whetherinpalce=True
or not.pandas/pandas/core/generic.py
Lines 9695 to 9698 in e3b2de8
Could you explain in more detail what you mean by propagate for all type? Do you mean we need to keep NA as it is even after this line?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @WillAyd,
I've done some further investigations on this, but I still belive the current code is the simplest way to make the missing values propagate.
If we want to let NA propagate without calling
fillna()
here, there might be too many code changes needed. See below codes :where()
ormask()
. If we don't,fillna()
will fill them withinplace
.pandas/pandas/core/generic.py
Lines 9695 to 9698 in f1441b2
to_numpy()
will fill the missing value usinginplace
when cond is a DataFrame.pandas/pandas/core/generic.py
Lines 9703 to 9716 in f1441b2
extract_bool_array()
fills the missing values using argna_value=False
atEABackedBlock.where()
, we might need to find every single NA index from cond before we call this function(using isna() for example) and then implement additional behaviour to make those values propagate atExtensionArray._where()
.pandas/pandas/core/internals/blocks.py
Lines 1664 to 1668 in f1441b2
pandas/pandas/core/array_algos/putmask.py
Lines 116 to 127 in f1441b2
If _where() is trying to fill the missing values for cond anyway, I think we don't necessarily have to disfavour the current code change. Could you give me some feedback?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@WillAyd
By filling in the missing values on
cond
with True, the missing value in the caller propagates. It's not filling in this missing values oncond
that then fails to properly propagate the caller's missing value.