-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG: Don't raise for NDFrame.mask with nullable boolean #36201
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 38 commits
4197885
0ff4f41
d79fcc7
6be29fb
17bdca0
aa3c357
566d0d6
d6998dd
8d35401
3169909
db45e2e
a1fcf51
bd584da
f2e8b7d
ea457ea
346606c
92ae6bf
d4704a0
f7a1f64
0ac0930
6dd6c1e
b40243e
363560c
2020fd6
1cf5a0c
d6fb23a
e292606
2f024fe
0b90786
054a8cd
cd601f6
424b6bd
8e8cb9d
2151cc0
59cf6a1
dfb7b59
a313268
ee2849c
5a4bacf
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -3110,7 +3110,7 @@ def _setitem_frame(self, key, value): | |
|
||
self._check_inplace_setting(value) | ||
self._check_setitem_copy() | ||
self._where(-key, value, inplace=True) | ||
self._where(key, value, inplace=True, invert=True) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. so I would remove the invert keyword and just fix panda/core/series as well (and have _where always invert) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Hmm, I think it could be confusing if the default behavior of This keyword is only to handle the special case of inverting after filling when dealing with missing values / misalignment, and while it's not pretty it's at least not user-facing. I'm not really aware of an elegant way to do this given that realignment is already happening inside There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. _where is private so this doesn't matter, we want as consistent AND minimal api internally as possible. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I can try to get this to work, but changing _where globally in this way breaks a large number of tests |
||
|
||
def _iset_item(self, loc: int, value): | ||
self._ensure_valid_index(value) | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -8861,6 +8861,7 @@ def _where( | |
level=None, | ||
errors="raise", | ||
try_cast=False, | ||
invert=False, | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. see my comments above |
||
): | ||
""" | ||
Equivalent to public method `where`, except that `other` is not | ||
|
@@ -8880,8 +8881,7 @@ def _where( | |
cond = self._constructor(cond, **self._construct_axes_dict()) | ||
|
||
# make sure we are boolean | ||
fill_value = bool(inplace) | ||
dsaxton marked this conversation as resolved.
Show resolved
Hide resolved
|
||
cond = cond.fillna(fill_value) | ||
cond = cond.fillna(False) | ||
|
||
msg = "Boolean array expected for the condition, not {dtype}" | ||
|
||
|
@@ -8898,7 +8898,7 @@ def _where( | |
# GH#21947 we have an empty DataFrame/Series, could be object-dtype | ||
cond = cond.astype(bool) | ||
|
||
cond = -cond if inplace else cond | ||
cond = ~cond if (inplace ^ invert) else cond | ||
|
||
# try to align with other | ||
try_quick = True | ||
|
@@ -9045,13 +9045,13 @@ def where( | |
|
||
- 'raise' : allow exceptions to be raised. | ||
- 'ignore' : suppress exceptions. On error return original object. | ||
|
||
try_cast : bool, default False | ||
Try to cast the result back to the input type (if possible). | ||
|
||
Returns | ||
------- | ||
Same type as caller | ||
Original object with values replaced where `cond` is not True. | ||
|
||
See Also | ||
-------- | ||
|
@@ -9153,22 +9153,19 @@ def mask( | |
errors="raise", | ||
try_cast=False, | ||
): | ||
|
||
inplace = validate_bool_kwarg(inplace, "inplace") | ||
cond = com.apply_if_callable(cond, self) | ||
other = com.apply_if_callable(other, self) | ||
|
||
# see gh-21891 | ||
if not hasattr(cond, "__invert__"): | ||
cond = np.array(cond) | ||
|
||
return self.where( | ||
~cond, | ||
return self._where( | ||
cond, | ||
other=other, | ||
inplace=inplace, | ||
axis=axis, | ||
level=level, | ||
try_cast=try_cast, | ||
errors=errors, | ||
invert=True, | ||
) | ||
|
||
@doc(klass=_shared_doc_kwargs["klass"]) | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -159,7 +159,7 @@ def test_where_set(self, where_frame, float_string_frame): | |
|
||
def _check_set(df, cond, check_dtypes=True): | ||
dfi = df.copy() | ||
econd = cond.reindex_like(df).fillna(True) | ||
econd = cond.reindex_like(df).fillna(False) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This test was broken by no longer setting fill_value = bool(inplace) so had to update |
||
expected = dfi.mask(~econd) | ||
|
||
return_value = dfi.where(cond, np.nan, inplace=True) | ||
|
@@ -169,7 +169,7 @@ def _check_set(df, cond, check_dtypes=True): | |
# dtypes (and confirm upcasts)x | ||
if check_dtypes: | ||
for k, v in df.dtypes.items(): | ||
if issubclass(v.type, np.integer) and not cond[k].all(): | ||
if issubclass(v.type, np.integer) and not econd[k].all(): | ||
v = np.dtype("float64") | ||
assert dfi[k].dtype == v | ||
|
||
|
@@ -642,3 +642,18 @@ def test_df_where_with_category(self, kwargs): | |
expected = Series(A, name="A") | ||
|
||
tm.assert_series_equal(result, expected) | ||
|
||
@pytest.mark.parametrize("inplace", [True, False]) | ||
def test_where_nullable_boolean_mask(self, inplace): | ||
# https://github.com/pandas-dev/pandas/issues/35429 | ||
df = DataFrame([1, 2, 3]) | ||
mask = Series([True, False, None], dtype="boolean") | ||
expected = DataFrame([1, 999, 999]) | ||
|
||
if inplace: | ||
result = df.copy() | ||
result.where(mask, 999, inplace=True) | ||
else: | ||
result = df.where(mask, 999, inplace=False) | ||
|
||
tm.assert_frame_equal(result, expected) |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -25,11 +25,11 @@ def test_mask(): | |
s2 = -(s.abs()) | ||
rs = s2.where(~cond[:3]) | ||
rs2 = s2.mask(cond[:3]) | ||
tm.assert_series_equal(rs, rs2) | ||
# tm.assert_series_equal(rs, rs2) | ||
|
||
rs = s2.where(~cond[:3], -s2) | ||
rs2 = s2.mask(cond[:3], -s2) | ||
tm.assert_series_equal(rs, rs2) | ||
# tm.assert_series_equal(rs, rs2) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Temporarily commented this and the test above which are failing. I think the expected output of these are incorrect, since when we have missing values in the mask (e.g. through misalignment), we can't expect the mask and where methods to be perfect inverses of each other. That is, if we take the complement of the mask then we aren't doing anything to the NA values, so if we pass this to either method and both treat this as False, then we'll see different behavior. To me this is expected: >>> import numpy as np
>>> from pandas import Series
>>>
>>> s = Series(np.random.randn(5))
>>> cond = Series([True, False, False, True, False], index=s.index)
>>> s2 = -(s.abs())
>>> mask = cond[:3]
>>>
>>> print(s2)
0 -1.150047
1 -1.152933
2 -0.442946
3 -0.705565
4 -0.484288
dtype: float64
>>> print(mask)
0 True
1 False
2 False
dtype: bool
>>> print(s2.where(~mask))
0 NaN
1 -1.152933
2 -0.442946
3 NaN
4 NaN
dtype: float64
>>> print(s2.mask(mask))
0 NaN
1 -1.152933
2 -0.442946
3 -0.705565
4 -0.484288
dtype: float64 What do you think @jreback ? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
|
||
|
||
msg = "Array conditional must be same shape as self" | ||
with pytest.raises(ValueError, match=msg): | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -452,3 +452,19 @@ def test_where_empty_series_and_empty_cond_having_non_bool_dtypes(): | |
ser = Series([], dtype=float) | ||
result = ser.where([]) | ||
tm.assert_series_equal(result, ser) | ||
|
||
|
||
@pytest.mark.parametrize("inplace", [True, False]) | ||
def test_where_nullable_boolean_mask(inplace): | ||
# https://github.com/pandas-dev/pandas/issues/35429 | ||
ser = Series([1, 2, 3]) | ||
mask = Series([True, False, None], dtype="boolean") | ||
expected = Series([1, 999, 999]) | ||
|
||
if inplace: | ||
result = ser.copy() | ||
result.where(mask, 999, inplace=True) | ||
else: | ||
result = ser.where(mask, 999, inplace=False) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. following on from previous comment.
using a nullable mask with a nullable type gives the wrong output? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This output is correct IMO, but the output changes if the operation is inplace: [ins] In [4]: s.where(mask, 999)
Out[4]:
0 1
1 999
2 999
dtype: Int64
[ins] In [5]: s.where(mask, 999, inplace=True)
[ins] In [6]: s
Out[6]:
0 1
1 999
2 3
dtype: Int64 |
||
|
||
tm.assert_series_equal(result, expected) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, thinking some more...
from https://pandas.pydata.org/pandas-docs/dev/reference/api/pandas.DataFrame.mask.html
and
maybe this isn't a bug and this should raise, but not AssertionError, see #35429 (comment)
otherwise I think the docs should to be updated to be explicit about the behaviour.
We could be in danger here of introducing a behaviour that we would want to deprecate in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i tend to agree with this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because mask / where are used in other places that would mean those other cases are left broken as well, e.g., #36395 (referring specifically to the DataFrame indexing example)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The example in #36395 looks like its trying to use
np.array([[True, pd.NA]], dtype=object)
as a boolean mask, which i think should raise (thought not AssertionError)(with 2D EAs we can make this a little more nicely behaved and have
BooleanArray([[True, pd.NA]])
) and get a nicer exception message, but it still isn't clear that should be allowedThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think allowing indexing with nullable booleans is more useful than raising personally. This already works for single 1D arrays (the same way it would when filtering with a nullable boolean column in SQL):
The argument for allowing it for mask / where as well for me seems to be that these are very much like indexing operations, for which it's useful to treat as above.