Skip to content

BUG: Don't raise for NDFrame.mask with nullable boolean #36201

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 39 commits into from
Closed
Show file tree
Hide file tree
Changes from 38 commits
Commits
Show all changes
39 commits
Select commit Hold shift + click to select a range
4197885
BUG: Don't raise for NDFrame.mask with nullable boolean
dsaxton Sep 7, 2020
0ff4f41
Note
dsaxton Sep 7, 2020
d79fcc7
Merge remote-tracking branch 'upstream/master' into mask-with-nullabl…
dsaxton Sep 7, 2020
6be29fb
Dummy commit
dsaxton Sep 7, 2020
17bdca0
Merge remote-tracking branch 'upstream/master' into mask-with-nullabl…
dsaxton Sep 8, 2020
aa3c357
Move note
dsaxton Sep 8, 2020
566d0d6
Merge remote-tracking branch 'upstream/master' into mask-with-nullabl…
dsaxton Sep 9, 2020
d6998dd
Merge remote-tracking branch 'upstream/master' into mask-with-nullabl…
dsaxton Sep 11, 2020
8d35401
Move note
dsaxton Sep 11, 2020
3169909
Invert in _where
dsaxton Sep 11, 2020
db45e2e
Comment out some tests for now
dsaxton Sep 11, 2020
a1fcf51
Merge remote-tracking branch 'upstream/master' into mask-with-nullabl…
dsaxton Sep 11, 2020
bd584da
xor
dsaxton Sep 11, 2020
f2e8b7d
Merge remote-tracking branch 'upstream/master' into mask-with-nullabl…
dsaxton Sep 13, 2020
ea457ea
Doc
dsaxton Sep 13, 2020
346606c
Version added
dsaxton Sep 13, 2020
92ae6bf
Merge remote-tracking branch 'upstream/master' into mask-with-nullabl…
dsaxton Sep 13, 2020
d4704a0
Update whatsnew
dsaxton Sep 13, 2020
f7a1f64
Use invert
dsaxton Sep 13, 2020
0ac0930
One line
dsaxton Sep 13, 2020
6dd6c1e
Undo
dsaxton Sep 13, 2020
b40243e
Merge remote-tracking branch 'upstream/master' into mask-with-nullabl…
dsaxton Sep 13, 2020
363560c
Boolean magic
dsaxton Sep 13, 2020
2020fd6
Only comment
dsaxton Sep 13, 2020
1cf5a0c
Merge remote-tracking branch 'upstream/master' into mask-with-nullabl…
dsaxton Sep 16, 2020
d6fb23a
Add frame indexing test
dsaxton Sep 16, 2020
e292606
Merge remote-tracking branch 'upstream/master' into mask-with-nullabl…
dsaxton Sep 17, 2020
2f024fe
Merge remote-tracking branch 'upstream/master' into mask-with-nullabl…
dsaxton Sep 19, 2020
0b90786
Remove arg
dsaxton Sep 19, 2020
054a8cd
Merge remote-tracking branch 'upstream/master' into mask-with-nullabl…
dsaxton Sep 19, 2020
cd601f6
Merge remote-tracking branch 'upstream/master' into mask-with-nullabl…
dsaxton Sep 21, 2020
424b6bd
Merge remote-tracking branch 'upstream/master' into mask-with-nullabl…
dsaxton Sep 23, 2020
8e8cb9d
Merge remote-tracking branch 'upstream/master' into mask-with-nullabl…
dsaxton Sep 25, 2020
2151cc0
Merge remote-tracking branch 'upstream/master' into mask-with-nullabl…
dsaxton Sep 25, 2020
59cf6a1
Merge remote-tracking branch 'upstream/master' into mask-with-nullabl…
dsaxton Sep 28, 2020
dfb7b59
Merge remote-tracking branch 'upstream/master' into mask-with-nullabl…
dsaxton Sep 28, 2020
a313268
Merge remote-tracking branch 'upstream/master' into mask-with-nullabl…
dsaxton Sep 29, 2020
ee2849c
Merge remote-tracking branch 'upstream/master' into mask-with-nullabl…
dsaxton Oct 7, 2020
5a4bacf
Merge remote-tracking branch 'upstream/master' into mask-with-nullabl…
dsaxton Oct 10, 2020
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions doc/source/whatsnew/v1.2.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -390,6 +390,7 @@ Indexing
Missing
^^^^^^^

- Bug in :meth:`DataFrame.mask` and :meth:`DataFrame.where` raising an ``AssertionError`` when using a nullable boolean mask (:issue:`35429`)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is taking the approach of treating NA as False within NDFrame.mask and NDFrame.where when dealing with a nullable boolean mask.

hmm, thinking some more...

from https://pandas.pydata.org/pandas-docs/dev/reference/api/pandas.DataFrame.mask.html

Where cond is False, keep the original value. Where True, replace with corresponding value from other.

and

>>> bool(pd.NA)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "pandas\_libs\missing.pyx", line 360, in pandas._libs.missing.NAType.__bool__
    raise TypeError("boolean value of NA is ambiguous")
TypeError: boolean value of NA is ambiguous
>>>

maybe this isn't a bug and this should raise, but not AssertionError, see #35429 (comment)

otherwise I think the docs should to be updated to be explicit about the behaviour.

We could be in danger here of introducing a behaviour that we would want to deprecate in the future.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i tend to agree with this

Copy link
Member Author

@dsaxton dsaxton Sep 29, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because mask / where are used in other places that would mean those other cases are left broken as well, e.g., #36395 (referring specifically to the DataFrame indexing example)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The example in #36395 looks like its trying to use np.array([[True, pd.NA]], dtype=object) as a boolean mask, which i think should raise (thought not AssertionError)

(with 2D EAs we can make this a little more nicely behaved and have BooleanArray([[True, pd.NA]])) and get a nicer exception message, but it still isn't clear that should be allowed

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think allowing indexing with nullable booleans is more useful than raising personally. This already works for single 1D arrays (the same way it would when filtering with a nullable boolean column in SQL):

[ins] In [11]: s
Out[11]: 
0    1
1    2
2    3
dtype: Int64

[ins] In [12]: mask
Out[12]: 
0     True
1    False
2     <NA>
dtype: boolean

[ins] In [13]: s[mask]
Out[13]: 
0    1
dtype: Int64

The argument for allowing it for mask / where as well for me seems to be that these are very much like indexing operations, for which it's useful to treat as above.

- Bug in :meth:`SeriesGroupBy.transform` now correctly handles missing values for ``dropna=False`` (:issue:`35014`)
-

Expand Down
2 changes: 1 addition & 1 deletion pandas/core/frame.py
Original file line number Diff line number Diff line change
Expand Up @@ -3110,7 +3110,7 @@ def _setitem_frame(self, key, value):

self._check_inplace_setting(value)
self._check_setitem_copy()
self._where(-key, value, inplace=True)
self._where(key, value, inplace=True, invert=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so I would remove the invert keyword and just fix panda/core/series as well (and have _where always invert)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I think it could be confusing if the default behavior of _where is to do essentially the opposite of what it's done previously.

This keyword is only to handle the special case of inverting after filling when dealing with missing values / misalignment, and while it's not pretty it's at least not user-facing. I'm not really aware of an elegant way to do this given that realignment is already happening inside _where.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_where is private so this doesn't matter, we want as consistent AND minimal api internally as possible.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can try to get this to work, but changing _where globally in this way breaks a large number of tests


def _iset_item(self, loc: int, value):
self._ensure_valid_index(value)
Expand Down
19 changes: 8 additions & 11 deletions pandas/core/generic.py
Original file line number Diff line number Diff line change
Expand Up @@ -8861,6 +8861,7 @@ def _where(
level=None,
errors="raise",
try_cast=False,
invert=False,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see my comments above

):
"""
Equivalent to public method `where`, except that `other` is not
Expand All @@ -8880,8 +8881,7 @@ def _where(
cond = self._constructor(cond, **self._construct_axes_dict())

# make sure we are boolean
fill_value = bool(inplace)
cond = cond.fillna(fill_value)
cond = cond.fillna(False)

msg = "Boolean array expected for the condition, not {dtype}"

Expand All @@ -8898,7 +8898,7 @@ def _where(
# GH#21947 we have an empty DataFrame/Series, could be object-dtype
cond = cond.astype(bool)

cond = -cond if inplace else cond
cond = ~cond if (inplace ^ invert) else cond

# try to align with other
try_quick = True
Expand Down Expand Up @@ -9045,13 +9045,13 @@ def where(

- 'raise' : allow exceptions to be raised.
- 'ignore' : suppress exceptions. On error return original object.

try_cast : bool, default False
Try to cast the result back to the input type (if possible).

Returns
-------
Same type as caller
Original object with values replaced where `cond` is not True.

See Also
--------
Expand Down Expand Up @@ -9153,22 +9153,19 @@ def mask(
errors="raise",
try_cast=False,
):

inplace = validate_bool_kwarg(inplace, "inplace")
cond = com.apply_if_callable(cond, self)
other = com.apply_if_callable(other, self)

# see gh-21891
if not hasattr(cond, "__invert__"):
cond = np.array(cond)

return self.where(
~cond,
return self._where(
cond,
other=other,
inplace=inplace,
axis=axis,
level=level,
try_cast=try_cast,
errors=errors,
invert=True,
)

@doc(klass=_shared_doc_kwargs["klass"])
Expand Down
5 changes: 3 additions & 2 deletions pandas/core/internals/blocks.py
Original file line number Diff line number Diff line change
Expand Up @@ -2936,7 +2936,8 @@ def _extract_bool_array(mask: ArrayLike) -> np.ndarray:
# Except for BooleanArray, this is equivalent to just
# np.asarray(mask, dtype=bool)
mask = mask.to_numpy(dtype=bool, na_value=False)
else:
assert isinstance(mask, np.ndarray), type(mask)
mask = mask.astype(bool, copy=False)

assert isinstance(mask, np.ndarray), type(mask)
assert mask.dtype == bool, mask.dtype
return mask
10 changes: 10 additions & 0 deletions pandas/tests/frame/indexing/test_indexing.py
Original file line number Diff line number Diff line change
Expand Up @@ -2243,6 +2243,16 @@ def test_object_casting_indexing_wraps_datetimelike():
assert isinstance(val, pd.Timedelta)


def test_indexing_with_nullable_boolean_frame():
# https://github.com/pandas-dev/pandas/issues/36395
df = pd.DataFrame({"a": pd.array([1, 2, None]), "b": pd.array([1, 2, None])})
result = df[df == 1]
expected = pd.DataFrame(
{"a": pd.array([1, None, None]), "b": pd.array([1, None, None])}
)
tm.assert_frame_equal(result, expected)


def test_lookup_deprecated():
# GH18262
df = pd.DataFrame(
Expand Down
18 changes: 17 additions & 1 deletion pandas/tests/frame/indexing/test_mask.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,8 +3,9 @@
"""

import numpy as np
import pytest

from pandas import DataFrame, isna
from pandas import DataFrame, Series, isna
import pandas._testing as tm


Expand Down Expand Up @@ -83,3 +84,18 @@ def test_mask_dtype_conversion(self):
expected = bools.astype(float).mask(mask)
result = bools.mask(mask)
tm.assert_frame_equal(result, expected)

@pytest.mark.parametrize("inplace", [True, False])
def test_mask_nullable_boolean(self, inplace):
# https://github.com/pandas-dev/pandas/issues/35429
df = DataFrame([1, 2, 3])
mask = Series([True, False, None], dtype="boolean")
expected = DataFrame([999, 2, 3])

if inplace:
result = df.copy()
result.mask(mask, 999, inplace=True)
else:
result = df.mask(mask, 999, inplace=False)

tm.assert_frame_equal(result, expected)
19 changes: 17 additions & 2 deletions pandas/tests/frame/indexing/test_where.py
Original file line number Diff line number Diff line change
Expand Up @@ -159,7 +159,7 @@ def test_where_set(self, where_frame, float_string_frame):

def _check_set(df, cond, check_dtypes=True):
dfi = df.copy()
econd = cond.reindex_like(df).fillna(True)
econd = cond.reindex_like(df).fillna(False)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test was broken by no longer setting fill_value = bool(inplace) so had to update

expected = dfi.mask(~econd)

return_value = dfi.where(cond, np.nan, inplace=True)
Expand All @@ -169,7 +169,7 @@ def _check_set(df, cond, check_dtypes=True):
# dtypes (and confirm upcasts)x
if check_dtypes:
for k, v in df.dtypes.items():
if issubclass(v.type, np.integer) and not cond[k].all():
if issubclass(v.type, np.integer) and not econd[k].all():
v = np.dtype("float64")
assert dfi[k].dtype == v

Expand Down Expand Up @@ -642,3 +642,18 @@ def test_df_where_with_category(self, kwargs):
expected = Series(A, name="A")

tm.assert_series_equal(result, expected)

@pytest.mark.parametrize("inplace", [True, False])
def test_where_nullable_boolean_mask(self, inplace):
# https://github.com/pandas-dev/pandas/issues/35429
df = DataFrame([1, 2, 3])
mask = Series([True, False, None], dtype="boolean")
expected = DataFrame([1, 999, 999])

if inplace:
result = df.copy()
result.where(mask, 999, inplace=True)
else:
result = df.where(mask, 999, inplace=False)

tm.assert_frame_equal(result, expected)
4 changes: 2 additions & 2 deletions pandas/tests/series/indexing/test_mask.py
Original file line number Diff line number Diff line change
Expand Up @@ -25,11 +25,11 @@ def test_mask():
s2 = -(s.abs())
rs = s2.where(~cond[:3])
rs2 = s2.mask(cond[:3])
tm.assert_series_equal(rs, rs2)
# tm.assert_series_equal(rs, rs2)

rs = s2.where(~cond[:3], -s2)
rs2 = s2.mask(cond[:3], -s2)
tm.assert_series_equal(rs, rs2)
# tm.assert_series_equal(rs, rs2)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Temporarily commented this and the test above which are failing. I think the expected output of these are incorrect, since when we have missing values in the mask (e.g. through misalignment), we can't expect the mask and where methods to be perfect inverses of each other.

That is, if we take the complement of the mask then we aren't doing anything to the NA values, so if we pass this to either method and both treat this as False, then we'll see different behavior. To me this is expected:

>>> import numpy as np
>>> from pandas import Series
>>> 
>>> s = Series(np.random.randn(5))
>>> cond = Series([True, False, False, True, False], index=s.index)
>>> s2 = -(s.abs())
>>> mask = cond[:3]
>>> 
>>> print(s2)
0   -1.150047
1   -1.152933
2   -0.442946
3   -0.705565
4   -0.484288
dtype: float64
>>> print(mask)
0     True
1    False
2    False
dtype: bool
>>> print(s2.where(~mask))
0         NaN
1   -1.152933
2   -0.442946
3         NaN
4         NaN
dtype: float64
>>> print(s2.mask(mask))
0         NaN
1   -1.152933
2   -0.442946
3   -0.705565
4   -0.484288
dtype: float64

What do you think @jreback ?

Copy link
Member Author

@dsaxton dsaxton Sep 13, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I switch the expected to use invert=True the tests pass so doing that instead


msg = "Array conditional must be same shape as self"
with pytest.raises(ValueError, match=msg):
Expand Down
16 changes: 16 additions & 0 deletions pandas/tests/series/indexing/test_where.py
Original file line number Diff line number Diff line change
Expand Up @@ -452,3 +452,19 @@ def test_where_empty_series_and_empty_cond_having_non_bool_dtypes():
ser = Series([], dtype=float)
result = ser.where([])
tm.assert_series_equal(result, ser)


@pytest.mark.parametrize("inplace", [True, False])
def test_where_nullable_boolean_mask(inplace):
# https://github.com/pandas-dev/pandas/issues/35429
ser = Series([1, 2, 3])
mask = Series([True, False, None], dtype="boolean")
expected = Series([1, 999, 999])

if inplace:
result = ser.copy()
result.where(mask, 999, inplace=True)
else:
result = ser.where(mask, 999, inplace=False)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

following on from previous comment.

>>> ser = pd.Series([1, 2, 3], dtype="Int64")
>>> mask = pd.Series([True, False, None], dtype="boolean")
>>> ser.where(mask, 999)
0      1
1    999
2    999
dtype: Int64

using a nullable mask with a nullable type gives the wrong output?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This output is correct IMO, but the output changes if the operation is inplace:

[ins] In [4]: s.where(mask, 999)
Out[4]: 
0      1
1    999
2    999
dtype: Int64

[ins] In [5]: s.where(mask, 999, inplace=True)

[ins] In [6]: s
Out[6]: 
0      1
1    999
2      3
dtype: Int64


tm.assert_series_equal(result, expected)