-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG: Don't raise for NDFrame.mask with nullable boolean #36201
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@@ -159,7 +159,7 @@ def test_where_set(self, where_frame, float_string_frame): | |||
|
|||
def _check_set(df, cond, check_dtypes=True): | |||
dfi = df.copy() | |||
econd = cond.reindex_like(df).fillna(True) | |||
econd = cond.reindex_like(df).fillna(False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test was broken by no longer setting fill_value = bool(inplace) so had to update
ok to target 1.1.2 if ready before tagged |
I can add a v1.1.2 note just in case |
@simonjayhawkins I think having a GitHub Action for merge conflicts like you mentioned would be super useful, do you know by chance how that could be set up? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is an intricate change and not appropriate for 1.1.2
i've added the 1.1.3 milestone back while the change note is still in 1.1.3.rst |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this patch needs to go in 1.2 as its a core change.
pandas/core/generic.py
Outdated
@@ -8835,7 +8834,7 @@ def _where( | |||
# GH#21947 we have an empty DataFrame/Series, could be object-dtype | |||
cond = cond.astype(bool) | |||
|
|||
cond = -cond if inplace else cond | |||
cond = ~cond if inplace else cond |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if you are goin to change this then let's clean starting on L834.
|
||
rs = s2.where(~cond[:3], -s2) | ||
rs2 = s2.mask(cond[:3], -s2) | ||
tm.assert_series_equal(rs, rs2) | ||
# tm.assert_series_equal(rs, rs2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Temporarily commented this and the test above which are failing. I think the expected output of these are incorrect, since when we have missing values in the mask (e.g. through misalignment), we can't expect the mask and where methods to be perfect inverses of each other.
That is, if we take the complement of the mask then we aren't doing anything to the NA values, so if we pass this to either method and both treat this as False, then we'll see different behavior. To me this is expected:
>>> import numpy as np
>>> from pandas import Series
>>>
>>> s = Series(np.random.randn(5))
>>> cond = Series([True, False, False, True, False], index=s.index)
>>> s2 = -(s.abs())
>>> mask = cond[:3]
>>>
>>> print(s2)
0 -1.150047
1 -1.152933
2 -0.442946
3 -0.705565
4 -0.484288
dtype: float64
>>> print(mask)
0 True
1 False
2 False
dtype: bool
>>> print(s2.where(~mask))
0 NaN
1 -1.152933
2 -0.442946
3 NaN
4 NaN
dtype: float64
>>> print(s2.mask(mask))
0 NaN
1 -1.152933
2 -0.442946
3 -0.705565
4 -0.484288
dtype: float64
What do you think @jreback ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I switch the expected to use invert=True the tests pass so doing that instead
seems like this is a workaround because BooleanArray doesn't have a |
this is a good point @jbrockmendel |
Do you mean the issue with - vs ~? I might be misunderstanding but inverting a BooleanArray seems to work okay: [ins] In [3]: arr
Out[3]:
<BooleanArray>
[True, False, <NA>]
Length: 3, dtype: boolean
[ins] In [4]: ~arr
Out[4]:
<BooleanArray>
[False, True, <NA>]
Length: 3, dtype: boolean This PR mostly became complicated because of the existing logic surrounding inplace inside _where, and when the inversion should happen in the presence of missing values. |
@@ -3104,7 +3104,7 @@ def _setitem_frame(self, key, value): | |||
|
|||
self._check_inplace_setting(value) | |||
self._check_setitem_copy() | |||
self._where(-key, value, inplace=True) | |||
self._where(key, value, inplace=True, invert=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so I would remove the invert keyword and just fix panda/core/series as well (and have _where always invert)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I think it could be confusing if the default behavior of _where
is to do essentially the opposite of what it's done previously.
This keyword is only to handle the special case of inverting after filling when dealing with missing values / misalignment, and while it's not pretty it's at least not user-facing. I'm not really aware of an elegant way to do this given that realignment is already happening inside _where
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
_where is private so this doesn't matter, we want as consistent AND minimal api internally as possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can try to get this to work, but changing _where globally in this way breaks a large number of tests
@@ -8857,13 +8857,18 @@ def _where( | |||
level=None, | |||
errors="raise", | |||
try_cast=False, | |||
invert=False, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see my comments above
As this PR evolves, can we review the milestone when this PR is ready but before this is merged. |
@@ -305,6 +305,7 @@ Indexing | |||
Missing | |||
^^^^^^^ | |||
|
|||
- Bug in :meth:`DataFrame.mask` and :meth:`DataFrame.where` raising an ``AssertionError`` when using a nullable boolean mask (:issue:`35429`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is taking the approach of treating NA as False within NDFrame.mask and NDFrame.where when dealing with a nullable boolean mask.
hmm, thinking some more...
from https://pandas.pydata.org/pandas-docs/dev/reference/api/pandas.DataFrame.mask.html
Where cond is False, keep the original value. Where True, replace with corresponding value from other.
and
>>> bool(pd.NA)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "pandas\_libs\missing.pyx", line 360, in pandas._libs.missing.NAType.__bool__
raise TypeError("boolean value of NA is ambiguous")
TypeError: boolean value of NA is ambiguous
>>>
maybe this isn't a bug and this should raise, but not AssertionError, see #35429 (comment)
otherwise I think the docs should to be updated to be explicit about the behaviour.
We could be in danger here of introducing a behaviour that we would want to deprecate in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i tend to agree with this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because mask / where are used in other places that would mean those other cases are left broken as well, e.g., #36395 (referring specifically to the DataFrame indexing example)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The example in #36395 looks like its trying to use np.array([[True, pd.NA]], dtype=object)
as a boolean mask, which i think should raise (thought not AssertionError)
(with 2D EAs we can make this a little more nicely behaved and have BooleanArray([[True, pd.NA]])
) and get a nicer exception message, but it still isn't clear that should be allowed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think allowing indexing with nullable booleans is more useful than raising personally. This already works for single 1D arrays (the same way it would when filtering with a nullable boolean column in SQL):
[ins] In [11]: s
Out[11]:
0 1
1 2
2 3
dtype: Int64
[ins] In [12]: mask
Out[12]:
0 True
1 False
2 <NA>
dtype: boolean
[ins] In [13]: s[mask]
Out[13]:
0 1
dtype: Int64
The argument for allowing it for mask / where as well for me seems to be that these are very much like indexing operations, for which it's useful to treat as above.
result = ser.copy() | ||
result.where(mask, 999, inplace=True) | ||
else: | ||
result = ser.where(mask, 999, inplace=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
following on from previous comment.
>>> ser = pd.Series([1, 2, 3], dtype="Int64")
>>> mask = pd.Series([True, False, None], dtype="boolean")
>>> ser.where(mask, 999)
0 1
1 999
2 999
dtype: Int64
using a nullable mask with a nullable type gives the wrong output?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This output is correct IMO, but the output changes if the operation is inplace:
[ins] In [4]: s.where(mask, 999)
Out[4]:
0 1
1 999
2 999
dtype: Int64
[ins] In [5]: s.where(mask, 999, inplace=True)
[ins] In [6]: s
Out[6]:
0 1
1 999
2 3
dtype: Int64
@@ -3104,7 +3104,7 @@ def _setitem_frame(self, key, value): | |||
|
|||
self._check_inplace_setting(value) | |||
self._check_setitem_copy() | |||
self._where(-key, value, inplace=True) | |||
self._where(key, value, inplace=True, invert=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
_where is private so this doesn't matter, we want as consistent AND minimal api internally as possible.
Closing for now since it seems it isn't clear if / how we want to do this |
black pandas
git diff upstream/master -u -- "*.py" | flake8 --diff
(Need to wait until the v1.1.3 whatsnew gets merged before adding a release note.)This is taking the approach of treating NA as False within NDFrame.mask and NDFrame.where when dealing with a nullable boolean mask.