Skip to content

BUG: Don't raise for NDFrame.mask with nullable boolean #36201

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 39 commits into from
Closed

BUG: Don't raise for NDFrame.mask with nullable boolean #36201

wants to merge 39 commits into from

Conversation

dsaxton
Copy link
Member

@dsaxton dsaxton commented Sep 7, 2020

(Need to wait until the v1.1.3 whatsnew gets merged before adding a release note.)

This is taking the approach of treating NA as False within NDFrame.mask and NDFrame.where when dealing with a nullable boolean mask.

@dsaxton dsaxton added Bug NA - MaskedArrays Related to pd.NA and nullable extension arrays labels Sep 7, 2020
@dsaxton dsaxton added this to the 1.1.3 milestone Sep 7, 2020
@@ -159,7 +159,7 @@ def test_where_set(self, where_frame, float_string_frame):

def _check_set(df, cond, check_dtypes=True):
dfi = df.copy()
econd = cond.reindex_like(df).fillna(True)
econd = cond.reindex_like(df).fillna(False)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test was broken by no longer setting fill_value = bool(inplace) so had to update

@simonjayhawkins
Copy link
Member

(Need to wait until the v1.1.3 whatsnew gets merged before adding a release note.)

ok to target 1.1.2 if ready before tagged

@dsaxton
Copy link
Member Author

dsaxton commented Sep 7, 2020

ok to target 1.1.2 if ready before tagged

I can add a v1.1.2 note just in case

@dsaxton
Copy link
Member Author

dsaxton commented Sep 7, 2020

@simonjayhawkins I think having a GitHub Action for merge conflicts like you mentioned would be super useful, do you know by chance how that could be set up?

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is an intricate change and not appropriate for 1.1.2

@dsaxton dsaxton removed this from the 1.1.3 milestone Sep 8, 2020
@simonjayhawkins simonjayhawkins added this to the 1.1.3 milestone Sep 9, 2020
@simonjayhawkins
Copy link
Member

i've added the 1.1.3 milestone back while the change note is still in 1.1.3.rst

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this patch needs to go in 1.2 as its a core change.

@@ -8835,7 +8834,7 @@ def _where(
# GH#21947 we have an empty DataFrame/Series, could be object-dtype
cond = cond.astype(bool)

cond = -cond if inplace else cond
cond = ~cond if inplace else cond
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you are goin to change this then let's clean starting on L834.

@jreback jreback modified the milestones: 1.1.3, 1.2 Sep 11, 2020

rs = s2.where(~cond[:3], -s2)
rs2 = s2.mask(cond[:3], -s2)
tm.assert_series_equal(rs, rs2)
# tm.assert_series_equal(rs, rs2)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Temporarily commented this and the test above which are failing. I think the expected output of these are incorrect, since when we have missing values in the mask (e.g. through misalignment), we can't expect the mask and where methods to be perfect inverses of each other.

That is, if we take the complement of the mask then we aren't doing anything to the NA values, so if we pass this to either method and both treat this as False, then we'll see different behavior. To me this is expected:

>>> import numpy as np
>>> from pandas import Series
>>> 
>>> s = Series(np.random.randn(5))
>>> cond = Series([True, False, False, True, False], index=s.index)
>>> s2 = -(s.abs())
>>> mask = cond[:3]
>>> 
>>> print(s2)
0   -1.150047
1   -1.152933
2   -0.442946
3   -0.705565
4   -0.484288
dtype: float64
>>> print(mask)
0     True
1    False
2    False
dtype: bool
>>> print(s2.where(~mask))
0         NaN
1   -1.152933
2   -0.442946
3         NaN
4         NaN
dtype: float64
>>> print(s2.mask(mask))
0         NaN
1   -1.152933
2   -0.442946
3   -0.705565
4   -0.484288
dtype: float64

What do you think @jreback ?

Copy link
Member Author

@dsaxton dsaxton Sep 13, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I switch the expected to use invert=True the tests pass so doing that instead

@jbrockmendel
Copy link
Member

seems like this is a workaround because BooleanArray doesn't have a __inv__. would defining that solve this issue?

@jreback
Copy link
Contributor

jreback commented Sep 16, 2020

this is a good point @jbrockmendel

@dsaxton
Copy link
Member Author

dsaxton commented Sep 16, 2020

seems like this is a workaround because BooleanArray doesn't have a __inv__. would defining that solve this issue?

Do you mean the issue with - vs ~? I might be misunderstanding but inverting a BooleanArray seems to work okay:

[ins] In [3]: arr
Out[3]: 
<BooleanArray>
[True, False, <NA>]
Length: 3, dtype: boolean

[ins] In [4]: ~arr
Out[4]: 
<BooleanArray>
[False, True, <NA>]
Length: 3, dtype: boolean

This PR mostly became complicated because of the existing logic surrounding inplace inside _where, and when the inversion should happen in the presence of missing values.

@@ -3104,7 +3104,7 @@ def _setitem_frame(self, key, value):

self._check_inplace_setting(value)
self._check_setitem_copy()
self._where(-key, value, inplace=True)
self._where(key, value, inplace=True, invert=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so I would remove the invert keyword and just fix panda/core/series as well (and have _where always invert)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I think it could be confusing if the default behavior of _where is to do essentially the opposite of what it's done previously.

This keyword is only to handle the special case of inverting after filling when dealing with missing values / misalignment, and while it's not pretty it's at least not user-facing. I'm not really aware of an elegant way to do this given that realignment is already happening inside _where.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_where is private so this doesn't matter, we want as consistent AND minimal api internally as possible.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can try to get this to work, but changing _where globally in this way breaks a large number of tests

@@ -8857,13 +8857,18 @@ def _where(
level=None,
errors="raise",
try_cast=False,
invert=False,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see my comments above

@simonjayhawkins
Copy link
Member

this is an intricate change and not appropriate for 1.1.2

As this PR evolves, can we review the milestone when this PR is ready but before this is merged.

@@ -305,6 +305,7 @@ Indexing
Missing
^^^^^^^

- Bug in :meth:`DataFrame.mask` and :meth:`DataFrame.where` raising an ``AssertionError`` when using a nullable boolean mask (:issue:`35429`)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is taking the approach of treating NA as False within NDFrame.mask and NDFrame.where when dealing with a nullable boolean mask.

hmm, thinking some more...

from https://pandas.pydata.org/pandas-docs/dev/reference/api/pandas.DataFrame.mask.html

Where cond is False, keep the original value. Where True, replace with corresponding value from other.

and

>>> bool(pd.NA)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "pandas\_libs\missing.pyx", line 360, in pandas._libs.missing.NAType.__bool__
    raise TypeError("boolean value of NA is ambiguous")
TypeError: boolean value of NA is ambiguous
>>>

maybe this isn't a bug and this should raise, but not AssertionError, see #35429 (comment)

otherwise I think the docs should to be updated to be explicit about the behaviour.

We could be in danger here of introducing a behaviour that we would want to deprecate in the future.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i tend to agree with this

Copy link
Member Author

@dsaxton dsaxton Sep 29, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because mask / where are used in other places that would mean those other cases are left broken as well, e.g., #36395 (referring specifically to the DataFrame indexing example)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The example in #36395 looks like its trying to use np.array([[True, pd.NA]], dtype=object) as a boolean mask, which i think should raise (thought not AssertionError)

(with 2D EAs we can make this a little more nicely behaved and have BooleanArray([[True, pd.NA]])) and get a nicer exception message, but it still isn't clear that should be allowed

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think allowing indexing with nullable booleans is more useful than raising personally. This already works for single 1D arrays (the same way it would when filtering with a nullable boolean column in SQL):

[ins] In [11]: s
Out[11]: 
0    1
1    2
2    3
dtype: Int64

[ins] In [12]: mask
Out[12]: 
0     True
1    False
2     <NA>
dtype: boolean

[ins] In [13]: s[mask]
Out[13]: 
0    1
dtype: Int64

The argument for allowing it for mask / where as well for me seems to be that these are very much like indexing operations, for which it's useful to treat as above.

result = ser.copy()
result.where(mask, 999, inplace=True)
else:
result = ser.where(mask, 999, inplace=False)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

following on from previous comment.

>>> ser = pd.Series([1, 2, 3], dtype="Int64")
>>> mask = pd.Series([True, False, None], dtype="boolean")
>>> ser.where(mask, 999)
0      1
1    999
2    999
dtype: Int64

using a nullable mask with a nullable type gives the wrong output?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This output is correct IMO, but the output changes if the operation is inplace:

[ins] In [4]: s.where(mask, 999)
Out[4]: 
0      1
1    999
2    999
dtype: Int64

[ins] In [5]: s.where(mask, 999, inplace=True)

[ins] In [6]: s
Out[6]: 
0      1
1    999
2      3
dtype: Int64

@@ -3104,7 +3104,7 @@ def _setitem_frame(self, key, value):

self._check_inplace_setting(value)
self._check_setitem_copy()
self._where(-key, value, inplace=True)
self._where(key, value, inplace=True, invert=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_where is private so this doesn't matter, we want as consistent AND minimal api internally as possible.

@dsaxton
Copy link
Member Author

dsaxton commented Nov 6, 2020

Closing for now since it seems it isn't clear if / how we want to do this

@dsaxton dsaxton closed this Nov 6, 2020
@dsaxton dsaxton deleted the mask-with-nullable-boolean branch March 30, 2021 20:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug NA - MaskedArrays Related to pd.NA and nullable extension arrays
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants