Skip to content

BUG: NA value doesn't match mask condition, still masked #52955

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
3 tasks done
shobsi opened this issue Apr 27, 2023 · 6 comments
Open
3 tasks done

BUG: NA value doesn't match mask condition, still masked #52955

shobsi opened this issue Apr 27, 2023 · 6 comments
Assignees
Labels
Bug Conditionals E.g. where, mask, case_when NA - MaskedArrays Related to pd.NA and nullable extension arrays

Comments

@shobsi
Copy link

shobsi commented Apr 27, 2023

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

>>> import pandas as pd
>>> s = pd.Series([123456789, -987654321, 314159, pd.NA, -234892, 55555], name='int64_col', dtype=pd.Int64Dtype())
>>> s
0     123456789
1    -987654321
2        314159
3          <NA>
4       -234892
5         55555
Name: int64_col, dtype: Int64
>>> s.mask(s%2 == 1)
0       <NA>
1       <NA>
2       <NA>
3       <NA>
4    -234892
5       <NA>
Name: int64_col, dtype: Int64
>>> s.mask(s%2 == 1, -1)
0         -1
1         -1
2         -1
3         -1
4    -234892
5         -1
Name: int64_col, dtype: Int64
>>> pd.__version__
'2.0.1'
>>>

Issue Description

Series.mask API is masking NA which does not match the mask condition. See the repro example.

Expected Behavior

>>> import pandas as pd
>>> s = pd.Series([123456789, -987654321, 314159, pd.NA, -234892, 55555], name='int64_col', dtype=pd.Int64Dtype())
>>> s
0     123456789
1    -987654321
2        314159
3          <NA>
4       -234892
5         55555
Name: int64_col, dtype: Int64
>>> s.mask(s%2 == 1)
0       <NA>
1       <NA>
2       <NA>
3       <NA>
4    -234892
5       <NA>
Name: int64_col, dtype: Int64
>>> s.mask(s%2 == 1, -1)
0         -1
1         -1
2         -1
3         <NA>
4    -234892
5         -1
Name: int64_col, dtype: Int64

Installed Versions

pd.__version__
'2.0.1'

@shobsi shobsi added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Apr 27, 2023
@topper-123 topper-123 added Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff NA - MaskedArrays Related to pd.NA and nullable extension arrays Filters e.g. head, tail, nth and removed Needs Triage Issue that has not been reviewed by a pandas team member labels May 4, 2023
@topper-123
Copy link
Contributor

topper-123 commented May 4, 2023

Thanks for the bug report, @shobsi .

Yes, this is clearly not the intended behavior. A PR would be welcome.

@Charlie-XIAO
Copy link
Contributor

Hi, I'm interested in this issue and looked into the code.

pandas/pandas/core/generic.py

Lines 9888 to 9890 in 607316c

# make sure we are boolean
fill_value = bool(inplace)
cond = cond.fillna(fill_value)

I thinks here is the suspicious line: it is filling NA values with inplace, which I don't understand why. Wouldn't this result in potentially different behaviors when using inplace=True and inplace=False?

As for this specific issue, I'm thinking that we can call fillna on cond before we pass it to _where. The problem is, cond could be any array-like, and only NDFrame has the fillna method. Nevertheless, I think we are not allowed to use objects that cannot be evaluated as a boolean when using in cond when cond is not an NDFrame. Therefore, can we do the following?

    def where(self, cond, other=np.nan, *, inplace=False, axis=None, level=None):
        other = common.apply_if_callable(other, self)

        if isinstance(cond, NDFrame):
            cond = cond.fillna(True)

        return self._where(cond, other, inplace, axis, level)

Not sure if this is a neat approach. Though it passed all test cases that either calls where or mask, I'm not sure if it can cause other potential problems.

@topper-123
Copy link
Contributor

topper-123 commented May 6, 2023

I think that sound great that you want to take this! I agree that fill_value = bool(inplace) looks very strange.

_where may be called from internal functions directly, so I think doing the fillna in _where would be better than in where IMO.

@Charlie-XIAO
Copy link
Contributor

Ah true, thanks for your reminder! I'll open a PR to add those lines in _where. As for fill_value = bool(inplace), I may try to construct some example to see if it can cause inconsistencies when using inplace=True and inplace=False and open another issue if so.

@Charlie-XIAO
Copy link
Contributor

take

@Charlie-XIAO
Copy link
Contributor

Charlie-XIAO commented May 7, 2023

@topper-123 I think I may understand now what fill_value = bool(inplace) is trying to do. For instance, _setitem_frame is calling _where with inplace=True, so that pd.NA will propagate if one does something like the following:

>>> df = pd.DataFrame(np.random.random((3, 3)), dtype=pd.Float64Dtype())
>>> df[0][0] = pd.NA
>>> df
          0         1         2
0      <NA>  0.609241  0.419094
1  0.274784  0.342904  0.026101
2  0.670259  0.218889  0.177126
>>> df[df >= 0.5] = 0
>>> df
          0         1         2
0      <NA>       0.0  0.419094
1  0.274784  0.342904  0.026101
2       0.0  0.218889  0.177126

But still I doubt this will cause some problems.

@mroeschke mroeschke added Conditionals E.g. where, mask, case_when and removed Filters e.g. head, tail, nth Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff labels Aug 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Conditionals E.g. where, mask, case_when NA - MaskedArrays Related to pd.NA and nullable extension arrays
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants