Skip to content

BUG: Add fillna at the beginning of _where not to fill NA. #60729 #60772

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 34 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 13 commits
Commits
Show all changes
34 commits
Select commit Hold shift + click to select a range
bc9a942
BUG: Add fillna so that cond doesnt contain NA at the beginning of _w…
sanggon6107 Jan 23, 2025
558569f
TST: Add tests for mask with NA. (#60729)
sanggon6107 Jan 23, 2025
bbbc720
BUG: Fix _where to make np.ndarray mutable. (#60729)
sanggon6107 Jan 23, 2025
e2f32cb
DOC: Add documentation regarding the bug (#60729)
sanggon6107 Jan 23, 2025
6fd8986
Merge branch 'main' into add-mask-fillna
sanggon6107 Jan 25, 2025
d2d5f62
ENH: Optimze test_mask_na()
sanggon6107 Jan 25, 2025
475f2d1
BUG: Fix a bug in test_mask_na() (#60729)
sanggon6107 Jan 25, 2025
db30b58
Update doc/source/whatsnew/v3.0.0.rst
sanggon6107 Feb 9, 2025
55fe420
Merge branch 'main' into add-mask-fillna
sanggon6107 Mar 3, 2025
cb94cf7
Add test arguments for test_mask_na
sanggon6107 Mar 3, 2025
71e442e
Fix whatsnew
sanggon6107 Mar 3, 2025
b6bd3af
Fix test failures by adding importorskip
sanggon6107 Mar 3, 2025
8bac997
Fill True when tuple or list cond has np.nan/pd.NA
sanggon6107 Mar 3, 2025
89bc1b4
Merge branch 'main' into add-mask-fillna
sanggon6107 Mar 4, 2025
f154cf5
Optimize _where
sanggon6107 Mar 4, 2025
eed6121
Optimize test_mask_na
sanggon6107 Mar 4, 2025
9ac81f0
Add np.array for read-only ndarray
sanggon6107 Mar 5, 2025
7e3fd3a
Merge branch 'main' into add-mask-fillna
sanggon6107 Mar 5, 2025
8c5ffff
Update generic.py
sanggon6107 Mar 5, 2025
5516517
Revert generic.py
sanggon6107 Mar 6, 2025
2437ce2
Merge branch 'main' into add-mask-fillna
sanggon6107 Mar 7, 2025
9556aa4
Replace np.array with fillna
sanggon6107 Mar 7, 2025
b64b8a7
Correct the unintended deletion
sanggon6107 Mar 7, 2025
9574746
Merge branch 'main' into add-mask-fillna
sanggon6107 Mar 20, 2025
c073c0b
Add test for list and ndarray
sanggon6107 Mar 24, 2025
4eea08e
Handle list with NA
sanggon6107 Mar 27, 2025
bbc5612
Fix code checks
sanggon6107 Mar 27, 2025
0851593
Fix type
sanggon6107 Mar 27, 2025
98fb602
Optimize operation
sanggon6107 Mar 28, 2025
915b8a7
Prevent a list with np.nan converting to float
sanggon6107 Apr 6, 2025
7611f59
Add test for a list with np.nan
sanggon6107 Apr 6, 2025
88b0530
Merge branch 'main' into add-mask-fillna
sanggon6107 Apr 20, 2025
044d0a9
fillna after constructor
sanggon6107 Apr 20, 2025
601f6c9
Parametrize cond type
sanggon6107 Apr 20, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions doc/source/whatsnew/v3.0.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -695,6 +695,7 @@ Indexing
- Bug in :meth:`DataFrame.from_records` throwing a ``ValueError`` when passed an empty list in ``index`` (:issue:`58594`)
- Bug in :meth:`DataFrame.loc` with inconsistent behavior of loc-set with 2 given indexes to Series (:issue:`59933`)
- Bug in :meth:`MultiIndex.insert` when a new value inserted to a datetime-like level gets cast to ``NaT`` and fails indexing (:issue:`60388`)
- Bug in :meth:`Series.mask` unexpectedly filling ``pd.NA`` (:issue:`60729`)
- Bug in printing :attr:`Index.names` and :attr:`MultiIndex.levels` would not escape single quotes (:issue:`60190`)

Missing
Expand Down
13 changes: 12 additions & 1 deletion pandas/core/generic.py
Original file line number Diff line number Diff line change
Expand Up @@ -9698,8 +9698,19 @@ def _where(
if axis is not None:
axis = self._get_axis_number(axis)

# align the cond to same shape as myself
cond = common.apply_if_callable(cond, self)

# We should not be filling NA. See GH#60729
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this trying to fill missing values when NaN is the missing value indicator? I don't think that is right either - the missing values should propogate for all types. We may just be missing coverage for the NaN case (which should be added to the test)

Copy link
Author

@sanggon6107 sanggon6107 Jan 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the feedback, @WillAyd .
I thought we could make the values propagate by filling cond with True, since _where() would finally keep the values in self alive where its cond is True.
Even if I don't fill those values here, _where would call fillna() using inplace at the below code. That's also why the result varies depending on whether inpalce=True or not.

pandas/pandas/core/generic.py

Lines 9695 to 9698 in e3b2de8

# make sure we are boolean
fill_value = bool(inplace)
cond = cond.fillna(fill_value)
cond = cond.infer_objects()

Could you explain in more detail what you mean by propagate for all type? Do you mean we need to keep NA as it is even after this line?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @WillAyd,

I've done some further investigations on this, but I still belive the current code is the simplest way to make the missing values propagate.
If we want to let NA propagate without calling fillna() here, there might be too many code changes needed. See below codes :

  1. Need to change the below code so that we don't fill the missing values when caller is where() or mask(). If we don't, fillna() will fill them with inplace.

pandas/pandas/core/generic.py

Lines 9695 to 9698 in f1441b2

# make sure we are boolean
fill_value = bool(inplace)
cond = cond.fillna(fill_value)
cond = cond.infer_objects()

  1. Need to change the below code as well since to_numpy() will fill the missing value using inplace when cond is a DataFrame.

pandas/pandas/core/generic.py

Lines 9703 to 9716 in f1441b2

if not isinstance(cond, ABCDataFrame):
# This is a single-dimensional object.
if not is_bool_dtype(cond):
raise TypeError(msg.format(dtype=cond.dtype))
else:
for _dt in cond.dtypes:
if not is_bool_dtype(_dt):
raise TypeError(msg.format(dtype=_dt))
if cond._mgr.any_extension_types:
# GH51574: avoid object ndarray conversion later on
cond = cond._constructor(
cond.to_numpy(dtype=bool, na_value=fill_value),
**cond._construct_axes_dict(),
)

  1. Since extract_bool_array() fills the missing values using arg na_value=False at EABackedBlock.where(), we might need to find every single NA index from cond before we call this function(using isna() for example) and then implement additional behaviour to make those values propagate at ExtensionArray._where().

def where(self, other, cond) -> list[Block]:
arr = self.values.T
cond = extract_bool_array(cond)

def extract_bool_array(mask: ArrayLike) -> npt.NDArray[np.bool_]:
"""
If we have a SparseArray or BooleanArray, convert it to ndarray[bool].
"""
if isinstance(mask, ExtensionArray):
# We could have BooleanArray, Sparse[bool], ...
# Except for BooleanArray, this is equivalent to just
# np.asarray(mask, dtype=bool)
mask = mask.to_numpy(dtype=bool, na_value=False)
mask = np.asarray(mask, dtype=bool)
return mask

If _where() is trying to fill the missing values for cond anyway, I think we don't necessarily have to disfavour the current code change. Could you give me some feedback?

Copy link
Member

@rhshadrach rhshadrach Mar 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@WillAyd

Is this trying to fill missing values when NaN is the missing value indicator? I don't think that is right either - the missing values should propogate for all types.

By filling in the missing values on cond with True, the missing value in the caller propagates. It's not filling in this missing values on cond that then fails to properly propagate the caller's missing value.

if isinstance(cond, np.ndarray):
cond = np.array(cond)
cond[isna(cond)] = True
elif isinstance(cond, NDFrame):
cond = cond.fillna(True)
Copy link
Member

@rhshadrach rhshadrach Apr 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Below we fillna using the value of inplace. Using True as done here looks correct, using inplace seems to be a bug. On main:

s = Series([1.0, 2.0, 3.0, 4.0])
cond = Series([True, False])
print(s.mask(cond))
# 0    NaN
# 1    2.0
# 2    NaN
# 3    NaN
# dtype: float64
s.mask(cond, inplace=True)
print(s)
# 0    NaN
# 1    2.0
# 2    3.0
# 3    4.0
# dtype: float64

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @rhshadrach ,
I think align at L9714 sometimes returns a DataFrame with NA in it, and they need to be filled with either of True or False depending on caller. (__setitem__ requires the former, and that's why it calls _where() with inplace=True)

My suggestion is to add an argument to _where so that the caller can decide whether or not we are going to fill the missing value with True after align. Could you let me know if this change would be acceptable?

elif isinstance(cond, (list, tuple)):
Copy link
Member

@rhshadrach rhshadrach Mar 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The docs for where state that cond can be "list-like", so we should be using is_list_like instead of this condition. However, can you instead move this section so that it's combined with the if block on L9714 immediately below.

Copy link
Author

@sanggon6107 sanggon6107 Mar 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've change the code, but it seems this causes test failures. I'll convert the status to draft and revise the code.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just confirmed that all the tests passed. Thanks!

cond = np.array(cond)
cond[isna(cond)] = True

# align the cond to same shape as myself
if isinstance(cond, NDFrame):
# CoW: Make sure reference is not kept alive
if cond.ndim == 1 and self.ndim == 2:
Expand Down
16 changes: 15 additions & 1 deletion pandas/tests/series/indexing/test_mask.py
Original file line number Diff line number Diff line change
@@ -1,7 +1,9 @@
import numpy as np
import pytest

from pandas import Series
from pandas import (
Series,
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you revert this change.

Copy link
Author

@sanggon6107 sanggon6107 Mar 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code reverted. Thanks!

import pandas._testing as tm


Expand Down Expand Up @@ -67,3 +69,15 @@ def test_mask_inplace():
rs = s.copy()
rs.mask(cond, -s, inplace=True)
tm.assert_series_equal(rs, s.mask(cond, -s))


@pytest.mark.parametrize("dtype", ["Int64", "int64[pyarrow]"])
def test_mask_na(dtype):
if dtype == "int64[pyarrow]":
pytest.importorskip("pyarrow")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead, can you change the parametrization to be:

pytest.param("int64[pyarrow]", marks=td.skip_if_no("pyarrow"))

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestion. just changed the code.

# We should not be filling pd.NA. See GH#60729
series = Series([None, 1, 2, None, 3, 4, None], dtype=dtype)
result = series.mask(series <= 2, -99)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you also test the case where the condition is an ndarray and a list.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @rhshadrach , add tests for a list and an ndarray. Could you review the code change?

expected = Series([None, -99, -99, None, 3, 4, None], dtype=dtype)

tm.assert_series_equal(result, expected)