-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
TST: Test for Dataframe.replace when column contains pd.NA (#47480) #49783
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
vsbits
commented
Nov 19, 2022
•
edited
Loading
edited
- closes BUG: DataFrame.replace fails to replace value when column contains pd.NA #47480
- Tests added and passed if fixing a bug or adding a new feature
- All code checks passed.
df.at[0, "A"] = pd.NA | ||
expected = df.copy() | ||
df["A"].replace(to_replace=1, value=100, inplace=True) | ||
expected.at[1, "A"] = 100 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please construct expected explicitly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No problem. Just pushed the edit.
@@ -1503,6 +1503,13 @@ def test_replace_value_none_dtype_numeric(self, val): | |||
result = df.replace({val: None}) | |||
tm.assert_frame_equal(result, expected) | |||
|
|||
def test_replace_in_col_containing_na(self): | |||
# GH#47480 | |||
df = DataFrame({"A": [pd.NA, 1, 2]}, dtype="Int64") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We will need to tests if you want to construct df like this. The DataFrame in the issue had dtype object.
If you set pd.NA into a column with dtype="int64" it gets cast to object.
Edit: You could parametrize with dtype in [object, "Int64"]. I think it is worth having both tests
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question: Is df = pd.DataFrame({"A": [pd.NA, 1, 2]}, dtype=dtype)
a valid way of constructing a DataFrame or should it be avoided? Because when dtype
is Int64
or Float64
it works just fine, but if it is None
or explicitly object
, the replace method raises an error:
dtype = <class 'object'>
@pytest.mark.parametrize("dtype", [object, "Int64"])
def test_replace_in_col_containing_na(self, dtype):
# GH#47480
df = DataFrame({"A": [pd.NA, 1, 2]}, dtype=dtype)
> df["A"].replace(to_replace=1, value=100, inplace=True)
tests/frame/methods/test_replace.py:1511:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _core/series.py:5104: in replace
return super().replace(
core/generic.py:7211: in replace
new_data = self._mgr.replace(
core/internals/managers.py:464: in replace
return self.apply(
core/internals/managers.py:350: in apply
applied = getattr(b, f)(**kwargs)
core/internals/blocks.py:561: in replace
mask = missing.mask_missing(values, to_replace)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
arr = array([<NA>, 1, 2], dtype=object), values_to_mask = array(1)
def mask_missing(arr: ArrayLike, values_to_mask) -> npt.NDArray[np.bool_]:
"""
Return a masking array of same size/shape as arr
with entries equaling any member of values_to_mask set to True
Parameters
----------
arr : ArrayLike
values_to_mask: list, tuple, or scalar
Returns
-------
np.ndarray[bool]
"""
# When called from Block.replace/replace_list, values_to_mask is
a scalar
# known to be holdable by arr.
# When called from Series._single_replace, values_to_mask is tuple or list
dtype, values_to_mask = infer_dtype_from(values_to_mask)
# error: Argument "dtype" to "array" has incompatible type "Union[dtype[Any],
# ExtensionDtype]"; expected "Union[dtype[Any], None, type, _SupportsDType, str,
# Union[Tuple[Any, int], Tuple[Any, Union[int, Sequence[int]]], List[Any],
# _DTypeDict, Tuple[Any, Any]]]"
values_to_mask = np.array(values_to_mask, dtype=dtype) # type: ignore[arg-type]
na_mask = isna(values_to_mask)
nonna = values_to_mask[~na_mask]
# GH 21977
mask = np.zeros(arr.shape, dtype=bool)
for x in nonna:
if is_numeric_v_string_like(arr, x):
# GH#29553 prevent numpy deprecation warnings
pass
else:
new_mask = arr == x
if not isinstance(new_mask, np.ndarray):
# usually BooleanArray
> new_mask = new_mask.to_numpy(dtype=bool, na_value=False)
E AttributeError: 'bool' object has no attribute 'to_numpy'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We will need to tests if you want to construct df like this. The DataFrame in the issue had dtype object.
If you set pd.NA into a column with dtype="int64" it gets cast to object.
Trying here it looks like it is casted to Float64. Is that the expected behavior?
>>> df = pd.DataFrame({'A': [0, 1, 2]})
>>> df['A'].dtypes
dtype('int64')
>>> df.at[0, 'A'] = pd.NA
>>> df['A'].dtypes
dtype('float64')
The tests passes when parametrizing with dtype in ["Float64", "Int64"].
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting, I would have expected that this is cast to object, will have a closer look tomorrow