Skip to content

TST: Test for Dataframe.replace when column contains pd.NA (#47480) #49783

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 2 commits into from
Closed
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 7 additions & 0 deletions pandas/tests/frame/methods/test_replace.py
Original file line number Diff line number Diff line change
Expand Up @@ -1503,6 +1503,13 @@ def test_replace_value_none_dtype_numeric(self, val):
result = df.replace({val: None})
tm.assert_frame_equal(result, expected)

def test_replace_in_col_containing_na(self):
# GH#47480
df = DataFrame({"A": [pd.NA, 1, 2]}, dtype="Int64")
Copy link
Member

@phofl phofl Nov 19, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We will need to tests if you want to construct df like this. The DataFrame in the issue had dtype object.

If you set pd.NA into a column with dtype="int64" it gets cast to object.

Edit: You could parametrize with dtype in [object, "Int64"]. I think it is worth having both tests

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: Is df = pd.DataFrame({"A": [pd.NA, 1, 2]}, dtype=dtype) a valid way of constructing a DataFrame or should it be avoided? Because when dtype is Int64 or Float64 it works just fine, but if it is None or explicitly object, the replace method raises an error:

dtype = <class 'object'>

    @pytest.mark.parametrize("dtype", [object, "Int64"])
    def test_replace_in_col_containing_na(self, dtype):
        # GH#47480
        df = DataFrame({"A": [pd.NA, 1, 2]}, dtype=dtype)
>       df["A"].replace(to_replace=1, value=100, inplace=True)

tests/frame/methods/test_replace.py:1511:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _core/series.py:5104: in replace
    return super().replace(
core/generic.py:7211: in replace
    new_data = self._mgr.replace(
core/internals/managers.py:464: in replace
    return self.apply(
core/internals/managers.py:350: in apply
    applied = getattr(b, f)(**kwargs)
core/internals/blocks.py:561: in replace
    mask = missing.mask_missing(values, to_replace)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
arr = array([<NA>, 1, 2], dtype=object), values_to_mask = array(1)       

    def mask_missing(arr: ArrayLike, values_to_mask) -> npt.NDArray[np.bool_]:
        """
        Return a masking array of same size/shape as arr
        with entries equaling any member of values_to_mask set to True   

        Parameters
        ----------
        arr : ArrayLike
        values_to_mask: list, tuple, or scalar

        Returns
        -------
        np.ndarray[bool]
        """
        # When called from Block.replace/replace_list, values_to_mask is 
a scalar
        #  known to be holdable by arr.
        # When called from Series._single_replace, values_to_mask is tuple or list
        dtype, values_to_mask = infer_dtype_from(values_to_mask)
        # error: Argument "dtype" to "array" has incompatible type "Union[dtype[Any],
        # ExtensionDtype]"; expected "Union[dtype[Any], None, type, _SupportsDType, str,
        # Union[Tuple[Any, int], Tuple[Any, Union[int, Sequence[int]]], List[Any],
        # _DTypeDict, Tuple[Any, Any]]]"
        values_to_mask = np.array(values_to_mask, dtype=dtype)  # type: ignore[arg-type]

        na_mask = isna(values_to_mask)
        nonna = values_to_mask[~na_mask]

        # GH 21977
        mask = np.zeros(arr.shape, dtype=bool)
        for x in nonna:
            if is_numeric_v_string_like(arr, x):
                # GH#29553 prevent numpy deprecation warnings
                pass
            else:
                new_mask = arr == x
                if not isinstance(new_mask, np.ndarray):
                    # usually BooleanArray
>                   new_mask = new_mask.to_numpy(dtype=bool, na_value=False)
E                   AttributeError: 'bool' object has no attribute 'to_numpy'

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We will need to tests if you want to construct df like this. The DataFrame in the issue had dtype object.

If you set pd.NA into a column with dtype="int64" it gets cast to object.

Trying here it looks like it is casted to Float64. Is that the expected behavior?

>>> df = pd.DataFrame({'A': [0, 1, 2]})
>>> df['A'].dtypes
dtype('int64')
>>> df.at[0, 'A'] = pd.NA
>>> df['A'].dtypes
dtype('float64')

The tests passes when parametrizing with dtype in ["Float64", "Int64"].

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting, I would have expected that this is cast to object, will have a closer look tomorrow

df["A"].replace(to_replace=1, value=100, inplace=True)
expected = DataFrame({"A": [pd.NA, 100, 2]}, dtype="Int64")
tm.assert_frame_equal(df, expected)


class TestDataFrameReplaceRegex:
@pytest.mark.parametrize(
Expand Down