Skip to content

TST: Test for Dataframe.replace when column contains pd.NA (#47480) #49783

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 2 commits into from
Closed

TST: Test for Dataframe.replace when column contains pd.NA (#47480) #49783

wants to merge 2 commits into from

Conversation

vsbits
Copy link

@vsbits vsbits commented Nov 19, 2022

df.at[0, "A"] = pd.NA
expected = df.copy()
df["A"].replace(to_replace=1, value=100, inplace=True)
expected.at[1, "A"] = 100
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please construct expected explicitly?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No problem. Just pushed the edit.

@@ -1503,6 +1503,13 @@ def test_replace_value_none_dtype_numeric(self, val):
result = df.replace({val: None})
tm.assert_frame_equal(result, expected)

def test_replace_in_col_containing_na(self):
# GH#47480
df = DataFrame({"A": [pd.NA, 1, 2]}, dtype="Int64")
Copy link
Member

@phofl phofl Nov 19, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We will need to tests if you want to construct df like this. The DataFrame in the issue had dtype object.

If you set pd.NA into a column with dtype="int64" it gets cast to object.

Edit: You could parametrize with dtype in [object, "Int64"]. I think it is worth having both tests

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: Is df = pd.DataFrame({"A": [pd.NA, 1, 2]}, dtype=dtype) a valid way of constructing a DataFrame or should it be avoided? Because when dtype is Int64 or Float64 it works just fine, but if it is None or explicitly object, the replace method raises an error:

dtype = <class 'object'>

    @pytest.mark.parametrize("dtype", [object, "Int64"])
    def test_replace_in_col_containing_na(self, dtype):
        # GH#47480
        df = DataFrame({"A": [pd.NA, 1, 2]}, dtype=dtype)
>       df["A"].replace(to_replace=1, value=100, inplace=True)

tests/frame/methods/test_replace.py:1511:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _core/series.py:5104: in replace
    return super().replace(
core/generic.py:7211: in replace
    new_data = self._mgr.replace(
core/internals/managers.py:464: in replace
    return self.apply(
core/internals/managers.py:350: in apply
    applied = getattr(b, f)(**kwargs)
core/internals/blocks.py:561: in replace
    mask = missing.mask_missing(values, to_replace)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
arr = array([<NA>, 1, 2], dtype=object), values_to_mask = array(1)       

    def mask_missing(arr: ArrayLike, values_to_mask) -> npt.NDArray[np.bool_]:
        """
        Return a masking array of same size/shape as arr
        with entries equaling any member of values_to_mask set to True   

        Parameters
        ----------
        arr : ArrayLike
        values_to_mask: list, tuple, or scalar

        Returns
        -------
        np.ndarray[bool]
        """
        # When called from Block.replace/replace_list, values_to_mask is 
a scalar
        #  known to be holdable by arr.
        # When called from Series._single_replace, values_to_mask is tuple or list
        dtype, values_to_mask = infer_dtype_from(values_to_mask)
        # error: Argument "dtype" to "array" has incompatible type "Union[dtype[Any],
        # ExtensionDtype]"; expected "Union[dtype[Any], None, type, _SupportsDType, str,
        # Union[Tuple[Any, int], Tuple[Any, Union[int, Sequence[int]]], List[Any],
        # _DTypeDict, Tuple[Any, Any]]]"
        values_to_mask = np.array(values_to_mask, dtype=dtype)  # type: ignore[arg-type]

        na_mask = isna(values_to_mask)
        nonna = values_to_mask[~na_mask]

        # GH 21977
        mask = np.zeros(arr.shape, dtype=bool)
        for x in nonna:
            if is_numeric_v_string_like(arr, x):
                # GH#29553 prevent numpy deprecation warnings
                pass
            else:
                new_mask = arr == x
                if not isinstance(new_mask, np.ndarray):
                    # usually BooleanArray
>                   new_mask = new_mask.to_numpy(dtype=bool, na_value=False)
E                   AttributeError: 'bool' object has no attribute 'to_numpy'

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We will need to tests if you want to construct df like this. The DataFrame in the issue had dtype object.

If you set pd.NA into a column with dtype="int64" it gets cast to object.

Trying here it looks like it is casted to Float64. Is that the expected behavior?

>>> df = pd.DataFrame({'A': [0, 1, 2]})
>>> df['A'].dtypes
dtype('int64')
>>> df.at[0, 'A'] = pd.NA
>>> df['A'].dtypes
dtype('float64')

The tests passes when parametrizing with dtype in ["Float64", "Int64"].

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting, I would have expected that this is cast to object, will have a closer look tomorrow

@vsbits vsbits closed this Nov 19, 2022
@vsbits vsbits deleted the add-test branch November 20, 2022 19:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: DataFrame.replace fails to replace value when column contains pd.NA
2 participants