Skip to content

TST: Tests for replace method when column contains pd.NA (#47480) #49805

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 11 commits into from
Closed

TST: Tests for replace method when column contains pd.NA (#47480) #49805

wants to merge 11 commits into from

Conversation

vsbits
Copy link

@vsbits vsbits commented Nov 20, 2022

Added tests to show BUG from issue (#47480).
Original issue was about pd.Series method (df["A"].replace), but pushing tests for both DataFrame and Series.

Test failing when DataFrame/Series has dtype object (infered when dtype=None on constructor).

Vítor Araujo added 2 commits November 20, 2022 19:53
@mroeschke
Copy link
Member

Looks like the added tests are failing in the CI. Are you interested in contributing a fix?

@mroeschke mroeschke added Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate replace replace method labels Nov 21, 2022
@@ -94,6 +94,11 @@ def mask_missing(arr: ArrayLike, values_to_mask) -> npt.NDArray[np.bool_]:
pass
else:
new_mask = arr == x
# GH#47480
if isinstance(new_mask, bool):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you fix this without trying the failing conversion first? You can check for dtype object and if arr has any nans and use a different logic in that case.

I would suggest using another mask that applies the equality comparison only for values that are not nan

@@ -89,15 +88,11 @@ def mask_missing(arr: ArrayLike, values_to_mask) -> npt.NDArray[np.bool_]:
# GH 21977
mask = np.zeros(arr.shape, dtype=bool)
for x in nonna:
if is_numeric_v_string_like(arr, x):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pls keep this check. it'll avoid some warnings and will be a fastpath

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer if you pre-compute a mask in case we have object dtype before entering the loop.
Then you can add another if clause to only update the values that are not na

new_mask[~object_na_mask] = arr[~object_na_mask] == x

@vsbits
Copy link
Author

vsbits commented Dec 2, 2022

Last commit seems to pass all tests.
Should I make any other changes to the code?

@@ -88,12 +88,15 @@ def mask_missing(arr: ArrayLike, values_to_mask) -> npt.NDArray[np.bool_]:

# GH 21977
mask = np.zeros(arr.shape, dtype=bool)
arr_na_mask = ~isna(arr)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we avoid doing this when not necessary? I think we only need this with object dtype

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vs-araujo can you do this

@vsbits
Copy link
Author

vsbits commented Dec 13, 2022

@phofl Now it only creates the NA mask if arr is a numpy.ndarray (happens when Series dtype is object. Otherwise arr is a pandas.array e.g. IntegerArray or FloatingArray. There is no ObjectArray class?).

Saw a problem after commiting:

    else:
        arr_nas = mask  # Avoid possibility of NameError on new_mask[~arr_nas]

Should change to mask.copy() or delete it altogether?

@github-actions
Copy link
Contributor

This pull request is stale because it has been open for thirty days with no activity. Please update and respond to this comment if you're still interested in working on this.

@github-actions github-actions bot added the Stale label Jan 15, 2023
@vsbits vsbits closed this Jan 15, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate replace replace method Stale
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: DataFrame.replace fails to replace value when column contains pd.NA
4 participants