Skip to content

REGR: RecursionError when attempting to replace np.nan values #46443

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 3 additions & 2 deletions pandas/core/dtypes/cast.py
Original file line number Diff line number Diff line change
Expand Up @@ -1454,7 +1454,7 @@ def _ensure_nanosecond_dtype(dtype: DtypeObj) -> DtypeObj:

# TODO: other value-dependent functions to standardize here include
# dtypes.concat.cast_to_common_type and Index._find_common_type_compat
def find_result_type(left: ArrayLike, right: Any) -> DtypeObj:
def find_result_type(left: ArrayLike, right: Any, strict_na: bool = False) -> DtypeObj:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

im wary of introducing this here. is there a hurry on this or can i take some time to catch up on the recent PRs that got us here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The addition here is where the regression occurred fixing an Interval issue. Are you thinking of pushing the fix further down into is_valid_na_for_dtype or ensure_dtype_can_hold_na or adding more special casing/code duplication in Block.replace?

As I see it, whichever way it's done there is always a risk of changing behavior whereas the fix here keeps the fix for the interval issue and also keeps the existing behavior for replace.

This is on master and I'm quite happy with follow-ups to this that refactor as it's easier to revert a refactor is there is an issue down the line whereas if we need to revert a regression fix we end up re-introducing the regression.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a hurry on this or can i take some time to catch up on the recent PRs that got us here?

no hurry (unless there are more refactors to this area of the code which make it more difficult to fix later)

"""
Find the type/dtype for a the result of an operation between these objects.

Expand All @@ -1466,6 +1466,7 @@ def find_result_type(left: ArrayLike, right: Any) -> DtypeObj:
----------
left : np.ndarray or ExtensionArray
right : Any
strict_na: bool

Returns
-------
Expand All @@ -1491,7 +1492,7 @@ def find_result_type(left: ArrayLike, right: Any) -> DtypeObj:

new_dtype = np.result_type(left, right)

elif is_valid_na_for_dtype(right, left.dtype):
elif not strict_na and is_valid_na_for_dtype(right, left.dtype):
# e.g. IntervalDtype[int] and None/np.nan
new_dtype = ensure_dtype_can_hold_na(left.dtype)

Expand Down
6 changes: 3 additions & 3 deletions pandas/core/internals/blocks.py
Original file line number Diff line number Diff line change
Expand Up @@ -436,15 +436,15 @@ def split_and_operate(self, func, *args, **kwargs) -> list[Block]:
# Up/Down-casting

@final
def coerce_to_target_dtype(self, other) -> Block:
def coerce_to_target_dtype(self, other, strict_na: bool = False) -> Block:
"""
coerce the current block to a dtype compat for other
we will return a block, possibly object, and not raise

we can also safely try to coerce to the same dtype
and will receive the same block
"""
new_dtype = find_result_type(self.values, other)
new_dtype = find_result_type(self.values, other, strict_na=strict_na)

return self.astype(new_dtype, copy=False)

Expand Down Expand Up @@ -601,7 +601,7 @@ def replace(
return blocks

elif self.ndim == 1 or self.shape[0] == 1:
blk = self.coerce_to_target_dtype(value)
blk = self.coerce_to_target_dtype(value, strict_na=True)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ATM i think rather than add special cases to the lower-level functions, we should keep it isolated here, similar to what you did for #45836 (and ideally down the line combine both fixes into something more elegant)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure.

from #45725 (comment)

The alternative is to intercept this particular situation somewhere. I think @simonjayhawkins has a branch that intercepts it at the level of find_common_type,

I think the issues arises basically from the difference between _can_hold_element and is_valid_na_for_dtype not being strict (i.e. None and pd.NA are considered valid for float array).

I have added a strict flag to pass through to the lower level library casting functions (i've put this in find_result_type for now since that's where the code was changed arising in the regression reported here.

But it probably does make more sense that the flag is passed further down through to is_valid_na_for_dtype and then maybe use _can_hold_element in is_valid_na_for_dtype when the strict flag is True.

also from #45725 (comment)

but unless there are other use cases I'm inclined to handle it in Block.replace

Seems reasonable. I'm not sure if there are any other use cases. I suspect we need some logic that uses is_valid_na_for_dtype in Block.replace for this to work but also suspect that there will also be some duplication with the lower level library functions to pull this off.

As mentioned above, would also be quite happy doing this as a refactor after the recursion issue is resolved by the fix suggested here.

rather than add special cases to the lower-level functions

I've not yet looked but off the top of your head, is the code added in #45568 to the lower level functions a special case for Interval? The reason I ask is removing that would also fix the recursion error.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've not yet looked but off the top of your head, is the code added in #45568 to the lower level functions a special case for Interval? The reason I ask is removing that would also fix the recursion error.

Most of the edits in #45568 should be correct for The General Case. The only special case for IntervalDtype is in ensure_dtype_can_hold_na, which I don't think is what you're referring to.

return blk.replace(
to_replace=to_replace,
value=value,
Expand Down
7 changes: 7 additions & 0 deletions pandas/tests/frame/methods/test_replace.py
Original file line number Diff line number Diff line change
Expand Up @@ -675,6 +675,13 @@ def test_replace_NAT_with_None(self):
expected = DataFrame([None, None])
tm.assert_frame_equal(result, expected)

def test_replace_float_nan_with_pd_NA(self):
# gh-45725
df = DataFrame([np.nan, np.nan], dtype=float)
result = df.replace({np.nan: pd.NA})
expected = DataFrame([pd.NA, pd.NA], dtype=object)
tm.assert_frame_equal(result, expected)

def test_replace_value_is_none(self, datetime_frame):
orig_value = datetime_frame.iloc[0, 0]
orig2 = datetime_frame.iloc[1, 0]
Expand Down