Skip to content

REGR: Fix regression RecursionError when replacing numeric scalar with None #48234

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 12 commits into from
Oct 31, 2022
9 changes: 9 additions & 0 deletions pandas/core/internals/blocks.py
Original file line number Diff line number Diff line change
Expand Up @@ -608,6 +608,15 @@ def replace(
blocks = [blk]
return blocks

elif value is None and not self.is_object:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it maybe that we do need to add additional logic for now but there has been some prior discussion about this issue cc @jbrockmendel

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah this is not a permanent fix, but this happens for every numeric column no matter what value you replace (not only nan), so we should provide a fix before releasing

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so we should provide a fix before releasing

sure. (unfortunate that #45725 was not labelled as a regression (from main) and that the milestone was removed)

This had a simple fix, which I reverted because of inconsistencies in handling None in a list-like and a scalar. Long term we need to address those inconsistencies.

any temp fix needs a code comment explaining when/how the code can/should be removed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

quickly looking back, I did open a PR (#46443) that attempted to address the fact that sometimes we want the missing value to be strict.

we have 2 weeks to fix, so happy to discuss all options.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, would be great if we can come up with a permanent fix

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

has there been a discussion about the desired behavior? i could imagine having Series([1, 2]).replace(2, None) returning Series([1, np.nan]). I think that might be the right thing to do internal-consistency-wise.

To get that I think we'd use _standardize_fill_value at the top of Block.replace.

Copy link
Member Author

@phofl phofl Aug 25, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you do ser.loc[0] = None this upcasts to object, so imo this would make sense for replace too?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ser = pd.Series([1, 2])
ser.loc[0] = None
>>> ser.dtype
dtype('float64')

Copy link
Member Author

@phofl phofl Aug 25, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah that's on me, sorry, I was expanding, e.g. doing ser.loc[2] = None which casted to object on 1.4.3. This is obviously inconsistent too. Would be ok with float64 and casting to nan then

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To get that I think we'd use _standardize_fill_value at the top of Block.replace.

IIUC from previous discussions that is the preferred fix. We then need to either special case (or handle differently) the dispatch from the list-like replace, where for backwards compatibility, the None replace value is explicit.

The reason I reverted the original fix for the RecursionError (on main) is to facilitate the backport fix, where we have less flexibility for changes in behavior. But the intention was that the issue would be fixed before 1.5.

The proper fix for 1.5 may also involve discussion on deprecating the current list-like replace behavior to be consistent with a scalar replace.

blk = self.astype(np.dtype(object))
return blk.replace(
to_replace=to_replace,
value=value,
inplace=True,
mask=mask,
)

elif self.ndim == 1 or self.shape[0] == 1:
blk = self.coerce_to_target_dtype(value)
return blk.replace(
Expand Down
11 changes: 11 additions & 0 deletions pandas/tests/frame/methods/test_replace.py
Original file line number Diff line number Diff line change
Expand Up @@ -1496,6 +1496,17 @@ def test_replace_list_with_mixed_type(
result = obj.replace(box(to_replace), value)
tm.assert_equal(result, expected)

@pytest.mark.parametrize("val", [2, np.nan, 2.0])
def test_replace_value_none_dtype_numeric(self, val):
# GH#48231
ser = DataFrame({"a": [1, val]})
result = ser.replace(val, None)
expected = DataFrame({"a": [1, None]}, dtype="object")
tm.assert_frame_equal(result, expected)

result = ser.replace({val: None})
tm.assert_frame_equal(result, expected)


class TestDataFrameReplaceRegex:
@pytest.mark.parametrize(
Expand Down
8 changes: 8 additions & 0 deletions pandas/tests/series/methods/test_replace.py
Original file line number Diff line number Diff line change
Expand Up @@ -667,3 +667,11 @@ def test_replace_different_int_types(self, any_int_numpy_dtype):
result = labs.replace(map_dict)
expected = labs.replace({0: 0, 2: 1, 1: 2})
tm.assert_series_equal(result, expected)

@pytest.mark.parametrize("val", [2, np.nan, 2.0])
def test_replace_value_none_dtype_numeric(self, val):
# GH#48231
ser = pd.Series([1, val])
result = ser.replace(val, None)
expected = pd.Series([1, None], dtype="object")
tm.assert_series_equal(result, expected)