Skip to content

BUG: Fix replacing in string series with NA (pandas-dev#32621) #32890

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 20 commits into from
Apr 10, 2020
Merged
Show file tree
Hide file tree
Changes from 12 commits
Commits
Show all changes
20 commits
Select commit Hold shift + click to select a range
47f6676
BUG: Fix replacing in `string` series with NA (pandas-dev#32621)
chrispe Mar 21, 2020
2b53200
BUG: Fix replacing in `string` series with NA (pandas-dev#32621)
chrispe Mar 21, 2020
7678495
BUG: Fix replacing in `string` series with NA (pandas-dev#32621)
chrispe Mar 21, 2020
719369d
BUG: Fix replacing in `string` series with NA (pandas-dev#32621)
chrispe Mar 21, 2020
e98c7c9
BUG: Fix replacing in `string` series with NA (pandas-dev#32621)
chrispe Mar 22, 2020
fb8d143
BUG: Fix replacing in `string` series with NA (pandas-dev#32621)
chrispe Mar 22, 2020
0405f6a
Merge remote-tracking branch 'upstream/master' into fix-issue-32621
chrispe Mar 22, 2020
23da71c
Merge remote-tracking branch 'upstream/master' into fix-issue-32621
chrispe Mar 29, 2020
ca81cb0
BUG: Fix replacing in `string` series with NA (pandas-dev#32621)
chrispe Mar 29, 2020
8b6d224
Merge remote-tracking branch 'upstream/master' into fix-issue-32621
chrispe Apr 8, 2020
c32a2cc
BUG: Fix replacing in `string` series with NA (#32621)
chrispe Apr 8, 2020
0a76844
BUG: Fix replacing in `string` series with NA (#32621)
chrispe Apr 8, 2020
b62ad89
BUG: Fix replacing in `string` series with NA (pandas-dev#32621)
chrispe Apr 8, 2020
f5b994a
Merge remote-tracking branch 'upstream/master' into fix-issue-32621
chrispe Apr 8, 2020
a73e2eb
BUG: Fix replacing in `string` series with NA (pandas-dev#32621)
chrispe Apr 8, 2020
949accc
BUG: Fix replacing in `string` series with NA (pandas-dev#32621)
chrispe Apr 8, 2020
e7b37a0
Merge remote-tracking branch 'upstream/master' into fix-issue-32621
chrispe Apr 8, 2020
37da655
Merge remote-tracking branch 'upstream/master' into fix-issue-32621
chrispe Apr 9, 2020
df5bc39
Added description to the 1.1 bug fixes section
chrispe Apr 9, 2020
606aeb8
Merge remote-tracking branch 'upstream/master' into fix-issue-32621
chrispe Apr 10, 2020
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
50 changes: 36 additions & 14 deletions pandas/core/internals/managers.py
Original file line number Diff line number Diff line change
Expand Up @@ -1941,6 +1941,24 @@ def _compare_or_regex_search(a, b, regex=False):
-------
mask : array_like of bool
"""

def _check(result, a, b):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you give this a more informative name, maybe _check_comparision_types

and type the input args as much as possible

if is_scalar(result) and (
isinstance(a, np.ndarray) or isinstance(b, np.ndarray)
):
type_names = [type(a).__name__, type(b).__name__]

if is_a_array:
type_names[0] = f"ndarray(dtype={a.dtype})"

if is_b_array:
type_names[1] = f"ndarray(dtype={b.dtype})"

raise TypeError(
f"Cannot compare types {repr(type_names[0])} and {repr(type_names[1])}"
)
return result

if not regex:
op = lambda x: operator.eq(x, b)
else:
Expand All @@ -1951,25 +1969,29 @@ def _compare_or_regex_search(a, b, regex=False):
is_a_array = isinstance(a, np.ndarray)
is_b_array = isinstance(b, np.ndarray)

# GH#32621 use mask to avoid comparing to NAs
if is_a_array and not is_b_array:
mask = np.reshape(~(isna(a)), a.shape)
elif is_b_array and not is_a_array:
mask = np.reshape(~(isna(b)), b.shape)
elif is_a_array and is_b_array:
mask = ~(isna(a) | isna(b))
if is_a_array:
a = a[mask]
if is_b_array:
b = b[mask]

if is_datetimelike_v_numeric(a, b) or is_numeric_v_string_like(a, b):
# GH#29553 avoid deprecation warnings from numpy
result = False
return _check(False, a, b)
else:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doesn't need to have an else here any longer

result = op(a)
if isinstance(result, np.ndarray):
tmp = np.zeros(mask.shape, dtype=np.bool)
tmp[mask] = result
result = tmp
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this not just
result[mask] = False

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because the shape of the mask can differ to that of the result, since we may select to compare only a subset of a's or b's elements (i.e. some of them may be NAs).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok can you add a comment on what this is doing (simnilar to what you just said)


if is_scalar(result) and (is_a_array or is_b_array):
type_names = [type(a).__name__, type(b).__name__]

if is_a_array:
type_names[0] = f"ndarray(dtype={a.dtype})"

if is_b_array:
type_names[1] = f"ndarray(dtype={b.dtype})"

raise TypeError(
f"Cannot compare types {repr(type_names[0])} and {repr(type_names[1])}"
)
return result
return _check(result, a, b)


def _fast_count_smallints(arr):
Expand Down
7 changes: 7 additions & 0 deletions pandas/tests/series/methods/test_replace.py
Original file line number Diff line number Diff line change
Expand Up @@ -241,6 +241,13 @@ def test_replace2(self):
assert (ser[6:10] == -1).all()
assert (ser[20:30] == -1).all()

def test_replace_with_dictlike_and_string_dtype(self):
# GH 32621
s = pd.Series(["one", "two", np.nan], dtype="string")
expected = pd.Series(["1", "2", np.nan])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The expected result here should also be dtype="string" I think

Copy link
Contributor Author

@chrispe chrispe Apr 9, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apparently that's not the case (even at the current stable version):

>>> import pandas as pd
>>> pd.__version__
'1.0.3'
>>> s = pd.Series(["one", "two"], dtype="string")
>>> expected = pd.Series(["1", "2"], dtype="string")
>>> result = s.replace(to_replace={"one": "1", "two": "2"})
>>> expected
0    1
1    2
dtype: string
>>> result
0    1
1    2
dtype: object

I'm not sure if that behaviour is to be expected or it should be tackled within a new issue. What do you think? This is not related to containing NA values.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would say that it is certainly expected for replace to again return a string dtype.

Although of course, there are not guarantees that your replacement value is a string ...
And given that it is also not working on master, not sure it needs to be fixed here.

For some cases of replace it does work fine:

In [33]: s = pd.Series(["one", "two"], dtype="string")  

In [34]: s.replace("one", "1")   
Out[34]: 
0      1
1    two
dtype: string

Copy link
Contributor Author

@chrispe chrispe Apr 9, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good observation. When not using the argument to_replace (i.e. s.replace("one", "1")) then the trace is different. It doesn't use the same function to apply the replacement in comparison to s.replace(to_replace={"one": "1"}).

For more details check here: https://github.com/pandas-dev/pandas/blob/master/pandas/core/generic.py#L6462

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@chrispe92 can you open an issue for this

result = s.replace({"one": "1", "two": "2"})
tm.assert_series_equal(expected, result)

def test_replace_with_empty_dictlike(self):
# GH 15289
s = pd.Series(list("abcd"))
Expand Down