Skip to content

BUG: Fix replacing in string series with NA (pandas-dev#32621) #32890

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 20 commits into from
Apr 10, 2020
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
20 commits
Select commit Hold shift + click to select a range
47f6676
BUG: Fix replacing in `string` series with NA (pandas-dev#32621)
chrispe Mar 21, 2020
2b53200
BUG: Fix replacing in `string` series with NA (pandas-dev#32621)
chrispe Mar 21, 2020
7678495
BUG: Fix replacing in `string` series with NA (pandas-dev#32621)
chrispe Mar 21, 2020
719369d
BUG: Fix replacing in `string` series with NA (pandas-dev#32621)
chrispe Mar 21, 2020
e98c7c9
BUG: Fix replacing in `string` series with NA (pandas-dev#32621)
chrispe Mar 22, 2020
fb8d143
BUG: Fix replacing in `string` series with NA (pandas-dev#32621)
chrispe Mar 22, 2020
0405f6a
Merge remote-tracking branch 'upstream/master' into fix-issue-32621
chrispe Mar 22, 2020
23da71c
Merge remote-tracking branch 'upstream/master' into fix-issue-32621
chrispe Mar 29, 2020
ca81cb0
BUG: Fix replacing in `string` series with NA (pandas-dev#32621)
chrispe Mar 29, 2020
8b6d224
Merge remote-tracking branch 'upstream/master' into fix-issue-32621
chrispe Apr 8, 2020
c32a2cc
BUG: Fix replacing in `string` series with NA (#32621)
chrispe Apr 8, 2020
0a76844
BUG: Fix replacing in `string` series with NA (#32621)
chrispe Apr 8, 2020
b62ad89
BUG: Fix replacing in `string` series with NA (pandas-dev#32621)
chrispe Apr 8, 2020
f5b994a
Merge remote-tracking branch 'upstream/master' into fix-issue-32621
chrispe Apr 8, 2020
a73e2eb
BUG: Fix replacing in `string` series with NA (pandas-dev#32621)
chrispe Apr 8, 2020
949accc
BUG: Fix replacing in `string` series with NA (pandas-dev#32621)
chrispe Apr 8, 2020
e7b37a0
Merge remote-tracking branch 'upstream/master' into fix-issue-32621
chrispe Apr 8, 2020
37da655
Merge remote-tracking branch 'upstream/master' into fix-issue-32621
chrispe Apr 9, 2020
df5bc39
Added description to the 1.1 bug fixes section
chrispe Apr 9, 2020
606aeb8
Merge remote-tracking branch 'upstream/master' into fix-issue-32621
chrispe Apr 10, 2020
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion doc/source/whatsnew/v1.1.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -452,8 +452,8 @@ Indexing

Missing
^^^^^^^

- Calling :meth:`fillna` on an empty Series now correctly returns a shallow copied object. The behaviour is now consistent with :class:`Index`, :class:`DataFrame` and a non-empty :class:`Series` (:issue:`32543`).
- Bug in :meth:`replace` when argument ``to_replace`` is of type dict/list and is used on a :class:`Series` containing ``<NA>`` was raising a ``TypeError``. The method now handles this by ignoring ``<NA>`` values when doing the comparison for the replacement (:issue:`32621`)
- Bug in :meth:`~Series.any` and :meth:`~Series.all` incorrectly returning ``<NA>`` for all ``False`` or all ``True`` values using the nulllable boolean dtype and with ``skipna=False`` (:issue:`33253`)

MultiIndex
Expand Down
72 changes: 53 additions & 19 deletions pandas/core/internals/managers.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@
import numpy as np

from pandas._libs import Timedelta, Timestamp, internals as libinternals, lib
from pandas._typing import ArrayLike, DtypeObj, Label
from pandas._typing import ArrayLike, DtypeObj, Label, Scalar
from pandas.util._validators import validate_bool_kwarg

from pandas.core.dtypes.cast import (
Expand Down Expand Up @@ -1905,7 +1905,9 @@ def _merge_blocks(
return blocks


def _compare_or_regex_search(a, b, regex=False):
def _compare_or_regex_search(
a: Union[ArrayLike, Scalar], b: Union[ArrayLike, Scalar], regex: bool = False
) -> Union[ArrayLike, bool]:
"""
Compare two array_like inputs of the same shape or two scalar values

Expand All @@ -1922,35 +1924,67 @@ def _compare_or_regex_search(a, b, regex=False):
-------
mask : array_like of bool
"""

def _check_comparison_types(
result: Union[ArrayLike, bool],
a: Union[ArrayLike, Scalar],
b: Union[ArrayLike, Scalar],
) -> Union[ArrayLike, bool]:
"""
Raises an error if the two arrays (a,b) cannot be compared.
Otherwise, returns the comparison result as expected.
"""
if is_scalar(result) and (
isinstance(a, np.ndarray) or isinstance(b, np.ndarray)
):
type_names = [type(a).__name__, type(b).__name__]

if isinstance(a, np.ndarray):
type_names[0] = f"ndarray(dtype={a.dtype})"

if isinstance(b, np.ndarray):
type_names[1] = f"ndarray(dtype={b.dtype})"

raise TypeError(
f"Cannot compare types {repr(type_names[0])} and {repr(type_names[1])}"
)
return result

if not regex:
op = lambda x: operator.eq(x, b)
else:
op = np.vectorize(
lambda x: bool(re.search(b, x)) if isinstance(x, str) else False
lambda x: bool(re.search(b, x))
if isinstance(x, str) and isinstance(b, str)
else False
)

is_a_array = isinstance(a, np.ndarray)
is_b_array = isinstance(b, np.ndarray)
# GH#32621 use mask to avoid comparing to NAs
if isinstance(a, np.ndarray) and not isinstance(b, np.ndarray):
mask = np.reshape(~(isna(a)), a.shape)
elif isinstance(b, np.ndarray) and not isinstance(a, np.ndarray):
mask = np.reshape(~(isna(b)), b.shape)
elif isinstance(a, np.ndarray) and isinstance(b, np.ndarray):
mask = ~(isna(a) | isna(b))
if isinstance(a, np.ndarray):
a = a[mask]
if isinstance(b, np.ndarray):
b = b[mask]

if is_datetimelike_v_numeric(a, b) or is_numeric_v_string_like(a, b):
# GH#29553 avoid deprecation warnings from numpy
result = False
else:
result = op(a)

if is_scalar(result) and (is_a_array or is_b_array):
type_names = [type(a).__name__, type(b).__name__]
return _check_comparison_types(False, a, b)

if is_a_array:
type_names[0] = f"ndarray(dtype={a.dtype})"
result = op(a)

if is_b_array:
type_names[1] = f"ndarray(dtype={b.dtype})"
if isinstance(result, np.ndarray):
# The shape of the mask can differ to that of the result
# since we may compare only a subset of a's or b's elements
tmp = np.zeros(mask.shape, dtype=np.bool)
tmp[mask] = result
result = tmp

raise TypeError(
f"Cannot compare types {repr(type_names[0])} and {repr(type_names[1])}"
)
return result
return _check_comparison_types(result, a, b)


def _fast_count_smallints(arr: np.ndarray) -> np.ndarray:
Expand Down
7 changes: 7 additions & 0 deletions pandas/tests/series/methods/test_replace.py
Original file line number Diff line number Diff line change
Expand Up @@ -241,6 +241,13 @@ def test_replace2(self):
assert (ser[6:10] == -1).all()
assert (ser[20:30] == -1).all()

def test_replace_with_dictlike_and_string_dtype(self):
# GH 32621
s = pd.Series(["one", "two", np.nan], dtype="string")
expected = pd.Series(["1", "2", np.nan])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The expected result here should also be dtype="string" I think

Copy link
Contributor Author

@chrispe chrispe Apr 9, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apparently that's not the case (even at the current stable version):

>>> import pandas as pd
>>> pd.__version__
'1.0.3'
>>> s = pd.Series(["one", "two"], dtype="string")
>>> expected = pd.Series(["1", "2"], dtype="string")
>>> result = s.replace(to_replace={"one": "1", "two": "2"})
>>> expected
0    1
1    2
dtype: string
>>> result
0    1
1    2
dtype: object

I'm not sure if that behaviour is to be expected or it should be tackled within a new issue. What do you think? This is not related to containing NA values.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would say that it is certainly expected for replace to again return a string dtype.

Although of course, there are not guarantees that your replacement value is a string ...
And given that it is also not working on master, not sure it needs to be fixed here.

For some cases of replace it does work fine:

In [33]: s = pd.Series(["one", "two"], dtype="string")  

In [34]: s.replace("one", "1")   
Out[34]: 
0      1
1    two
dtype: string

Copy link
Contributor Author

@chrispe chrispe Apr 9, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good observation. When not using the argument to_replace (i.e. s.replace("one", "1")) then the trace is different. It doesn't use the same function to apply the replacement in comparison to s.replace(to_replace={"one": "1"}).

For more details check here: https://github.com/pandas-dev/pandas/blob/master/pandas/core/generic.py#L6462

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@chrispe92 can you open an issue for this

result = s.replace({"one": "1", "two": "2"})
tm.assert_series_equal(expected, result)

def test_replace_with_empty_dictlike(self):
# GH 15289
s = pd.Series(list("abcd"))
Expand Down