BUG: Fix replacing in `string` series with NA (pandas-dev#32621) #32890

chrispe · 2020-03-21T17:21:51Z

The pd.NA values are replaced with np.nan before comparing the arrays/scalars

closes BUG: Replace in string series with NA #32621
tests passed
passes black pandas
passes git diff upstream/master -u -- "*.py" | flake8 --diff

The pd.NA values are replaced with np.nan before comparing the arrays/scalars

Made improvements based on the tests which failed

Added change to resolve linting check

Added test for the reported bug

WillAyd · 2020-03-21T23:17:03Z

I think the problem here is actually that the element wise comparison reduces to a scalar:

>>> np.array(["one", "two", np.nan]) == "one"
array([ True, False, False])
>>> np.array(["one", "two", pd.NA]) == "one"
__main__:1: FutureWarning: elementwise comparison failed; returning scalar instead, but in the future will perform elementwise comparison
False

Which then gets caught in some of the functions here

@TomAugspurger or @jorisvandenbossche do you know the backstory to that behavior?

jreback

see comment

pandas/core/internals/managers.py

chrispe · 2020-03-22T10:38:52Z

I think the problem here is actually that the element wise comparison reduces to a scalar:
>>> np.array(["one", "two", np.nan]) == "one"
array([ True, False, False])
>>> np.array(["one", "two", pd.NA]) == "one"
__main__:1: FutureWarning: elementwise comparison failed; returning scalar instead, but in the future will perform elementwise comparison
False
Which then gets caught in some of the functions here

@TomAugspurger or @jorisvandenbossche do you know the backstory to that behavior?

Hi @WillAyd, please check my comment #32890 (comment) which further explains why probably that warning is raised.

pep8speaks · 2020-03-22T22:53:39Z

Hello @chrispe92! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2020-04-10 11:09:11 UTC

pandas/core/internals/managers.py

chrispe · 2020-03-29T12:01:19Z

Hi @jreback,

Can you have a look my latest changes? I'm curious to know what you think.

jreback · 2020-03-29T16:14:53Z

pandas/core/internals/managers.py

+        if isinstance(result, np.ndarray):
+            tmp = np.zeros(mask.shape, dtype=np.bool)
+            tmp[mask] = result
+            result = tmp


why is this not just
result[mask] = False

Because the shape of the mask can differ to that of the result, since we may select to compare only a subset of a's or b's elements (i.e. some of them may be NAs).

ok can you add a comment on what this is doing (simnilar to what you just said)

jreback · 2020-03-29T16:17:13Z

pandas/core/internals/managers.py

@@ -1952,8 +1952,25 @@ def _compare_or_regex_search(a, b, regex=False):
        # GH#29553 avoid deprecation warnings from numpy
        result = False


let's just return here (you will need move the check on like 194 into a function ,e.g.

if is_datetimeliek......: return check(False) .....

can you update for this

Hi @jreback, I'm not sure if I completely understood the change you requested to make here. But I assume you wanted to place the entire block that checks the result's value into a function instead? If so, I did that and made some additional changes (c32a2cc) in order to have it working. Can you please confirm? Thanks

jreback · 2020-04-07T00:11:19Z

pandas/core/internals/managers.py

@@ -1952,8 +1952,25 @@ def _compare_or_regex_search(a, b, regex=False):
        # GH#29553 avoid deprecation warnings from numpy
        result = False


can you update for this

jreback · 2020-04-07T00:12:56Z

pandas/core/internals/managers.py

+        elif is_a_array and is_b_array:
+            mask = ~(isna(a) | isna(b))
+
+        if is_a_array:


can you not do this in the above logic?

Hi @jreback, I moved that code block above (c32a2cc). Can you please confirm that this is what you were expecting? Thanks.

jreback · 2020-04-08T15:22:50Z

pandas/core/internals/managers.py

@@ -1941,6 +1941,24 @@ def _compare_or_regex_search(a, b, regex=False):
    -------
    mask : array_like of bool
    """
+
+    def _check(result, a, b):


can you give this a more informative name, maybe _check_comparision_types

and type the input args as much as possible

jreback · 2020-04-08T15:23:28Z

pandas/core/internals/managers.py

    if is_datetimelike_v_numeric(a, b) or is_numeric_v_string_like(a, b):
        # GH#29553 avoid deprecation warnings from numpy
-        result = False
+        return _check(False, a, b)
    else:


doesn't need to have an else here any longer

jreback · 2020-04-08T15:24:08Z

pandas/core/internals/managers.py

+        if isinstance(result, np.ndarray):
+            tmp = np.zeros(mask.shape, dtype=np.bool)
+            tmp[mask] = result
+            result = tmp


ok can you add a comment on what this is doing (simnilar to what you just said)

chrispe · 2020-04-09T07:33:47Z

Hi @jreback, I've made all the changes that you suggested.
Other than those, I've also made the following changes:

Added expected types in both of the functions
Removed the is_array variables in order to pass the mypy tests.

Can you please check them and let me know what you think? Did I miss anything? Thanks.

jreback

looks good. can you add a whatsnew note in 1.1 bug fixes missing section. ping on green.

pandas/core/internals/managers.py

jorisvandenbossche · 2020-04-09T18:26:24Z

pandas/tests/series/methods/test_replace.py

+    def test_replace_with_dictlike_and_string_dtype(self):
+        # GH 32621
+        s = pd.Series(["one", "two", np.nan], dtype="string")
+        expected = pd.Series(["1", "2", np.nan])


The expected result here should also be dtype="string" I think

Apparently that's not the case (even at the current stable version):

>>> import pandas as pd >>> pd.__version__ '1.0.3' >>> s = pd.Series(["one", "two"], dtype="string") >>> expected = pd.Series(["1", "2"], dtype="string") >>> result = s.replace(to_replace={"one": "1", "two": "2"}) >>> expected 0 1 1 2 dtype: string >>> result 0 1 1 2 dtype: object

I'm not sure if that behaviour is to be expected or it should be tackled within a new issue. What do you think? This is not related to containing NA values.

I would say that it is certainly expected for replace to again return a string dtype.

Although of course, there are not guarantees that your replacement value is a string ...
And given that it is also not working on master, not sure it needs to be fixed here.

For some cases of replace it does work fine:

In [33]: s = pd.Series(["one", "two"], dtype="string") In [34]: s.replace("one", "1") Out[34]: 0 1 1 two dtype: string

Good observation. When not using the argument to_replace (i.e. s.replace("one", "1")) then the trace is different. It doesn't use the same function to apply the replacement in comparison to s.replace(to_replace={"one": "1"}).

For more details check here: https://github.com/pandas-dev/pandas/blob/master/pandas/core/generic.py#L6462

@chrispe92 can you open an issue for this

chrispe · 2020-04-10T06:43:15Z

Hi @jreback, a note for the bug is now included to the whatsnew 1.1 bug fixes missing section. Can you have a look? Thanks.

jreback · 2020-04-10T17:41:27Z

thanks @chrispe92

if you could open a new issue for the dtype preservation on .replace, this PR solves the current issue so can address the other later.

chrispe · 2020-04-11T16:52:31Z

thanks @chrispe92

if you could open a new issue for the dtype preservation on .replace, this PR solves the current issue so can address the other later.

Hi @jreback, related issue is now created (#33484)

chrispe added 4 commits March 21, 2020 18:14

BUG: Fix replacing in string series with NA (pandas-dev#32621)

47f6676

The pd.NA values are replaced with np.nan before comparing the arrays/scalars

BUG: Fix replacing in string series with NA (pandas-dev#32621)

2b53200

Made improvements based on the tests which failed

BUG: Fix replacing in string series with NA (pandas-dev#32621)

7678495

Added change to resolve linting check

BUG: Fix replacing in string series with NA (pandas-dev#32621)

719369d

Added test for the reported bug

WillAyd added the ExtensionArray Extending pandas with custom dtypes or arrays. label Mar 21, 2020

jreback requested changes Mar 21, 2020

View reviewed changes

pandas/core/internals/managers.py Outdated Show resolved Hide resolved

BUG: Fix replacing in string series with NA (pandas-dev#32621)

e98c7c9

chrispe added 2 commits March 22, 2020 23:57

BUG: Fix replacing in string series with NA (pandas-dev#32621)

fb8d143

Merge remote-tracking branch 'upstream/master' into fix-issue-32621

0405f6a

TomAugspurger reviewed Mar 23, 2020

View reviewed changes

pandas/core/internals/managers.py Outdated Show resolved Hide resolved

simonjayhawkins added Bug Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate labels Mar 23, 2020

chrispe added 2 commits March 29, 2020 12:31

Merge remote-tracking branch 'upstream/master' into fix-issue-32621

23da71c

BUG: Fix replacing in string series with NA (pandas-dev#32621)

ca81cb0

chrispe requested a review from jreback March 29, 2020 13:23

jreback requested changes Mar 29, 2020

View reviewed changes

jreback requested changes Apr 7, 2020

View reviewed changes

chrispe added 3 commits April 8, 2020 13:45

Merge remote-tracking branch 'upstream/master' into fix-issue-32621

8b6d224

BUG: Fix replacing in string series with NA (pandas-dev#32621)

c32a2cc

BUG: Fix replacing in string series with NA (pandas-dev#32621)

0a76844

jreback requested changes Apr 8, 2020

View reviewed changes

chrispe added 5 commits April 8, 2020 21:37

BUG: Fix replacing in string series with NA (pandas-dev#32621)

b62ad89

Merge remote-tracking branch 'upstream/master' into fix-issue-32621

f5b994a

BUG: Fix replacing in string series with NA (pandas-dev#32621)

a73e2eb

BUG: Fix replacing in string series with NA (pandas-dev#32621)

949accc

Merge remote-tracking branch 'upstream/master' into fix-issue-32621

e7b37a0

jreback requested changes Apr 9, 2020

View reviewed changes

jreback added this to the 1.1 milestone Apr 9, 2020

chrispe added 2 commits April 9, 2020 19:48

Merge remote-tracking branch 'upstream/master' into fix-issue-32621

37da655

Added description to the 1.1 bug fixes section

df5bc39

jorisvandenbossche reviewed Apr 9, 2020

View reviewed changes

Merge remote-tracking branch 'upstream/master' into fix-issue-32621

606aeb8

jreback approved these changes Apr 10, 2020

View reviewed changes

jreback merged commit 3cca07c into pandas-dev:master Apr 10, 2020

chrispe deleted the fix-issue-32621 branch April 12, 2020 08:04

TomAugspurger mentioned this pull request May 1, 2020

Performance regression in replace.ReplaceDict.time_replace_series #33920

Closed

chrispe mentioned this pull request Jul 12, 2020

Place the calculation of mask prior to the calls of comp in replace_list to improve performance #35229

Merged

5 tasks

simonjayhawkins mentioned this pull request Aug 16, 2020

REGR: Don't ignore compiled patterns in replace #35697

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: Fix replacing in `string` series with NA (pandas-dev#32621) #32890

BUG: Fix replacing in `string` series with NA (pandas-dev#32621) #32890

chrispe commented Mar 21, 2020 •

edited

Loading

WillAyd commented Mar 21, 2020

jreback left a comment

chrispe commented Mar 22, 2020 •

edited

Loading

pep8speaks commented Mar 22, 2020 •

edited

Loading

chrispe commented Mar 29, 2020

jreback Mar 29, 2020

chrispe Mar 29, 2020

jreback Apr 8, 2020

jreback Mar 29, 2020

jreback Apr 7, 2020

chrispe Apr 8, 2020 •

edited

Loading

jreback Apr 7, 2020

jreback Apr 7, 2020

chrispe Apr 8, 2020

jreback Apr 8, 2020

jreback Apr 8, 2020

jreback Apr 8, 2020

chrispe commented Apr 9, 2020 •

edited

Loading

jreback left a comment

jorisvandenbossche Apr 9, 2020

chrispe Apr 9, 2020 •

edited

Loading

jorisvandenbossche Apr 9, 2020

chrispe Apr 9, 2020 •

edited

Loading

jreback Apr 10, 2020

chrispe commented Apr 10, 2020

jreback commented Apr 10, 2020

chrispe commented Apr 11, 2020

		@@ -1952,8 +1952,25 @@ def _compare_or_regex_search(a, b, regex=False):
		# GH#29553 avoid deprecation warnings from numpy
		result = False

BUG: Fix replacing in string series with NA (pandas-dev#32621) #32890

BUG: Fix replacing in string series with NA (pandas-dev#32621) #32890

Conversation

chrispe commented Mar 21, 2020 • edited Loading

WillAyd commented Mar 21, 2020

jreback left a comment

Choose a reason for hiding this comment

chrispe commented Mar 22, 2020 • edited Loading

pep8speaks commented Mar 22, 2020 • edited Loading

Comment last updated at 2020-04-10 11:09:11 UTC

chrispe commented Mar 29, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

chrispe Apr 8, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

chrispe commented Apr 9, 2020 • edited Loading

jreback left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

chrispe Apr 9, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

chrispe Apr 9, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

chrispe commented Apr 10, 2020

jreback commented Apr 10, 2020

chrispe commented Apr 11, 2020

BUG: Fix replacing in `string` series with NA (pandas-dev#32621) #32890

BUG: Fix replacing in `string` series with NA (pandas-dev#32621) #32890

chrispe commented Mar 21, 2020 •

edited

Loading

chrispe commented Mar 22, 2020 •

edited

Loading

pep8speaks commented Mar 22, 2020 •

edited

Loading

chrispe Apr 8, 2020 •

edited

Loading

chrispe commented Apr 9, 2020 •

edited

Loading

chrispe Apr 9, 2020 •

edited

Loading

chrispe Apr 9, 2020 •

edited

Loading