Skip to content

BUG: Fix replacing in string series with NA (pandas-dev#32621) #32890

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 20 commits into from
Apr 10, 2020

Conversation

chrispe
Copy link
Contributor

@chrispe chrispe commented Mar 21, 2020

The pd.NA values are replaced with np.nan before comparing the arrays/scalars

chrispe added 4 commits March 21, 2020 18:14
The pd.NA values are replaced with np.nan before comparing the arrays/scalars
Made improvements based on the tests which failed
Added change to resolve linting check
@WillAyd
Copy link
Member

WillAyd commented Mar 21, 2020

I think the problem here is actually that the element wise comparison reduces to a scalar:

>>> np.array(["one", "two", np.nan]) == "one"
array([ True, False, False])
>>> np.array(["one", "two", pd.NA]) == "one"
__main__:1: FutureWarning: elementwise comparison failed; returning scalar instead, but in the future will perform elementwise comparison
False

Which then gets caught in some of the functions here

@TomAugspurger or @jorisvandenbossche do you know the backstory to that behavior?

@WillAyd WillAyd added the ExtensionArray Extending pandas with custom dtypes or arrays. label Mar 21, 2020
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see comment

@chrispe
Copy link
Contributor Author

chrispe commented Mar 22, 2020

I think the problem here is actually that the element wise comparison reduces to a scalar:

>>> np.array(["one", "two", np.nan]) == "one"
array([ True, False, False])
>>> np.array(["one", "two", pd.NA]) == "one"
__main__:1: FutureWarning: elementwise comparison failed; returning scalar instead, but in the future will perform elementwise comparison
False

Which then gets caught in some of the functions here

@TomAugspurger or @jorisvandenbossche do you know the backstory to that behavior?

Hi @WillAyd, please check my comment #32890 (comment) which further explains why probably that warning is raised.

@pep8speaks
Copy link

pep8speaks commented Mar 22, 2020

Hello @chrispe92! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2020-04-10 11:09:11 UTC

@simonjayhawkins simonjayhawkins added Bug Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate labels Mar 23, 2020
@chrispe
Copy link
Contributor Author

chrispe commented Mar 29, 2020

Hi @jreback,

Can you have a look my latest changes? I'm curious to know what you think.

@chrispe chrispe requested a review from jreback March 29, 2020 13:23
if isinstance(result, np.ndarray):
tmp = np.zeros(mask.shape, dtype=np.bool)
tmp[mask] = result
result = tmp
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this not just
result[mask] = False

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because the shape of the mask can differ to that of the result, since we may select to compare only a subset of a's or b's elements (i.e. some of them may be NAs).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok can you add a comment on what this is doing (simnilar to what you just said)

@@ -1952,8 +1952,25 @@ def _compare_or_regex_search(a, b, regex=False):
# GH#29553 avoid deprecation warnings from numpy
result = False
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's just return here (you will need move the check on like 194 into a function ,e.g.

if is_datetimeliek......:
     return check(False)

.....

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you update for this

Copy link
Contributor Author

@chrispe chrispe Apr 8, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @jreback, I'm not sure if I completely understood the change you requested to make here. But I assume you wanted to place the entire block that checks the result's value into a function instead? If so, I did that and made some additional changes (c32a2cc) in order to have it working. Can you please confirm? Thanks

@@ -1952,8 +1952,25 @@ def _compare_or_regex_search(a, b, regex=False):
# GH#29553 avoid deprecation warnings from numpy
result = False
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you update for this

elif is_a_array and is_b_array:
mask = ~(isna(a) | isna(b))

if is_a_array:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you not do this in the above logic?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @jreback, I moved that code block above (c32a2cc). Can you please confirm that this is what you were expecting? Thanks.

@@ -1941,6 +1941,24 @@ def _compare_or_regex_search(a, b, regex=False):
-------
mask : array_like of bool
"""

def _check(result, a, b):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you give this a more informative name, maybe _check_comparision_types

and type the input args as much as possible

if is_datetimelike_v_numeric(a, b) or is_numeric_v_string_like(a, b):
# GH#29553 avoid deprecation warnings from numpy
result = False
return _check(False, a, b)
else:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doesn't need to have an else here any longer

if isinstance(result, np.ndarray):
tmp = np.zeros(mask.shape, dtype=np.bool)
tmp[mask] = result
result = tmp
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok can you add a comment on what this is doing (simnilar to what you just said)

@chrispe
Copy link
Contributor Author

chrispe commented Apr 9, 2020

Hi @jreback, I've made all the changes that you suggested.
Other than those, I've also made the following changes:

  • Added expected types in both of the functions
  • Removed the is_array variables in order to pass the mypy tests.

Can you please check them and let me know what you think? Did I miss anything? Thanks.

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good. can you add a whatsnew note in 1.1 bug fixes missing section. ping on green.

@jreback jreback added this to the 1.1 milestone Apr 9, 2020
def test_replace_with_dictlike_and_string_dtype(self):
# GH 32621
s = pd.Series(["one", "two", np.nan], dtype="string")
expected = pd.Series(["1", "2", np.nan])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The expected result here should also be dtype="string" I think

Copy link
Contributor Author

@chrispe chrispe Apr 9, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apparently that's not the case (even at the current stable version):

>>> import pandas as pd
>>> pd.__version__
'1.0.3'
>>> s = pd.Series(["one", "two"], dtype="string")
>>> expected = pd.Series(["1", "2"], dtype="string")
>>> result = s.replace(to_replace={"one": "1", "two": "2"})
>>> expected
0    1
1    2
dtype: string
>>> result
0    1
1    2
dtype: object

I'm not sure if that behaviour is to be expected or it should be tackled within a new issue. What do you think? This is not related to containing NA values.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would say that it is certainly expected for replace to again return a string dtype.

Although of course, there are not guarantees that your replacement value is a string ...
And given that it is also not working on master, not sure it needs to be fixed here.

For some cases of replace it does work fine:

In [33]: s = pd.Series(["one", "two"], dtype="string")  

In [34]: s.replace("one", "1")   
Out[34]: 
0      1
1    two
dtype: string

Copy link
Contributor Author

@chrispe chrispe Apr 9, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good observation. When not using the argument to_replace (i.e. s.replace("one", "1")) then the trace is different. It doesn't use the same function to apply the replacement in comparison to s.replace(to_replace={"one": "1"}).

For more details check here: https://github.com/pandas-dev/pandas/blob/master/pandas/core/generic.py#L6462

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@chrispe92 can you open an issue for this

@chrispe
Copy link
Contributor Author

chrispe commented Apr 10, 2020

Hi @jreback, a note for the bug is now included to the whatsnew 1.1 bug fixes missing section. Can you have a look? Thanks.

@jreback jreback merged commit 3cca07c into pandas-dev:master Apr 10, 2020
@jreback
Copy link
Contributor

jreback commented Apr 10, 2020

thanks @chrispe92

if you could open a new issue for the dtype preservation on .replace, this PR solves the current issue so can address the other later.

@chrispe
Copy link
Contributor Author

chrispe commented Apr 11, 2020

thanks @chrispe92

if you could open a new issue for the dtype preservation on .replace, this PR solves the current issue so can address the other later.

Hi @jreback, related issue is now created (#33484)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug ExtensionArray Extending pandas with custom dtypes or arrays. Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: Replace in string series with NA
7 participants