-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG: Fix replacing in string
series with NA (pandas-dev#32621)
#32890
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
The pd.NA values are replaced with np.nan before comparing the arrays/scalars
Made improvements based on the tests which failed
Added change to resolve linting check
Added test for the reported bug
I think the problem here is actually that the element wise comparison reduces to a scalar: >>> np.array(["one", "two", np.nan]) == "one"
array([ True, False, False])
>>> np.array(["one", "two", pd.NA]) == "one"
__main__:1: FutureWarning: elementwise comparison failed; returning scalar instead, but in the future will perform elementwise comparison
False Which then gets caught in some of the functions here @TomAugspurger or @jorisvandenbossche do you know the backstory to that behavior? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see comment
Hi @WillAyd, please check my comment #32890 (comment) which further explains why probably that warning is raised. |
Hello @chrispe92! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found: There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻 Comment last updated at 2020-04-10 11:09:11 UTC |
Hi @jreback, Can you have a look my latest changes? I'm curious to know what you think. |
pandas/core/internals/managers.py
Outdated
if isinstance(result, np.ndarray): | ||
tmp = np.zeros(mask.shape, dtype=np.bool) | ||
tmp[mask] = result | ||
result = tmp |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is this not just
result[mask] = False
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because the shape of the mask can differ to that of the result, since we may select to compare only a subset of a's or b's elements (i.e. some of them may be NAs).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok can you add a comment on what this is doing (simnilar to what you just said)
pandas/core/internals/managers.py
Outdated
@@ -1952,8 +1952,25 @@ def _compare_or_regex_search(a, b, regex=False): | |||
# GH#29553 avoid deprecation warnings from numpy | |||
result = False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's just return here (you will need move the check on like 194 into a function ,e.g.
if is_datetimeliek......:
return check(False)
.....
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you update for this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @jreback, I'm not sure if I completely understood the change you requested to make here. But I assume you wanted to place the entire block that checks the result's value into a function instead? If so, I did that and made some additional changes (c32a2cc) in order to have it working. Can you please confirm? Thanks
pandas/core/internals/managers.py
Outdated
@@ -1952,8 +1952,25 @@ def _compare_or_regex_search(a, b, regex=False): | |||
# GH#29553 avoid deprecation warnings from numpy | |||
result = False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you update for this
pandas/core/internals/managers.py
Outdated
elif is_a_array and is_b_array: | ||
mask = ~(isna(a) | isna(b)) | ||
|
||
if is_a_array: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you not do this in the above logic?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pandas/core/internals/managers.py
Outdated
@@ -1941,6 +1941,24 @@ def _compare_or_regex_search(a, b, regex=False): | |||
------- | |||
mask : array_like of bool | |||
""" | |||
|
|||
def _check(result, a, b): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you give this a more informative name, maybe _check_comparision_types
and type the input args as much as possible
pandas/core/internals/managers.py
Outdated
if is_datetimelike_v_numeric(a, b) or is_numeric_v_string_like(a, b): | ||
# GH#29553 avoid deprecation warnings from numpy | ||
result = False | ||
return _check(False, a, b) | ||
else: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
doesn't need to have an else here any longer
pandas/core/internals/managers.py
Outdated
if isinstance(result, np.ndarray): | ||
tmp = np.zeros(mask.shape, dtype=np.bool) | ||
tmp[mask] = result | ||
result = tmp |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok can you add a comment on what this is doing (simnilar to what you just said)
Hi @jreback, I've made all the changes that you suggested.
Can you please check them and let me know what you think? Did I miss anything? Thanks. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good. can you add a whatsnew note in 1.1 bug fixes missing section. ping on green.
def test_replace_with_dictlike_and_string_dtype(self): | ||
# GH 32621 | ||
s = pd.Series(["one", "two", np.nan], dtype="string") | ||
expected = pd.Series(["1", "2", np.nan]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The expected result here should also be dtype="string"
I think
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apparently that's not the case (even at the current stable version):
>>> import pandas as pd
>>> pd.__version__
'1.0.3'
>>> s = pd.Series(["one", "two"], dtype="string")
>>> expected = pd.Series(["1", "2"], dtype="string")
>>> result = s.replace(to_replace={"one": "1", "two": "2"})
>>> expected
0 1
1 2
dtype: string
>>> result
0 1
1 2
dtype: object
I'm not sure if that behaviour is to be expected or it should be tackled within a new issue. What do you think? This is not related to containing NA values.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would say that it is certainly expected for replace
to again return a string dtype.
Although of course, there are not guarantees that your replacement value is a string ...
And given that it is also not working on master, not sure it needs to be fixed here.
For some cases of replace
it does work fine:
In [33]: s = pd.Series(["one", "two"], dtype="string")
In [34]: s.replace("one", "1")
Out[34]:
0 1
1 two
dtype: string
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good observation. When not using the argument to_replace
(i.e. s.replace("one", "1")
) then the trace is different. It doesn't use the same function to apply the replacement in comparison to s.replace(to_replace={"one": "1"})
.
For more details check here: https://github.com/pandas-dev/pandas/blob/master/pandas/core/generic.py#L6462
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@chrispe92 can you open an issue for this
Hi @jreback, a note for the bug is now included to the whatsnew 1.1 bug fixes missing section. Can you have a look? Thanks. |
thanks @chrispe92 if you could open a new issue for the dtype preservation on |
|
The pd.NA values are replaced with np.nan before comparing the arrays/scalars
string
series with NA #32621black pandas
git diff upstream/master -u -- "*.py" | flake8 --diff