Skip to content

str.replace('.','') should replace every character? (fix) #24809

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 9 commits into from
25 changes: 25 additions & 0 deletions doc/source/whatsnew/v0.24.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -432,6 +432,30 @@ Backwards incompatible API changes

Pandas 0.24.0 includes a number of API breaking changes.

Replacing strings using Pattern
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Be sure to perform a replace of literal strings by passing the
regex=False parameter to func:`str.replace`. Mainly when the
pattern is 1 size string (:issue:`24809`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is not needed, this is a simple bug fix


Before:

.. ipython:: python

s = pd.Series(['A|B|C'])

result = s.str.replace('|', ' ')
result

After:

.. ipython:: python

s = pd.Series(['A|B|C'])

result = s.str.replace('|', ' ', regex=False)
result

.. _whatsnew_0240.api_breaking.deps:

Expand Down Expand Up @@ -1645,6 +1669,7 @@ Strings
- Bug in :meth:`Index.str.split` was not nan-safe (:issue:`23677`).
- Bug :func:`Series.str.contains` not respecting the ``na`` argument for a ``Categorical`` dtype ``Series`` (:issue:`22158`)
- Bug in :meth:`Index.str.cat` when the result contained only ``NaN`` (:issue:`24044`)
- Bug in :func:`Series.str.replace` not applying regex in patterns of length 1 (:issue:`24809`)

Interval
^^^^^^^^
Expand Down
9 changes: 6 additions & 3 deletions pandas/core/strings.py
Original file line number Diff line number Diff line change
Expand Up @@ -425,7 +425,7 @@ def str_endswith(arr, pat, na=np.nan):
return _na_map(f, arr, na, dtype=bool)


def str_replace(arr, pat, repl, n=-1, case=None, flags=0, regex=True):
def str_replace(arr, pat, repl, n=-1, case=None, flags=0, regex=None):
r"""
Replace occurrences of pattern/regex in the Series/Index with
some other string. Equivalent to :meth:`str.replace` or
Expand Down Expand Up @@ -564,7 +564,7 @@ def str_replace(arr, pat, repl, n=-1, case=None, flags=0, regex=True):
# add case flag, if provided
if case is False:
flags |= re.IGNORECASE
if is_compiled_re or len(pat) > 1 or flags or callable(repl):
if is_compiled_re or pat or flags or callable(repl):
n = n if n >= 0 else 0
compiled = re.compile(pat, flags=flags)
f = lambda x: compiled.sub(repl=repl, string=x, count=n)
Expand All @@ -577,6 +577,9 @@ def str_replace(arr, pat, repl, n=-1, case=None, flags=0, regex=True):
if callable(repl):
raise ValueError("Cannot use a callable replacement when "
"regex=False")
if regex==None:
warnings.warn("Warning: Interpreting '%s' as a literal, not a regex... " % pat +
"The default will change in the future.", FutureWarning, stacklevel=3)
f = lambda x: x.replace(pat, repl, n)

return _na_map(f, arr)
Expand Down Expand Up @@ -2529,7 +2532,7 @@ def match(self, pat, case=True, flags=0, na=np.nan):
return self._wrap_result(result, fill_value=na)

@copy(str_replace)
def replace(self, pat, repl, n=-1, case=None, flags=0, regex=True):
def replace(self, pat, repl, n=-1, case=None, flags=0, regex=None):
result = str_replace(self._parent, pat, repl, n=n, case=case,
flags=flags, regex=regex)
return self._wrap_result(result)
Expand Down
17 changes: 16 additions & 1 deletion pandas/tests/test_strings.py
Original file line number Diff line number Diff line change
Expand Up @@ -1008,6 +1008,21 @@ def test_replace(self):
values = klass(data)
pytest.raises(TypeError, values.str.replace, 'a', repl)

# GH 24804
def test_replace_single_pattern(self):
values = Series(['abc', '123'])

result = values.str.replace('.', 'foo', regex=True)
expected = Series(['foofoofoo', 'foofoofoo'])
tm.assert_series_equal(result, expected)

def test_replace_without_specifying_regex_parameter(self):
values = Series(['a.c'])

result = values.str.replace('.', 'b')
expected = Series(['abc'])
tm.assert_series_equal(result, expected)

def test_replace_callable(self):
# GH 15055
values = Series(['fooBAD__barBAD', NA])
Expand Down Expand Up @@ -2924,7 +2939,7 @@ def test_pipe_failures(self):

tm.assert_series_equal(result, exp)

result = s.str.replace('|', ' ')
result = s.str.replace('|', ' ', regex=False)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm Ok. I think this is correct but arguably an API breaking change so make sure we make note of that in the whatsnew

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how is this an api change? regex=True is the default

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was just thinking of this particular instance and any where a user was passing in a single character that may have special meaning with a regex. This previously would directly replace a pipe but now requires regex=False in user code, so it could cause some breakage.

Being extra conservative but not tied to the request then if you feel its over communicating.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exactly, but if you think this documentation is not necessary, let me know and I can change.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm. This is a potentially disruptive change...

Can we:

  1. Change the default regex=None for .str.replace
  2. Detect when a length-1 character is a regex symbol
  3. Warn that it'll change in the future to interpret that character as a regex, not a literal
  4. set regex=False for now to preserve the old (buggy) behavior?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK I agree - that's probably the best go-forward path, save the first point which I don't understand.

@alarangeiras can you raise a FutureWarning here instead?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

save the first point which I don't understand.

Changing the default regex=None? That's so we can detect if we need a warning or not.

If the user passes .str.replace('.', 'b', regex=True), we know to interpret the . as re.compile('.'), so the output would be 'bbb'.

If the user passes .str.replace('.', 'b', regex=False), we know that they want a literal ., so the output is 'abc'.

We'll use regex=None to see if the user is explicit or not.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not just warn when the length of the pattern is 1 and regex=True? Whether or not the user explicitly typed that or relied on the default argument they'd hit the same bug at the end of the day. Don't see value in introducing a None value into a True/False field currently

Copy link
Contributor

@TomAugspurger TomAugspurger Jan 17, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not just warn when the length of the pattern is 1 and regex=True?

Then I think there would be no way to have

In [5]: pd.Series(['a.c']).str.replace('.', 'b')
# Warning: Interpreting '.' as a literal, not a regex... The default will change in the future.
Out[5]:
0    abc
dtype: object
# no warning
In [5]: pd.Series(['a.c']).str.replace('.', 'b', regex=True)
Out[5]:
0    bbb
dtype: object

unless I"m missing something.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Following this line of reasoning, from what I understand, every bug found should issue a warning of future adjustment?

exp = Series(['A B C'])

tm.assert_series_equal(result, exp)
Expand Down