-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
ENH: str.replace accepts a compiled expression #15456
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov Report
@@ Coverage Diff @@
## master #15456 +/- ##
==========================================
- Coverage 91.05% 91.03% -0.03%
==========================================
Files 137 137
Lines 49285 49290 +5
==========================================
- Hits 44878 44872 -6
- Misses 4407 4418 +11
Continue to review full report at Codecov.
|
needs tests |
Tests and documentation added. Tests passing. |
pandas/core/strings.py
Outdated
Character sequence or regular expression | ||
pat : string or compiled regex | ||
String can be a character sequence or regular expression. Also, a | ||
compiled regex may be used. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
say starting in 0.20., a compiled regex may be used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if you use the versionadded then you don't need this
pandas/core/strings.py
Outdated
repl : string or callable | ||
Replacement string or a callable. The callable is passed the regex | ||
match object and must return a replacement string to be used. | ||
See :func:`re.sub`. | ||
|
||
.. versionadded:: 0.20.0 | ||
`pat` also accepts a compiled regex. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is not in the right place. you can either create another versionadded for pat
pandas/core/strings.py
Outdated
|
||
if use_re: | ||
# Check whether pat is a compiled regex or should be compiled | ||
is_re = isinstance(pat, type(re.compile(''))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there is a python way to see the re.compile(..) type.
pandas/core/strings.py
Outdated
|
||
if is_re: | ||
if not case or flags: | ||
warnings.warn( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is a problem. If you want to ignore flags/case, then you need to change those defaults to None
. Then if you have a compilable you can check this (and raise, not warn) if the are not None. If you have a string, then these can be None, at which point you set the default. This should be back-compat.
pandas/core/strings.py
Outdated
else: | ||
f = lambda x: x.replace(pat, repl, n) | ||
|
||
return _na_map(f, arr) | ||
|
||
|
||
def _str_replace_regex_func(regex, repl, n=-1): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
instead of doing this, use partial
pandas/tests/test_strings.py
Outdated
@@ -469,6 +469,66 @@ def test_replace_callable(self): | |||
exp = Series(['bAR', NA]) | |||
tm.assert_series_equal(result, exp) | |||
|
|||
def test_replace_regex(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
test_replace_compiled_regex
Thanks for the feedback. I like your idea of using a partial; and I've implemented it in a revision that simplifies the pattern/compiled-regex case to one Regarding testing for a compiled regex, _pattern_type = type(sre_compile.compile("", 0))
...
def _compile(pattern, flags):
...
if isinstance(pattern, _pattern_type): I also made the document changes. Thanks for catching that. |
0aa1ede
to
a7ea439
Compare
Actually, I just realized that my use of |
Tests/checks passing. @jreback - please review when you have time. Key items to review:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks pretty good. need to fix the signature
pandas/core/strings.py
Outdated
""" | ||
|
||
# Check whether repl is valid (GH 13438, GH 15055) | ||
if not (is_string_like(repl) or callable(repl)): | ||
raise TypeError("repl must be a string or callable") | ||
use_re = not case or len(pat) > 1 or flags or callable(repl) | ||
# Check whether pat is a compiled regex or should be compiled | ||
is_compiled_re = isinstance(pat, type(re.compile(""))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can use pandas.types.common.is_re
here instead
pandas/core/strings.py
Outdated
# Check whether pat is a compiled regex or should be compiled | ||
is_compiled_re = isinstance(pat, type(re.compile(""))) | ||
use_re = (is_compiled_re or not case or | ||
len(pat) > 1 or flags or callable(repl)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would have an immediate check after is_compiled_re
that sees if case/flags
are passed (and raise)
@@ -311,8 +311,12 @@ def str_replace(arr, pat, repl, n=-1, case=True, flags=0): | |||
|
|||
Parameters | |||
---------- | |||
pat : string | |||
Character sequence or regular expression | |||
pat : string or compiled regex |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the signature needs to change for case=None
and flags=None
(and then set them if not a is_compiled_re (and they are None)
Thanks for your review. Code is updated, will let it test overnight. FYI - |
pat = re.compile(r'BAD[_]*') | ||
result = values.str.replace(pat, '') | ||
exp = Series(['foobar', NA]) | ||
tm.assert_series_equal(result, exp) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so looks like this is failing of the builds: https://circleci.com/gh/pandas-dev/pandas/188?utm_campaign=vcs-integration-link&utm_medium=referral&utm_source=github-build-link
This is a build where we override LOCALE='C' You can tests locally by adding
import locale
locale.setlocale(locale.LC_ALL, 'C')
at the top of pandas/pandas/__init__.py
and running with 3.5
not entirely sure what is happening
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for tracing it out. I believe the issue is related to re
in python 3.4 requiring an integer for flags
(similar to the same requirement for 2.7). I'll look to fix it today.
pandas/core/strings.py
Outdated
If True, case sensitive | ||
flags : int, default 0 (no flags) | ||
re module flags, e.g. re.IGNORECASE | ||
case : boolean, default None (case sensitive) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
case : boolean, optional
- if False, case insensistive
- Must be None if `pat` is a compliled regex
similar for flags
pandas/core/strings.py
Outdated
@@ -372,21 +378,47 @@ def str_replace(arr, pat, repl, n=-1, case=True, flags=0): | |||
0 tWO | |||
1 bAR | |||
dtype: object | |||
|
|||
When `pat` is a compiled regex, all flags should be included in the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
put this under Notes
, and then the example under Examples
pandas/core/strings.py
Outdated
if (case is not None) or (flags is not None): | ||
raise ValueError("case and flags must be None" | ||
" when pat is a compiled regex") | ||
# pre-PY3.5 re package requires that flags is an int |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you don't need the comment; you can just always set flags=0
pandas/core/strings.py
Outdated
case = True | ||
if flags is None: | ||
flags = 0 | ||
# add case flag, if provided |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
blank line
doc/source/whatsnew/v0.20.0.txt
Outdated
@@ -29,6 +29,7 @@ New features | |||
- Integration with the ``feather-format``, including a new top-level ``pd.read_feather()`` and ``DataFrame.to_feather()`` method, see :ref:`here <io.feather>`. | |||
- ``.str.replace`` now accepts a callable, as replacement, which is passed to ``re.sub`` (:issue:`15055`) | |||
- ``FrozenList`` has gained the ``.difference()`` setop method (:issue:`15475`) | |||
- ``.str.replace`` now accepts a compiled regular expression as pattern (:issue:`15446`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as a pattern
just some very minor documentation comments. Also can you add an example in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice PR! Added few comments
doc/source/text.rst
Outdated
@@ -6,6 +6,7 @@ | |||
|
|||
import numpy as np | |||
import pandas as pd | |||
import re |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you put this import in the example itself?
pandas/core/strings.py
Outdated
- Must be None if `pat` is a compiled regex | ||
flags : int, optional | ||
- re module flags, e.g. re.IGNORECASE | ||
- Must be None if `pat` is a compiled regex |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why was it needed to change the default of flags
? Because it is always converted to 0 if not specified (None), so I would just keep it 0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah agree with @jorisvandenbossche here, simply make flags=0
the default
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. flags=None
as default is an artifact of the development process. I'll revert back to flags=0
.
pandas/core/strings.py
Outdated
re module flags, e.g. re.IGNORECASE | ||
case : boolean, optional | ||
- if False, case insensitive | ||
- Must be None if `pat` is a compiled regex |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not clear anymore what the default is ('True' for non-compiled).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The default argument is case=None
. Would it be appropriate to describe it as such?
case : boolean, default None
- If None or True, case sensitive
- Must be None if `pat` is a compiled regex
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jreback would it make sense to revert back to case=True
here and test for case is True
when pat
is a compiled regex? That would keep it consistent across strings.py
.
However, I foresee this PR as a potential stepping stone to allow compiled regex in all search-pattern functions in strings.py
, and if case=None
is a better design pattern (versus case=True
) then let's start making that change with this function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The issue is then you would have to always set to None
if you are using a compiled regex. If we have it as True
, then you will either have it ignored or misleading (if it differs from the flags in the regex).
so easy enough to have flags=0
be the default.
@jorisvandenbossche thoughts here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point on it being misleading when:
regex_pat = re.compile(..., flags=re.IGNORECASE)
str_replace(regex_pat, ..., case=True)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I think we need to have case=None
. Then you say in the doc-string (you have 2 lines)
- if pat is compiled expression then
case
must beNone
- if pat is a string, then case defaults to
True
.
I don't necessarily like this API, but not sure what else we can do.
doc/source/text.rst
Outdated
|
||
.. ipython:: | ||
|
||
@verbatim |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use code-block
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jreback This was actually fine, as it ensures to have an ipython block without having it to paste in a code-block
I implemented changes from @jorisvandenbossche and @jreback. Thanks for the comments and discussion. Let me know if you have any other feedback. |
doc/source/text.rst
Outdated
|
||
.. code-block:: python | ||
|
||
>>> s3.str.replace(regex_pat, 'XX-XX ', flags=re.IGNORECASE) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
trivial, but this should look like an ipython output, literally run it and copy-paste it here.
doc/source/whatsnew/v0.20.0.txt
Outdated
@@ -29,6 +29,7 @@ New features | |||
- Integration with the ``feather-format``, including a new top-level ``pd.read_feather()`` and ``DataFrame.to_feather()`` method, see :ref:`here <io.feather>`. | |||
- ``.str.replace`` now accepts a callable, as replacement, which is passed to ``re.sub`` (:issue:`15055`) | |||
- ``FrozenList`` has gained the ``.difference()`` setop method (:issue:`15475`) | |||
- ``.str.replace`` now accepts a compiled regular expression as a pattern (:issue:`15446`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Series.str.replace()
(and fix the one 2 lines above)
@@ -323,15 +328,24 @@ def str_replace(arr, pat, repl, n=-1, case=True, flags=0): | |||
|
|||
n : int, default -1 (all) | |||
Number of replacements to make from start | |||
case : boolean, default True | |||
If True, case sensitive | |||
case : boolean, default None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
another possiblity here is to make case=object
the default, this allows you to detect if anything is passed, with the nice propery that object
is True
(while None
is False
). Its the same idea, but less error prone.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the idea, especially because of the object
is True
property. This would make it more portable to other search-pattern functions.
The code is much simpler:
if not case:
flags |= re.IGNORECASE
is_compiled_re = is_re(pat)
if is_compiled_re and flags:
raise ValueError("case and flags must be default values"
" when pat is a compiled regex")
use_re = is_compiled_re or len(pat) > 1 or flags or callable(repl)
Here's the documentation I'd imagine works well with that. @jreback, can you please let me know what you think before I run the whole thing through again with the other changes above?
case : boolean, default True (case sensitive)
Must be True if `pat` is a compiled regex
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i'll let joris comment but i think it should work
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know using such a sentinel is a way to go when you want to be able to distinguish None (a passed None as a value for the arg you want to catch), but this is not the case here. We can use None to distinguish a passed True or the default None as True. We don't need to be able to catch an explicitly passed None. So I don't understand why using object
is needed here?
Further it has the disadvantage that you get an ugly value in the signature.
As far as I see it, we just need to choose between those options:
- keep default as True, and say that the argument is ignored if compiled regex is passed
- default of None, which is set to True if no compiled regex is passed, and we can raise if something else as None is passed in case of compiled regex.
Code-wise the first is the simplest.
The second we can give more information to the user.
Interested in what @jorisvandenbossche thinks about @jreback's suggestion #15456 (comment), which is implemented in the current version; i.e., use default argument |
doc/source/text.rst
Outdated
|
||
In [1]: s3.str.replace(regex_pat, 'XX-XX ', flags=re.IGNORECASE) | ||
--------------------------------------------------------------------------- | ||
ValueError: case and flags must be default values when pat is a compiled regex |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What you had before was actually better IMO, as that ensures that the numbering of ipython code blocks is more consistent.
(We don't use the verbatim option of ipython blocks not much, but for such cases this is actually nicer compared to using code-blocks I think)
pandas/core/strings.py
Outdated
case : boolean, default True | ||
If True, case sensitive | ||
case : boolean, default True (case sensitive) | ||
Must be True if `pat` is a compiled regex |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find this now more confusing, as "must be True for compiled regex" is strange as it in practice not used for a compiled regex
@rouzazari ok, why don;t you revert to the previous logic where |
All sounds good to me. Pushed approved changes and reverting back to I'm also adding an additional test to |
lgtm. ping on green. |
flags |= re.IGNORECASE | ||
regex = re.compile(pat, flags=flags) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you want to keep this compilation of the regex pattern. Now you pass the uncompiled pattern in the lambda (if an uncompiled regex was passed), which will have a slight performance hit I think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're definitely onto something that I overlooked: re.sub
is slower than using <compiled_regex>.sub
. This is because re.sub
calls re._compile
, which performs a cache lookup of already-cached regex or compiles the regex (see python/cpython/Lib/re.py), whereas the <compiled_regex>.sub
does not.
I will change the code so that we avoid using re.sub
and avoid the performance hit of the checking the cache each time.
pandas/core/strings.py
Outdated
case : boolean, default None | ||
- If True, case sensitive | ||
- Defaults to True if `pat` is a string | ||
- Must be None if `pat` is a compiled regex |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This sounds like you have to set it to None yourself if passing a compiled regex, while you actually just shouldn't use it. So I would make this something like "Can not be set if pat
is a compiled regex."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's better. I'll make the change for flags
as well.
- Series.str.replace now accepts a compiled regular expression for `pat`. - Signature for .str.replace changed, but remains backwards compatible. See pandas-dev#15446
can u add an asv for that case with re.sub? |
I'm not too familiar with asv yet but it seems straight forward and worth learning. Just to make sure I understand, are you suggesting that I test Looks like there is a file to step from ( |
No, the idea is to have a test that uses
There is indeed already a test, so no need to add one I think. |
@rouzazari Thanks a lot! |
Great! I enjoyed working on this with both of you. Thanks for your help along the way. Is it worth it to add a similar enhancement to other pattern-matching functions? |
@rouzazari you too! sure, why don't you open an issue with an example of what needs to change (then of course a PR is welcome) |
- Series.str.replace now accepts a compiled regular expression for `pat`. - Signature for .str.replace changed, but remains backwards compatible. See pandas-dev#15446
.str.replace now accepts a compiled regular expression.
See #15446
git diff upstream/master | flake8 --diff