Skip to content

ENH: str.replace accepts a compiled expression #15456

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Mar 5, 2017

Conversation

rouzazari
Copy link
Contributor

@rouzazari rouzazari commented Feb 19, 2017

.str.replace now accepts a compiled regular expression.

See #15446

@codecov-io
Copy link

codecov-io commented Feb 19, 2017

Codecov Report

Merging #15456 into master will decrease coverage by -0.03%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master   #15456      +/-   ##
==========================================
- Coverage   91.05%   91.03%   -0.03%     
==========================================
  Files         137      137              
  Lines       49285    49290       +5     
==========================================
- Hits        44878    44872       -6     
- Misses       4407     4418      +11
Impacted Files Coverage Δ
pandas/core/strings.py 98.49% <100%> (+0.01%)
pandas/io/gbq.py 25% <0%> (-58.34%)
pandas/tools/merge.py 91.78% <0%> (-0.35%)
pandas/core/frame.py 97.83% <0%> (-0.1%)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0b77680...624637f. Read the comment docs.

@jreback
Copy link
Contributor

jreback commented Feb 20, 2017

needs tests

@jreback jreback added Enhancement Strings String extension data type and string data labels Feb 20, 2017
@rouzazari
Copy link
Contributor Author

Tests and documentation added. Tests passing.

Character sequence or regular expression
pat : string or compiled regex
String can be a character sequence or regular expression. Also, a
compiled regex may be used.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

say starting in 0.20., a compiled regex may be used.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you use the versionadded then you don't need this

repl : string or callable
Replacement string or a callable. The callable is passed the regex
match object and must return a replacement string to be used.
See :func:`re.sub`.

.. versionadded:: 0.20.0
`pat` also accepts a compiled regex.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is not in the right place. you can either create another versionadded for pat


if use_re:
# Check whether pat is a compiled regex or should be compiled
is_re = isinstance(pat, type(re.compile('')))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there is a python way to see the re.compile(..) type.


if is_re:
if not case or flags:
warnings.warn(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a problem. If you want to ignore flags/case, then you need to change those defaults to None. Then if you have a compilable you can check this (and raise, not warn) if the are not None. If you have a string, then these can be None, at which point you set the default. This should be back-compat.

else:
f = lambda x: x.replace(pat, repl, n)

return _na_map(f, arr)


def _str_replace_regex_func(regex, repl, n=-1):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of doing this, use partial

@@ -469,6 +469,66 @@ def test_replace_callable(self):
exp = Series(['bAR', NA])
tm.assert_series_equal(result, exp)

def test_replace_regex(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

test_replace_compiled_regex

@rouzazari
Copy link
Contributor Author

Thanks for the feedback.

I like your idea of using a partial; and I've implemented it in a revision that simplifies the pattern/compiled-regex case to one lambda using a partial. The re package will raise an error if case or flags are provided. Please let me know what you think.

Regarding testing for a compiled regex, python/cpython/re package tests in a similar way (see code below). Let me know if you think we should test differently.

_pattern_type = type(sre_compile.compile("", 0))
...
def _compile(pattern, flags):
...
    if isinstance(pattern, _pattern_type):

I also made the document changes. Thanks for catching that.

@rouzazari rouzazari force-pushed the GH15446 branch 2 times, most recently from 0aa1ede to a7ea439 Compare February 24, 2017 19:35
@rouzazari
Copy link
Contributor Author

Actually, I just realized that my use of partial is not necessary. I revised to use a simple lambda with re.sub. I'll let the tests finish before pinging back.

@rouzazari
Copy link
Contributor Author

rouzazari commented Feb 25, 2017

Tests/checks passing.

@jreback - please review when you have time. Key items to review:

  • are we comfortable with checking if pat is a compiled regex using type / isinstance? This is how re checks to see if a pattern is a compiled regex.
  • can we let re package raise errors when flags is provided with a compiled pattern (as is the default behavior for re.sub), or should pandas raise its own error?

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks pretty good. need to fix the signature

"""

# Check whether repl is valid (GH 13438, GH 15055)
if not (is_string_like(repl) or callable(repl)):
raise TypeError("repl must be a string or callable")
use_re = not case or len(pat) > 1 or flags or callable(repl)
# Check whether pat is a compiled regex or should be compiled
is_compiled_re = isinstance(pat, type(re.compile("")))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can use pandas.types.common.is_re here instead

# Check whether pat is a compiled regex or should be compiled
is_compiled_re = isinstance(pat, type(re.compile("")))
use_re = (is_compiled_re or not case or
len(pat) > 1 or flags or callable(repl))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would have an immediate check after is_compiled_re that sees if case/flags are passed (and raise)

@@ -311,8 +311,12 @@ def str_replace(arr, pat, repl, n=-1, case=True, flags=0):

Parameters
----------
pat : string
Character sequence or regular expression
pat : string or compiled regex
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the signature needs to change for case=None and flags=None (and then set them if not a is_compiled_re (and they are None)

@rouzazari
Copy link
Contributor Author

Thanks for your review. Code is updated, will let it test overnight.

FYI - flags must be an int for Python 2.7's re.sub (i.e. flags=None will raise an error). I added a compat.PY2 test to set flags=0 when pat is a compiled regex. Not sure if there is a better pattern for that.

pat = re.compile(r'BAD[_]*')
result = values.str.replace(pat, '')
exp = Series(['foobar', NA])
tm.assert_series_equal(result, exp)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so looks like this is failing of the builds: https://circleci.com/gh/pandas-dev/pandas/188?utm_campaign=vcs-integration-link&utm_medium=referral&utm_source=github-build-link

This is a build where we override LOCALE='C' You can tests locally by adding

import locale
locale.setlocale(locale.LC_ALL, 'C')

at the top of pandas/pandas/__init__.py

and running with 3.5

not entirely sure what is happening

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for tracing it out. I believe the issue is related to re in python 3.4 requiring an integer for flags (similar to the same requirement for 2.7). I'll look to fix it today.

If True, case sensitive
flags : int, default 0 (no flags)
re module flags, e.g. re.IGNORECASE
case : boolean, default None (case sensitive)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

case : boolean, optional
    - if False, case insensistive
    - Must be None if `pat` is a compliled regex

similar for flags

@@ -372,21 +378,47 @@ def str_replace(arr, pat, repl, n=-1, case=True, flags=0):
0 tWO
1 bAR
dtype: object

When `pat` is a compiled regex, all flags should be included in the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

put this under Notes, and then the example under Examples

if (case is not None) or (flags is not None):
raise ValueError("case and flags must be None"
" when pat is a compiled regex")
# pre-PY3.5 re package requires that flags is an int
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you don't need the comment; you can just always set flags=0

case = True
if flags is None:
flags = 0
# add case flag, if provided
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

blank line

@@ -29,6 +29,7 @@ New features
- Integration with the ``feather-format``, including a new top-level ``pd.read_feather()`` and ``DataFrame.to_feather()`` method, see :ref:`here <io.feather>`.
- ``.str.replace`` now accepts a callable, as replacement, which is passed to ``re.sub`` (:issue:`15055`)
- ``FrozenList`` has gained the ``.difference()`` setop method (:issue:`15475`)
- ``.str.replace`` now accepts a compiled regular expression as pattern (:issue:`15446`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as a pattern

@jreback jreback added this to the 0.20.0 milestone Mar 2, 2017
@jreback
Copy link
Contributor

jreback commented Mar 2, 2017

just some very minor documentation comments. Also can you add an example in test.rst (and have it say versionadded 0.20.0 as well.

@jorisvandenbossche

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice PR! Added few comments

@@ -6,6 +6,7 @@

import numpy as np
import pandas as pd
import re
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you put this import in the example itself?

- Must be None if `pat` is a compiled regex
flags : int, optional
- re module flags, e.g. re.IGNORECASE
- Must be None if `pat` is a compiled regex
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why was it needed to change the default of flags ? Because it is always converted to 0 if not specified (None), so I would just keep it 0

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah agree with @jorisvandenbossche here, simply make flags=0 the default

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. flags=None as default is an artifact of the development process. I'll revert back to flags=0.

re module flags, e.g. re.IGNORECASE
case : boolean, optional
- if False, case insensitive
- Must be None if `pat` is a compiled regex
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not clear anymore what the default is ('True' for non-compiled).

Copy link
Contributor Author

@rouzazari rouzazari Mar 3, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default argument is case=None. Would it be appropriate to describe it as such?

    case : boolean, default None
        - If None or True, case sensitive
        - Must be None if `pat` is a compiled regex

Copy link
Contributor Author

@rouzazari rouzazari Mar 3, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback would it make sense to revert back to case=True here and test for case is True when pat is a compiled regex? That would keep it consistent across strings.py.

However, I foresee this PR as a potential stepping stone to allow compiled regex in all search-pattern functions in strings.py, and if case=None is a better design pattern (versus case=True) then let's start making that change with this function.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The issue is then you would have to always set to None if you are using a compiled regex. If we have it as True, then you will either have it ignored or misleading (if it differs from the flags in the regex).

so easy enough to have flags=0 be the default.

@jorisvandenbossche thoughts here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point on it being misleading when:

regex_pat = re.compile(..., flags=re.IGNORECASE)
str_replace(regex_pat, ..., case=True)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I think we need to have case=None. Then you say in the doc-string (you have 2 lines)

  • if pat is compiled expression then case must be None
  • if pat is a string, then case defaults to True.

I don't necessarily like this API, but not sure what else we can do.


.. ipython::

@verbatim
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use code-block

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback This was actually fine, as it ensures to have an ipython block without having it to paste in a code-block

@rouzazari
Copy link
Contributor Author

I implemented changes from @jorisvandenbossche and @jreback. Thanks for the comments and discussion. Let me know if you have any other feedback.


.. code-block:: python

>>> s3.str.replace(regex_pat, 'XX-XX ', flags=re.IGNORECASE)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

trivial, but this should look like an ipython output, literally run it and copy-paste it here.

@@ -29,6 +29,7 @@ New features
- Integration with the ``feather-format``, including a new top-level ``pd.read_feather()`` and ``DataFrame.to_feather()`` method, see :ref:`here <io.feather>`.
- ``.str.replace`` now accepts a callable, as replacement, which is passed to ``re.sub`` (:issue:`15055`)
- ``FrozenList`` has gained the ``.difference()`` setop method (:issue:`15475`)
- ``.str.replace`` now accepts a compiled regular expression as a pattern (:issue:`15446`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Series.str.replace() (and fix the one 2 lines above)

@@ -323,15 +328,24 @@ def str_replace(arr, pat, repl, n=-1, case=True, flags=0):

n : int, default -1 (all)
Number of replacements to make from start
case : boolean, default True
If True, case sensitive
case : boolean, default None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

another possiblity here is to make case=object the default, this allows you to detect if anything is passed, with the nice propery that object is True (while None is False). Its the same idea, but less error prone.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the idea, especially because of the object is True property. This would make it more portable to other search-pattern functions.

The code is much simpler:

    if not case:
        flags |= re.IGNORECASE

    is_compiled_re = is_re(pat)
    if is_compiled_re and flags:
        raise ValueError("case and flags must be default values"
                         " when pat is a compiled regex")

    use_re = is_compiled_re or len(pat) > 1 or flags or callable(repl)

Here's the documentation I'd imagine works well with that. @jreback, can you please let me know what you think before I run the whole thing through again with the other changes above?

    case : boolean, default True (case sensitive)
        Must be True if `pat` is a compiled regex

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'll let joris comment but i think it should work

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know using such a sentinel is a way to go when you want to be able to distinguish None (a passed None as a value for the arg you want to catch), but this is not the case here. We can use None to distinguish a passed True or the default None as True. We don't need to be able to catch an explicitly passed None. So I don't understand why using object is needed here?
Further it has the disadvantage that you get an ugly value in the signature.

As far as I see it, we just need to choose between those options:

  • keep default as True, and say that the argument is ignored if compiled regex is passed
  • default of None, which is set to True if no compiled regex is passed, and we can raise if something else as None is passed in case of compiled regex.

Code-wise the first is the simplest.
The second we can give more information to the user.

@rouzazari
Copy link
Contributor Author

Interested in what @jorisvandenbossche thinks about @jreback's suggestion #15456 (comment), which is implemented in the current version; i.e., use default argument case=object to make it truthy.


In [1]: s3.str.replace(regex_pat, 'XX-XX ', flags=re.IGNORECASE)
---------------------------------------------------------------------------
ValueError: case and flags must be default values when pat is a compiled regex
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What you had before was actually better IMO, as that ensures that the numbering of ipython code blocks is more consistent.
(We don't use the verbatim option of ipython blocks not much, but for such cases this is actually nicer compared to using code-blocks I think)

case : boolean, default True
If True, case sensitive
case : boolean, default True (case sensitive)
Must be True if `pat` is a compiled regex
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find this now more confusing, as "must be True for compiled regex" is strange as it in practice not used for a compiled regex

@jreback
Copy link
Contributor

jreback commented Mar 4, 2017

@rouzazari ok, why don;t you revert to the previous logic where case=None, flags=0 and consistent doc-strings.

@rouzazari
Copy link
Contributor Author

All sounds good to me. Pushed approved changes and reverting back to case=None, flags=0.

I'm also adding an additional test to test_strings.py (case=True with compiled regex should throw an error). I was not testing for this before, but should have been.

@jreback
Copy link
Contributor

jreback commented Mar 4, 2017

lgtm. ping on green.

flags |= re.IGNORECASE
regex = re.compile(pat, flags=flags)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you want to keep this compilation of the regex pattern. Now you pass the uncompiled pattern in the lambda (if an uncompiled regex was passed), which will have a slight performance hit I think.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're definitely onto something that I overlooked: re.sub is slower than using <compiled_regex>.sub. This is because re.sub calls re._compile, which performs a cache lookup of already-cached regex or compiles the regex (see python/cpython/Lib/re.py), whereas the <compiled_regex>.sub does not.

I will change the code so that we avoid using re.sub and avoid the performance hit of the checking the cache each time.

case : boolean, default None
- If True, case sensitive
- Defaults to True if `pat` is a string
- Must be None if `pat` is a compiled regex
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sounds like you have to set it to None yourself if passing a compiled regex, while you actually just shouldn't use it. So I would make this something like "Can not be set if pat is a compiled regex."

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's better. I'll make the change for flags as well.

- Series.str.replace now accepts a compiled regular expression for `pat`.
- Signature for .str.replace changed, but remains backwards compatible.

See pandas-dev#15446
@jreback
Copy link
Contributor

jreback commented Mar 4, 2017

can u add an asv for that case with re.sub?

@rouzazari
Copy link
Contributor Author

I'm not too familiar with asv yet but it seems straight forward and worth learning.

Just to make sure I understand, are you suggesting that I test re.sub versus pattern.sub (where pattern is a compiled regex) using asv?

Looks like there is a file to step from (asv_bench/benchmarks/strings.py) with a test_replace benchmark.

@jorisvandenbossche
Copy link
Member

Just to make sure I understand, are you suggesting that I test re.sub versus pattern.sub (where pattern is a compiled regex) using asv?

No, the idea is to have a test that uses str.replace itself that you can benchmark, and eg in this case could detect a slowdown if you change the implementation from using pattern.sub to re.sub(patters, ..).

Looks like there is a file to step from (asv_bench/benchmarks/strings.py) with a test_replace benchmark.

There is indeed already a test, so no need to add one I think.
You can try to run this test to see of there is any performance impact of this PR (http://pandas.pydata.org/pandas-docs/stable/contributing.html#running-the-performance-test-suite), although I don't expect that for this PR (now that you used the compiling).

@jorisvandenbossche jorisvandenbossche merged commit a00ad37 into pandas-dev:master Mar 5, 2017
@jorisvandenbossche
Copy link
Member

@rouzazari Thanks a lot!

@rouzazari
Copy link
Contributor Author

Great! I enjoyed working on this with both of you. Thanks for your help along the way.

Is it worth it to add a similar enhancement to other pattern-matching functions?

@jreback
Copy link
Contributor

jreback commented Mar 5, 2017

@rouzazari you too!

sure, why don't you open an issue with an example of what needs to change (then of course a PR is welcome)

AnkurDedania pushed a commit to AnkurDedania/pandas that referenced this pull request Mar 21, 2017
- Series.str.replace now accepts a compiled regular expression for `pat`.
- Signature for .str.replace changed, but remains backwards compatible.

See pandas-dev#15446
AnkurDedania pushed a commit to AnkurDedania/pandas that referenced this pull request Mar 21, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement Strings String extension data type and string data
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Should str.replace accept a compiled expression?
4 participants