Skip to content

Should str.replace accept a compiled expression? #15446

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
gwerbin opened this issue Feb 17, 2017 · 4 comments · Fixed by #15456
Closed

Should str.replace accept a compiled expression? #15446

gwerbin opened this issue Feb 17, 2017 · 4 comments · Fixed by #15456
Labels
API Design Enhancement Strings String extension data type and string data
Milestone

Comments

@gwerbin
Copy link

gwerbin commented Feb 17, 2017

The current solution is to call str.replace(<compiled_re>.pattern, flags=<compiled_re>.flags) which is relatively ugly and verbose in my opnion.

Here's a contrived example of removing stopwords and normalizing whitespace afterwards:

import pandas as pd
import re

some_names = pd.Series(["three weddings and a funeral", "the big lebowski", "florence and the machine"])

stopwords = ["the", "a", "and"]
stopwords_re = re.compile(r"(\s+)?\b({})\b(\s+)?".format("|".join(stopwords), re.IGNORECASE)
whitespace_re = re.compile(r"\s+")

# desired code:
# some_names.str.replace(stopwords_re, " ").str.strip().str.replace(whitespace_re, " ")

# actual code:
some_names.\
    str.replace(stopwords_re.pattern, " ", flags=stopwords_re.flags).\
    str.strip().str.replace(whitespace_re.pattern, " ", flags=whitespace_re.flags)

Why do I think this is better?

  1. It's nice to have commonly used regular expressions compiled and to carry their flags around with them (and also allows the use of "verbose" regular expressions)
  2. It's not that compiled regular expressions should quack like strings... it's that in this case we're making strings quack like compiled regular expressions, but at the same time not letting those compiled regular expressions quack their own quack.

Is there a good reason not to implement this?

@TomAugspurger
Copy link
Contributor

Is there a good reason not to implement this?

I can't think of one. Feel free to give it a shot!

@TomAugspurger TomAugspurger added API Design Difficulty Novice Enhancement Strings String extension data type and string data labels Feb 17, 2017
@TomAugspurger TomAugspurger added this to the 0.20.0 milestone Feb 17, 2017
@jreback
Copy link
Contributor

jreback commented Feb 17, 2017

yup these reasonable (and straightforward) to do.

@jreback jreback changed the title Should str.replace accept a compiled expression from the re module? Should str.replace accept a compiled expression? Feb 18, 2017
rouzazari added a commit to rouzazari/pandas that referenced this issue Feb 19, 2017
.str.replace now accepts a compiled regular expression.

See pandas-dev#15446
rouzazari added a commit to rouzazari/pandas that referenced this issue Feb 19, 2017
.str.replace now accepts a compiled regular expression.

See pandas-dev#15446

TODO before merging:

- Consider moving check for compiled regex to `pandas.types` and using a dummy compiled regex for type checking.
- Consider what to do with `case` and `flags` parameters, which are noop
  when `pat` is a compiled regex.
- Add compiled regex tests for `str_replace`
- Add compiled regex documentation for `str_replace`
rouzazari added a commit to rouzazari/pandas that referenced this issue Feb 19, 2017
.str.replace now accepts a compiled regular expression.

See pandas-dev#15446

TODO before merging:

- Consider moving check for compiled regex to `pandas.types` and using a dummy compiled regex for type checking.
- Consider what to do with `case` and `flags` parameters, which are noop
  when `pat` is a compiled regex.
- Add compiled regex tests for `str_replace`
- Add compiled regex documentation for `str_replace`
@rouzazari
Copy link
Contributor

I just submitted a PR after looking into this (apologies in advance for the unexpected triple-commit message above).

Very simply, I added a check to pandas.core.strings.str_replace to test whether pat is a compiled regular expression.

A few questions that will help drive this to completion:

  1. Testing whether a variable is a compiled regular expression could be moved to pandas.types. Is this preferred?
  2. I'm not testing if case == False or flags not None, so a user may supply a compiled regular expression along with case and flags and receive output that does not incorporated case and flags. Is it preferred to handle these parameters as noop when provided with a compiled regular expression or to raise an error or warning?
  3. I can't find tests for str_replace. Where would I add test code for adding compiled regular expression handling?

Note:

  • I'm using a protected variable _re._pattern_type to test if pat is a compiled regular expression. Alternatively this can be done by compiling a dummy regular expression just to get the type, but I'll wait until question 1 above is clear.
  • Documentation in str_replace will depend on question 2 above.

rouzazari added a commit to rouzazari/pandas that referenced this issue Feb 21, 2017
.str.replace now accepts a compiled regular expression.

See pandas-dev#15446
rouzazari added a commit to rouzazari/pandas that referenced this issue Feb 24, 2017
.str.replace now accepts a compiled regular expression for `pat`.

See pandas-dev#15446
rouzazari added a commit to rouzazari/pandas that referenced this issue Feb 24, 2017
.str.replace now accepts a compiled regular expression for `pat`.

See pandas-dev#15446
rouzazari added a commit to rouzazari/pandas that referenced this issue Feb 24, 2017
.str.replace now accepts a compiled regular expression for `pat`.

See pandas-dev#15446
rouzazari added a commit to rouzazari/pandas that referenced this issue Mar 2, 2017
- .str.replace now accepts a compiled regular expression for `pat`.
- Signature for .str.replace changed, but remains backwards compatible.

See pandas-dev#15446
rouzazari added a commit to rouzazari/pandas that referenced this issue Mar 2, 2017
- .str.replace now accepts a compiled regular expression for `pat`.
- Signature for .str.replace changed, but remains backwards compatible.

See pandas-dev#15446
rouzazari added a commit to rouzazari/pandas that referenced this issue Mar 3, 2017
- .str.replace now accepts a compiled regular expression for `pat`.
- Signature for .str.replace changed, but remains backwards compatible.

See pandas-dev#15446
rouzazari added a commit to rouzazari/pandas that referenced this issue Mar 3, 2017
- .str.replace now accepts a compiled regular expression for `pat`.
- Signature for .str.replace changed, but remains backwards compatible.

See pandas-dev#15446
rouzazari added a commit to rouzazari/pandas that referenced this issue Mar 4, 2017
- Series.str.replace now accepts a compiled regular expression for `pat`.
- Signature for .str.replace changed, but remains backwards compatible.

See pandas-dev#15446
rouzazari added a commit to rouzazari/pandas that referenced this issue Mar 4, 2017
- Series.str.replace now accepts a compiled regular expression for `pat`.
- Signature for .str.replace changed, but remains backwards compatible.

See pandas-dev#15446
rouzazari added a commit to rouzazari/pandas that referenced this issue Mar 4, 2017
- Series.str.replace now accepts a compiled regular expression for `pat`.
- Signature for .str.replace changed, but remains backwards compatible.

See pandas-dev#15446
jorisvandenbossche pushed a commit that referenced this issue Mar 5, 2017
- Series.str.replace now accepts a compiled regular expression for `pat`.
- Signature for .str.replace changed, but remains backwards compatible.

See #15446
@gwerbin
Copy link
Author

gwerbin commented Mar 6, 2017

Wow, thank you!

AnkurDedania pushed a commit to AnkurDedania/pandas that referenced this issue Mar 21, 2017
- Series.str.replace now accepts a compiled regular expression for `pat`.
- Signature for .str.replace changed, but remains backwards compatible.

See pandas-dev#15446
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design Enhancement Strings String extension data type and string data
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants