Skip to content

ENH: Allow regex matching in fullmatch mode #32806

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
frreiss opened this issue Mar 18, 2020 · 0 comments · Fixed by #32807
Closed

ENH: Allow regex matching in fullmatch mode #32806

frreiss opened this issue Mar 18, 2020 · 0 comments · Fixed by #32807
Labels
API - Consistency Internal Consistency of API/Behavior Strings String extension data type and string data
Milestone

Comments

@frreiss
Copy link
Contributor

frreiss commented Mar 18, 2020

Problem description

Series.str contains methods for all the regular expression matching modes in the re package except for re.fullmatch(). fullmatch only returns matches that cover the entire input string, unlike match, which also returns matches that start at the beginning of the string but do not cover the complete string.

One can work around the lack of fullmatch by round-tripping to/from numpy arrays and using np.vectorize, i.e.

>>> s = pd.Series(["foo", "bar", "foobar"])
>>> my_regex = "foo"
>>> import re
>>> import numpy as np
>>> compiled_regex = re.compile(my_regex)
>>> regex_f = np.vectorize(lambda s: compiled_regex.fullmatch(s) is not None)
>>> matches_array = regex_f(s.values)
>>> matches_series = pd.Series(matches_array)
>>> matches_series
0     True
1    False
2    False
dtype: bool

but it would be more convenient for users if fullmatch was built in.

The fullmatch method was added to the re package in Python 3.4. I think that the reason this method wasn't in previous versions of Pandas was that older versions of Python don't have re.fullmatch. As of Pandas 1.0, all the supported versions of Python now have fullmatch.

I have a pull request ready that adds this functionality. After my changes, the Series.str namespace gets a new method fullmatch that evaluates re.fullmatch over the series. For example:

>>> s = pd.Series(["foo", "bar", "foobar"])
>>> s.str.fullmatch("foo")
0     True
1    False
2    False
dtype: bool

[Edit: Simplified the workaround]

@jreback jreback added API Design Strings String extension data type and string data API - Consistency Internal Consistency of API/Behavior and removed API Design labels Mar 19, 2020
@jreback jreback added this to the 1.1 milestone Mar 19, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API - Consistency Internal Consistency of API/Behavior Strings String extension data type and string data
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants