Skip to content

Commit 3ebd769

Browse files
committed
Merge pull request pandas-dev#5224 from daniel-ballan/redefine-match
2 parents 30dcacf + 266a2e3 commit 3ebd769

File tree

4 files changed

+142
-31
lines changed

4 files changed

+142
-31
lines changed

doc/source/basics.rst

+28-8
Original file line numberDiff line numberDiff line change
@@ -960,6 +960,9 @@ importantly, these methods exclude missing/NA values automatically. These are
960960
accessed via the Series's ``str`` attribute and generally have names matching
961961
the equivalent (scalar) build-in string methods:
962962

963+
Splitting and Replacing Strings
964+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
965+
963966
.. ipython:: python
964967
965968
s = Series(['A', 'B', 'C', 'Aaba', 'Baca', np.nan, 'CABA', 'dog', 'cat'])
@@ -990,11 +993,12 @@ Methods like ``replace`` and ``findall`` take regular expressions, too:
990993
s3
991994
s3.str.replace('^.a|dog', 'XX-XX ', case=False)
992995
993-
The method ``match`` returns the groups in a regular expression in one tuple.
994-
Starting in pandas version 0.13.0, the method ``extract`` is available to
995-
accomplish this more conveniently.
996+
Extracting Substrings
997+
~~~~~~~~~~~~~~~~~~~~~
996998

997-
Extracting a regular expression with one group returns a Series of strings.
999+
The method ``extract`` (introduced in version 0.13) accepts regular expressions
1000+
with match groups. Extracting a regular expression with one group returns
1001+
a Series of strings.
9981002

9991003
.. ipython:: python
10001004
@@ -1016,18 +1020,34 @@ Named groups like
10161020

10171021
.. ipython:: python
10181022
1019-
Series(['a1', 'b2', 'c3']).str.match('(?P<letter>[ab])(?P<digit>\d)')
1023+
Series(['a1', 'b2', 'c3']).str.extract('(?P<letter>[ab])(?P<digit>\d)')
10201024
10211025
and optional groups like
10221026

10231027
.. ipython:: python
10241028
1025-
Series(['a1', 'b2', '3']).str.match('(?P<letter>[ab])?(?P<digit>\d)')
1029+
Series(['a1', 'b2', '3']).str.extract('(?P<letter>[ab])?(?P<digit>\d)')
10261030
10271031
can also be used.
10281032

1029-
Methods like ``contains``, ``startswith``, and ``endswith`` takes an extra
1030-
``na`` arguement so missing values can be considered True or False:
1033+
Testing for Strings that Match or Contain a Pattern
1034+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1035+
1036+
In previous versions, *extracting* match groups was accomplished by ``match``,
1037+
which returned a not-so-convenient Series of tuples. Starting in version 0.14,
1038+
the default behavior of match will change. It will return a boolean
1039+
indexer, analagous to the method ``contains``.
1040+
1041+
The distinction between
1042+
``match`` and ``contains`` is strictness: ``match`` relies on
1043+
strict ``re.match`` while ``contains`` relies on ``re.search``.
1044+
1045+
In version 0.13, ``match`` performs its old, deprecated behavior by default,
1046+
but the new behavior is availabe through the keyword argument
1047+
``as_indexer=True``.
1048+
1049+
Methods like ``match``, ``contains``, ``startswith``, and ``endswith`` take
1050+
an extra ``na`` arguement so missing values can be considered True or False:
10311051

10321052
.. ipython:: python
10331053

doc/source/v0.13.0.txt

+8
Original file line numberDiff line numberDiff line change
@@ -102,6 +102,14 @@ Deprecated in 0.13.0
102102
- deprecated ``iterkv``, which will be removed in a future release (this was
103103
an alias of iteritems used to bypass ``2to3``'s changes).
104104
(:issue:`4384`, :issue:`4375`, :issue:`4372`)
105+
- deprecated the string method ``match``, whose role is now performed more
106+
idiomatically by ``extract``. In a future release, the default behavior
107+
of ``match`` will change to become analogous to ``contains``, which returns
108+
a boolean indexer. (Their
109+
distinction is strictness: ``match`` relies on ``re.match`` while
110+
``contains`` relies on ``re.serach``.) In this release, the deprecated
111+
behavior is the default, but the new behavior is available through the
112+
keyword argument ``as_indexer=True``.
105113

106114
Indexing API Changes
107115
~~~~~~~~~~~~~~~~~~~~

pandas/core/strings.py

+56-17
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,7 @@
77
import pandas.compat as compat
88
import re
99
import pandas.lib as lib
10-
10+
import warnings
1111

1212
def _get_array_list(arr, others):
1313
if isinstance(others[0], (list, np.ndarray)):
@@ -169,6 +169,10 @@ def str_contains(arr, pat, case=True, flags=0, na=np.nan):
169169

170170
regex = re.compile(pat, flags=flags)
171171

172+
if regex.groups > 0:
173+
warnings.warn("This pattern has match groups. To actually get the"
174+
" groups, use str.extract.", UserWarning)
175+
172176
f = lambda x: bool(regex.search(x))
173177
return _na_map(f, arr, na)
174178

@@ -303,35 +307,70 @@ def rep(x, r):
303307
return result
304308

305309

306-
def str_match(arr, pat, flags=0):
310+
def str_match(arr, pat, case=True, flags=0, na=np.nan, as_indexer=False):
307311
"""
308-
Find groups in each string (from beginning) using passed regular expression
312+
Deprecated: Find groups in each string using passed regular expression.
313+
If as_indexer=True, determine if each string matches a regular expression.
309314
310315
Parameters
311316
----------
312317
pat : string
313-
Pattern or regular expression
318+
Character sequence or regular expression
319+
case : boolean, default True
320+
If True, case sensitive
314321
flags : int, default 0 (no flags)
315322
re module flags, e.g. re.IGNORECASE
323+
na : default NaN, fill value for missing values.
324+
as_indexer : False, by default, gives deprecated behavior better achieved
325+
using str_extract. True return boolean indexer.
326+
316327
317328
Returns
318329
-------
319-
matches : array
330+
matches : boolean array (if as_indexer=True)
331+
matches : array of tuples (if as_indexer=False, default but deprecated)
332+
333+
Note
334+
----
335+
To extract matched groups, which is the deprecated behavior of match, use
336+
str.extract.
320337
"""
338+
339+
if not case:
340+
flags |= re.IGNORECASE
341+
321342
regex = re.compile(pat, flags=flags)
322343

323-
def f(x):
324-
m = regex.match(x)
325-
if m:
326-
return m.groups()
327-
else:
328-
return []
344+
if (not as_indexer) and regex.groups > 0:
345+
# Do this first, to make sure it happens even if the re.compile
346+
# raises below.
347+
warnings.warn("In future versions of pandas, match will change to"
348+
" always return a bool indexer.""", UserWarning)
349+
350+
if as_indexer and regex.groups > 0:
351+
warnings.warn("This pattern has match groups. To actually get the"
352+
" groups, use str.extract.""", UserWarning)
353+
354+
# If not as_indexer and regex.groups == 0, this returns empty lists
355+
# and is basically useless, so we will not warn.
356+
357+
if (not as_indexer) and regex.groups > 0:
358+
def f(x):
359+
m = regex.match(x)
360+
if m:
361+
return m.groups()
362+
else:
363+
return []
364+
else:
365+
# This is the new behavior of str_match.
366+
f = lambda x: bool(regex.match(x))
329367

330368
return _na_map(f, arr)
331369

370+
332371
def str_extract(arr, pat, flags=0):
333372
"""
334-
Find groups in each string (from beginning) using passed regular expression
373+
Find groups in each string using passed regular expression
335374
336375
Parameters
337376
----------
@@ -358,7 +397,7 @@ def str_extract(arr, pat, flags=0):
358397
def f(x):
359398
if not isinstance(x, compat.string_types):
360399
return None
361-
m = regex.match(x)
400+
m = regex.search(x)
362401
if m:
363402
return m.groups()[0] # may be None
364403
else:
@@ -368,7 +407,7 @@ def f(x):
368407
def f(x):
369408
if not isinstance(x, compat.string_types):
370409
return empty_row
371-
m = regex.match(x)
410+
m = regex.search(x)
372411
if m:
373412
return Series(list(m.groups())) # may contain None
374413
else:
@@ -668,13 +707,13 @@ def wrapper(self):
668707
return wrapper
669708

670709

671-
def _pat_wrapper(f, flags=False, na=False):
710+
def _pat_wrapper(f, flags=False, na=False, **kwargs):
672711
def wrapper1(self, pat):
673712
result = f(self.series, pat)
674713
return self._wrap_result(result)
675714

676-
def wrapper2(self, pat, flags=0):
677-
result = f(self.series, pat, flags=flags)
715+
def wrapper2(self, pat, flags=0, **kwargs):
716+
result = f(self.series, pat, flags=flags, **kwargs)
678717
return self._wrap_result(result)
679718

680719
def wrapper3(self, pat, na=np.nan):

pandas/tests/test_strings.py

+50-6
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,7 @@
55
import operator
66
import re
77
import unittest
8+
import warnings
89

910
import nose
1011

@@ -392,29 +393,66 @@ def test_repeat(self):
392393
u('dddddd')])
393394
tm.assert_series_equal(result, exp)
394395

395-
def test_match(self):
396+
def test_deprecated_match(self):
397+
# Old match behavior, deprecated (but still default) in 0.13
396398
values = Series(['fooBAD__barBAD', NA, 'foo'])
397399

398-
result = values.str.match('.*(BAD[_]+).*(BAD)')
400+
with tm.assert_produces_warning():
401+
result = values.str.match('.*(BAD[_]+).*(BAD)')
399402
exp = Series([('BAD__', 'BAD'), NA, []])
400403
tm.assert_series_equal(result, exp)
401404

402405
# mixed
403406
mixed = Series(['aBAD_BAD', NA, 'BAD_b_BAD', True, datetime.today(),
404407
'foo', None, 1, 2.])
405408

406-
rs = Series(mixed).str.match('.*(BAD[_]+).*(BAD)')
409+
with tm.assert_produces_warning():
410+
rs = Series(mixed).str.match('.*(BAD[_]+).*(BAD)')
407411
xp = [('BAD_', 'BAD'), NA, ('BAD_', 'BAD'), NA, NA, [], NA, NA, NA]
408412
tm.assert_isinstance(rs, Series)
409413
tm.assert_almost_equal(rs, xp)
410414

411415
# unicode
412416
values = Series([u('fooBAD__barBAD'), NA, u('foo')])
413417

414-
result = values.str.match('.*(BAD[_]+).*(BAD)')
418+
with tm.assert_produces_warning():
419+
result = values.str.match('.*(BAD[_]+).*(BAD)')
415420
exp = Series([(u('BAD__'), u('BAD')), NA, []])
416421
tm.assert_series_equal(result, exp)
417422

423+
def test_match(self):
424+
# New match behavior introduced in 0.13
425+
values = Series(['fooBAD__barBAD', NA, 'foo'])
426+
with tm.assert_produces_warning():
427+
result = values.str.match('.*(BAD[_]+).*(BAD)', as_indexer=True)
428+
exp = Series([True, NA, False])
429+
tm.assert_series_equal(result, exp)
430+
431+
# If no groups, use new behavior even when as_indexer is False.
432+
# (Old behavior is pretty much useless in this case.)
433+
values = Series(['fooBAD__barBAD', NA, 'foo'])
434+
result = values.str.match('.*BAD[_]+.*BAD', as_indexer=False)
435+
exp = Series([True, NA, False])
436+
tm.assert_series_equal(result, exp)
437+
438+
# mixed
439+
mixed = Series(['aBAD_BAD', NA, 'BAD_b_BAD', True, datetime.today(),
440+
'foo', None, 1, 2.])
441+
442+
with tm.assert_produces_warning():
443+
rs = Series(mixed).str.match('.*(BAD[_]+).*(BAD)', as_indexer=True)
444+
xp = [True, NA, True, NA, NA, False, NA, NA, NA]
445+
tm.assert_isinstance(rs, Series)
446+
tm.assert_almost_equal(rs, xp)
447+
448+
# unicode
449+
values = Series([u('fooBAD__barBAD'), NA, u('foo')])
450+
451+
with tm.assert_produces_warning():
452+
result = values.str.match('.*(BAD[_]+).*(BAD)', as_indexer=True)
453+
exp = Series([True, NA, False])
454+
tm.assert_series_equal(result, exp)
455+
418456
def test_extract(self):
419457
# Contains tests like those in test_match and some others.
420458

@@ -966,7 +1004,10 @@ def test_match_findall_flags(self):
9661004

9671005
pat = pattern = r'([A-Z0-9._%+-]+)@([A-Z0-9.-]+)\.([A-Z]{2,4})'
9681006

969-
result = data.str.match(pat, flags=re.IGNORECASE)
1007+
with warnings.catch_warnings(record=True) as w:
1008+
warnings.simplefilter('always')
1009+
result = data.str.match(pat, flags=re.IGNORECASE)
1010+
assert issubclass(w[-1].category, UserWarning)
9701011
self.assertEquals(result[0], ('dave', 'google', 'com'))
9711012

9721013
result = data.str.findall(pat, flags=re.IGNORECASE)
@@ -975,7 +1016,10 @@ def test_match_findall_flags(self):
9751016
result = data.str.count(pat, flags=re.IGNORECASE)
9761017
self.assertEquals(result[0], 1)
9771018

978-
result = data.str.contains(pat, flags=re.IGNORECASE)
1019+
with warnings.catch_warnings(record=True) as w:
1020+
warnings.simplefilter('always')
1021+
result = data.str.contains(pat, flags=re.IGNORECASE)
1022+
assert issubclass(w[-1].category, UserWarning)
9791023
self.assertEquals(result[0], True)
9801024

9811025
def test_encode_decode(self):

0 commit comments

Comments
 (0)