Skip to content

Commit 1dda3fb

Browse files
committed
Merge pull request #5826 from danielballan/extract-docstring
DOC: Add example to extract docstring, and re-explain change to match.
2 parents 77f5232 + b23563a commit 1dda3fb

File tree

2 files changed

+66
-16
lines changed

2 files changed

+66
-16
lines changed

doc/source/basics.rst

+28-11
Original file line numberDiff line numberDiff line change
@@ -1029,7 +1029,7 @@ with more than one group returns a DataFrame with one column per group.
10291029
10301030
Series(['a1', 'b2', 'c3']).str.extract('([ab])(\d)')
10311031
1032-
Elements that do not match return a row of ``NaN``s.
1032+
Elements that do not match return a row filled with ``NaN``.
10331033
Thus, a Series of messy strings can be "converted" into a
10341034
like-indexed Series or DataFrame of cleaned-up or more useful strings,
10351035
without necessitating ``get()`` to access tuples or ``re.match`` objects.
@@ -1051,18 +1051,35 @@ can also be used.
10511051
Testing for Strings that Match or Contain a Pattern
10521052
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
10531053

1054-
In previous versions, *extracting* match groups was accomplished by ``match``,
1055-
which returned a not-so-convenient Series of tuples. Starting in version 0.14,
1056-
the default behavior of match will change. It will return a boolean
1057-
indexer, analagous to the method ``contains``.
10581054

1059-
The distinction between
1060-
``match`` and ``contains`` is strictness: ``match`` relies on
1061-
strict ``re.match`` while ``contains`` relies on ``re.search``.
1055+
You can check whether elements contain a pattern:
10621056

1063-
In version 0.13, ``match`` performs its old, deprecated behavior by default,
1064-
but the new behavior is availabe through the keyword argument
1065-
``as_indexer=True``.
1057+
.. ipython:: python
1058+
1059+
pattern = r'[a-z][0-9]'
1060+
Series(['1', '2', '3a', '3b', '03c']).contains(pattern)
1061+
1062+
or match a pattern:
1063+
1064+
1065+
.. ipython:: python
1066+
1067+
Series(['1', '2', '3a', '3b', '03c']).match(pattern, as_indexer=True)
1068+
1069+
The distinction between ``match`` and ``contains`` is strictness: ``match``
1070+
relies on strict ``re.match``, while ``contains`` relies on ``re.search``.
1071+
1072+
.. warning::
1073+
1074+
In previous versions, ``match`` was for *extracting* groups,
1075+
returning a not-so-convenient Series of tuples. The new method ``extract``
1076+
(described in the previous section) is now preferred.
1077+
1078+
This old, deprecated behavior of ``match`` is still the default. As
1079+
demonstrated above, use the new behavior by setting ``as_indexer=True``.
1080+
In this mode, ``match`` is analagous to ``contains``, returning a boolean
1081+
Series. The new behavior will become the default behavior in a future
1082+
release.
10661083

10671084
Methods like ``match``, ``contains``, ``startswith``, and ``endswith`` take
10681085
an extra ``na`` arguement so missing values can be considered True or False:

pandas/core/strings.py

+38-5
Original file line numberDiff line numberDiff line change
@@ -164,6 +164,11 @@ def str_contains(arr, pat, case=True, flags=0, na=np.nan):
164164
165165
Returns
166166
-------
167+
Series of boolean values
168+
169+
See Also
170+
--------
171+
match : analagous, but stricter, relying on re.match instead of re.search
167172
168173
"""
169174
if not case:
@@ -326,11 +331,22 @@ def str_match(arr, pat, case=True, flags=0, na=np.nan, as_indexer=False):
326331
as_indexer : False, by default, gives deprecated behavior better achieved
327332
using str_extract. True return boolean indexer.
328333
334+
Returns
335+
-------
336+
boolean Series
337+
if as_indexer=True
338+
Series of tuples
339+
if as_indexer=False, default but deprecated
329340
330341
Returns
331342
-------
332-
matches : boolean array (if as_indexer=True)
333-
matches : array of tuples (if as_indexer=False, default but deprecated)
343+
Series of boolean values
344+
345+
See Also
346+
--------
347+
contains : analagous, but less strict, relying on re.search instead of
348+
re.match
349+
extract : now preferred to the deprecated usage of match (as_indexer=False)
334350
335351
Notes
336352
-----
@@ -385,10 +401,27 @@ def str_extract(arr, pat, flags=0):
385401
-------
386402
extracted groups : Series (one group) or DataFrame (multiple groups)
387403
404+
Examples
405+
--------
406+
A pattern with one group will return a Series. Non-matches will be NaN.
388407
389-
Notes
390-
-----
391-
Compare to the string method match, which returns re.match objects.
408+
>>> Series(['a1', 'b2', 'c3']).str.extract('[ab](\d)')
409+
0 1
410+
1 2
411+
2 NaN
412+
dtype: object
413+
414+
A pattern with more than one group will return a DataFrame.
415+
416+
>>> Series(['a1', 'b2', 'c3']).str.extract('([ab])(\d)')
417+
418+
A pattern may contain optional groups.
419+
420+
>>> Series(['a1', 'b2', 'c3']).str.extract('([ab])?(\d)')
421+
422+
Named groups will become column names in the result.
423+
424+
>>> Series(['a1', 'b2', 'c3']).str.extract('(?P<letter>[ab])(?P<digit>\d)')
392425
"""
393426
regex = re.compile(pat, flags=flags)
394427

0 commit comments

Comments
 (0)