Skip to content

Commit e048150

Browse files
tdhockjreback
authored andcommitted
DOC: extract/extractall clarifications
This PR clarifies the new documentation for extract and extractall. It was requested by @jreback in #11386 (comment) Author: Toby Dylan Hocking <[email protected]> Closes #12281 from tdhock/extract-docs and squashes the following commits: 2019d1b [Toby Dylan Hocking] DOC: extract/extractall clarifications
1 parent e82d093 commit e048150

File tree

2 files changed

+28
-28
lines changed

2 files changed

+28
-28
lines changed

doc/source/text.rst

+11-11
Original file line numberDiff line numberDiff line change
@@ -196,9 +196,9 @@ DataFrame with one column per group.
196196
Elements that do not match return a row filled with ``NaN``. Thus, a
197197
Series of messy strings can be "converted" into a like-indexed Series
198198
or DataFrame of cleaned-up or more useful strings, without
199-
necessitating ``get()`` to access tuples or ``re.match`` objects. The
200-
results dtype always is object, even if no match is found and the
201-
result only contains ``NaN``.
199+
necessitating ``get()`` to access tuples or ``re.match`` objects. The
200+
dtype of the result is always object, even if no match is found and
201+
the result only contains ``NaN``.
202202

203203
Named groups like
204204

@@ -275,15 +275,16 @@ Extract all matches in each subject (extractall)
275275

276276
.. _text.extractall:
277277

278+
.. versionadded:: 0.18.0
279+
278280
Unlike ``extract`` (which returns only the first match),
279281

280282
.. ipython:: python
281283
282284
s = pd.Series(["a1a2", "b1", "c1"], ["A", "B", "C"])
283285
s
284-
s.str.extract("[ab](?P<digit>\d)", expand=False)
285-
286-
.. versionadded:: 0.18.0
286+
two_groups = '(?P<letter>[a-z])(?P<digit>[0-9])'
287+
s.str.extract(two_groups, expand=True)
287288
288289
the ``extractall`` method returns every match. The result of
289290
``extractall`` is always a ``DataFrame`` with a ``MultiIndex`` on its
@@ -292,30 +293,29 @@ indicates the order in the subject.
292293

293294
.. ipython:: python
294295
295-
s.str.extractall("[ab](?P<digit>\d)")
296+
s.str.extractall(two_groups)
296297
297298
When each subject string in the Series has exactly one match,
298299

299300
.. ipython:: python
300301
301302
s = pd.Series(['a3', 'b3', 'c2'])
302303
s
303-
two_groups = '(?P<letter>[a-z])(?P<digit>[0-9])'
304304
305305
then ``extractall(pat).xs(0, level='match')`` gives the same result as
306306
``extract(pat)``.
307307

308308
.. ipython:: python
309309
310-
extract_result = s.str.extract(two_groups, expand=False)
310+
extract_result = s.str.extract(two_groups, expand=True)
311311
extract_result
312312
extractall_result = s.str.extractall(two_groups)
313313
extractall_result
314314
extractall_result.xs(0, level="match")
315315
316316
317317
Testing for Strings that Match or Contain a Pattern
318-
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
318+
---------------------------------------------------
319319

320320
You can check whether elements contain a pattern:
321321

@@ -355,7 +355,7 @@ Methods like ``match``, ``contains``, ``startswith``, and ``endswith`` take
355355
s4.str.contains('A', na=False)
356356
357357
Creating Indicator Variables
358-
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
358+
----------------------------
359359

360360
You can extract dummy variables from string columns.
361361
For example if they are separated by a ``'|'``:

doc/source/whatsnew/v0.18.0.txt

+17-17
Original file line numberDiff line numberDiff line change
@@ -157,50 +157,50 @@ Currently the default is ``expand=None`` which gives a ``FutureWarning`` and use
157157

158158
.. ipython:: python
159159

160-
pd.Series(['a1', 'b2', 'c3']).str.extract('[ab](\d)', expand=False)
160+
pd.Series(['a1', 'b2', 'c3']).str.extract('[ab](\d)', expand=None)
161161

162-
Extracting a regular expression with one group returns a ``DataFrame``
163-
with one column if ``expand=True``.
162+
Extracting a regular expression with one group returns a Series if
163+
``expand=False``.
164164

165165
.. ipython:: python
166166

167-
pd.Series(['a1', 'b2', 'c3']).str.extract('[ab](\d)', expand=True)
167+
pd.Series(['a1', 'b2', 'c3']).str.extract('[ab](\d)', expand=False)
168168

169-
It returns a Series if ``expand=False``.
169+
It returns a ``DataFrame`` with one column if ``expand=True``.
170170

171171
.. ipython:: python
172172

173-
pd.Series(['a1', 'b2', 'c3']).str.extract('[ab](\d)', expand=False)
173+
pd.Series(['a1', 'b2', 'c3']).str.extract('[ab](\d)', expand=True)
174174

175175
Calling on an ``Index`` with a regex with exactly one capture group
176-
returns a ``DataFrame`` with one column if ``expand=True``,
176+
returns an ``Index`` if ``expand=False``.
177177

178178
.. ipython:: python
179179

180180
s = pd.Series(["a1", "b2", "c3"], ["A11", "B22", "C33"])
181181
s
182-
s.index.str.extract("(?P<letter>[a-zA-Z])", expand=True)
183-
184-
It returns an ``Index`` if ``expand=False``.
185-
186-
.. ipython:: python
187-
188182
s.index.str.extract("(?P<letter>[a-zA-Z])", expand=False)
189183

190-
Calling on an ``Index`` with a regex with more than one capture group
191-
returns a ``DataFrame`` if ``expand=True``.
184+
It returns a ``DataFrame`` with one column if ``expand=True``.
192185

193186
.. ipython:: python
194187

195-
s.index.str.extract("(?P<letter>[a-zA-Z])([0-9]+)", expand=True)
188+
s.index.str.extract("(?P<letter>[a-zA-Z])", expand=True)
196189

197-
It raises ``ValueError`` if ``expand=False``.
190+
Calling on an ``Index`` with a regex with more than one capture group
191+
raises ``ValueError`` if ``expand=False``.
198192

199193
.. code-block:: python
200194

201195
>>> s.index.str.extract("(?P<letter>[a-zA-Z])([0-9]+)", expand=False)
202196
ValueError: only one regex group is supported with Index
203197

198+
It returns a ``DataFrame`` if ``expand=True``.
199+
200+
.. ipython:: python
201+
202+
s.index.str.extract("(?P<letter>[a-zA-Z])([0-9]+)", expand=True)
203+
204204
In summary, ``extract(expand=True)`` always returns a ``DataFrame``
205205
with a row for every subject string, and a column for every capture
206206
group.

0 commit comments

Comments
 (0)