Skip to content

DOC: extract/extractall clarifications #12281

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
22 changes: 11 additions & 11 deletions doc/source/text.rst
Original file line number Diff line number Diff line change
Expand Up @@ -196,9 +196,9 @@ DataFrame with one column per group.
Elements that do not match return a row filled with ``NaN``. Thus, a
Series of messy strings can be "converted" into a like-indexed Series
or DataFrame of cleaned-up or more useful strings, without
necessitating ``get()`` to access tuples or ``re.match`` objects. The
results dtype always is object, even if no match is found and the
result only contains ``NaN``.
necessitating ``get()`` to access tuples or ``re.match`` objects. The
dtype of the result is always object, even if no match is found and
the result only contains ``NaN``.

Named groups like

Expand Down Expand Up @@ -275,15 +275,16 @@ Extract all matches in each subject (extractall)

.. _text.extractall:

.. versionadded:: 0.18.0

Unlike ``extract`` (which returns only the first match),

.. ipython:: python

s = pd.Series(["a1a2", "b1", "c1"], ["A", "B", "C"])
s
s.str.extract("[ab](?P<digit>\d)", expand=False)

.. versionadded:: 0.18.0
two_groups = '(?P<letter>[a-z])(?P<digit>[0-9])'
s.str.extract(two_groups, expand=True)

the ``extractall`` method returns every match. The result of
``extractall`` is always a ``DataFrame`` with a ``MultiIndex`` on its
Expand All @@ -292,30 +293,29 @@ indicates the order in the subject.

.. ipython:: python

s.str.extractall("[ab](?P<digit>\d)")
s.str.extractall(two_groups)

When each subject string in the Series has exactly one match,

.. ipython:: python

s = pd.Series(['a3', 'b3', 'c2'])
s
two_groups = '(?P<letter>[a-z])(?P<digit>[0-9])'

then ``extractall(pat).xs(0, level='match')`` gives the same result as
``extract(pat)``.

.. ipython:: python

extract_result = s.str.extract(two_groups, expand=False)
extract_result = s.str.extract(two_groups, expand=True)
extract_result
extractall_result = s.str.extractall(two_groups)
extractall_result
extractall_result.xs(0, level="match")


Testing for Strings that Match or Contain a Pattern
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
---------------------------------------------------

You can check whether elements contain a pattern:

Expand Down Expand Up @@ -355,7 +355,7 @@ Methods like ``match``, ``contains``, ``startswith``, and ``endswith`` take
s4.str.contains('A', na=False)

Creating Indicator Variables
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
----------------------------

You can extract dummy variables from string columns.
For example if they are separated by a ``'|'``:
Expand Down
34 changes: 17 additions & 17 deletions doc/source/whatsnew/v0.18.0.txt
Original file line number Diff line number Diff line change
Expand Up @@ -157,50 +157,50 @@ Currently the default is ``expand=None`` which gives a ``FutureWarning`` and use

.. ipython:: python

pd.Series(['a1', 'b2', 'c3']).str.extract('[ab](\d)', expand=False)
pd.Series(['a1', 'b2', 'c3']).str.extract('[ab](\d)', expand=None)

Extracting a regular expression with one group returns a ``DataFrame``
with one column if ``expand=True``.
Extracting a regular expression with one group returns a Series if
``expand=False``.

.. ipython:: python

pd.Series(['a1', 'b2', 'c3']).str.extract('[ab](\d)', expand=True)
pd.Series(['a1', 'b2', 'c3']).str.extract('[ab](\d)', expand=False)

It returns a Series if ``expand=False``.
It returns a ``DataFrame`` with one column if ``expand=True``.

.. ipython:: python

pd.Series(['a1', 'b2', 'c3']).str.extract('[ab](\d)', expand=False)
pd.Series(['a1', 'b2', 'c3']).str.extract('[ab](\d)', expand=True)

Calling on an ``Index`` with a regex with exactly one capture group
returns a ``DataFrame`` with one column if ``expand=True``,
returns an ``Index`` if ``expand=False``.

.. ipython:: python

s = pd.Series(["a1", "b2", "c3"], ["A11", "B22", "C33"])
s
s.index.str.extract("(?P<letter>[a-zA-Z])", expand=True)

It returns an ``Index`` if ``expand=False``.

.. ipython:: python

s.index.str.extract("(?P<letter>[a-zA-Z])", expand=False)

Calling on an ``Index`` with a regex with more than one capture group
returns a ``DataFrame`` if ``expand=True``.
It returns a ``DataFrame`` with one column if ``expand=True``.

.. ipython:: python

s.index.str.extract("(?P<letter>[a-zA-Z])([0-9]+)", expand=True)
s.index.str.extract("(?P<letter>[a-zA-Z])", expand=True)

It raises ``ValueError`` if ``expand=False``.
Calling on an ``Index`` with a regex with more than one capture group
raises ``ValueError`` if ``expand=False``.

.. code-block:: python

>>> s.index.str.extract("(?P<letter>[a-zA-Z])([0-9]+)", expand=False)
ValueError: only one regex group is supported with Index

It returns a ``DataFrame`` if ``expand=True``.

.. ipython:: python

s.index.str.extract("(?P<letter>[a-zA-Z])([0-9]+)", expand=True)

In summary, ``extract(expand=True)`` always returns a ``DataFrame``
with a row for every subject string, and a column for every capture
group.
Expand Down