Skip to content

DOC: Improved the docstring of str.extract() (Delhi) #20141

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 7 commits into from
Jul 7, 2018
Merged
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
102 changes: 15 additions & 87 deletions pandas/core/strings.py
Original file line number Diff line number Diff line change
Expand Up @@ -633,21 +633,21 @@ def _str_extract_frame(arr, pat, flags=0):

def str_extract(arr, pat, flags=0, expand=True):
r"""
Return the match object corresponding to regex `pat`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this r before the doc string right?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it is to ensure correct rendering for used \ in the examples

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This description isn't quite right, is it? We aren't returning a Match object.

Maybe

Extract capture groups in the regex `pat` as columns in a DataFrame.


For each subject string in the Series, extract groups from the
first match of regular expression pat.
first match of regular expression `pat`.

Parameters
----------
pat : string
Regular expression pattern with capturing groups
Regular expression pattern with capturing groups.
flags : int, default 0 (no flags)
re module flags, e.g. re.IGNORECASE

Re module flags, e.g. re.IGNORECASE.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case it is fine to not capitalize the first word, as it refers to a python module. Therefore, would you actually quote it like ``re`` ? (and the same for re.IGNORECASE)

expand : bool, default True
* If True, return DataFrame.
* If False, return Series/Index/DataFrame.
If True, return DataFrame, else return Series/Index/DataFrame.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When is a DataFrame returned if expand=False?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It still returns a DataFrame when you have multiple groups (only if there is only one group, it gives a Series). Probably good to explain this a bit more.

(I find this a bit strange behaviour, but that's what it currently is. In other case we then return a Series with lists I think)


.. versionadded:: 0.18.0
.. versionadded:: 0.18.0.

Returns
-------
Expand All @@ -668,7 +668,7 @@ def str_extract(arr, pat, flags=0, expand=True):
A pattern with two groups will return a DataFrame with two columns.
Non-matches will be NaN.

>>> s = Series(['a1', 'b2', 'c3'])
>>> s = pd.Series(['a1', 'b2', 'c3'])
>>> s.str.extract(r'([ab])(\d)')
0 1
0 a 1
Expand Down Expand Up @@ -707,7 +707,6 @@ def str_extract(arr, pat, flags=0, expand=True):
1 2
2 NaN
dtype: object

"""
if not isinstance(expand, bool):
raise ValueError("expand must be True or False")
Expand Down Expand Up @@ -898,94 +897,23 @@ def str_join(arr, sep):

def str_findall(arr, pat, flags=0):
"""
Find all occurrences of pattern or regular expression in the Series/Index.

Equivalent to applying :func:`re.findall` to all the elements in the
Series/Index.
Find all occurrences of pattern or regular expression in the
Series/Index. Equivalent to :func:`re.findall`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The summary line should only be one line. Can you try to condense it or split it up?


Parameters
----------
pat : string
Pattern or regular expression.
flags : int, default 0
``re`` module flags, e.g. `re.IGNORECASE` (default is 0, which means
no flags).
Pattern or regular expression
flags : int, default 0 (no flags)
re module flags, e.g. re.IGNORECASE
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

similar comment here as above


Returns
-------
Series/Index of lists of strings
All non-overlapping matches of pattern or regular expression in each
string of this Series/Index.
matches : Series/Index of lists

See Also
--------
count : Count occurrences of pattern or regular expression in each string
of the Series/Index.
extractall : For each string in the Series, extract groups from all matches
of regular expression and return a DataFrame with one row for each
match and one column for each group.
re.findall : The equivalent ``re`` function to all non-overlapping matches
of pattern or regular expression in string, as a list of strings.

Examples
--------

>>> s = pd.Series(['Lion', 'Monkey', 'Rabbit'])

The search for the pattern 'Monkey' returns one match:

>>> s.str.findall('Monkey')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason you removed those examples ?

0 []
1 [Monkey]
2 []
dtype: object

On the other hand, the search for the pattern 'MONKEY' doesn't return any
match:

>>> s.str.findall('MONKEY')
0 []
1 []
2 []
dtype: object

Flags can be added to the pattern or regular expression. For instance,
to find the pattern 'MONKEY' ignoring the case:

>>> import re
>>> s.str.findall('MONKEY', flags=re.IGNORECASE)
0 []
1 [Monkey]
2 []
dtype: object

When the pattern matches more than one string in the Series, all matches
are returned:

>>> s.str.findall('on')
0 [on]
1 [on]
2 []
dtype: object

Regular expressions are supported too. For instance, the search for all the
strings ending with the word 'on' is shown next:

>>> s.str.findall('on$')
0 [on]
1 []
2 []
dtype: object

If the pattern is found more than once in the same string, then a list of
multiple strings is returned:

>>> s.str.findall('b')
0 []
1 []
2 [b, b]
dtype: object

extractall : returns DataFrame with one column per capture group
"""
regex = re.compile(pat, flags=flags)
return _na_map(regex.findall, arr)
Expand Down