Skip to content

DOC: Improved the docstring of str.extract() (Delhi) #20141

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 7 commits into from
Jul 7, 2018

Conversation

arjunsharma97
Copy link
Contributor

Checklist for the pandas documentation sprint (ignore this if you are doing
an unrelated PR):

  • PR title is "DOC: update the docstring"
  • The validation script passes: scripts/validate_docstrings.py <your-function-or-method>
  • The PEP8 style check passes: git diff upstream/master -u -- "*.py" | flake8 --diff
  • The html version looks good: python doc/make.py --single <your-function-or-method>
  • It has been proofread on language by another sprint participant

Please include the output of the validation script below between the "```" ticks:

################################################################################
#################### Docstring (pandas.Series.str.extract)  ####################
################################################################################

Return the match object corresponding to regex `pat`.

For each subject string in the Series, extract groups from the
first match of regular expression `pat`.

Parameters
----------
pat : string
    Regular expression pattern with capturing groups.
flags : int, default 0 (no flags)
    Re module flags, e.g. re.IGNORECASE.
expand : bool, default True
    If True, return DataFrame, else return Series/Index/DataFrame.

    .. versionadded:: 0.18.0.

Returns
-------
DataFrame with one row for each subject string, and one column for
each group. Any capture group names in regular expression pat will
be used for column names; otherwise capture group numbers will be
used. The dtype of each result column is always object, even when
no match is found. If expand=False and pat has only one capture group,
then return a Series (if subject is a Series) or Index (if subject
is an Index).

See Also
--------
extractall : returns all matches (not just the first match)

Examples
--------
A pattern with two groups will return a DataFrame with two columns.
Non-matches will be NaN.

>>> s = pd.Series(['a1', 'b2', 'c3'])
>>> s.str.extract(r'([ab])(\d)')
     0    1
0    a    1
1    b    2
2  NaN  NaN

A pattern may contain optional groups.

>>> s.str.extract(r'([ab])?(\d)')
     0  1
0    a  1
1    b  2
2  NaN  3

Named groups will become column names in the result.

>>> s.str.extract(r'(?P<letter>[ab])(?P<digit>\d)')
  letter digit
0      a     1
1      b     2
2    NaN   NaN

A pattern with one group will return a DataFrame with one column
if expand=True.

>>> s.str.extract(r'[ab](\d)', expand=True)
     0
0    1
1    2
2  NaN

A pattern with one group will return a Series if expand=False.

>>> s.str.extract(r'[ab](\d)', expand=False)
0      1
1      2
2    NaN
dtype: object

################################################################################
################################## Validation ##################################
################################################################################

Docstring for "pandas.Series.str.extract" correct. :)

If the validation script still gives errors, but you think there is a good reason
to deviate in this case (and there are certainly such cases), please state this
explicitly.

@@ -633,21 +633,21 @@ def _str_extract_frame(arr, pat, flags=0):

def str_extract(arr, pat, flags=0, expand=True):
r"""
Return the match object corresponding to regex `pat`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this r before the doc string right?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it is to ensure correct rendering for used \ in the examples

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This description isn't quite right, is it? We aren't returning a Match object.

Maybe

Extract capture groups in the regex `pat` as columns in a DataFrame.

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR! Added some comments

@@ -633,21 +633,21 @@ def _str_extract_frame(arr, pat, flags=0):

def str_extract(arr, pat, flags=0, expand=True):
r"""
Return the match object corresponding to regex `pat`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it is to ensure correct rendering for used \ in the examples

flags : int, default 0 (no flags)
re module flags, e.g. re.IGNORECASE

Re module flags, e.g. re.IGNORECASE.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case it is fine to not capitalize the first word, as it refers to a python module. Therefore, would you actually quote it like ``re`` ? (and the same for re.IGNORECASE)

Equivalent to applying :func:`re.findall` to all the elements in the
Series/Index.
Find all occurrences of pattern or regular expression in the
Series/Index. Equivalent to :func:`re.findall`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The summary line should only be one line. Can you try to condense it or split it up?

no flags).
Pattern or regular expression
flags : int, default 0 (no flags)
re module flags, e.g. re.IGNORECASE
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

similar comment here as above


The search for the pattern 'Monkey' returns one match:

>>> s.str.findall('Monkey')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason you removed those examples ?

@codecov
Copy link

codecov bot commented Mar 15, 2018

Codecov Report

❗ No coverage uploaded for pull request base (master@32ee973). Click here to learn what that means.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff            @@
##             master   #20141   +/-   ##
=========================================
  Coverage          ?   91.72%           
=========================================
  Files             ?      150           
  Lines             ?    49152           
  Branches          ?        0           
=========================================
  Hits              ?    45086           
  Misses            ?     4066           
  Partials          ?        0
Flag Coverage Δ
#multiple 90.11% <ø> (?)
#single 41.84% <ø> (?)
Impacted Files Coverage Δ
pandas/core/strings.py 98.32% <ø> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 32ee973...ec8bd44. Read the comment docs.

@@ -633,21 +633,21 @@ def _str_extract_frame(arr, pat, flags=0):

def str_extract(arr, pat, flags=0, expand=True):
r"""
Return the match object corresponding to regex `pat`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This description isn't quite right, is it? We aren't returning a Match object.

Maybe

Extract capture groups in the regex `pat` as columns in a DataFrame.

flags : int, default 0 (no flags)
re module flags, e.g. re.IGNORECASE

``re`` module flags, e.g. ``re.IGNORECASE``.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this can be done by using :mod:`re`

expand : bool, default True
* If True, return DataFrame.
* If False, return Series/Index/DataFrame.
If True, return DataFrame, else return Series/Index/DataFrame.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When is a DataFrame returned if expand=False?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It still returns a DataFrame when you have multiple groups (only if there is only one group, it gives a Series). Probably good to explain this a bit more.

(I find this a bit strange behaviour, but that's what it currently is. In other case we then return a Series with lists I think)

@jreback jreback added this to the 0.24.0 milestone Jul 7, 2018
@jreback jreback merged commit b12b7ae into pandas-dev:master Jul 7, 2018
@jreback
Copy link
Contributor

jreback commented Jul 7, 2018

thanks @arjunsharma97 and @mroeschke for fixups!

Sup3rGeo pushed a commit to Sup3rGeo/pandas that referenced this pull request Oct 1, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants