Skip to content

DOC: update the pandas.Series.str.split docstring #20307

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

mananpal1997
Copy link
Contributor

@mananpal1997 mananpal1997 commented Mar 12, 2018

Following up from discussion on #20282

  • PR title is "DOC: update the pandas.Series.str.split docstring"
  • The validation script passes: scripts/validate_docstrings.py pandas.Series.str.split docstring
  • The PEP8 style check passes: git diff upstream/master -u -- "*.py" | flake8 --diff
  • The html version looks good: python doc/make.py --single pandas.Series.str.split docstring
################################################################################
##################### Docstring (pandas.Series.str.split)  #####################
################################################################################

Split strings around given separator/delimiter.

Split each string in the caller's values by given
pattern, propagating NaN values. Equivalent to :meth:`str.split`.

Parameters
----------
pat : str, optional
    String or regular expression to split on.
    If not specified, split on whitespace.
n : int, default -1 (all)
    Limit number of splits in output.
    ``None``, 0 and -1 will be interpreted as return all splits.
expand : bool, default False
    Expand the splitted strings into separate columns.

    * If ``True``, return DataFrame/MultiIndex expanding dimensionality.
    * If ``False``, return Series/Index, containing lists of strings.

Returns
-------
split : Series/Index or DataFrame/MultiIndex of objects
    Type matches caller unless ``expand=True`` (return type is DataFrame or
MultiIndex)

Notes
-----
The handling of the `n` keyword depends on the number of found splits:

- If found splits > `n`,  make first `n` splits only
- If found splits <= `n`, make all splits
- If for a certain row the number of found splits < `n`,
  append `None` for padding up to `n` if ``expand=True``

Examples
--------
>>> s = pd.Series(["this is good text", "but this is even better"])

By default, split will return an object of the same size
having lists containing the split elements

>>> s.str.split()
0           [this, is, good, text]
1    [but, this, is, even, better]
dtype: object
>>> s.str.split("random")
0          [this is good text]
1    [but this is even better]
dtype: object

When using ``expand=True``, the split elements will expand out into
separate columns.

For Series object, output return type is DataFrame.

>>> s.str.split(expand=True)
      0     1     2     3       4
0  this    is  good  text    None
1   but  this    is  even  better
>>> s.str.split(" is ", expand=True)
          0            1
0      this    good text
1  but this  even better

For Index object, output return type is MultiIndex.

>>> i = pd.Index(["ba 100 001", "ba 101 002", "ba 102 003"])
>>> i.str.split(expand=True)
MultiIndex(levels=[['ba'], ['100', '101', '102'], ['001', '002', '003']],
       labels=[[0, 0, 0], [0, 1, 2], [0, 1, 2]])

Parameter `n` can be used to limit the number of splits in the output.

>>> s.str.split("is", n=1)
0          [th,  is good text]
1    [but th,  is even better]
dtype: object
>>> s.str.split("is", n=1, expand=True)
        0                1
0      th     is good text
1  but th   is even better

If NaN is present, it is propagated throughout the columns
during the split.

>>> s = pd.Series(["this is good text", "but this is even better", np.nan])
>>> s.str.split(n=3, expand=True)
      0     1     2            3
0  this    is  good         text
1   but  this    is  even better
2   NaN   NaN   NaN          NaN

################################################################################
################################## Validation ##################################
################################################################################

Errors found:
	See Also section not found

@jorisvandenbossche @WillAyd

@jorisvandenbossche
Copy link
Member

@mananpal1997 Thanks!

Additional comment: on line 1121, the indentation should be the same as the line above, can you fix that as well? (cannot comment on that line :-))

@WillAyd did you have a proposal to write the Returns section more clear?

@mananpal1997 mananpal1997 force-pushed the docstring_pandas.Series.str.split branch from 9184934 to 52fe248 Compare March 12, 2018 16:52
@WillAyd
Copy link
Member

WillAyd commented Mar 12, 2018

Nice job @mananpal1997. I suggest the Return type line just say Series, DataFrame, Index or MultiIndex and the description for the return should say Type matches caller unless ``expand=True`` (see Notes). Then in the notes say something like if using ``expand=True``, Series and Index callers will return DataFrame and MultiIndex objects, respectively.

split : Series/Index or DataFrame/MultiIndex of objects
Type matches caller unless ``expand=True`` (return type is DataFrame or
MultiIndex)
split : Series, Index, DataFrame or MultiIndex of objects
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since you are only returning one item you don't need to list the variable name, so get rid of "split : ". Also chop " of objects" off the end

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done 👍

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@WillAyd I was trying to resolve for coverage and I think I messed up with PR. Does everything look good to you? 😅

@mananpal1997 mananpal1997 force-pushed the docstring_pandas.Series.str.split branch from b57f570 to c9bf3e8 Compare March 12, 2018 18:01
@codecov
Copy link

codecov bot commented Mar 12, 2018

Codecov Report

Merging #20307 into master will not change coverage.
The diff coverage is n/a.

Impacted file tree graph

@@           Coverage Diff           @@
##           master   #20307   +/-   ##
=======================================
  Coverage    91.7%    91.7%           
=======================================
  Files         150      150           
  Lines       49165    49165           
=======================================
  Hits        45087    45087           
  Misses       4078     4078
Flag Coverage Δ
#multiple 90.09% <ø> (ø) ⬆️
#single 41.86% <ø> (ø) ⬆️
Impacted Files Coverage Δ
pandas/core/strings.py 98.32% <ø> (ø) ⬆️
pandas/core/series.py 93.84% <0%> (ø) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 7169830...27548e8. Read the comment docs.

@mananpal1997 mananpal1997 force-pushed the docstring_pandas.Series.str.split branch from c9bf3e8 to 1a2efb0 Compare March 12, 2018 18:04
@mananpal1997 mananpal1997 force-pushed the docstring_pandas.Series.str.split branch from 1a2efb0 to 7169830 Compare March 12, 2018 18:18
@mananpal1997 mananpal1997 reopened this Mar 12, 2018
@TomAugspurger
Copy link
Contributor

Added a see also. Thanks @mananpal1997.

@TomAugspurger TomAugspurger merged commit edaa112 into pandas-dev:master Mar 13, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants