Skip to content

DOC: updated the Series.str.rsplit and Series.str.split docstrings #21026

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 15 commits into from
Jun 22, 2018
Merged
241 changes: 119 additions & 122 deletions pandas/core/strings.py
Original file line number Diff line number Diff line change
Expand Up @@ -1343,108 +1343,7 @@ def str_pad(arr, width, side='left', fillchar=' '):


def str_split(arr, pat=None, n=None):
"""
Split strings around given separator/delimiter.

Split each string in the caller's values by given
pattern, propagating NaN values. Equivalent to :meth:`str.split`.

Parameters
----------
pat : str, optional
String or regular expression to split on.
If not specified, split on whitespace.
n : int, default -1 (all)
Limit number of splits in output.
``None``, 0 and -1 will be interpreted as return all splits.
expand : bool, default False
Expand the split strings into separate columns.

* If ``True``, return DataFrame/MultiIndex expanding dimensionality.
* If ``False``, return Series/Index, containing lists of strings.

Returns
-------
Series, Index, DataFrame or MultiIndex
Type matches caller unless ``expand=True`` (see Notes).

Notes
-----
The handling of the `n` keyword depends on the number of found splits:

- If found splits > `n`, make first `n` splits only
- If found splits <= `n`, make all splits
- If for a certain row the number of found splits < `n`,
append `None` for padding up to `n` if ``expand=True``

If using ``expand=True``, Series and Index callers return DataFrame and
MultiIndex objects, respectively.

See Also
--------
str.split : Standard library version of this method.
Series.str.get_dummies : Split each string into dummy variables.
Series.str.partition : Split string on a separator, returning
the before, separator, and after components.

Examples
--------
>>> s = pd.Series(["this is good text", "but this is even better"])

By default, split will return an object of the same size
having lists containing the split elements

>>> s.str.split()
0 [this, is, good, text]
1 [but, this, is, even, better]
dtype: object
>>> s.str.split("random")
0 [this is good text]
1 [but this is even better]
dtype: object

When using ``expand=True``, the split elements will expand out into
separate columns.

For Series object, output return type is DataFrame.

>>> s.str.split(expand=True)
0 1 2 3 4
0 this is good text None
1 but this is even better
>>> s.str.split(" is ", expand=True)
0 1
0 this good text
1 but this even better

For Index object, output return type is MultiIndex.

>>> i = pd.Index(["ba 100 001", "ba 101 002", "ba 102 003"])
>>> i.str.split(expand=True)
MultiIndex(levels=[['ba'], ['100', '101', '102'], ['001', '002', '003']],
labels=[[0, 0, 0], [0, 1, 2], [0, 1, 2]])

Parameter `n` can be used to limit the number of splits in the output.

>>> s.str.split("is", n=1)
0 [th, is good text]
1 [but th, is even better]
dtype: object
>>> s.str.split("is", n=1, expand=True)
0 1
0 th is good text
1 but th is even better

If NaN is present, it is propagated throughout the columns
during the split.

>>> s = pd.Series(["this is good text", "but this is even better", np.nan])
>>> s.str.split(n=3, expand=True)
0 1 2 3
0 this is good text
1 but this is even better
2 NaN NaN NaN NaN
"""
if pat is None:
if n is None or n == 0:
n = -1
Expand All @@ -1464,25 +1363,7 @@ def str_split(arr, pat=None, n=None):


def str_rsplit(arr, pat=None, n=None):
"""
Split each string in the Series/Index by the given delimiter
string, starting at the end of the string and working to the front.
Equivalent to :meth:`str.rsplit`.

Parameters
----------
pat : string, default None
Separator to split on. If None, splits on whitespace
n : int, default -1 (all)
None, 0 and -1 will be interpreted as return all splits
expand : bool, default False
* If True, return DataFrame/MultiIndex expanding dimensionality.
* If False, return Series/Index.

Returns
-------
split : Series/Index or DataFrame/MultiIndex of objects
"""

if n is None or n == 0:
n = -1
f = lambda x: x.rsplit(pat, n)
Expand Down Expand Up @@ -2325,12 +2206,128 @@ def cat(self, others=None, sep=None, na_rep=None, join=None):
res = Series(res, index=data.index, name=self._orig.name)
return res

@copy(str_split)
_shared_docs['str_split'] = ("""
Split strings around given separator/delimiter.

Splits the string in the Series/Index from the %(side)s,
at the specified delimiter string. Equivalent to :meth:`str.%(method)s`.

Parameters
----------
pat : str, optional
String or regular expression to split on.
If not specified, split on whitespace.
n : int, default -1 (all)
Limit number of splits in output.
``None``, 0 and -1 will be interpreted as return all splits.
expand : bool, default False
Expand the splitted strings into separate columns.

* If ``True``, return DataFrame/MultiIndex expanding dimensionality.
* If ``False``, return Series/Index, containing lists of strings.

Returns
-------
Series, Index, DataFrame or MultiIndex
Type matches caller unless ``expand=True`` (see Notes).

See Also
--------
Series.str.split : Split strings around given separator/delimiter.
Series.str.rsplit : Splits string around given separator/delimiter,
starting from the right.
Series.str.join : Join lists contained as elements in the Series/Index
with passed delimiter.
str.split : Standard library version for split.
str.rsplit : Standard library version for rsplit.

Notes
-----
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you move the Notes after See Also? That's the ordering of the numpydoc standard.

The handling of the `n` keyword depends on the number of found splits:

- If found splits > `n`, make first `n` splits only
- If found splits <= `n`, make all splits
- If for a certain row the number of found splits < `n`,
append `None` for padding up to `n` if ``expand=True``

If using ``expand=True``, Series and Index callers return DataFrame and
MultiIndex objects, respectively.

Examples
--------
>>> s = pd.Series(["this is a regular sentence", "https://docs.python.org/3/tutorial/index.html", np.nan])

In the default setting, the string is split by whitespace.

>>> s.str.split()
0 [this, is, a, regular, sentence]
1 [https://docs.python.org/3/tutorial/index.html]
2 NaN
dtype: object

Without the `n` parameter, the outputs of `rsplit` and `split` are identical.

>>> s.str.rsplit()
0 [this, is, a, regular, sentence]
1 [https://docs.python.org/3/tutorial/index.html]
2 NaN
dtype: object

The `n` parameter can be used to limit the number of splits on the
delimiter. The outputs of `split` and `rsplit` are different.

>>> s.str.split(n=2)
0 [this, is, a regular sentence]
1 [https://docs.python.org/3/tutorial/index.html]
2 NaN
dtype: object

>>> s.str.rsplit(n=2)
0 [this is a, regular, sentence]
1 [https://docs.python.org/3/tutorial/index.html]
2 NaN
dtype: object

The `pat` parameter can be used to split by other characters.

>>> s.str.split(pat = "/")
0 [this is a regular sentence]
1 [https:, , docs.python.org, 3, tutorial, index...
2 NaN
dtype: object
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. May be it would be a bit more clear if we introduce one parameter at a time, each with its explanation. Meaning that these two previous examples will be these 3:

  • s.str.split(pat='/') - with a sentence explaining that pat can be used to split by other characters
  • s.str.split(n=2) - that's the one you're explaining, but I'd use the default pat to make it slightly simple
  • s.str.rsplit(n=2) - here we can explain how rsplitis different thansplit`.

That does not fully address your concern that if can feel strange that rsplit page starts by a split example, but IMO will make it easier to understand when rsplit should be used.

An idea that just came to my mind is that instead of /path/to/python/file we could use for example http://pandas.pydata.org/pandas-docs/stable/api.html (or a shorter url). Then, rsplit(n=1) can be used to split the html document from the rest of the url.

One last idea that you can consider is having an example s.str.rsplit() after the first one. With a comment like "Without the n parameter, rsplit behaves like split" (I think this is the case, may be you want to double check, for example when using expand). If you think this can make users in the rsplit page be less confused, it could be a good idea.

The docstring looks good to me, feel free to incorporate the changes that makes sense to you.

Copy link
Contributor Author

@ryankarlos ryankarlos Jun 21, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@datapythonista I like the idea of using http://pandas.pydata.org/pandas-docs/stable/api.html but then the issue is that we will need to rsplit(pat ='/', n=1) as rsplit(n=1) won't work in this instance -should this be another example then as its utilising both parameters pat and n ? So how i see the different examples ordered and sectioned are as below -
Ive seen other docstrings only explain the examples at the beginning of sections (where a different parameter is introduced) - so in this case where i will explain how rsplit is different to split - does this have to be introduced at the beginning when introducing both split and rsplit examples for n=2 or can the sentence come just before the rsplit example - is there a specific convention for this ?

Default example

  • s.str.split()
  • s.str.rsplit() - sentence explaining how rsplit produces the same output as split in this case

Parameter n

  • s.str.split(n=2)
  • s.str.rsplit(n=2) - explain how rsplit is different than split

Parameter pat

  • s.str.split(pat=' / ') - sentence explaining how pat can be used to split by other characters like /

Real world example using both pat and n

  • s.str.rsplit(pat = ' / ', n=1) - example of how pat and n can be used together to split the html document in "http://pandas.pydata.org/pandas-docs/stable/visualization.html" from rest of url.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That sounds good to me. I think the docstring is already good. I think those changes will make it better, but whatever you think will be more useful (and consice) for users, will be great.


When using ``expand=True``, the split elements will expand out into
separate columns. If NaN is present, it is propagated throughout
the columns during the split.

>>> s.str.split(expand=True)
0 1 2 3 4
0 this is a regular sentence
1 https://docs.python.org/3/tutorial/index.html None None None None
2 NaN NaN NaN NaN NaN

For slightly more complex use cases like splitting the html document name
from a url, a combination of parameter settings can be used.

>>> s.str.rsplit("/", n=1, expand=True)
0 1
0 this is a regular sentence None
1 https://docs.python.org/3/tutorial index.html
2 NaN NaN
""")

@Appender(_shared_docs['str_split'] % {
'side': 'beginning',
'method': 'split'
})
def split(self, pat=None, n=-1, expand=False):
result = str_split(self._data, pat, n=n)
return self._wrap_result(result, expand=expand)

@copy(str_rsplit)
@Appender(_shared_docs['str_split'] % {
'side': 'end',
'method': 'rsplit'
})
def rsplit(self, pat=None, n=-1, expand=False):
result = str_rsplit(self._data, pat, n=n)
return self._wrap_result(result, expand=expand)
Expand Down