Skip to content

DOC: update the pandas.Series.str.split docstring #20282

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
77 changes: 69 additions & 8 deletions pandas/core/strings.py
Original file line number Diff line number Diff line change
Expand Up @@ -1095,24 +1095,85 @@ def str_pad(arr, width, side='left', fillchar=' '):

def str_split(arr, pat=None, n=None):
"""
Split each string (a la re.split) in the Series/Index by given
pattern, propagating NA values. Equivalent to :meth:`str.split`.
Split strings around given separator/delimiter.

Split each string in the caller's values by given
pattern, propagating NaN values. Equivalent to :meth:`str.split`.

Parameters
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

General comment - rather than putting sub-bullets under the parameters would it be clearer to move those into a dedicated Notes section? They do explain some of the implementation details so wonder if they are better served there

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would be better. I'll update it.

----------
pat : string, default None
String or regular expression to split on. If None, splits on whitespace
pat : str, optional
String or regular expression to split on.
If `None`, split on whitespace.
n : int, default -1 (all)
None, 0 and -1 will be interpreted as return all splits
Limit number of splits in output.
`None`, 0 and -1 will be interpreted as return all splits.
expand : bool, default False
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at this in more detail, I see why you have this documented here in spite of the fact that str_split doesn't actually accept this parameter (it's docstring is copied to split). Ref some of the work in #10085

@jreback is there any reason why we wouldn't want to deprecate items like str_split and move to private internal methods like _str_split that the front-end facing API uses instead? If so may be a logical follow up to this docstring update

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@WillAyd the str_split methods are not considered public, it's just how it is implemented right now. I agree this one is a bit confusing as some keyword are used in str_split and some only in the method, but generally in the file, the docstrings are located at the functions (so str_split), although they are written as if they were documenting the method.

Feel free to open an issue if you have a proposal how this could be improved.

* If True, return DataFrame/MultiIndex expanding dimensionality.
* If False, return Series/Index.
Expand the splitted strings into separate columns.

return_type : deprecated, use `expand`
* If `True`, return DataFrame/MultiIndex expanding dimensionality.
* If `False`, return Series/Index, containing lists of strings.

Returns
-------
Type matches caller unless `expand=True` (return type is `DataFrame` or
`MultiIndex`)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you put this one below the line below?
Like

split : Series/Index or DataFrame/MultiIndex
    Type matches caller unless `expand=True` (return type is `DataFrame` or
    `MultiIndex`)

split : Series/Index or DataFrame/MultiIndex of objects
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find this return type rather confusing - how does the Index play a part? I think this could be clarified better

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I was confused about that as well. Should I keep it in the doc?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For returns make sure the first line is just the type (unless returning more than one, which isn't the case here). I think that would best be Series or Index with a description that says something like "Return type matches caller unless expand=True (always returns a DataFrame)"


Notes
-----
- If n >= default splits, makes all splits
- If n < default splits, makes first n splits only
- Appends `None` for padding if `expand=True`

Examples
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The examples are good but I think the section can use a short sentence or two to introduce the examples and clue the reader in on what they should be looking at. Also, is it possible to show an example that deals with missing values?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might have missed mentioning this but while I think the examples are good you should add a sentence (or a few) to call out what users should be looking at with the examples. It would also be nice to show one or two with missing data to illustrate how NaN gets propagated

Copy link
Contributor Author

@mananpal1997 mananpal1997 Mar 11, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Like this?

>>> s.str.split("is", n=1, expand=True)
        0                1
0      th     is good text
1  but th   is even better
see notes about expand=True

Also, I wrote an example to show NaN propagation but None is being propagated instead of NaN.

>>> s = pd.Series(["this is good text", "but this is even better", np.nan])
>>> s.str.split(expand=True)
      0     1     2     3       4
0  this    is  good  text    None
1   but  this    is  even  better
2   NaN  None  None  None    None

Shouldn't output be?

      0     1     2     3       4
0  this    is  good  text    None
1   but  this    is  even  better
2   NaN   NaN   NaN   NaN     NaN

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Put your comments before the example and just highlight what the user should look at. So for your first example say something like "By default, split will return an object of the same size containing lists to hold the split elements" and then introduce the second with something like "By contrast, when using expand=True the split elements will expand out into separate columns." Doesn't need to be exactly those words but something along those lines - make sense?

As far as your example is concerned, make sure you run everything on the master branch. My guess is you are using an older version of pandas as the fix to propagate NaN was released in v0.21.1 (see #18462)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, right.
thanks!

--------
>>> s = pd.Series(["this is good text", "but this is even better"])

By default, split will return an object of the same size
having lists containing the split elements

>>> s.str.split()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

xref my other comment - kind of strange that the method here is actually str_split but we are showing how to use str.split. While technically the same, from an API perspective I feel like this docstring belongs under the split method and we should make this method private internal.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same is the case for str.rsplit

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep thanks - might be a few other instances in the module as well. I don't want it to hold up what you've done here but could be a good follow up for you to clean up the actual functions

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should I work on it in this same pr, or should I make a separate issue for this and work on that?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Separate issue

0 [this, is, good, text]
1 [but, this, is, even, better]
dtype: object
>>> s.str.split("random")
0 [this is good text]
1 [but this is even better]
dtype: object

When using `expand=True`, the split elements will
expand out into separate columns.

>>> s.str.split(expand=True)
0 1 2 3 4
0 this is good text None
1 but this is even better
>>> s.str.split(" is ", expand=True)
0 1
0 this good text
1 but this even better

Parameter `n` can be used to limit the number of splits in the output.

>>> s.str.split("is", n=1)
0 [th, is good text]
1 [but th, is even better]
dtype: object
>>> s.str.split("is", n=1, expand=True)
0 1
0 th is good text
1 but th is even better

If NaN is present, it is propagated throughout the columns
during the split.

>>> s = pd.Series(["this is good text", "but this is even better", np.nan])
>>> s.str.split(n=3, expand=True)
0 1 2 3
0 this is good text
1 but this is even better
2 NaN NaN NaN NaN
"""
if pat is None:
if n is None or n == 0:
Expand Down