-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
DOC: update the pandas.Series.str.split docstring #20282
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 3 commits
9126c82
2e13424
0a1da96
da27e5f
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -1095,24 +1095,85 @@ def str_pad(arr, width, side='left', fillchar=' '): | |
|
||
def str_split(arr, pat=None, n=None): | ||
""" | ||
Split each string (a la re.split) in the Series/Index by given | ||
pattern, propagating NA values. Equivalent to :meth:`str.split`. | ||
Split strings around given separator/delimiter. | ||
|
||
Split each string in the caller's values by given | ||
pattern, propagating NaN values. Equivalent to :meth:`str.split`. | ||
|
||
Parameters | ||
---------- | ||
pat : string, default None | ||
String or regular expression to split on. If None, splits on whitespace | ||
pat : str, optional | ||
String or regular expression to split on. | ||
If not specified, split on whitespace. | ||
n : int, default -1 (all) | ||
None, 0 and -1 will be interpreted as return all splits | ||
Limit number of splits in output. | ||
``None``, 0 and -1 will be interpreted as return all splits. | ||
expand : bool, default False | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Looking at this in more detail, I see why you have this documented here in spite of the fact that @jreback is there any reason why we wouldn't want to deprecate items like There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @WillAyd the Feel free to open an issue if you have a proposal how this could be improved. |
||
* If True, return DataFrame/MultiIndex expanding dimensionality. | ||
* If False, return Series/Index. | ||
Expand the splitted strings into separate columns. | ||
|
||
return_type : deprecated, use `expand` | ||
* If ``True``, return DataFrame/MultiIndex expanding dimensionality. | ||
* If ``False``, return Series/Index, containing lists of strings. | ||
|
||
Returns | ||
------- | ||
split : Series/Index or DataFrame/MultiIndex of objects | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I find this return type rather confusing - how does the Index play a part? I think this could be clarified better There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, I was confused about that as well. Should I keep it in the doc? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. For returns make sure the first line is just the type (unless returning more than one, which isn't the case here). I think that would best be |
||
Type matches caller unless ``expand=True`` (return type is DataFrame or | ||
MultiIndex) | ||
|
||
Notes | ||
----- | ||
- If n >= default splits, makes all splits | ||
- If n < default splits, makes first n splits only | ||
- Appends `None` for padding if ``expand=True`` | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. One final comment: I find this list not fully clear. What is 'n' and what is 'default splits' ? I suppose it details how Proposal (but not fully sure this is what you meant):
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. right! @jorisvandenbossche wanted to ask about an issue I faced while setting up pandas. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, nbsphinx is missing from the dev requirements, they are only included in the optional ones. Will open an issues about that. |
||
|
||
Examples | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The examples are good but I think the section can use a short sentence or two to introduce the examples and clue the reader in on what they should be looking at. Also, is it possible to show an example that deals with missing values? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Might have missed mentioning this but while I think the examples are good you should add a sentence (or a few) to call out what users should be looking at with the examples. It would also be nice to show one or two with missing data to illustrate how There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Like this?
Also, I wrote an example to show NaN propagation but None is being propagated instead of NaN.
Shouldn't output be?
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Put your comments before the example and just highlight what the user should look at. So for your first example say something like "By default, split will return an object of the same size containing lists to hold the split elements" and then introduce the second with something like "By contrast, when using As far as your example is concerned, make sure you run everything on the master branch. My guess is you are using an older version of pandas as the fix to propagate There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. ah, right. |
||
-------- | ||
>>> s = pd.Series(["this is good text", "but this is even better"]) | ||
|
||
By default, split will return an object of the same size | ||
having lists containing the split elements | ||
|
||
>>> s.str.split() | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. xref my other comment - kind of strange that the method here is actually There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. same is the case for There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yep thanks - might be a few other instances in the module as well. I don't want it to hold up what you've done here but could be a good follow up for you to clean up the actual functions There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Should I work on it in this same pr, or should I make a separate issue for this and work on that? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Separate issue |
||
0 [this, is, good, text] | ||
1 [but, this, is, even, better] | ||
dtype: object | ||
>>> s.str.split("random") | ||
0 [this is good text] | ||
1 [but this is even better] | ||
dtype: object | ||
|
||
When using ``expand=True``, the split elements will | ||
expand out into separate columns. | ||
|
||
>>> s.str.split(expand=True) | ||
0 1 2 3 4 | ||
0 this is good text None | ||
1 but this is even better | ||
>>> s.str.split(" is ", expand=True) | ||
0 1 | ||
0 this good text | ||
1 but this even better | ||
|
||
Parameter `n` can be used to limit the number of splits in the output. | ||
|
||
>>> s.str.split("is", n=1) | ||
0 [th, is good text] | ||
1 [but th, is even better] | ||
dtype: object | ||
>>> s.str.split("is", n=1, expand=True) | ||
0 1 | ||
0 th is good text | ||
1 but th is even better | ||
|
||
If NaN is present, it is propagated throughout the columns | ||
during the split. | ||
|
||
>>> s = pd.Series(["this is good text", "but this is even better", np.nan]) | ||
>>> s.str.split(n=3, expand=True) | ||
0 1 2 3 | ||
0 this is good text | ||
1 but this is even better | ||
2 NaN NaN NaN NaN | ||
""" | ||
if pat is None: | ||
if n is None or n == 0: | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
General comment - rather than putting sub-bullets under the parameters would it be clearer to move those into a dedicated Notes section? They do explain some of the implementation details so wonder if they are better served there
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That would be better. I'll update it.