DOC: updated the Series.str.rsplit and Series.str.split docstrings #21026

ryankarlos · 2018-05-14T00:32:12Z

closes #xxxx
tests passed validate_docstrings.py
passes git diff upstream/master -u -- "*.py" | flake8 --diff
rendered document looks ok 'python make.py --single pandas.Series.str.rsplit' and 'python make.py --single pandas.Series.str.split'

ryankarlos · 2018-05-14T00:39:38Z

@datapythonista Finally got round to completing this after the PyData Pandas Sprint - let me know what you think ?

codecov · 2018-05-14T02:03:13Z

Codecov Report

Merging #21026 into master will increase coverage by <.01%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master   #21026      +/-   ##
==========================================
+ Coverage    91.9%    91.9%   +<.01%     
==========================================
  Files         153      153              
  Lines       49549    49550       +1     
==========================================
+ Hits        45539    45540       +1     
  Misses       4010     4010

Flag	Coverage Δ
#multiple	`90.3% <100%> (ø)`	⬆️
#single	`41.78% <100%> (ø)`	⬆️

Impacted Files	Coverage Δ
pandas/core/strings.py	`98.63% <100%> (ø)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 028c9c0...b11873f. Read the comment docs.

datapythonista · 2018-05-14T13:12:05Z

I was taking a look, and I don't see a reason why split and rsplit should have different docstrings. I guess the docstrings are independent just because historical reasons. If you don't see a reason either, could you merge what you did with the existing split docstring please? You can take a look at partition or rstrip to see how to reuse a docstring in two methods.

ryankarlos · 2018-05-19T00:54:20Z

@datapythonista Thanks. No I didn't find any reason for different docstrings, except the output of the examples being different. There were a few typos and whitespace issues that were corrected in the exisiting rstrip docstring - do you suggest I revert back to the format it was in before ?

datapythonista · 2018-05-19T20:27:20Z

No, it's not about the format. In pandas, there are "normal" docstrings that are independent from the others, and are defined at the beginning of the function, like it was this case.

But then, there are methods or functions that are very similar, and the docstring is created once (usually at the beginning of the file), and then reused (by injecting it into the function) in more than one case.

So, originally, before you did any change, split and rsplit had different/independent docstrings. And you kept it this way, as expected. But what I'm proposing, is that instead of simply writing the docstring for rsplit, that you create a shared docstring with the best of what you did and the existing docstring of split, and then you reuse it for both functions. It may sound complex, but if you take a look at how this is done in partition and rstrip you'll see that it's very easy. And I think the final result will be much better, not having two separate docstrings that are (or should be) almost the same.

Let me know what do you think, and if you face any problem. Thanks!

pep8speaks · 2018-06-03T14:57:57Z

Hello @ryankarlos! Thanks for updating the PR.

In the file pandas/core/strings.py, following are the PEP8 issues :

Line 2258:80: E501 line too long (110 > 79 characters)
Line 2268:80: E501 line too long (81 > 79 characters)
Line 2304:80: E501 line too long (83 > 79 characters)
Line 2305:80: E501 line too long (83 > 79 characters)
Line 2306:80: E501 line too long (83 > 79 characters)
Line 2307:80: E501 line too long (83 > 79 characters)

Comment last updated on June 22, 2018 at 10:20 Hours UTC

ryankarlos · 2018-06-03T15:10:06Z

@datapythonista Sorry for the delay. Oh ok I understand now, yes that makes sense. I have now updated the docstring to include a resuable docstring for rsplit and split like rpartition as you suggested. I have also adapted the original independent docstring so it only includes minimal information like partition and strip (excluding examples etc as I have added them in later). Let me know if it looks ok.

datapythonista · 2018-06-03T15:17:34Z

Did you generate the docstrings to see if they are rendered ok? I may be wrong, but I think some stuff will appear duplicated.

Also, can you take a look at the PEP-8 errors in you code please? Seems like you have some tailing spaces and a line too long.

ryankarlos · 2018-06-08T21:10:47Z

@datapythonista I have now corrected all whitespace issues and other PEP-8 errors. I rendered it and it looks ok - the examples were duplicated in both (since I had combined the rsplit and split examples for both functions. However, I have split them up so that rsplit examples will only show up in the rsplit doc and split examples in split doc. I have attached screenshots of what the sections of the docs look like.

datapythonista · 2018-06-08T22:04:37Z

Looks quite good, but if I'm not mistaken, the docstrings under str_split and str_rsplit are not being used, are they? If that's right, I'd get rid of them.

Then, while your html looks great, creating separate examples and See Also options makes the code more complicated. What we've been doing in the other strings is in the See Also add both docstrings, even if they will self reference themselves. And for the examples, just show both methods in the same examples (so, a single set of examples that explain both split and rsplit).

It's not a big deal, but while the documentation is better the way you did it, we should try to keep the code simple too. I'd personally do it in the same way as in the other cases (you can take a look at the docstrings of the methods I mentioned before).

ryankarlos · 2018-06-09T23:06:21Z

@datapythonista Yes, the docstrings under str_split and str_rsplit are not being used so i have got rid of them in the most recent commit. Im a bit confused with what you meant by adding in both docstrings in the 'see also' bit. The docstring of the other methods like partition docstring just seems to be referencing the rpartition docstring in 'see also' and not self referencing partition as well.
Ive also now created a single set of examples that explain both split and rsplit. Let me know.
Also, Im not too sure why one of the continuous integration check is failing....it was passing before ?

datapythonista

It looks much better now.

Added some comments with formatting and some ideas that IMO would make this doc better, but we're almost there.

datapythonista · 2018-06-11T09:21:47Z

pandas/core/strings.py

+    Split strings around given separator/delimiter.
+
+    Splits the string in the Series/Index from the %(side)s,
+    at the specified delimiter string.Equivalent to :meth:`str.%(method)s`.


Missing space after the period.

datapythonista · 2018-06-11T09:30:25Z

pandas/core/strings.py

+    If NaN is present, it is propagated throughout the columns
+    during the split.
+
+    >>> s = pd.Series(["this is good text", "but this is even better", np.nan])


Personally, I'd use a single example at the beginning that contains different data that can illustrate all examples.

So, instead of adding the np.nan here, I'd include it from the beginning, and I'd avoid creating more data at the middle of the example.

Also, I find that this is good text and but this is even better kind of illustrate the same behaviour in every example. Something like This is a regular sentence and this,is,comma,separated,text for example could be used to show what is returned when splitting by space and not spaces are found, and then how to use another delimiter like comma (IMO in a more realistic way that splitting by a word).

Feel free to use the examples that make sense for you, I was just providing them to explain what I mean.

datapythonista · 2018-06-11T09:38:51Z

pandas/core/strings.py

+
+    See Also
+    --------
+    %(also)s


For the See Also, we can leave it like you implemented it. But what we've been doing in other docstrings is to just add here all the methods that use this docstring. They will be self-referencing, besides referencing the similar methods, but that's not a being deal.

So, in this case, instead of %(also)s we would have something like:

split : Splits string at the first occurrence of delimiter. rsplit : Splits string at the last occurrence of delimiter.

Note that the description should end in a period (you don't have it). Also, I find these descriptions somehow misleading, as it seems that n is always 1.

But also, there are other methods that I think it's worth adding here.At least: Series.str.join, str.split and str.rsplit.

ryankarlos · 2018-06-19T01:51:00Z

@datapythonista Sorry for the delay in gettting back. Thanks for the feedback - Ive updated the examples and 'See Also' section now.

datapythonista

just couple of small things, but looks really great.

datapythonista · 2018-06-19T09:38:50Z

pandas/core/strings.py

+        Type matches caller unless ``expand=True`` (see Notes).
+
+    Notes
+    -----


Can you move the Notes after See Also? That's the ordering of the numpydoc standard.

datapythonista · 2018-06-19T09:41:02Z

pandas/core/strings.py

+
+    >>> s.str.split(n=2)
+    0    [this, is, a regular sentence]
+    1    [this,is,comma,separated,text]


I didn't anticipate this before, but if we use comma separated values in the string, it's difficult to tell here which one has been splitted. Can be use something else like semicolon, or something like /path/to/some/file that looks like a real-world example and makes it obvious if it has been splitted or not?

Yes I think the issues with commas is that it is difficult to notice the separation space, thats what i noticed - further complicated by the fact that more commas are added in where the split happens. I think the path or semi colon is a good idea - ill see which one looks best

datapythonista · 2018-06-19T09:43:15Z

pandas/core/strings.py

+    --------
+    >>> s = pd.Series(["this is a regular sentence", "this,is,comma,separated,text", np.nan])
+
+    By default, split and rsplit will return an object of the same size


I'd start by the simplest possible example, with all the arguments with default values: >>> s.str.split(). I think for many users that's the only example they'll care about when checking this doc.

I was thinking about that - but then there would be a bit redundancy further down. I could rearrange it look something like this - what do you think ?

Default arguments

s.str.split()
s.str.rsplit()

semicolon separator or path separator

s.str.split(";")
s.str.rsplit(";")

number of splits

s.str.split(n=2)
s.str.rsplit(n=2)

with expand argument

s.str.split(n=2, expand = True)
s.str.rsplit(n= 2, expand = True)

I'd try to avoid this redundancy, I don't think it adds a lot of value to the user. I'd think it as if rsplit was split(side='reverse'). Meaning that I'd write every use case for split only, and a single user case for rsplit. Or may be rsplit(n=2), as the behaviour is diffirent, but I don't think it adds value to show how rsplit works with different separators, as it's exactly the same as split.

But feel free to do it in a way that it makes sense to you.

ryankarlos · 2018-06-19T11:31:01Z

@datapythonista thanks for the comments, makes sense - ill add those in. do you know why one of the jobs in the continuous-integration travis-ci checks may be failing - im thinking its due to some linting issues but can't figure out exactly what it is to make it pass ? I can't find any similar issues raised by others

datapythonista · 2018-06-19T12:37:50Z

I'm not sure about the problem with travis. If it fails again after updating the PR, may be you can create an environment with Python 3.5 and run lint.sh manually? I don't see much information in travis logs, so hopefully it can be reproduced locally.

ryankarlos · 2018-06-21T02:02:41Z

@datapythonista just updated the PR. So what I did in the examples was to only include rsplit in one example (changing n with path separator) and the rest with just split. Also included the default split as the first example. At the moment it will just show up as s.str.split() for both docs - i don't know if it would look better to have it as s.str.rsplit() for the rsplit documentation - of course the output doesn't make a difference but i was just thinking it may look odd if someone sees split as the default first example under rsplit doc. If its not a big deal we can leave it like it is.

datapythonista · 2018-06-21T07:36:54Z

pandas/core/strings.py

+    0    [this is a regular sentence]
+    1         [path/to, python, file]
+    2                             NaN
+    dtype: object


Looks good to me. May be it would be a bit more clear if we introduce one parameter at a time, each with its explanation. Meaning that these two previous examples will be these 3:

s.str.split(pat='/') - with a sentence explaining that pat can be used to split by other characters

s.str.split(n=2) - that's the one you're explaining, but I'd use the default pat to make it slightly simple

s.str.rsplit(n=2) - here we can explain how rsplitis different thansplit`.

That does not fully address your concern that if can feel strange that rsplit page starts by a split example, but IMO will make it easier to understand when rsplit should be used.

An idea that just came to my mind is that instead of /path/to/python/file we could use for example http://pandas.pydata.org/pandas-docs/stable/api.html (or a shorter url). Then, rsplit(n=1) can be used to split the html document from the rest of the url.

One last idea that you can consider is having an example s.str.rsplit() after the first one. With a comment like "Without the n parameter, rsplit behaves like split" (I think this is the case, may be you want to double check, for example when using expand). If you think this can make users in the rsplit page be less confused, it could be a good idea.

The docstring looks good to me, feel free to incorporate the changes that makes sense to you.

@datapythonista I like the idea of using http://pandas.pydata.org/pandas-docs/stable/api.html but then the issue is that we will need to rsplit(pat ='/', n=1) as rsplit(n=1) won't work in this instance -should this be another example then as its utilising both parameters pat and n ? So how i see the different examples ordered and sectioned are as below -
Ive seen other docstrings only explain the examples at the beginning of sections (where a different parameter is introduced) - so in this case where i will explain how rsplit is different to split - does this have to be introduced at the beginning when introducing both split and rsplit examples for n=2 or can the sentence come just before the rsplit example - is there a specific convention for this ?

Default example

s.str.split()

s.str.rsplit() - sentence explaining how rsplit produces the same output as split in this case

Parameter n

s.str.split(n=2)

s.str.rsplit(n=2) - explain how rsplit is different than split

Parameter pat

s.str.split(pat=' / ') - sentence explaining how pat can be used to split by other characters like /

Real world example using both pat and n

s.str.rsplit(pat = ' / ', n=1) - example of how pat and n can be used together to split the html document in "http://pandas.pydata.org/pandas-docs/stable/visualization.html" from rest of url.

That sounds good to me. I think the docstring is already good. I think those changes will make it better, but whatever you think will be more useful (and consice) for users, will be great.

ryankarlos · 2018-06-22T00:20:11Z

@datapythonista Updated PR with changes to examples added in. Here's screenshots of what the examples look like in the rendered html - it think its looking a lot more complete now.

datapythonista

Looks really good to me. Great PR, thanks @ryankarlos

jreback · 2018-06-22T10:19:43Z

@ryankarlos have some linting issues:

(pandas) bash-3.2$ git diff master --name-only -- "*.py" | grep "pandas" | xargs flake8
git diff master --name-only -- "*.py" | grep "pandas" | xargs flake8
pandas/core/strings.py:1366:1: W293 blank line contains whitespace
pandas/core/strings.py:2237:73: W291 trailing whitespace
pandas/core/strings.py:2239:76: W291 trailing whitespace
pandas/core/strings.py:2240:28: W291 trailing whitespace
pandas/core/strings.py:2258:80: E501 line too long (110 > 79 characters)
pandas/core/strings.py:2258:111: W291 trailing whitespace
pandas/core/strings.py:2268:80: E501 line too long (81 > 79 characters)
pandas/core/strings.py:2279:25: W291 trailing whitespace
pandas/core/strings.py:2291:66: W291 trailing whitespace
pandas/core/strings.py:2300:69: W291 trailing whitespace
pandas/core/strings.py:2304:80: E501 line too long (83 > 79 characters)
pandas/core/strings.py:2305:80: E501 line too long (83 > 79 characters)
pandas/core/strings.py:2305:84: W291 trailing whitespace
pandas/core/strings.py:2306:80: E501 line too long (83 > 79 characters)
pandas/core/strings.py:2307:80: E501 line too long (83 > 79 characters)
pandas/core/strings.py:2322:9: E123 closing bracket does not match indentation of opening bracket's line
pandas/core/strings.py:2330:9: E123 closing bracket does not match indentation of opening bracket's line
pandas/core/strings.py:2330:11: W291 trailing whitespace

jreback · 2018-06-22T10:20:23Z

i fixed some of those.

ryankarlos · 2018-06-22T12:12:25Z

@jreback ahh sorry - I fixed these now but Im getting another weird error I haven't seen before when running validate_docstrings.py - it was working ok previously. Seems to be some class inheritance issue in one of the other recently updated files but not exactly sure ? Also, same errors when trying to render the html.

Ryans-MacBook-Air:doc ryannazareth$ python ../scripts/validate_docstrings.py pandas.Series.str.rsplit
Traceback (most recent call last):
File "../scripts/validate_docstrings.py", line 37, in
import pandas
File "/Users/ryannazareth/Documents/PyData_Conference_2018/Pandas_Sprint/pandas/pandas/init.py", line 42, in
from pandas.core.api import *
File "/Users/ryannazareth/Documents/PyData_Conference_2018/Pandas_Sprint/pandas/pandas/core/api.py", line 10, in
from pandas.core.groupby.groupby import Grouper
File "/Users/ryannazareth/Documents/PyData_Conference_2018/Pandas_Sprint/pandas/pandas/core/groupby/init.py", line 2, in
from pandas.core.groupby.groupby import (
File "/Users/ryannazareth/Documents/PyData_Conference_2018/Pandas_Sprint/pandas/pandas/core/groupby/groupby.py", line 46, in
from pandas.core.index import (Index, MultiIndex,
File "/Users/ryannazareth/Documents/PyData_Conference_2018/Pandas_Sprint/pandas/pandas/core/index.py", line 2, in
from pandas.core.indexes.api import *
File "/Users/ryannazareth/Documents/PyData_Conference_2018/Pandas_Sprint/pandas/pandas/core/indexes/api.py", line 4, in
from pandas.core.indexes.base import (Index,
File "/Users/ryannazareth/Documents/PyData_Conference_2018/Pandas_Sprint/pandas/pandas/core/indexes/base.py", line 7, in
from pandas._libs import (lib, index as libindex, tslib as libts,
File "pandas/_libs/index.pyx", line 28, in init pandas._libs.index
File "pandas/_libs/tslibs/period.pyx", line 59, in init pandas._libs.tslibs.period
File "/Users/ryannazareth/Documents/PyData_Conference_2018/Pandas_Sprint/pandas/pandas/tseries/offsets.py", line 2331, in
offset=BDay(), time_rule=None):
File "/Users/ryannazareth/Documents/PyData_Conference_2018/Pandas_Sprint/pandas/pandas/tseries/offsets.py", line 485, in init
BaseOffset.init(self, n, normalize)
TypeError: object.init() takes no parameters

datapythonista · 2018-06-22T12:37:55Z

@ryankarlos I think the problem is with the C extensions. I'd get the last version from master, and then build the extenstions again:

git fetch upstream
git merge upstream/master
python setup.py build_ext --inplace -j4

ryankarlos · 2018-06-22T17:46:56Z

@datapythonista Thanks that seemed to do the trick !
@jreback PR updated with fixed linting issues

jreback · 2018-06-22T23:36:48Z

thanks @ryankarlos

…andas-dev#21026)

ryankarlos changed the title ~~Docstring rsplit~~ DOC: update the Series.str.rsplit docstring May 14, 2018

ryankarlos changed the title ~~DOC: update the Series.str.rsplit docstring~~ DOC: updated the Series.str.rsplit docstring May 14, 2018

gfyoung added the Docs label May 21, 2018

ryankarlos added 4 commits June 3, 2018 15:41

Updated docstring for str.rsplit

14d6cb3

Update docstring following validation test

e79511e

Remove trailing whitespace line 1470

77d8dad

DOC: Created reusable docstring for split and rsplit

2a1ea74

ryankarlos force-pushed the docstring_rsplit branch from 8aed679 to 2a1ea74 Compare June 3, 2018 14:57

ryankarlos added 2 commits June 3, 2018 16:37

DOC: Created reusable docstring for split and rsplit

522ca66

Remove trailing whitespace

25bf40b

ryankarlos changed the title ~~DOC: updated the Series.str.rsplit docstring~~ DOC: updated the Series.str.rsplit and Series.str.split docstrings Jun 8, 2018

Modified see also

4d7d71b

ryankarlos closed this Jun 9, 2018

ryankarlos deleted the docstring_rsplit branch June 9, 2018 23:03

ryankarlos restored the docstring_rsplit branch June 9, 2018 23:05

ryankarlos reopened this Jun 9, 2018

ryankarlos force-pushed the docstring_rsplit branch 2 times, most recently from f96f989 to 4d7d71b Compare June 10, 2018 00:51

datapythonista reviewed Jun 11, 2018

View reviewed changes

Modified examples and See Also

773cfbf

ryankarlos force-pushed the docstring_rsplit branch from 9773ec7 to 773cfbf Compare June 19, 2018 01:46

datapythonista reviewed Jun 19, 2018

View reviewed changes

ryankarlos added 2 commits June 21, 2018 02:24

Updated examples 21 June 2018

60aaaed

Update Examples 21 June 2018

e9857e1

ryankarlos force-pushed the docstring_rsplit branch from 0d986fc to e9857e1 Compare June 21, 2018 01:57

datapythonista reviewed Jun 21, 2018

View reviewed changes

Examples - final version

190f0cb

datapythonista approved these changes Jun 22, 2018

View reviewed changes

Merge branch 'master' into PR_TOOL_MERGE_PR_21026

cceb362

some lint

390d351

jreback added this to the 0.24.0 milestone Jun 22, 2018

ryankarlos added 2 commits June 22, 2018 16:20

Merge remote-tracking branch 'upstream/master' into docstring_rsplit

9b81132

Fixed lint issues

b11873f

jreback merged commit 7bee353 into pandas-dev:master Jun 22, 2018

Sup3rGeo pushed a commit to Sup3rGeo/pandas that referenced this pull request Oct 1, 2018

DOC: updated the Series.str.rsplit and Series.str.split docstrings (p…

912651e

…andas-dev#21026)

DOC: updated the Series.str.rsplit and Series.str.split docstrings #21026

DOC: updated the Series.str.rsplit and Series.str.split docstrings #21026

Conversation

ryankarlos commented May 14, 2018 • edited Loading

ryankarlos commented May 14, 2018

codecov bot commented May 14, 2018 • edited Loading

Codecov Report

datapythonista commented May 14, 2018

ryankarlos commented May 19, 2018

datapythonista commented May 19, 2018

pep8speaks commented Jun 3, 2018 • edited Loading

Comment last updated on June 22, 2018 at 10:20 Hours UTC

ryankarlos commented Jun 3, 2018 • edited Loading

datapythonista commented Jun 3, 2018

ryankarlos commented Jun 8, 2018 • edited Loading

datapythonista commented Jun 8, 2018

ryankarlos commented Jun 9, 2018 • edited Loading

datapythonista left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ryankarlos commented Jun 19, 2018

datapythonista left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ryankarlos Jun 19, 2018 • edited Loading

Choose a reason for hiding this comment

Default arguments

semicolon separator or path separator

number of splits

with expand argument

Choose a reason for hiding this comment

ryankarlos commented Jun 19, 2018

datapythonista commented Jun 19, 2018

ryankarlos commented Jun 21, 2018 • edited Loading

Choose a reason for hiding this comment

ryankarlos Jun 21, 2018 • edited Loading

Choose a reason for hiding this comment

Default example

Parameter n

Parameter pat

Real world example using both pat and n

Choose a reason for hiding this comment

ryankarlos commented Jun 22, 2018 • edited Loading

datapythonista left a comment

Choose a reason for hiding this comment

jreback commented Jun 22, 2018

jreback commented Jun 22, 2018

ryankarlos commented Jun 22, 2018 • edited Loading

datapythonista commented Jun 22, 2018

ryankarlos commented Jun 22, 2018 • edited Loading

jreback commented Jun 22, 2018

ryankarlos commented May 14, 2018 •

edited

Loading

codecov bot commented May 14, 2018 •

edited

Loading

pep8speaks commented Jun 3, 2018 •

edited

Loading

ryankarlos commented Jun 3, 2018 •

edited

Loading

ryankarlos commented Jun 8, 2018 •

edited

Loading

ryankarlos commented Jun 9, 2018 •

edited

Loading

ryankarlos Jun 19, 2018 •

edited

Loading

ryankarlos commented Jun 21, 2018 •

edited

Loading

ryankarlos Jun 21, 2018 •

edited

Loading

ryankarlos commented Jun 22, 2018 •

edited

Loading

ryankarlos commented Jun 22, 2018 •

edited

Loading

ryankarlos commented Jun 22, 2018 •

edited

Loading