Skip to content

DOC: updated the Series.str.rsplit and Series.str.split docstrings #21026

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 15 commits into from
Jun 22, 2018

Conversation

ryankarlos
Copy link
Contributor

@ryankarlos ryankarlos commented May 14, 2018

  • closes #xxxx
  • tests passed validate_docstrings.py
  • passes git diff upstream/master -u -- "*.py" | flake8 --diff
  • rendered document looks ok 'python make.py --single pandas.Series.str.rsplit' and 'python make.py --single pandas.Series.str.split'

@ryankarlos ryankarlos changed the title Docstring rsplit DOC: update the Series.str.rsplit docstring May 14, 2018
@ryankarlos ryankarlos changed the title DOC: update the Series.str.rsplit docstring DOC: updated the Series.str.rsplit docstring May 14, 2018
@ryankarlos
Copy link
Contributor Author

@datapythonista Finally got round to completing this after the PyData Pandas Sprint - let me know what you think ?

@codecov
Copy link

codecov bot commented May 14, 2018

Codecov Report

Merging #21026 into master will increase coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #21026      +/-   ##
==========================================
+ Coverage    91.9%    91.9%   +<.01%     
==========================================
  Files         153      153              
  Lines       49549    49550       +1     
==========================================
+ Hits        45539    45540       +1     
  Misses       4010     4010
Flag Coverage Δ
#multiple 90.3% <100%> (ø) ⬆️
#single 41.78% <100%> (ø) ⬆️
Impacted Files Coverage Δ
pandas/core/strings.py 98.63% <100%> (ø) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 028c9c0...b11873f. Read the comment docs.

@datapythonista
Copy link
Member

I was taking a look, and I don't see a reason why split and rsplit should have different docstrings. I guess the docstrings are independent just because historical reasons. If you don't see a reason either, could you merge what you did with the existing split docstring please? You can take a look at partition or rstrip to see how to reuse a docstring in two methods.

@ryankarlos
Copy link
Contributor Author

@datapythonista Thanks. No I didn't find any reason for different docstrings, except the output of the examples being different. There were a few typos and whitespace issues that were corrected in the exisiting rstrip docstring - do you suggest I revert back to the format it was in before ?

@datapythonista
Copy link
Member

No, it's not about the format. In pandas, there are "normal" docstrings that are independent from the others, and are defined at the beginning of the function, like it was this case.

But then, there are methods or functions that are very similar, and the docstring is created once (usually at the beginning of the file), and then reused (by injecting it into the function) in more than one case.

So, originally, before you did any change, split and rsplit had different/independent docstrings. And you kept it this way, as expected. But what I'm proposing, is that instead of simply writing the docstring for rsplit, that you create a shared docstring with the best of what you did and the existing docstring of split, and then you reuse it for both functions. It may sound complex, but if you take a look at how this is done in partition and rstrip you'll see that it's very easy. And I think the final result will be much better, not having two separate docstrings that are (or should be) almost the same.

Let me know what do you think, and if you face any problem. Thanks!

@gfyoung gfyoung added the Docs label May 21, 2018
@pep8speaks
Copy link

pep8speaks commented Jun 3, 2018

Hello @ryankarlos! Thanks for updating the PR.

Line 2258:80: E501 line too long (110 > 79 characters)
Line 2268:80: E501 line too long (81 > 79 characters)
Line 2304:80: E501 line too long (83 > 79 characters)
Line 2305:80: E501 line too long (83 > 79 characters)
Line 2306:80: E501 line too long (83 > 79 characters)
Line 2307:80: E501 line too long (83 > 79 characters)

Comment last updated on June 22, 2018 at 10:20 Hours UTC

@ryankarlos
Copy link
Contributor Author

ryankarlos commented Jun 3, 2018

@datapythonista Sorry for the delay. Oh ok I understand now, yes that makes sense. I have now updated the docstring to include a resuable docstring for rsplit and split like rpartition as you suggested. I have also adapted the original independent docstring so it only includes minimal information like partition and strip (excluding examples etc as I have added them in later). Let me know if it looks ok.

@datapythonista
Copy link
Member

Did you generate the docstrings to see if they are rendered ok? I may be wrong, but I think some stuff will appear duplicated.

Also, can you take a look at the PEP-8 errors in you code please? Seems like you have some tailing spaces and a line too long.

@ryankarlos
Copy link
Contributor Author

ryankarlos commented Jun 8, 2018

@datapythonista I have now corrected all whitespace issues and other PEP-8 errors. I rendered it and it looks ok - the examples were duplicated in both (since I had combined the rsplit and split examples for both functions. However, I have split them up so that rsplit examples will only show up in the rsplit doc and split examples in split doc. I have attached screenshots of what the sections of the docs look like.

str rsplit_params
str rsplit_examples
str split_params
str split_examples

@ryankarlos ryankarlos changed the title DOC: updated the Series.str.rsplit docstring DOC: updated the Series.str.rsplit and Series.str.split docstrings Jun 8, 2018
@datapythonista
Copy link
Member

Looks quite good, but if I'm not mistaken, the docstrings under str_split and str_rsplit are not being used, are they? If that's right, I'd get rid of them.

Then, while your html looks great, creating separate examples and See Also options makes the code more complicated. What we've been doing in the other strings is in the See Also add both docstrings, even if they will self reference themselves. And for the examples, just show both methods in the same examples (so, a single set of examples that explain both split and rsplit).

It's not a big deal, but while the documentation is better the way you did it, we should try to keep the code simple too. I'd personally do it in the same way as in the other cases (you can take a look at the docstrings of the methods I mentioned before).

@ryankarlos ryankarlos closed this Jun 9, 2018
@ryankarlos ryankarlos deleted the docstring_rsplit branch June 9, 2018 23:03
@ryankarlos ryankarlos restored the docstring_rsplit branch June 9, 2018 23:05
@ryankarlos
Copy link
Contributor Author

ryankarlos commented Jun 9, 2018

@datapythonista Yes, the docstrings under str_split and str_rsplit are not being used so i have got rid of them in the most recent commit. Im a bit confused with what you meant by adding in both docstrings in the 'see also' bit. The docstring of the other methods like partition docstring just seems to be referencing the rpartition docstring in 'see also' and not self referencing partition as well.
Ive also now created a single set of examples that explain both split and rsplit. Let me know.
Also, Im not too sure why one of the continuous integration check is failing....it was passing before ?

str split1
str split2
str rsplit1
str rsplit2

@ryankarlos ryankarlos reopened this Jun 9, 2018
@ryankarlos ryankarlos force-pushed the docstring_rsplit branch 2 times, most recently from f96f989 to 4d7d71b Compare June 10, 2018 00:51
Copy link
Member

@datapythonista datapythonista left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks much better now.

Added some comments with formatting and some ideas that IMO would make this doc better, but we're almost there.

Split strings around given separator/delimiter.

Splits the string in the Series/Index from the %(side)s,
at the specified delimiter string.Equivalent to :meth:`str.%(method)s`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing space after the period.

If NaN is present, it is propagated throughout the columns
during the split.

>>> s = pd.Series(["this is good text", "but this is even better", np.nan])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally, I'd use a single example at the beginning that contains different data that can illustrate all examples.

So, instead of adding the np.nan here, I'd include it from the beginning, and I'd avoid creating more data at the middle of the example.

Also, I find that this is good text and but this is even better kind of illustrate the same behaviour in every example. Something like This is a regular sentence and this,is,comma,separated,text for example could be used to show what is returned when splitting by space and not spaces are found, and then how to use another delimiter like comma (IMO in a more realistic way that splitting by a word).

Feel free to use the examples that make sense for you, I was just providing them to explain what I mean.


See Also
--------
%(also)s
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the See Also, we can leave it like you implemented it. But what we've been doing in other docstrings is to just add here all the methods that use this docstring. They will be self-referencing, besides referencing the similar methods, but that's not a being deal.

So, in this case, instead of %(also)s we would have something like:

split : Splits string at the first occurrence of delimiter.
rsplit : Splits string at the last occurrence of delimiter.

Note that the description should end in a period (you don't have it). Also, I find these descriptions somehow misleading, as it seems that n is always 1.

But also, there are other methods that I think it's worth adding here.At least: Series.str.join, str.split and str.rsplit.

@ryankarlos
Copy link
Contributor Author

@datapythonista Sorry for the delay in gettting back. Thanks for the feedback - Ive updated the examples and 'See Also' section now.

Copy link
Member

@datapythonista datapythonista left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just couple of small things, but looks really great.

Type matches caller unless ``expand=True`` (see Notes).

Notes
-----
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you move the Notes after See Also? That's the ordering of the numpydoc standard.


>>> s.str.split(n=2)
0 [this, is, a regular sentence]
1 [this,is,comma,separated,text]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't anticipate this before, but if we use comma separated values in the string, it's difficult to tell here which one has been splitted. Can be use something else like semicolon, or something like /path/to/some/file that looks like a real-world example and makes it obvious if it has been splitted or not?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I think the issues with commas is that it is difficult to notice the separation space, thats what i noticed - further complicated by the fact that more commas are added in where the split happens. I think the path or semi colon is a good idea - ill see which one looks best

--------
>>> s = pd.Series(["this is a regular sentence", "this,is,comma,separated,text", np.nan])

By default, split and rsplit will return an object of the same size
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd start by the simplest possible example, with all the arguments with default values: >>> s.str.split(). I think for many users that's the only example they'll care about when checking this doc.

Copy link
Contributor Author

@ryankarlos ryankarlos Jun 19, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking about that - but then there would be a bit redundancy further down. I could rearrange it look something like this - what do you think ?

Default arguments

s.str.split()
s.str.rsplit()

semicolon separator or path separator

s.str.split(";")
s.str.rsplit(";")

number of splits

s.str.split(n=2)
s.str.rsplit(n=2)

with expand argument

s.str.split(n=2, expand = True)
s.str.rsplit(n= 2, expand = True)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd try to avoid this redundancy, I don't think it adds a lot of value to the user. I'd think it as if rsplit was split(side='reverse'). Meaning that I'd write every use case for split only, and a single user case for rsplit. Or may be rsplit(n=2), as the behaviour is diffirent, but I don't think it adds value to show how rsplit works with different separators, as it's exactly the same as split.

But feel free to do it in a way that it makes sense to you.

@ryankarlos
Copy link
Contributor Author

@datapythonista thanks for the comments, makes sense - ill add those in. do you know why one of the jobs in the continuous-integration travis-ci checks may be failing - im thinking its due to some linting issues but can't figure out exactly what it is to make it pass ? I can't find any similar issues raised by others

@datapythonista
Copy link
Member

I'm not sure about the problem with travis. If it fails again after updating the PR, may be you can create an environment with Python 3.5 and run lint.sh manually? I don't see much information in travis logs, so hopefully it can be reproduced locally.

@ryankarlos
Copy link
Contributor Author

ryankarlos commented Jun 21, 2018

@datapythonista just updated the PR. So what I did in the examples was to only include rsplit in one example (changing n with path separator) and the rest with just split. Also included the default split as the first example. At the moment it will just show up as s.str.split() for both docs - i don't know if it would look better to have it as s.str.rsplit() for the rsplit documentation - of course the output doesn't make a difference but i was just thinking it may look odd if someone sees split as the default first example under rsplit doc. If its not a big deal we can leave it like it is.

0 [this is a regular sentence]
1 [path/to, python, file]
2 NaN
dtype: object
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. May be it would be a bit more clear if we introduce one parameter at a time, each with its explanation. Meaning that these two previous examples will be these 3:

  • s.str.split(pat='/') - with a sentence explaining that pat can be used to split by other characters
  • s.str.split(n=2) - that's the one you're explaining, but I'd use the default pat to make it slightly simple
  • s.str.rsplit(n=2) - here we can explain how rsplitis different thansplit`.

That does not fully address your concern that if can feel strange that rsplit page starts by a split example, but IMO will make it easier to understand when rsplit should be used.

An idea that just came to my mind is that instead of /path/to/python/file we could use for example http://pandas.pydata.org/pandas-docs/stable/api.html (or a shorter url). Then, rsplit(n=1) can be used to split the html document from the rest of the url.

One last idea that you can consider is having an example s.str.rsplit() after the first one. With a comment like "Without the n parameter, rsplit behaves like split" (I think this is the case, may be you want to double check, for example when using expand). If you think this can make users in the rsplit page be less confused, it could be a good idea.

The docstring looks good to me, feel free to incorporate the changes that makes sense to you.

Copy link
Contributor Author

@ryankarlos ryankarlos Jun 21, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@datapythonista I like the idea of using http://pandas.pydata.org/pandas-docs/stable/api.html but then the issue is that we will need to rsplit(pat ='/', n=1) as rsplit(n=1) won't work in this instance -should this be another example then as its utilising both parameters pat and n ? So how i see the different examples ordered and sectioned are as below -
Ive seen other docstrings only explain the examples at the beginning of sections (where a different parameter is introduced) - so in this case where i will explain how rsplit is different to split - does this have to be introduced at the beginning when introducing both split and rsplit examples for n=2 or can the sentence come just before the rsplit example - is there a specific convention for this ?

Default example

  • s.str.split()
  • s.str.rsplit() - sentence explaining how rsplit produces the same output as split in this case

Parameter n

  • s.str.split(n=2)
  • s.str.rsplit(n=2) - explain how rsplit is different than split

Parameter pat

  • s.str.split(pat=' / ') - sentence explaining how pat can be used to split by other characters like /

Real world example using both pat and n

  • s.str.rsplit(pat = ' / ', n=1) - example of how pat and n can be used together to split the html document in "http://pandas.pydata.org/pandas-docs/stable/visualization.html" from rest of url.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That sounds good to me. I think the docstring is already good. I think those changes will make it better, but whatever you think will be more useful (and consice) for users, will be great.

@ryankarlos
Copy link
Contributor Author

ryankarlos commented Jun 22, 2018

@datapythonista Updated PR with changes to examples added in. Here's screenshots of what the examples look like in the rendered html - it think its looking a lot more complete now.

examples_1

examples_2

Copy link
Member

@datapythonista datapythonista left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks really good to me. Great PR, thanks @ryankarlos

@jreback
Copy link
Contributor

jreback commented Jun 22, 2018

@ryankarlos have some linting issues:

(pandas) bash-3.2$ git diff master --name-only -- "*.py" | grep "pandas" | xargs flake8
git diff master --name-only -- "*.py" | grep "pandas" | xargs flake8
pandas/core/strings.py:1366:1: W293 blank line contains whitespace
pandas/core/strings.py:2237:73: W291 trailing whitespace
pandas/core/strings.py:2239:76: W291 trailing whitespace
pandas/core/strings.py:2240:28: W291 trailing whitespace
pandas/core/strings.py:2258:80: E501 line too long (110 > 79 characters)
pandas/core/strings.py:2258:111: W291 trailing whitespace
pandas/core/strings.py:2268:80: E501 line too long (81 > 79 characters)
pandas/core/strings.py:2279:25: W291 trailing whitespace
pandas/core/strings.py:2291:66: W291 trailing whitespace
pandas/core/strings.py:2300:69: W291 trailing whitespace
pandas/core/strings.py:2304:80: E501 line too long (83 > 79 characters)
pandas/core/strings.py:2305:80: E501 line too long (83 > 79 characters)
pandas/core/strings.py:2305:84: W291 trailing whitespace
pandas/core/strings.py:2306:80: E501 line too long (83 > 79 characters)
pandas/core/strings.py:2307:80: E501 line too long (83 > 79 characters)
pandas/core/strings.py:2322:9: E123 closing bracket does not match indentation of opening bracket's line
pandas/core/strings.py:2330:9: E123 closing bracket does not match indentation of opening bracket's line
pandas/core/strings.py:2330:11: W291 trailing whitespace

@jreback
Copy link
Contributor

jreback commented Jun 22, 2018

i fixed some of those.

@jreback jreback added this to the 0.24.0 milestone Jun 22, 2018
@ryankarlos
Copy link
Contributor Author

ryankarlos commented Jun 22, 2018

@jreback ahh sorry - I fixed these now but Im getting another weird error I haven't seen before when running validate_docstrings.py - it was working ok previously. Seems to be some class inheritance issue in one of the other recently updated files but not exactly sure ? Also, same errors when trying to render the html.

Ryans-MacBook-Air:doc ryannazareth$ python ../scripts/validate_docstrings.py pandas.Series.str.rsplit
Traceback (most recent call last):
File "../scripts/validate_docstrings.py", line 37, in
import pandas
File "/Users/ryannazareth/Documents/PyData_Conference_2018/Pandas_Sprint/pandas/pandas/init.py", line 42, in
from pandas.core.api import *
File "/Users/ryannazareth/Documents/PyData_Conference_2018/Pandas_Sprint/pandas/pandas/core/api.py", line 10, in
from pandas.core.groupby.groupby import Grouper
File "/Users/ryannazareth/Documents/PyData_Conference_2018/Pandas_Sprint/pandas/pandas/core/groupby/init.py", line 2, in
from pandas.core.groupby.groupby import (
File "/Users/ryannazareth/Documents/PyData_Conference_2018/Pandas_Sprint/pandas/pandas/core/groupby/groupby.py", line 46, in
from pandas.core.index import (Index, MultiIndex,
File "/Users/ryannazareth/Documents/PyData_Conference_2018/Pandas_Sprint/pandas/pandas/core/index.py", line 2, in
from pandas.core.indexes.api import *
File "/Users/ryannazareth/Documents/PyData_Conference_2018/Pandas_Sprint/pandas/pandas/core/indexes/api.py", line 4, in
from pandas.core.indexes.base import (Index,
File "/Users/ryannazareth/Documents/PyData_Conference_2018/Pandas_Sprint/pandas/pandas/core/indexes/base.py", line 7, in
from pandas._libs import (lib, index as libindex, tslib as libts,
File "pandas/_libs/index.pyx", line 28, in init pandas._libs.index
File "pandas/_libs/tslibs/period.pyx", line 59, in init pandas._libs.tslibs.period
File "/Users/ryannazareth/Documents/PyData_Conference_2018/Pandas_Sprint/pandas/pandas/tseries/offsets.py", line 2331, in
offset=BDay(), time_rule=None):
File "/Users/ryannazareth/Documents/PyData_Conference_2018/Pandas_Sprint/pandas/pandas/tseries/offsets.py", line 485, in init
BaseOffset.init(self, n, normalize)
TypeError: object.init() takes no parameters

@datapythonista
Copy link
Member

@ryankarlos I think the problem is with the C extensions. I'd get the last version from master, and then build the extenstions again:

git fetch upstream
git merge upstream/master
python setup.py build_ext --inplace -j4

@ryankarlos
Copy link
Contributor Author

ryankarlos commented Jun 22, 2018

@datapythonista Thanks that seemed to do the trick !
@jreback PR updated with fixed linting issues

@jreback jreback merged commit 7bee353 into pandas-dev:master Jun 22, 2018
@jreback
Copy link
Contributor

jreback commented Jun 22, 2018

thanks @ryankarlos

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants