-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
DOC: updated the Series.str.rsplit and Series.str.split docstrings #21026
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@datapythonista Finally got round to completing this after the PyData Pandas Sprint - let me know what you think ? |
Codecov Report
@@ Coverage Diff @@
## master #21026 +/- ##
==========================================
+ Coverage 91.9% 91.9% +<.01%
==========================================
Files 153 153
Lines 49549 49550 +1
==========================================
+ Hits 45539 45540 +1
Misses 4010 4010
Continue to review full report at Codecov.
|
I was taking a look, and I don't see a reason why |
@datapythonista Thanks. No I didn't find any reason for different docstrings, except the output of the examples being different. There were a few typos and whitespace issues that were corrected in the exisiting rstrip docstring - do you suggest I revert back to the format it was in before ? |
No, it's not about the format. In pandas, there are "normal" docstrings that are independent from the others, and are defined at the beginning of the function, like it was this case. But then, there are methods or functions that are very similar, and the docstring is created once (usually at the beginning of the file), and then reused (by injecting it into the function) in more than one case. So, originally, before you did any change, Let me know what do you think, and if you face any problem. Thanks! |
8aed679
to
2a1ea74
Compare
Hello @ryankarlos! Thanks for updating the PR.
Comment last updated on June 22, 2018 at 10:20 Hours UTC |
@datapythonista Sorry for the delay. Oh ok I understand now, yes that makes sense. I have now updated the docstring to include a resuable docstring for rsplit and split like rpartition as you suggested. I have also adapted the original independent docstring so it only includes minimal information like partition and strip (excluding examples etc as I have added them in later). Let me know if it looks ok. |
Did you generate the docstrings to see if they are rendered ok? I may be wrong, but I think some stuff will appear duplicated. Also, can you take a look at the PEP-8 errors in you code please? Seems like you have some tailing spaces and a line too long. |
@datapythonista I have now corrected all whitespace issues and other PEP-8 errors. I rendered it and it looks ok - the examples were duplicated in both (since I had combined the rsplit and split examples for both functions. However, I have split them up so that rsplit examples will only show up in the rsplit doc and split examples in split doc. I have attached screenshots of what the sections of the docs look like. |
Looks quite good, but if I'm not mistaken, the docstrings under Then, while your html looks great, creating separate examples and It's not a big deal, but while the documentation is better the way you did it, we should try to keep the code simple too. I'd personally do it in the same way as in the other cases (you can take a look at the docstrings of the methods I mentioned before). |
@datapythonista Yes, the docstrings under str_split and str_rsplit are not being used so i have got rid of them in the most recent commit. Im a bit confused with what you meant by adding in both docstrings in the 'see also' bit. The docstring of the other methods like partition docstring just seems to be referencing the rpartition docstring in 'see also' and not self referencing partition as well. |
f96f989
to
4d7d71b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks much better now.
Added some comments with formatting and some ideas that IMO would make this doc better, but we're almost there.
pandas/core/strings.py
Outdated
Split strings around given separator/delimiter. | ||
|
||
Splits the string in the Series/Index from the %(side)s, | ||
at the specified delimiter string.Equivalent to :meth:`str.%(method)s`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing space after the period.
pandas/core/strings.py
Outdated
If NaN is present, it is propagated throughout the columns | ||
during the split. | ||
|
||
>>> s = pd.Series(["this is good text", "but this is even better", np.nan]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Personally, I'd use a single example at the beginning that contains different data that can illustrate all examples.
So, instead of adding the np.nan
here, I'd include it from the beginning, and I'd avoid creating more data at the middle of the example.
Also, I find that this is good text
and but this is even better
kind of illustrate the same behaviour in every example. Something like This is a regular sentence
and this,is,comma,separated,text
for example could be used to show what is returned when splitting by space and not spaces are found, and then how to use another delimiter like comma (IMO in a more realistic way that splitting by a word).
Feel free to use the examples that make sense for you, I was just providing them to explain what I mean.
pandas/core/strings.py
Outdated
|
||
See Also | ||
-------- | ||
%(also)s |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the See Also
, we can leave it like you implemented it. But what we've been doing in other docstrings is to just add here all the methods that use this docstring. They will be self-referencing, besides referencing the similar methods, but that's not a being deal.
So, in this case, instead of %(also)s
we would have something like:
split : Splits string at the first occurrence of delimiter.
rsplit : Splits string at the last occurrence of delimiter.
Note that the description should end in a period (you don't have it). Also, I find these descriptions somehow misleading, as it seems that n
is always 1
.
But also, there are other methods that I think it's worth adding here.At least: Series.str.join
, str.split
and str.rsplit
.
9773ec7
to
773cfbf
Compare
@datapythonista Sorry for the delay in gettting back. Thanks for the feedback - Ive updated the examples and 'See Also' section now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just couple of small things, but looks really great.
pandas/core/strings.py
Outdated
Type matches caller unless ``expand=True`` (see Notes). | ||
|
||
Notes | ||
----- |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you move the Notes
after See Also
? That's the ordering of the numpydoc standard.
pandas/core/strings.py
Outdated
|
||
>>> s.str.split(n=2) | ||
0 [this, is, a regular sentence] | ||
1 [this,is,comma,separated,text] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't anticipate this before, but if we use comma separated values in the string, it's difficult to tell here which one has been splitted. Can be use something else like semicolon, or something like /path/to/some/file
that looks like a real-world example and makes it obvious if it has been splitted or not?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes I think the issues with commas is that it is difficult to notice the separation space, thats what i noticed - further complicated by the fact that more commas are added in where the split happens. I think the path or semi colon is a good idea - ill see which one looks best
pandas/core/strings.py
Outdated
-------- | ||
>>> s = pd.Series(["this is a regular sentence", "this,is,comma,separated,text", np.nan]) | ||
|
||
By default, split and rsplit will return an object of the same size |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd start by the simplest possible example, with all the arguments with default values: >>> s.str.split()
. I think for many users that's the only example they'll care about when checking this doc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking about that - but then there would be a bit redundancy further down. I could rearrange it look something like this - what do you think ?
Default arguments
s.str.split()
s.str.rsplit()
semicolon separator or path separator
s.str.split(";")
s.str.rsplit(";")
number of splits
s.str.split(n=2)
s.str.rsplit(n=2)
with expand argument
s.str.split(n=2, expand = True)
s.str.rsplit(n= 2, expand = True)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd try to avoid this redundancy, I don't think it adds a lot of value to the user. I'd think it as if rsplit
was split(side='reverse')
. Meaning that I'd write every use case for split
only, and a single user case for rsplit
. Or may be rsplit(n=2)
, as the behaviour is diffirent, but I don't think it adds value to show how rsplit
works with different separators, as it's exactly the same as split
.
But feel free to do it in a way that it makes sense to you.
@datapythonista thanks for the comments, makes sense - ill add those in. do you know why one of the jobs in the continuous-integration travis-ci checks may be failing - im thinking its due to some linting issues but can't figure out exactly what it is to make it pass ? I can't find any similar issues raised by others |
I'm not sure about the problem with travis. If it fails again after updating the PR, may be you can create an environment with Python 3.5 and run |
0d986fc
to
e9857e1
Compare
@datapythonista just updated the PR. So what I did in the examples was to only include rsplit in one example (changing n with path separator) and the rest with just split. Also included the default split as the first example. At the moment it will just show up as s.str.split() for both docs - i don't know if it would look better to have it as s.str.rsplit() for the rsplit documentation - of course the output doesn't make a difference but i was just thinking it may look odd if someone sees split as the default first example under rsplit doc. If its not a big deal we can leave it like it is. |
pandas/core/strings.py
Outdated
0 [this is a regular sentence] | ||
1 [path/to, python, file] | ||
2 NaN | ||
dtype: object |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me. May be it would be a bit more clear if we introduce one parameter at a time, each with its explanation. Meaning that these two previous examples will be these 3:
s.str.split(pat='/')
- with a sentence explaining thatpat
can be used to split by other characterss.str.split(n=2)
- that's the one you're explaining, but I'd use the defaultpat
to make it slightly simples.str.rsplit(n=2) - here we can explain how
rsplitis different than
split`.
That does not fully address your concern that if can feel strange that rsplit
page starts by a split
example, but IMO will make it easier to understand when rsplit
should be used.
An idea that just came to my mind is that instead of /path/to/python/file
we could use for example http://pandas.pydata.org/pandas-docs/stable/api.html
(or a shorter url). Then, rsplit(n=1)
can be used to split the html document from the rest of the url.
One last idea that you can consider is having an example s.str.rsplit()
after the first one. With a comment like "Without the n
parameter, rsplit
behaves like split
" (I think this is the case, may be you want to double check, for example when using expand
). If you think this can make users in the rsplit
page be less confused, it could be a good idea.
The docstring looks good to me, feel free to incorporate the changes that makes sense to you.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@datapythonista I like the idea of using http://pandas.pydata.org/pandas-docs/stable/api.html but then the issue is that we will need to rsplit(pat ='/', n=1) as rsplit(n=1) won't work in this instance -should this be another example then as its utilising both parameters pat and n ? So how i see the different examples ordered and sectioned are as below -
Ive seen other docstrings only explain the examples at the beginning of sections (where a different parameter is introduced) - so in this case where i will explain how rsplit
is different to split
- does this have to be introduced at the beginning when introducing both split
and rsplit
examples for n=2
or can the sentence come just before the rsplit
example - is there a specific convention for this ?
Default example
s.str.split()
s.str.rsplit()
- sentence explaining howrsplit
produces the same output assplit
in this case
Parameter n
s.str.split(n=2)
s.str.rsplit(n=2)
- explain howrsplit
is different thansplit
Parameter pat
s.str.split(pat=' / ')
- sentence explaining howpat
can be used to split by other characters like/
Real world example using both pat and n
s.str.rsplit(pat = ' / ', n=1)
- example of how pat and n can be used together to split the html document in"http://pandas.pydata.org/pandas-docs/stable/visualization.html"
from rest of url.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That sounds good to me. I think the docstring is already good. I think those changes will make it better, but whatever you think will be more useful (and consice) for users, will be great.
@datapythonista Updated PR with changes to examples added in. Here's screenshots of what the examples look like in the rendered html - it think its looking a lot more complete now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks really good to me. Great PR, thanks @ryankarlos
@ryankarlos have some linting issues:
|
i fixed some of those. |
@jreback ahh sorry - I fixed these now but Im getting another weird error I haven't seen before when running validate_docstrings.py - it was working ok previously. Seems to be some class inheritance issue in one of the other recently updated files but not exactly sure ? Also, same errors when trying to render the html. Ryans-MacBook-Air:doc ryannazareth$ python ../scripts/validate_docstrings.py pandas.Series.str.rsplit |
@ryankarlos I think the problem is with the C extensions. I'd get the last version from master, and then build the extenstions again:
|
@datapythonista Thanks that seemed to do the trick ! |
thanks @ryankarlos |
git diff upstream/master -u -- "*.py" | flake8 --diff