-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
DOC: update the pandas.DataFrame.clip docstring #20212
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Goo contribution, added some comments
pandas/core/generic.py
Outdated
@@ -5601,53 +5601,99 @@ def clip(self, lower=None, upper=None, axis=None, inplace=False, | |||
""" | |||
Trim values at input threshold(s). | |||
|
|||
Elements above the upper threshold will be changed to upper threshold. | |||
Elements below the lower threshold will be changed to lower threshold. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems duplicated.
pandas/core/generic.py
Outdated
@@ -5601,53 +5601,99 @@ def clip(self, lower=None, upper=None, axis=None, inplace=False, | |||
""" | |||
Trim values at input threshold(s). | |||
|
|||
Elements above the upper threshold will be changed to upper threshold. | |||
Elements below the lower threshold will be changed to lower threshold. | |||
|
|||
Parameters | |||
---------- | |||
lower : float or array_like, default None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure if it's better or worse, but I think the standard we defined is to use float, array-like or None, default None
(so the None
is duplicated). Also note the hyphen and not underscore in array-like
pandas/core/generic.py
Outdated
Parameters | ||
---------- | ||
lower : float or array_like, default None | ||
Lower threshold for clipping. Values smaller than upper will be |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For naming parameters it's better to have them in backticks. In this case I think it adds value, making it clearer that upper
is a parameter (and lower
).
pandas/core/generic.py
Outdated
upper : float or array_like, default None | ||
Upper threshold for clipping. Values larger than upper will be | ||
converted to upper. | ||
axis : int or string axis name, optional |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We added a standard for axis to the documentation, at the end of the Paramters
section. Basically it's axis : {0 or 'index', 1 or 'columns', None}, default None
pandas/core/generic.py
Outdated
.. versionadded:: 0.21.0 | ||
.. versionadded:: 0.21.0. | ||
args : dictionary of arguments arguments passed to pandas.compat.numpy | ||
kwargs : dictionary of keyword arguments passed to pandas.compat.numpy |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
*args
and **kwargs
with the stars (even if the validate script complaints), and without the type (the description part goes in the nest line)
pandas/core/generic.py
Outdated
|
||
Returns | ||
------- | ||
clipped : Series | ||
clipped : DataFrame/Series |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Series
or DataFrame
pandas/core/generic.py
Outdated
|
||
Examples | ||
-------- | ||
>>> df=pd.DataFrame({'a':[1, 2, 3], 'b':[4, 5, 6], 'c':[7, 8, 9001]}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This have several PEP-8 issues because of missing spaces
pandas/core/generic.py
Outdated
a ``Series`` to the lower/upper argument. | ||
|
||
>>> some_data={'A':[-19, 12, -5],'B':[1, 100, -5]} | ||
>>> df=pd.DataFrame(data=some_data, index=['foo', 'bar', 'bizz']) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
more PEP-8
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
args
and kwargs
are passed to validate_clip_with_axis
, that is not trivial. It means that args
and kwargs
actually make sense. These should be clear in the docstring. Check here: https://github.com/pandas-dev/pandas/blob/master/pandas/compat/numpy/function.py#L137
pandas/core/generic.py
Outdated
|
||
Notes | ||
----- | ||
Clipping data is a method for dealing with dubious elements. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
prefer, out-of-range to dubious
pandas/core/generic.py
Outdated
See Also | ||
-------- | ||
pandas.DataFrame.clip_upper : Return copy of input with values | ||
above given value(s) truncated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add Series.clip
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is in generic, isn't it being reused by Series.clip?
pandas/core/generic.py
Outdated
of removing outliers from data. Columns of a DataFrame can be | ||
winsorized by using clip. | ||
|
||
>>> import numpy as np |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
don't need the numpy import
pandas/core/generic.py
Outdated
>>> import numpy as np | ||
>>> x=np.random.normal(size=(1000,3)) | ||
>>> df=pd.DataFrame(x, columns=['a','b','c']) | ||
>>> #Winsorize columns at 5% and 95% |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can add this as part of the text above
pandas/core/generic.py
Outdated
>>> x=np.random.normal(size=(1000,3)) | ||
>>> df=pd.DataFrame(x, columns=['a','b','c']) | ||
>>> #Winsorize columns at 5% and 95% | ||
>>> U=df.quantile(0.95) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
spaces around equals
@cmdelatorre |
Possible documentation for args and kwargs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really like this contribution. Made some comments.
pandas/core/generic.py
Outdated
Align object with lower and upper along the given axis. | ||
inplace : boolean, default False | ||
Whether to perform the operation in place on the data | ||
.. versionadded:: 0.21.0 | ||
.. versionadded:: 0.21.0. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find the description of axis a bit complex.
pandas/core/generic.py
Outdated
.. versionadded:: 0.21.0 | ||
.. versionadded:: 0.21.0. | ||
*args : arguments passed to pandas.compat.numpy | ||
**kwargs : keyword arguments passed to pandas.compat.numpy |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not in the documentation yet, but after some discussion some minutes ago, we'll have *args, **kwargs
in a single line (no colon), and a single description in the next, as it's rarely different between them.
pandas/core/generic.py
Outdated
clipped : Series | ||
clipped : `Series` or `DataFrame`. | ||
Elements above or below the upper and lower thresholds converted to | ||
threshold values. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can just have the type in the first row of Returns, providing a name doesn't add much value.
The description sounds a bit like if we could be returning only part of the original values.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure thing.
pandas/core/generic.py
Outdated
----- | ||
Clipping data is a method for dealing with out-of-range elements. | ||
If some elements are too large or too small, clipping is one way to | ||
transform the data into a reasonable range. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This sounds more like part of the extended summary to me, than Notes, which is usually left for details on the implementaiton (e.g. calling this function makes a copy of the data)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should I just remove it then?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's an interesting comment. May be you can also mention about the outlier thing you show in the example. But I'd have it in the extended summary. After moving it, make sure the whole summary makes sense and doesn't sound repetitive. Usually happens when you move blocks.
pandas/core/generic.py
Outdated
See Also | ||
-------- | ||
pandas.DataFrame.clip_upper : Return copy of input with values | ||
above given value(s) truncated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is in generic, isn't it being reused by Series.clip?
pandas/core/generic.py
Outdated
a ``Series`` to the lower/upper argument. | ||
|
||
>>> some_data = {'A': [-19, 12, -5], 'B': [1, 100, -5]} | ||
>>> df = pd.DataFrame(data=some_data, index=['foo', 'bar', 'bizz']) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would create df
just once at the beginning (with 2 cols), and reuse it all the time.
I've usually seen foo
, bar
and foobar
, not important, but I'd use that unless there is another convention that uses bizz
pandas/core/generic.py
Outdated
bizz -5 -5 | ||
|
||
Use the axis argument to clip by column or rows. Clip column A with | ||
lower threshold of -10 and column B has lower threshold of 10. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
two spaces after the dot?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, bad habit.
pandas/core/generic.py
Outdated
>>> x = np.random.normal(size=(1000,3)) | ||
>>> U = df.quantile(0.95) | ||
>>> L = df.quantile(0.5) | ||
>>> winsorized_df = df.clip(lower=L, upper=U, axis=1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like having this here. But couple of thoughts (feel free to disagree in both, they are open questions):
- Should we add
quantile
to See Also, if we consider this a common case? - I'd probably like more to use the previous
df
even if this makes more sense with normally distributed data
Regardless of these two points, this example has some typos:
- Missing space after the comma in the
size
tuple x
is defined but later you're usingdf
- I don't see a reason to use capital letters for
L
andU
, I don't see them as constants even if they are not modified again in the example - I would use descriptive names, not a single letter
- Makes more sense to define lower first
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll work on these tomorrow and update the PR then. Thanks for the feedback.
Hello @Dpananos! Thanks for updating the PR. Cheers ! There are no PEP8 issues in this Pull Request. 🍻 Comment last updated on March 13, 2018 at 15:29 Hours UTC |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking great, I few more comments.
pandas/core/generic.py
Outdated
*args : Additional keywords have no effect but might be accepted | ||
for compatibility with numpy. | ||
**kwargs : Additional keywords have no effect but might be accepted | ||
for compatibility with numpy. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was some change to the convention during the sprint, and args and kwargs should be in the same line *args, **kwargs
and a common description in the next line. We didn't find any case where args and kwargs have different description, and it's repetitive the way it was originally defined.
Also, note that *args
are positional arguments, I don't think it's correct to name they keywords.
pandas/core/generic.py
Outdated
clipped : Series | ||
`Series` or `DataFrame`. | ||
DataFrame is returned with those values above/below the | ||
`upper`/`'lower` thresholds set to the threshold values. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If return can be Series
, the description is not accurate, saying the DataFrame
is returned. May be something like "Original input with values above/below..."?
pandas/core/generic.py
Outdated
|
||
See Also | ||
-------- | ||
pandas.Series.clip : Trim values at input threshold(s). | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the prefix pandas
is not needed, just Series.clip
. Also, not sure if that was already discussed, but don't we want clip_lower
and clip_upper
here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you previously mentioned that those were generic. Did we want them included still?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I think I wasn't clear enough. You should check it, but as this docstring is in generic.py
I assume it'll be used by both Series
and DataFrame
. That's why I said it was (the docstring) was generic.
So, as your assigned docstring was DataFrame.clip
(and it's what it's in the title of the PR) but you're actually working also in Series.clip
, I don't think it makes sense to only add Series.clip
as @jreback suggested. I'd say that we probably want also DataFrame.clip
. Even if it'll be a bit weird to have in the See Also
the same method which is being documented, but I think it's all right. In a separate PR we could use a template and just have the right one.
So, summarizing, the see also should contain Series.clip
, DataFrame.clip
, and the clip_lower
and clip_upper
for each of them (I assume both classes have it). And assuming that clip
is commonly used with quantile
as in your example (I don't know), I would also add it.
@jreback do you agree?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, yes I see, I was mistaken. Sure, I can add all those.
pandas/core/generic.py
Outdated
|
||
Examples | ||
-------- | ||
>>> some_data = {'a': [1, 2, 3], 'b': [4, 5, 6], 'c': [7, 8, 9001]} | ||
>>> df = pd.DataFrame(some_data, index = ['foo', 'bar', 'foobar']) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The spaces around =
are not PEP-8 compliant. Sorry the validation script doesn't check it yet.
pandas/core/generic.py
Outdated
a b c | ||
foo 1 4 7 | ||
bar 2 5 8 | ||
foobar 3 6 9 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this example could be more concise, and help people understand the concept faster.
I think 2 columns is enough, have 3 doesn't add much value, but takes some extra time to check what's going on with the example.
Also, the lower trimming by 1 not having effect, makes me a bit confused, on whether I'm understanding it or not. Also it's somethimg really minor, but I found the 9001 unnecessarily big and arbitrary (using 100
wouldn't make me thing there is any special on it..
Feel free to use the example you want, but using something like -1, -2, -100, 1, 2, 100 and trimming at -10 and 10 would make it more clear and straight to the point.
pandas/core/generic.py
Outdated
a ``Series`` to the lower/upper argument. Use the axis argument to clip | ||
by column or rows. | ||
|
||
>>> col_thresh = pd.Series({'a':4, 'b':5, 'c':6}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
spaces after colons for PEP-8
pandas/core/generic.py
Outdated
|
||
>>> lwr_thresh = df.quantile(0.05) | ||
>>> upr_thresh = df.quantile(0.95) | ||
>>> dfw = df.clip(lower=lwr_thresh, upper=upr_thresh, axis='columns') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as you're using df now, I think you should show the output, instead of saving it to a variable.
Also, it's more a personal taste, feel free to ignore the commend, but these abbreviations feel a bit cryptic. lower_threshold
would be better, but I think just lower
it's probably even better.
And just an idea, in case you like it (it's how I'd write it and find more readable, but feel free to leave it as it):
>>> lower, upper = df.quantile(0.05), df.quantile(0.95)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great, just a small PEP-8 issue, and I think it's perfect.
pandas/core/generic.py
Outdated
|
||
Examples | ||
-------- | ||
>>> some_data = {'a': [-1, -2, -100], 'b': [1, 2, 100]} | ||
>>> df = pd.DataFrame(some_data, index = ['foo', 'bar', 'foobar']) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PEP-8 issue with spaces around =
.
Also, as you need to change this, I'd directly have the dictionary in the constructor, and avoid creating some_data
. Just for the sake of consistency, as all examples I've seen use this way. With one key of the dictionary in one line, and the index in a third one, I think line width should be a problem.
pandas/core/generic.py
Outdated
the 5th and 95th percentiles. | ||
|
||
>>> lower, upper = df.quantile(0.05), df.quantile(0.95) | ||
>>> df.clip(lower=lower, upper=upper, axis='columns') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if it's just because I'm too tired, but is axis
useful here? If the threshold is a scalar, doesn't seem to have an effect, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Axis is required here since df.quantile
returns a series with the quantiles for each column. Running without the axis argument results in an error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd find useful having this comment in the explanations before the test. :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm.
Really nice documentation, I find it very useful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Edited
DataFrame.clip : Trim values at input threshold(s). | ||
Series.clip : Trim values at input threshold(s). | ||
Series.clip_lower : Return copy of the input with values below given | ||
value(s) truncated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would just mention Series.clip here (not upper/lower)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why don't you find clip_upper
and clip_lower
related? I suggested them, but actually I was wondering if I was missing something, or if it was being considered deprecating them. It seems to me that clip(upper=X)
could be used for clip_upper
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
going to deprecate lower/upper but that’s not the reason
when showing a DataFrame method only like to show 1 Series method and and any related DataFrame method
just a preference to not have too many things in See Also
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense to me.
But if I'm not wrong, the title is misleading but this docstring is for the Series.clip method too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ahh ok in that case this is ok
pandas/core/generic.py
Outdated
3 0.230930 0.000000 | ||
4 1.100000 0.570967 | ||
a b | ||
foo -1 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
alignment doesn’t look right here
pandas/core/generic.py
Outdated
foobar 10 100 | ||
|
||
`Winsorizing <https://en.wikipedia.org/wiki/Winsorizing>`__ is a | ||
related method, whereby the data are clipped at |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
put this reference in the Notes section
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you want me to move the entire part about winsorization or just the link?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just the link, you can see the format in the 13. References
section of this document: http://numpydoc.readthedocs.io/en/latest/format.html
Codecov Report
@@ Coverage Diff @@
## master #20212 +/- ##
=========================================
Coverage ? 91.73%
=========================================
Files ? 150
Lines ? 49168
Branches ? 0
=========================================
Hits ? 45102
Misses ? 4066
Partials ? 0
Continue to review full report at Codecov.
|
pandas/core/generic.py
Outdated
@@ -5629,6 +5629,12 @@ def clip(self, lower=None, upper=None, axis=None, inplace=False, | |||
Original input with those values above/below the | |||
`upper`/`lower` thresholds set to the threshold values. | |||
|
|||
Notes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it should be in References
and not in Notes
. It wasn't in the documentation for the sprint, for simplicity, but you can check this document: http://numpydoc.readthedocs.io/en/latest/format.html
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, I agree. Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
@Dpananos I merged the other PR that also edited the same docstring (#20256). This will cause conflicts for this one, so you will need to resolve them (sorry for that trouble, but wanted to acknowledge both PRs, and yours adds some extra than the other). If you need any help with updating the PR (merge upstream/master), let me know! |
@jorisvandenbossche I'm not really certain what happened. So you merged another docstring and now I have a conflict? How should I go about updating the PR? |
So to update the PR with the latest changes on master, you can do:
This will give conflicts in the Solving these conflicts will mean merging somehow the two docstring (you can just use your judgement on which of the two versions you find best for each element of the docstring, certainly your examples were more elaborate I think).
and push again to the branch on origin. I would say: try it out, and if you have troubles with, I can also do it. |
Checklist for the pandas documentation sprint (ignore this if you are doing
an unrelated PR):
scripts/validate_docstrings.py <your-function-or-method>
git diff upstream/master -u -- "*.py" | flake8 --diff
python doc/make.py --single <your-function-or-method>
Please include the output of the validation script below between the "```" ticks:
Changes:
*args
and**kwargs
description taken from DOC: update the docstring of pandas.Series.clip #20256 in the name of consistency