-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
DOC: Improve the docstring of DataFrame.nlargest #20255
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Co-authored-by: Igor C. A. de Lima <[email protected]> Signed-off-by: Carlos Eduardo Moreira dos Santos <[email protected]> Signed-off-by: Igor C. A. de Lima <[email protected]>
pandas/core/frame.py
Outdated
keep : {'first', 'last'}, default 'first' | ||
Where there are duplicate values: | ||
- ``first`` : take the first occurrence. | ||
- ``last`` : take the last occurrence. | ||
- `first` : take the first occurrence; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't need punctuation at the end of either bullet point
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First, I removed them, but the validator complained about the missing full stop. Because of that, I kept this way. If I should remove them and ignore the validation error, let me know.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm OK now I know why I've seen this a few times. @datapythonista @jorisvandenbossche I would suggest that we ignore errors for a lack of punctuation at the end of bulleted lists - any objection to that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we can certainly ignore errors in such cases of bullet points
pandas/core/frame.py
Outdated
See Also | ||
-------- | ||
DataFrame.nsmallest : Return the `n` smallest rows sorted by given | ||
columns. | ||
|
||
Examples | ||
-------- | ||
>>> df = pd.DataFrame({'a': [1, 10, 8, 11, -1], | ||
... 'b': list('abdce'), | ||
... 'c': [1.0, 2.0, np.nan, 3.0, 4.0]}) | ||
>>> df.nlargest(3, 'a') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since the DataFrame constructor here is not trivial can we add a line to print the DataFrame as is, giving visual contrast to the examples?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the review. Done.
pandas/core/frame.py
Outdated
See Also | ||
-------- | ||
DataFrame.nsmallest : Return the `n` smallest rows sorted by given | ||
columns. | ||
|
||
Examples |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add an example to differentiate between "first" and "last" keep
parameters?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice suggestion. Done.
pandas/core/frame.py
Outdated
""" | ||
Return the `n` largest rows sorted by `columns`. | ||
|
||
Sort the DataFrame by `columns` in descending order and return the top |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we don't actually sort, can you re-word, we just pick the largest n values
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the review. I was not aware of the internals. Done.
pandas/core/frame.py
Outdated
columns : list or str | ||
Column name or names to order by | ||
Column name or names to retrieve values from. | ||
keep : {'first', 'last'}, default 'first' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Default value is implied for first element in the possible values, so no need to explicitly say default 'first'
again
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@WillAyd I think we changed the guide on this (some time last week) .. :)
For explicitness, I think it is good to keep the " default 'first' ". A new reader cannot know the rule of "default is fist value in the set of options"
pandas/core/frame.py
Outdated
keep : {'first', 'last'}, default 'first' | ||
Where there are duplicate values: | ||
- ``first`` : take the first occurrence. | ||
- ``last`` : take the last occurrence. | ||
- `first` : take the first occurrence; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm OK now I know why I've seen this a few times. @datapythonista @jorisvandenbossche I would suggest that we ignore errors for a lack of punctuation at the end of bulleted lists - any objection to that?
pandas/core/frame.py
Outdated
keep : {'first', 'last'}, default 'first' | ||
Where there are duplicate values: | ||
- ``first`` : take the first occurrence. | ||
- ``last`` : take the last occurrence. | ||
- `first` : take the first occurrence; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After looking at your example do you think it makes more sense to word this as "prioritize the first occurrence(s)"? I assume it's more or less doing that as opposed to a full take right?
Put in other words, if n
is 2 and you have three duplicate values it wouldn't return all three but rather prioritize the 2 that come first, right?
pandas/core/frame.py
Outdated
3 10 c 3.0 | ||
2 8 d NaN | ||
|
||
>>> df.nlargest(3, 'a', keep='last') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great thanks! I would just add a short blurb to direct the users attention to what's important here and in the example above. So maybe before the previous example say "In the following example, we will use nlargest
to select the three rows having the largest values in column 'a'". Then for the next example something like "When using keep='last'
ties are resolved in reverse order". Doesn't need to be exactly those words
a b c | ||
3 10 c 3.0 | ||
1 10 b 2.0 | ||
2 8 d NaN |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add one more example selecting multiple columns?
pandas/core/frame.py
Outdated
""" | ||
Return the `n` largest rows ordered by `columns`. | ||
|
||
Return the `n` largest rows of `columns` in descending order. The |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find "largest rows" a strange wording, as a row itself cannot be "large", only the values in it. Here I would use "n first rows" based on sorting 'columns' in descending order
pandas/core/frame.py
Outdated
columns : list or str | ||
Column name or names to order by | ||
Column name or names to retrieve values from. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think "to order by" is more correct, as otherwise it seems that you only return the values from those columns
pandas/core/frame.py
Outdated
columns : list or str | ||
Column name or names to order by | ||
Column name or names to retrieve values from. | ||
keep : {'first', 'last'}, default 'first' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@WillAyd I think we changed the guide on this (some time last week) .. :)
For explicitness, I think it is good to keep the " default 'first' ". A new reader cannot know the rule of "default is fist value in the set of options"
pandas/core/frame.py
Outdated
keep : {'first', 'last'}, default 'first' | ||
Where there are duplicate values: | ||
- ``first`` : take the first occurrence. | ||
- ``last`` : take the last occurrence. | ||
- `first` : take the first occurrence; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we can certainly ignore errors in such cases of bullet points
Hello @cemsbr! Thanks for updating the PR. Cheers ! There are no PEP8 issues in this Pull Request. 🍻 Comment last updated on March 17, 2018 at 10:28 Hours UTC |
Codecov Report
@@ Coverage Diff @@
## master #20255 +/- ##
=========================================
Coverage ? 91.72%
=========================================
Files ? 150
Lines ? 49165
Branches ? 0
=========================================
Hits ? 45099
Misses ? 4066
Partials ? 0
Continue to review full report at Codecov.
|
As suggested, the changes result in the expected validation error below.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good - couple more minor things
pandas/core/frame.py
Outdated
|
||
Returns | ||
------- | ||
DataFrame | ||
The `n` largest rows in the DataFrame, ordered by the given columns | ||
in descending order. | ||
The `n` first rows ordered by the given columns in descending |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor stylistic comment but "first n
" sounds more natural than "n
first"
pandas/core/frame.py
Outdated
>>> df.nlargest(3, 'a', keep='last') | ||
a b c | ||
3 10 c 3.0 | ||
1 10 b 2.0 | ||
2 8 d NaN | ||
|
||
To order by the largest values in column "a" and then "c", we can |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great example. I would also add something to Notes touching on this behavior - something like "When columns
is iterable the order of the iterable's elements dictates the order in which columns in the DataFrame are prioritized"
pandas/core/frame.py
Outdated
1 10 b 2.0 | ||
2 8 d NaN | ||
|
||
The dtype of column "b" is `object` and attempting to get its largest |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to generalize could instead say "Attempting to use nlargest
on an non-numeric dtypes will raise a TypeError
"
pandas/core/frame.py
Outdated
columns : list or str | ||
Column name or names to retrieve values from. | ||
Number of rows to return. | ||
columns : iterable or single value |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
str or iterable
pandas/core/frame.py
Outdated
|
||
Return the `n` first rows with the largest values in `columns`, in | ||
descending order. The columns that are not specified are returned as | ||
well, but not used for ordering. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I'm right, this is equivalent to using df.sort_values(columns, ascending=False).head(n)
, isn't it? If that's the case, I'd explicitly say it in the extended summary, and I'd add both methods to the See Also
section.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it’s an equivalent result, but this is much more performany
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Edited for the last minor updates.
@cemsbr Thanks a lot 👍 |
Co-authored-by: Igor C. A. de Lima [email protected]
Signed-off-by: Carlos Eduardo Moreira dos Santos [email protected]
Signed-off-by: Igor C. A. de Lima [email protected]
Checklist for the pandas documentation sprint (ignore this if you are doing
an unrelated PR):
scripts/validate_docstrings.py <your-function-or-method>
git diff upstream/master -u -- "*.py" | flake8 --diff
python doc/make.py --single <your-function-or-method>
Please include the output of the validation script below between the "```" ticks: