Skip to content

DOC: Improve the docstring of DataFrame.nlargest #20255

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Mar 17, 2018

Conversation

cemsbr
Copy link
Contributor

@cemsbr cemsbr commented Mar 10, 2018

Co-authored-by: Igor C. A. de Lima [email protected]
Signed-off-by: Carlos Eduardo Moreira dos Santos [email protected]
Signed-off-by: Igor C. A. de Lima [email protected]

Checklist for the pandas documentation sprint (ignore this if you are doing
an unrelated PR):

  • PR title is "DOC: update the docstring"
  • The validation script passes: scripts/validate_docstrings.py <your-function-or-method>
  • The PEP8 style check passes: git diff upstream/master -u -- "*.py" | flake8 --diff
  • The html version looks good: python doc/make.py --single <your-function-or-method>
  • It has been proofread on language by another sprint participant

Please include the output of the validation script below between the "```" ticks:

################################################################################
################################## Validation ##################################
################################################################################

Docstring for "pandas.core.frame.DataFrame.nlargest" correct. :)

Co-authored-by: Igor C. A. de Lima <[email protected]>
Signed-off-by: Carlos Eduardo Moreira dos Santos <[email protected]>
Signed-off-by: Igor C. A. de Lima <[email protected]>
keep : {'first', 'last'}, default 'first'
Where there are duplicate values:
- ``first`` : take the first occurrence.
- ``last`` : take the last occurrence.
- `first` : take the first occurrence;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't need punctuation at the end of either bullet point

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First, I removed them, but the validator complained about the missing full stop. Because of that, I kept this way. If I should remove them and ignore the validation error, let me know.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm OK now I know why I've seen this a few times. @datapythonista @jorisvandenbossche I would suggest that we ignore errors for a lack of punctuation at the end of bulleted lists - any objection to that?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we can certainly ignore errors in such cases of bullet points

See Also
--------
DataFrame.nsmallest : Return the `n` smallest rows sorted by given
columns.

Examples
--------
>>> df = pd.DataFrame({'a': [1, 10, 8, 11, -1],
... 'b': list('abdce'),
... 'c': [1.0, 2.0, np.nan, 3.0, 4.0]})
>>> df.nlargest(3, 'a')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the DataFrame constructor here is not trivial can we add a line to print the DataFrame as is, giving visual contrast to the examples?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review. Done.

See Also
--------
DataFrame.nsmallest : Return the `n` smallest rows sorted by given
columns.

Examples
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add an example to differentiate between "first" and "last" keep parameters?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice suggestion. Done.

"""
Return the `n` largest rows sorted by `columns`.

Sort the DataFrame by `columns` in descending order and return the top
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we don't actually sort, can you re-word, we just pick the largest n values

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review. I was not aware of the internals. Done.

@jreback jreback added the Reshaping Concat, Merge/Join, Stack/Unstack, Explode label Mar 11, 2018
columns : list or str
Column name or names to order by
Column name or names to retrieve values from.
keep : {'first', 'last'}, default 'first'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Default value is implied for first element in the possible values, so no need to explicitly say default 'first' again

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@WillAyd I think we changed the guide on this (some time last week) .. :)
For explicitness, I think it is good to keep the " default 'first' ". A new reader cannot know the rule of "default is fist value in the set of options"

keep : {'first', 'last'}, default 'first'
Where there are duplicate values:
- ``first`` : take the first occurrence.
- ``last`` : take the last occurrence.
- `first` : take the first occurrence;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm OK now I know why I've seen this a few times. @datapythonista @jorisvandenbossche I would suggest that we ignore errors for a lack of punctuation at the end of bulleted lists - any objection to that?

keep : {'first', 'last'}, default 'first'
Where there are duplicate values:
- ``first`` : take the first occurrence.
- ``last`` : take the last occurrence.
- `first` : take the first occurrence;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After looking at your example do you think it makes more sense to word this as "prioritize the first occurrence(s)"? I assume it's more or less doing that as opposed to a full take right?

Put in other words, if n is 2 and you have three duplicate values it wouldn't return all three but rather prioritize the 2 that come first, right?

3 10 c 3.0
2 8 d NaN

>>> df.nlargest(3, 'a', keep='last')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great thanks! I would just add a short blurb to direct the users attention to what's important here and in the example above. So maybe before the previous example say "In the following example, we will use nlargest to select the three rows having the largest values in column 'a'". Then for the next example something like "When using keep='last' ties are resolved in reverse order". Doesn't need to be exactly those words

a b c
3 10 c 3.0
1 10 b 2.0
2 8 d NaN
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add one more example selecting multiple columns?

"""
Return the `n` largest rows ordered by `columns`.

Return the `n` largest rows of `columns` in descending order. The
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find "largest rows" a strange wording, as a row itself cannot be "large", only the values in it. Here I would use "n first rows" based on sorting 'columns' in descending order

columns : list or str
Column name or names to order by
Column name or names to retrieve values from.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think "to order by" is more correct, as otherwise it seems that you only return the values from those columns

columns : list or str
Column name or names to order by
Column name or names to retrieve values from.
keep : {'first', 'last'}, default 'first'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@WillAyd I think we changed the guide on this (some time last week) .. :)
For explicitness, I think it is good to keep the " default 'first' ". A new reader cannot know the rule of "default is fist value in the set of options"

keep : {'first', 'last'}, default 'first'
Where there are duplicate values:
- ``first`` : take the first occurrence.
- ``last`` : take the last occurrence.
- `first` : take the first occurrence;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we can certainly ignore errors in such cases of bullet points

@pep8speaks
Copy link

pep8speaks commented Mar 12, 2018

Hello @cemsbr! Thanks for updating the PR.

Cheers ! There are no PEP8 issues in this Pull Request. 🍻

Comment last updated on March 17, 2018 at 10:28 Hours UTC

@codecov
Copy link

codecov bot commented Mar 12, 2018

Codecov Report

❗ No coverage uploaded for pull request base (master@fb556ed). Click here to learn what that means.
The diff coverage is 96.77%.

Impacted file tree graph

@@            Coverage Diff            @@
##             master   #20255   +/-   ##
=========================================
  Coverage          ?   91.72%           
=========================================
  Files             ?      150           
  Lines             ?    49165           
  Branches          ?        0           
=========================================
  Hits              ?    45099           
  Misses            ?     4066           
  Partials          ?        0
Flag Coverage Δ
#multiple 90.11% <96.77%> (?)
#single 41.86% <9.67%> (?)
Impacted Files Coverage Δ
pandas/core/frame.py 97.18% <96.77%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update fb556ed...7dd4f02. Read the comment docs.

@cemsbr
Copy link
Contributor Author

cemsbr commented Mar 13, 2018

As suggested, the changes result in the expected validation error below.

################################################################################
################################## Validation ##################################
################################################################################

Errors found:
        Errors in parameters section
                Parameter "keep" description should finish with "."

Copy link
Member

@WillAyd WillAyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good - couple more minor things


Returns
-------
DataFrame
The `n` largest rows in the DataFrame, ordered by the given columns
in descending order.
The `n` first rows ordered by the given columns in descending
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor stylistic comment but "first n" sounds more natural than "n first"

>>> df.nlargest(3, 'a', keep='last')
a b c
3 10 c 3.0
1 10 b 2.0
2 8 d NaN

To order by the largest values in column "a" and then "c", we can
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great example. I would also add something to Notes touching on this behavior - something like "When columns is iterable the order of the iterable's elements dictates the order in which columns in the DataFrame are prioritized"

1 10 b 2.0
2 8 d NaN

The dtype of column "b" is `object` and attempting to get its largest
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to generalize could instead say "Attempting to use nlargest on an non-numeric dtypes will raise a TypeError"

columns : list or str
Column name or names to retrieve values from.
Number of rows to return.
columns : iterable or single value
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

str or iterable


Return the `n` first rows with the largest values in `columns`, in
descending order. The columns that are not specified are returned as
well, but not used for ordering.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I'm right, this is equivalent to using df.sort_values(columns, ascending=False).head(n), isn't it? If that's the case, I'd explicitly say it in the extended summary, and I'd add both methods to the See Also section.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it’s an equivalent result, but this is much more performany

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Edited for the last minor updates.

@jorisvandenbossche jorisvandenbossche merged commit 48e680e into pandas-dev:master Mar 17, 2018
@jorisvandenbossche
Copy link
Member

@cemsbr Thanks a lot 👍

@jorisvandenbossche jorisvandenbossche added this to the 0.23.0 milestone Mar 17, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Docs Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants