Skip to content

DOC: Improve the docstring of DataFrame.nlargest #20255

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Mar 17, 2018
Merged
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
77 changes: 65 additions & 12 deletions pandas/core/frame.py
Original file line number Diff line number Diff line change
Expand Up @@ -3835,34 +3835,87 @@ def sortlevel(self, level=0, axis=0, ascending=True, inplace=False,
inplace=inplace, sort_remaining=sort_remaining)

def nlargest(self, n, columns, keep='first'):
"""Get the rows of a DataFrame sorted by the `n` largest
values of `columns`.
"""
Return the `n` first rows ordered by `columns` in descending order.

Return the `n` first rows with the largest values in `columns`, in
descending order. The columns that are not specified are returned as
well, but not used for ordering.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I'm right, this is equivalent to using df.sort_values(columns, ascending=False).head(n), isn't it? If that's the case, I'd explicitly say it in the extended summary, and I'd add both methods to the See Also section.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it’s an equivalent result, but this is much more performany


Parameters
----------
n : int
Number of items to retrieve
columns : list or str
Column name or names to order by
Number of rows to return.
columns : iterable or single value
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

str or iterable

Column label(s) to order by.
keep : {'first', 'last'}, default 'first'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Default value is implied for first element in the possible values, so no need to explicitly say default 'first' again

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@WillAyd I think we changed the guide on this (some time last week) .. :)
For explicitness, I think it is good to keep the " default 'first' ". A new reader cannot know the rule of "default is fist value in the set of options"

Where there are duplicate values:
- ``first`` : take the first occurrence.
- ``last`` : take the last occurrence.

- `first` : prioritize the first occurrence(s)
- `last` : prioritize the last occurrence(s)

Returns
-------
DataFrame
The `n` first rows ordered by the given columns in descending
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor stylistic comment but "first n" sounds more natural than "n first"

order.

See Also
--------
DataFrame.nsmallest : Return the `n` first rows ordered by `columns` in
ascending order.

Notes
-----
This function cannot be used with all column types. For example, when
specifying columns with `object` or `category` dtypes, ``TypeError`` is
raised.

Examples
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add an example to differentiate between "first" and "last" keep parameters?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice suggestion. Done.

--------
>>> df = pd.DataFrame({'a': [1, 10, 8, 11, -1],
>>> df = pd.DataFrame({'a': [1, 10, 8, 10, -1],
... 'b': list('abdce'),
... 'c': [1.0, 2.0, np.nan, 3.0, 4.0]})
>>> df
a b c
0 1 a 1.0
1 10 b 2.0
2 8 d NaN
3 10 c 3.0
4 -1 e 4.0

In the following example, we will use ``nlargest`` to select the three
rows having the largest values in column "a".

>>> df.nlargest(3, 'a')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the DataFrame constructor here is not trivial can we add a line to print the DataFrame as is, giving visual contrast to the examples?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review. Done.

a b c
3 11 c 3
1 10 b 2
2 8 d NaN
a b c
1 10 b 2.0
3 10 c 3.0
2 8 d NaN

When using ``keep='last'``, ties are resolved in reverse order:

>>> df.nlargest(3, 'a', keep='last')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great thanks! I would just add a short blurb to direct the users attention to what's important here and in the example above. So maybe before the previous example say "In the following example, we will use nlargest to select the three rows having the largest values in column 'a'". Then for the next example something like "When using keep='last' ties are resolved in reverse order". Doesn't need to be exactly those words

a b c
3 10 c 3.0
1 10 b 2.0
2 8 d NaN
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add one more example selecting multiple columns?


To order by the largest values in column "a" and then "c", we can
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great example. I would also add something to Notes touching on this behavior - something like "When columns is iterable the order of the iterable's elements dictates the order in which columns in the DataFrame are prioritized"

specify multiple columns like in the next example.

>>> df.nlargest(3, ['a', 'c'])
a b c
3 10 c 3.0
1 10 b 2.0
2 8 d NaN

The dtype of column "b" is `object` and attempting to get its largest
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to generalize could instead say "Attempting to use nlargest on an non-numeric dtypes will raise a TypeError"

values raises a ``TypeError`` exception:

>>> df.nlargest(3, 'b')
Traceback (most recent call last):
TypeError: Column 'b' has dtype object, cannot use method 'nlargest' with this dtype
"""
return algorithms.SelectNFrame(self,
n=n,
Expand Down