DOC: Improve the docstring of DataFrame.nlargest #20255

cemsbr · 2018-03-10T22:17:08Z

Co-authored-by: Igor C. A. de Lima [email protected]
Signed-off-by: Carlos Eduardo Moreira dos Santos [email protected]
Signed-off-by: Igor C. A. de Lima [email protected]

Checklist for the pandas documentation sprint (ignore this if you are doing
an unrelated PR):

PR title is "DOC: update the docstring"
The validation script passes: scripts/validate_docstrings.py <your-function-or-method>
The PEP8 style check passes: git diff upstream/master -u -- "*.py" | flake8 --diff
The html version looks good: python doc/make.py --single <your-function-or-method>
It has been proofread on language by another sprint participant

Please include the output of the validation script below between the "```" ticks:

################################################################################
################################## Validation ##################################
################################################################################

Docstring for "pandas.core.frame.DataFrame.nlargest" correct. :)

Co-authored-by: Igor C. A. de Lima <[email protected]> Signed-off-by: Carlos Eduardo Moreira dos Santos <[email protected]> Signed-off-by: Igor C. A. de Lima <[email protected]>

WillAyd · 2018-03-11T01:07:06Z

pandas/core/frame.py

        keep : {'first', 'last'}, default 'first'
            Where there are duplicate values:
-            - ``first`` : take the first occurrence.
-            - ``last`` : take the last occurrence.
+            - `first` : take the first occurrence;


Shouldn't need punctuation at the end of either bullet point

First, I removed them, but the validator complained about the missing full stop. Because of that, I kept this way. If I should remove them and ignore the validation error, let me know.

Hmm OK now I know why I've seen this a few times. @datapythonista @jorisvandenbossche I would suggest that we ignore errors for a lack of punctuation at the end of bulleted lists - any objection to that?

Yes, we can certainly ignore errors in such cases of bullet points

WillAyd · 2018-03-11T01:08:56Z

pandas/core/frame.py

+        See Also
+        --------
+        DataFrame.nsmallest : Return the `n` smallest rows sorted by given
+            columns.

        Examples
        --------
        >>> df = pd.DataFrame({'a': [1, 10, 8, 11, -1],
        ...                    'b': list('abdce'),
        ...                    'c': [1.0, 2.0, np.nan, 3.0, 4.0]})
        >>> df.nlargest(3, 'a')


Since the DataFrame constructor here is not trivial can we add a line to print the DataFrame as is, giving visual contrast to the examples?

Thanks for the review. Done.

WillAyd · 2018-03-11T01:09:31Z

pandas/core/frame.py

+        See Also
+        --------
+        DataFrame.nsmallest : Return the `n` smallest rows sorted by given
+            columns.

        Examples


Can you add an example to differentiate between "first" and "last" keep parameters?

Nice suggestion. Done.

jreback · 2018-03-11T13:37:11Z

pandas/core/frame.py

+        """
+        Return the `n` largest rows sorted by `columns`.
+
+        Sort the DataFrame by `columns` in descending order and return the top


we don't actually sort, can you re-word, we just pick the largest n values

Thanks for the review. I was not aware of the internals. Done.

WillAyd · 2018-03-11T23:56:28Z

pandas/core/frame.py

        columns : list or str
-            Column name or names to order by
+            Column name or names to retrieve values from.
        keep : {'first', 'last'}, default 'first'


Default value is implied for first element in the possible values, so no need to explicitly say default 'first' again

@WillAyd I think we changed the guide on this (some time last week) .. :)
For explicitness, I think it is good to keep the " default 'first' ". A new reader cannot know the rule of "default is fist value in the set of options"

WillAyd · 2018-03-11T23:57:39Z

pandas/core/frame.py

        keep : {'first', 'last'}, default 'first'
            Where there are duplicate values:
-            - ``first`` : take the first occurrence.
-            - ``last`` : take the last occurrence.
+            - `first` : take the first occurrence;


Hmm OK now I know why I've seen this a few times. @datapythonista @jorisvandenbossche I would suggest that we ignore errors for a lack of punctuation at the end of bulleted lists - any objection to that?

WillAyd · 2018-03-11T23:59:43Z

pandas/core/frame.py

        keep : {'first', 'last'}, default 'first'
            Where there are duplicate values:
-            - ``first`` : take the first occurrence.
-            - ``last`` : take the last occurrence.
+            - `first` : take the first occurrence;


After looking at your example do you think it makes more sense to word this as "prioritize the first occurrence(s)"? I assume it's more or less doing that as opposed to a full take right?

Put in other words, if n is 2 and you have three duplicate values it wouldn't return all three but rather prioritize the 2 that come first, right?

WillAyd · 2018-03-12T00:03:56Z

pandas/core/frame.py

+        3  10  c  3.0
+        2   8  d  NaN
+
+        >>> df.nlargest(3, 'a', keep='last')


Great thanks! I would just add a short blurb to direct the users attention to what's important here and in the example above. So maybe before the previous example say "In the following example, we will use nlargest to select the three rows having the largest values in column 'a'". Then for the next example something like "When using keep='last' ties are resolved in reverse order". Doesn't need to be exactly those words

WillAyd · 2018-03-12T00:04:15Z

pandas/core/frame.py

+            a  b    c
+        3  10  c  3.0
+        1  10  b  2.0
+        2   8  d  NaN


Can you add one more example selecting multiple columns?

jorisvandenbossche · 2018-03-12T07:29:25Z

pandas/core/frame.py

+        """
+        Return the `n` largest rows ordered by `columns`.
+
+        Return the `n` largest rows of `columns` in descending order. The


I find "largest rows" a strange wording, as a row itself cannot be "large", only the values in it. Here I would use "n first rows" based on sorting 'columns' in descending order

jorisvandenbossche · 2018-03-12T07:30:00Z

pandas/core/frame.py

        columns : list or str
-            Column name or names to order by
+            Column name or names to retrieve values from.


I think "to order by" is more correct, as otherwise it seems that you only return the values from those columns

jorisvandenbossche · 2018-03-12T07:31:29Z

pandas/core/frame.py

        columns : list or str
-            Column name or names to order by
+            Column name or names to retrieve values from.
        keep : {'first', 'last'}, default 'first'


@WillAyd I think we changed the guide on this (some time last week) .. :)
For explicitness, I think it is good to keep the " default 'first' ". A new reader cannot know the rule of "default is fist value in the set of options"

jorisvandenbossche · 2018-03-12T07:32:07Z

pandas/core/frame.py

        keep : {'first', 'last'}, default 'first'
            Where there are duplicate values:
-            - ``first`` : take the first occurrence.
-            - ``last`` : take the last occurrence.
+            - `first` : take the first occurrence;


Yes, we can certainly ignore errors in such cases of bullet points

pep8speaks · 2018-03-12T23:59:16Z

Hello @cemsbr! Thanks for updating the PR.

Cheers ! There are no PEP8 issues in this Pull Request. 🍻

Comment last updated on March 17, 2018 at 10:28 Hours UTC

codecov · 2018-03-12T23:59:20Z

Codecov Report

❗ No coverage uploaded for pull request base (master@fb556ed). Click here to learn what that means.
The diff coverage is 96.77%.

@@            Coverage Diff            @@
##             master   #20255   +/-   ##
=========================================
  Coverage          ?   91.72%           
=========================================
  Files             ?      150           
  Lines             ?    49165           
  Branches          ?        0           
=========================================
  Hits              ?    45099           
  Misses            ?     4066           
  Partials          ?        0

Flag	Coverage Δ
#multiple	`90.11% <96.77%> (?)`
#single	`41.86% <9.67%> (?)`

Impacted Files	Coverage Δ
pandas/core/frame.py	`97.18% <96.77%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update fb556ed...7dd4f02. Read the comment docs.

cemsbr · 2018-03-13T00:00:02Z

As suggested, the changes result in the expected validation error below.

################################################################################
################################## Validation ##################################
################################################################################

Errors found:
        Errors in parameters section
                Parameter "keep" description should finish with "."

WillAyd

Looking good - couple more minor things

WillAyd · 2018-03-13T00:12:40Z

pandas/core/frame.py


        Returns
        -------
        DataFrame
-            The `n` largest rows in the DataFrame, ordered by the given columns
-            in descending order.
+            The `n` first rows ordered by the given columns in descending


Minor stylistic comment but "first n" sounds more natural than "n first"

WillAyd · 2018-03-13T00:18:53Z

pandas/core/frame.py

        >>> df.nlargest(3, 'a', keep='last')
            a  b    c
        3  10  c  3.0
        1  10  b  2.0
        2   8  d  NaN
+
+        To order by the largest values in column "a" and then "c", we can


Great example. I would also add something to Notes touching on this behavior - something like "When columns is iterable the order of the iterable's elements dictates the order in which columns in the DataFrame are prioritized"

WillAyd · 2018-03-13T00:20:40Z

pandas/core/frame.py

+        1  10  b  2.0
+        2   8  d  NaN
+
+        The dtype of column "b" is `object` and attempting to get its largest


Just to generalize could instead say "Attempting to use nlargest on an non-numeric dtypes will raise a TypeError"

WillAyd · 2018-03-13T00:23:42Z

pandas/core/frame.py

-        columns : list or str
-            Column name or names to retrieve values from.
+            Number of rows to return.
+        columns : iterable or single value


str or iterable

datapythonista · 2018-03-13T00:25:08Z

pandas/core/frame.py

+
+        Return the `n` first rows with the largest values in `columns`, in
+        descending order. The columns that are not specified are returned as
+        well, but not used for ordering.


If I'm right, this is equivalent to using df.sort_values(columns, ascending=False).head(n), isn't it? If that's the case, I'd explicitly say it in the extended summary, and I'd add both methods to the See Also section.

it’s an equivalent result, but this is much more performany

jorisvandenbossche

Edited for the last minor updates.

jorisvandenbossche · 2018-03-17T10:29:53Z

@cemsbr Thanks a lot 👍

DOC: Improve the docstring of DataFrame.nlargest

7f15d7a

Co-authored-by: Igor C. A. de Lima <[email protected]> Signed-off-by: Carlos Eduardo Moreira dos Santos <[email protected]> Signed-off-by: Igor C. A. de Lima <[email protected]>

jorisvandenbossche added the Docs label Mar 10, 2018

WillAyd requested changes Mar 11, 2018

View reviewed changes

jreback requested changes Mar 11, 2018

View reviewed changes

jreback added the Reshaping Concat, Merge/Join, Stack/Unstack, Explode label Mar 11, 2018

DOC: DataFrame.nlargest - apply review suggestions

873efa2

WillAyd requested changes Mar 12, 2018

View reviewed changes

jorisvandenbossche reviewed Mar 12, 2018

View reviewed changes

DOC: DataFrame.nlargest - apply review suggestions

de08075

WillAyd requested changes Mar 13, 2018

View reviewed changes

datapythonista reviewed Mar 13, 2018

View reviewed changes

updates

7dd4f02

jorisvandenbossche approved these changes Mar 17, 2018

View reviewed changes

jorisvandenbossche merged commit 48e680e into pandas-dev:master Mar 17, 2018

jorisvandenbossche added this to the 0.23.0 milestone Mar 17, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DOC: Improve the docstring of DataFrame.nlargest #20255

DOC: Improve the docstring of DataFrame.nlargest #20255

cemsbr commented Mar 10, 2018 •

edited

Loading

WillAyd Mar 11, 2018

cemsbr Mar 11, 2018

WillAyd Mar 11, 2018

jorisvandenbossche Mar 12, 2018

WillAyd Mar 11, 2018

cemsbr Mar 11, 2018

WillAyd Mar 11, 2018

cemsbr Mar 11, 2018

jreback Mar 11, 2018

cemsbr Mar 11, 2018

WillAyd Mar 11, 2018

jorisvandenbossche Mar 12, 2018

WillAyd Mar 11, 2018

WillAyd Mar 11, 2018

WillAyd Mar 12, 2018

WillAyd Mar 12, 2018

jorisvandenbossche Mar 12, 2018

jorisvandenbossche Mar 12, 2018

jorisvandenbossche Mar 12, 2018

jorisvandenbossche Mar 12, 2018

pep8speaks commented Mar 12, 2018 •

edited

Loading

codecov bot commented Mar 12, 2018 •

edited

Loading

cemsbr commented Mar 13, 2018

WillAyd left a comment

WillAyd Mar 13, 2018

WillAyd Mar 13, 2018

WillAyd Mar 13, 2018

WillAyd Mar 13, 2018

datapythonista Mar 13, 2018

jreback Mar 13, 2018

jorisvandenbossche left a comment

jorisvandenbossche commented Mar 17, 2018

DOC: Improve the docstring of DataFrame.nlargest #20255

DOC: Improve the docstring of DataFrame.nlargest #20255

Conversation

cemsbr commented Mar 10, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pep8speaks commented Mar 12, 2018 • edited Loading

Comment last updated on March 17, 2018 at 10:28 Hours UTC

codecov bot commented Mar 12, 2018 • edited Loading

Codecov Report

cemsbr commented Mar 13, 2018

WillAyd left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jorisvandenbossche left a comment

Choose a reason for hiding this comment

jorisvandenbossche commented Mar 17, 2018

cemsbr commented Mar 10, 2018 •

edited

Loading

pep8speaks commented Mar 12, 2018 •

edited

Loading

codecov bot commented Mar 12, 2018 •

edited

Loading