Skip to content

ENH: add pandas.core.groupby.DataFrameGroupBy.nlargest / nsmallest #33601

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
h-vetinari opened this issue Apr 17, 2020 · 2 comments
Closed

ENH: add pandas.core.groupby.DataFrameGroupBy.nlargest / nsmallest #33601

h-vetinari opened this issue Apr 17, 2020 · 2 comments
Labels
API - Consistency Internal Consistency of API/Behavior Enhancement Groupby

Comments

@h-vetinari
Copy link
Contributor

Is your feature request related to a problem?

It's common to want to select nlargest/nsmallest within a groupby based on whichever column is supposed to determine the sort order. However, the DataFrameGroupBy API has no such function, and it's very cumbersome to first select the largest records of a Series and then somehow reduce the dataframe based on that (and it's essentially impossible in the case when there's no unique index per record).

Another consideration is speed: for Series, the docs note:

Notes
Faster than .sort_values(ascending=False).head(n) for small n relative to the size of the Series object.

Hopefully this can be leveraged for DataFrames as well.

Describe the solution you'd like

Docstring for nlargest, nsmallest adapted correspondingly

def nlargest(self, sort_by, n=5, keep="first") -> "DataFrame":
    """
    Return the `n` first rows when sorting after `sort_by` in descending order

    Parameters
    ----------
    sort_by: list
        Column(s) of the DataFrame describing the precedence according to which
        the DataFrame shall be sorted. Equivalent to `DataFrame.sort_values(sort_by)`
    n : int, default 5
        Return this many descending sorted values.
    keep : {'first', 'last', 'all'}, default 'first'
        When the sorting results in a tie extending beyond the `n`-th record:
        - ``first`` : return the first `n` occurrences in order
            of appearance.
        - ``last`` : return the last `n` occurrences in reverse
            order of appearance.
        - ``all`` : keep all occurrences. This can result in a Series of
            size larger than `n`.

    Returns
    -------
    DataFrame
        The first `n` records of the DataFrame, sorted in decreasing order
        according to `sort_by`

    See Also
    --------
    DataFrame.nsmallest: Get the `n` smallest elements.
    DataFrame.sort_values: Sort Series by values.
    DataFrame.head: Return the first `n` rows.

    Notes
    -----
    Faster than ``.sort_values(ascending=False).head(n)`` for small `n`
    relative to the size of the ``DataFrame`` object.

    [...]
    """"

API breaking implications

No breaking changes, function doesn't exist yet.

Initially, this could directly wrap .sort_values(sort_by, ascending=False).head(n), but hopefully similar speed ups can be gained as in the Series case.

Alternatives

One could allow an argument ascending if the columns should be allowed to toggle this per column (same as for DataFrame.sort_values). In this case, it might be better to call the methods nfirst / nlast.

Alternatively, one could just implement a minimal version where nlargest/nsmallest is only determined based on one column (and not several). In this case, a kwarg dropna=True should be added too (like for Series, cf. #28984).

@h-vetinari h-vetinari added Enhancement Needs Triage Issue that has not been reviewed by a pandas team member labels Apr 17, 2020
@dsaxton dsaxton added API - Consistency Internal Consistency of API/Behavior Groupby labels Apr 17, 2020
@mroeschke mroeschke removed the Needs Triage Issue that has not been reviewed by a pandas team member label Apr 17, 2020
@simonjayhawkins
Copy link
Member

xref #23993 for prior discussion on this topic.

@mroeschke
Copy link
Member

I think given the similar discussion in #23993, there was not much support for this functionality due to variability of output shape.

Going to close from the same conclusion but happy to reopen if the community and core devs are in support in adding this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API - Consistency Internal Consistency of API/Behavior Enhancement Groupby
Projects
None yet
Development

No branches or pull requests

4 participants