ENH: add pandas.core.groupby.DataFrameGroupBy.nlargest / nsmallest #33601

h-vetinari · 2020-04-17T00:56:04Z

Is your feature request related to a problem?

It's common to want to select nlargest/nsmallest within a groupby based on whichever column is supposed to determine the sort order. However, the DataFrameGroupBy API has no such function, and it's very cumbersome to first select the largest records of a Series and then somehow reduce the dataframe based on that (and it's essentially impossible in the case when there's no unique index per record).

Another consideration is speed: for Series, the docs note:

Notes
Faster than .sort_values(ascending=False).head(n) for small n relative to the size of the Series object.

Hopefully this can be leveraged for DataFrames as well.

Describe the solution you'd like

Docstring for nlargest, nsmallest adapted correspondingly

def nlargest(self, sort_by, n=5, keep="first") -> "DataFrame":
    """
    Return the `n` first rows when sorting after `sort_by` in descending order

    Parameters
    ----------
    sort_by: list
        Column(s) of the DataFrame describing the precedence according to which
        the DataFrame shall be sorted. Equivalent to `DataFrame.sort_values(sort_by)`
    n : int, default 5
        Return this many descending sorted values.
    keep : {'first', 'last', 'all'}, default 'first'
        When the sorting results in a tie extending beyond the `n`-th record:
        - ``first`` : return the first `n` occurrences in order
            of appearance.
        - ``last`` : return the last `n` occurrences in reverse
            order of appearance.
        - ``all`` : keep all occurrences. This can result in a Series of
            size larger than `n`.

    Returns
    -------
    DataFrame
        The first `n` records of the DataFrame, sorted in decreasing order
        according to `sort_by`

    See Also
    --------
    DataFrame.nsmallest: Get the `n` smallest elements.
    DataFrame.sort_values: Sort Series by values.
    DataFrame.head: Return the first `n` rows.

    Notes
    -----
    Faster than ``.sort_values(ascending=False).head(n)`` for small `n`
    relative to the size of the ``DataFrame`` object.

    [...]
    """"

API breaking implications

No breaking changes, function doesn't exist yet.

Initially, this could directly wrap .sort_values(sort_by, ascending=False).head(n), but hopefully similar speed ups can be gained as in the Series case.

Alternatives

One could allow an argument ascending if the columns should be allowed to toggle this per column (same as for DataFrame.sort_values). In this case, it might be better to call the methods nfirst / nlast.

Alternatively, one could just implement a minimal version where nlargest/nsmallest is only determined based on one column (and not several). In this case, a kwarg dropna=True should be added too (like for Series, cf. #28984).

The text was updated successfully, but these errors were encountered:

simonjayhawkins · 2020-04-23T15:49:54Z

xref #23993 for prior discussion on this topic.

mroeschke · 2021-07-31T04:24:01Z

I think given the similar discussion in #23993, there was not much support for this functionality due to variability of output shape.

Going to close from the same conclusion but happy to reopen if the community and core devs are in support in adding this.

h-vetinari added Enhancement Needs Triage Issue that has not been reviewed by a pandas team member labels Apr 17, 2020

dsaxton added API - Consistency Internal Consistency of API/Behavior Groupby labels Apr 17, 2020

mroeschke removed the Needs Triage Issue that has not been reviewed by a pandas team member label Apr 17, 2020

mroeschke closed this as completed Jul 31, 2021

Shadimrad mentioned this issue Aug 11, 2022

ENH: implement nlargest and nsmallest for DataFrameGroupBy like SeriesGroupBy #46924

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: add pandas.core.groupby.DataFrameGroupBy.nlargest / nsmallest #33601

ENH: add pandas.core.groupby.DataFrameGroupBy.nlargest / nsmallest #33601

h-vetinari commented Apr 17, 2020

simonjayhawkins commented Apr 23, 2020

mroeschke commented Jul 31, 2021

ENH: add pandas.core.groupby.DataFrameGroupBy.nlargest / nsmallest #33601

ENH: add pandas.core.groupby.DataFrameGroupBy.nlargest / nsmallest #33601

Comments

h-vetinari commented Apr 17, 2020

Is your feature request related to a problem?

Describe the solution you'd like

API breaking implications

Alternatives

simonjayhawkins commented Apr 23, 2020

mroeschke commented Jul 31, 2021