You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
It's common to want to select nlargest/nsmallest within a groupby based on whichever column is supposed to determine the sort order. However, the DataFrameGroupBy API has no such function, and it's very cumbersome to first select the largest records of a Series and then somehow reduce the dataframe based on that (and it's essentially impossible in the case when there's no unique index per record).
Another consideration is speed: for Series, the docs note:
Notes
Faster than .sort_values(ascending=False).head(n) for small n relative to the size of the Series object.
Hopefully this can be leveraged for DataFrames as well.
Describe the solution you'd like
Docstring for nlargest, nsmallest adapted correspondingly
defnlargest(self, sort_by, n=5, keep="first") ->"DataFrame":
""" Return the `n` first rows when sorting after `sort_by` in descending order Parameters ---------- sort_by: list Column(s) of the DataFrame describing the precedence according to which the DataFrame shall be sorted. Equivalent to `DataFrame.sort_values(sort_by)` n : int, default 5 Return this many descending sorted values. keep : {'first', 'last', 'all'}, default 'first' When the sorting results in a tie extending beyond the `n`-th record: - ``first`` : return the first `n` occurrences in order of appearance. - ``last`` : return the last `n` occurrences in reverse order of appearance. - ``all`` : keep all occurrences. This can result in a Series of size larger than `n`. Returns ------- DataFrame The first `n` records of the DataFrame, sorted in decreasing order according to `sort_by` See Also -------- DataFrame.nsmallest: Get the `n` smallest elements. DataFrame.sort_values: Sort Series by values. DataFrame.head: Return the first `n` rows. Notes ----- Faster than ``.sort_values(ascending=False).head(n)`` for small `n` relative to the size of the ``DataFrame`` object. [...] """"
API breaking implications
No breaking changes, function doesn't exist yet.
Initially, this could directly wrap .sort_values(sort_by, ascending=False).head(n), but hopefully similar speed ups can be gained as in the Series case.
Alternatives
One could allow an argument ascending if the columns should be allowed to toggle this per column (same as for DataFrame.sort_values). In this case, it might be better to call the methods nfirst / nlast.
Alternatively, one could just implement a minimal version where nlargest/nsmallest is only determined based on one column (and not several). In this case, a kwarg dropna=True should be added too (like for Series, cf. #28984).
The text was updated successfully, but these errors were encountered:
Is your feature request related to a problem?
It's common to want to select nlargest/nsmallest within a groupby based on whichever column is supposed to determine the sort order. However, the DataFrameGroupBy API has no such function, and it's very cumbersome to first select the largest records of a Series and then somehow reduce the dataframe based on that (and it's essentially impossible in the case when there's no unique index per record).
Another consideration is speed: for Series, the docs note:
Hopefully this can be leveraged for DataFrames as well.
Describe the solution you'd like
Docstring for
nlargest
,nsmallest
adapted correspondinglyAPI breaking implications
No breaking changes, function doesn't exist yet.
Initially, this could directly wrap
.sort_values(sort_by, ascending=False).head(n)
, but hopefully similar speed ups can be gained as in the Series case.Alternatives
One could allow an argument
ascending
if the columns should be allowed to toggle this per column (same as forDataFrame.sort_values
). In this case, it might be better to call the methodsnfirst
/nlast
.Alternatively, one could just implement a minimal version where
nlargest
/nsmallest
is only determined based on one column (and not several). In this case, a kwargdropna=True
should be added too (like for Series, cf. #28984).The text was updated successfully, but these errors were encountered: