Make parameter keep=False keep duplicates for nlargest/nsmallest #16818

tdpetrou · 2017-07-03T17:30:36Z

Code Sample, a copy-pastable example if possible

>>> s = pd.Series([10,9,8,7,7,7,6])
>>> s.nlargest(4)
0    10
1     9
2     8
3     7
dtype: int64

Problem description

The docstrings list False as one of the possible argument values for keep. pandas raises a ValueError when attempting to use this parameter.

Expected Output

It would be nice to have nlargest work like this.

>>> s.nlargest(4, keep=False)
0    10
1     9
2     8
3     7
4     7
5     7

Output of `pd.show_versions()`

# Paste the output here pd.show_versions() here INSTALLED VERSIONS ------------------ commit: None python: 3.6.1.final.0 python-bits: 64 OS: Darwin OS-release: 15.6.0 machine: x86_64 processor: i386 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: en_US.UTF-8

pandas: 0.20.2
pytest: 3.0.7
pip: 9.0.1
setuptools: 35.0.2
Cython: 0.25.2
numpy: 1.13.0
scipy: 0.19.0
xarray: None
IPython: 6.0.0
sphinx: 1.5.5
patsy: 0.4.1
dateutil: 2.6.0
pytz: 2017.2
blosc: None
bottleneck: 1.2.0
tables: 3.4.2
numexpr: 2.6.2
feather: None
matplotlib: 2.0.2
openpyxl: 2.4.7
xlrd: 1.0.0
xlwt: 1.2.0
xlsxwriter: 0.9.6
lxml: 3.7.3
bs4: 4.6.0
html5lib: 0.999999999
sqlalchemy: 1.1.9
pymysql: None
psycopg2: None
jinja2: 2.9.6
s3fs: None
pandas_gbq: None
pandas_datareader: 0.3.0.post

The text was updated successfully, but these errors were encountered:

gfyoung · 2017-12-06T04:42:11Z

@tdpetrou : Thanks for reporting this! Unfortunately, we have already decided to go and remove this option from the docs. (#18559)

If you have strong feelings about including this as an option, please respond, and we can discuss.

tdpetrou · 2017-12-06T14:59:57Z

@gfyoung This doesn't make sense to me. The keep option is only important whenever there are ties for the last value. Arguably, the most useful thing to do is to keep all of the ties which would be keep=False. This option still exists in the docstrings in 0.21.

Also, the duplicated method has same three options so it matches that. Additionally, it was just recently removed without discussion.

edit: A better value for the parameter would be keep='all'

gfyoung · 2017-12-06T17:09:25Z

@tdpetrou : keep=False in my mind would mean to drop them (the opposite of what you want), but yes, keep='all' would make more sense. The reason for dropping the option in the docs was because the implementation has only supported the first two options (the docs look like copy + paste at some point).

True that there was no discussion to this, but a simple doc-change like #18559 went in without objections from anyone. Everyone (including yourself) could have objected if need be, though to be fair, unless you're tracking pandas changes regularly, this would have been hard to catch.

The good news is that we haven't released this yet, so we do have time to implement a "third" option for keep if need be. @jreback : what are your thoughts?

tdpetrou · 2017-12-06T17:17:27Z

@gfyoung Thanks for the response. I proposed a simple solution to this in #18656. It was very easy and basically already done. I think the original reason for using False was to match duplicated (and possibly others that I can't remember right now)?

Regardless, it just makes a lot of sense (to me) that you would want to keep the nlargest/nsmallest plus ties and the fix is easy and already complete.

gfyoung · 2017-12-06T17:20:25Z

Your case for using False is a little shaky IMO because it doesn't much sense (as you yourself even have acknowledged) in the context of this function. Semantics trump this sort of API consistency.

Your implementation is indeed simple but the surfacing of it will require discussion, should other maintainers believe that we should actually have such an option.

tdpetrou · 2017-12-06T17:23:12Z

I'm not in favor of False and would prefer 'all'. The only thing I am pushing for is to make the functionality for keeping all ties possible.

Closes pandas-devgh-16818. Closes pandas-devgh-18656.

summerela · 2019-02-21T02:04:37Z

The option to keep all ties returned from nlargest would be wonderful. Right now I have to use this function in a groupby to find the top two counts for each group, then go back and find all rows that match the top two scores. It would be faster and more efficient to just have the function return all rows that match the top n largest/smallest values and I'm not sure why this feature was removed. Pretty please can you put it back? :)

WillAyd · 2019-02-21T02:24:50Z

@summerela the feature as described here was implemented in 0.24. If that's not what you are looking for better to open a new issue at this point

summerela · 2019-02-21T02:28:10Z

I'm running Dask v 1.1.1 and the only allowed arguments for keep are "first" and "last".

WillAyd · 2019-02-21T02:31:56Z

OK if you can open up a new issue with code sample including output of pd.show_versions someone can take a look

summerela · 2019-02-21T02:33:02Z

Can do. Thank you!

artinthetrees · 2021-01-28T17:36:54Z

it would be useful also to have a keep option of "none"!

s = pd.Series([10,9,8,7,7,7,6])
s.nlargest(4)
0 10
1 9
2 8
dtype: int64

TomAugspurger added API Design Difficulty Intermediate labels Jul 12, 2017

TomAugspurger added this to the Next Major Release milestone Jul 12, 2017

TomAugspurger added the Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff label Jul 12, 2017

tdpetrou mentioned this issue Dec 6, 2017

added option keep=False to nlargests/nsmallest #18656

Closed

4 tasks

gfyoung closed this as completed Dec 6, 2017

gfyoung added Duplicate Report Duplicate issue or pull request and removed Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff API Design Difficulty Intermediate labels Dec 6, 2017

gfyoung modified the milestones: Next Major Release, No action Dec 6, 2017

gfyoung added a commit to forking-repos/pandas that referenced this issue Jun 27, 2018

ENH: Allow keep='all' for nlargest/nsmallest

8d3b883

Closes pandas-devgh-16818. Closes pandas-devgh-18656.

gfyoung mentioned this issue Jun 27, 2018

ENH: Allow keep='all' for nlargest/nsmallest #21650

Merged

gfyoung added a commit to forking-repos/pandas that referenced this issue Jun 27, 2018

ENH: Allow keep='all' for nlargest/nsmallest

4b7f199

Closes pandas-devgh-16818. Closes pandas-devgh-18656.

gfyoung added a commit to forking-repos/pandas that referenced this issue Jun 28, 2018

ENH: Allow keep='all' for nlargest/nsmallest

a07c686

Closes pandas-devgh-16818. Closes pandas-devgh-18656.

gfyoung added a commit to forking-repos/pandas that referenced this issue Jun 28, 2018

ENH: Allow keep='all' for nlargest/nsmallest

4998050

Closes pandas-devgh-16818. Closes pandas-devgh-18656.

gfyoung added a commit to forking-repos/pandas that referenced this issue Jun 28, 2018

ENH: Allow keep='all' for nlargest/nsmallest (pandas-dev#21650)

0801b8c

Closes pandas-devgh-16818. Closes pandas-devgh-18656.

Sup3rGeo pushed a commit to Sup3rGeo/pandas that referenced this issue Oct 1, 2018

ENH: Allow keep='all' for nlargest/nsmallest (pandas-dev#21650)

db18fd6

Closes pandas-devgh-16818. Closes pandas-devgh-18656.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make parameter keep=False keep duplicates for nlargest/nsmallest #16818

Make parameter keep=False keep duplicates for nlargest/nsmallest #16818

tdpetrou commented Jul 3, 2017

gfyoung commented Dec 6, 2017

tdpetrou commented Dec 6, 2017 •

edited

Loading

gfyoung commented Dec 6, 2017

tdpetrou commented Dec 6, 2017

gfyoung commented Dec 6, 2017

tdpetrou commented Dec 6, 2017

summerela commented Feb 21, 2019

WillAyd commented Feb 21, 2019

summerela commented Feb 21, 2019

WillAyd commented Feb 21, 2019

summerela commented Feb 21, 2019

artinthetrees commented Jan 28, 2021

Make parameter keep=False keep duplicates for nlargest/nsmallest #16818

Make parameter keep=False keep duplicates for nlargest/nsmallest #16818

Comments

tdpetrou commented Jul 3, 2017

Code Sample, a copy-pastable example if possible

Problem description

Expected Output

Output of pd.show_versions()

gfyoung commented Dec 6, 2017

tdpetrou commented Dec 6, 2017 • edited Loading

gfyoung commented Dec 6, 2017

tdpetrou commented Dec 6, 2017

gfyoung commented Dec 6, 2017

tdpetrou commented Dec 6, 2017

summerela commented Feb 21, 2019

WillAyd commented Feb 21, 2019

summerela commented Feb 21, 2019

WillAyd commented Feb 21, 2019

summerela commented Feb 21, 2019

artinthetrees commented Jan 28, 2021

Output of `pd.show_versions()`

tdpetrou commented Dec 6, 2017 •

edited

Loading