BUG: DataFrame.nlargest() returns incorrect result when DataFrame has non-unique index #14846

Dr-Irv · 2016-12-09T21:01:30Z

Code Sample, a copy-pastable example if possible

import pandas as pd
df=pd.DataFrame({'variable': ['a','a','b','b','c','c'],
                 'value' : [1000,2000,10,20,100,200]},
                 index=[1,2]*3)
df.nlargest(3,'value')

Problem description

The result should only have 3 rows. If the index has unique values, it is correct. But when the index has duplicate values, the incorrect result is produced.

The example produces the following output:

   value variable
2   2000        a
2   2000        a
1   1000        a
2    200        c
2    200        c
1    100        c
2     20        b
2     20        b
1     10        b

Expected Output

   value variable
2  2000       a
1  1000       a
2  200        c

Output of `pd.show_versions()`

INSTALLED VERSIONS ------------------ commit: None python: 3.5.2.final.0 python-bits: 64 OS: Windows OS-release: 10 machine: AMD64 processor: Intel64 Family 6 Model 60 Stepping 3, GenuineIntel byteorder: little LC_ALL: None LANG: None LOCALE: None.None

pandas: 0.19.0
nose: None
pip: 8.1.2
setuptools: 27.2.0
Cython: None
numpy: 1.11.2
scipy: None
statsmodels: None
xarray: None
IPython: 5.1.0
sphinx: None
patsy: None
dateutil: 2.5.3
pytz: 2016.7
blosc: None
bottleneck: None
tables: None
numexpr: None
matplotlib: None
openpyxl: None
xlrd: 1.0.0
xlwt: None
xlsxwriter: None
lxml: None
bs4: 4.5.1
html5lib: 0.999
httplib2: None
apiclient: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.8
boto: None
pandas_datareader: None

The text was updated successfully, but these errors were encountered:

chris-b1 · 2016-12-09T21:17:22Z

This looks to be a duplicate of #13412 - which is fixed on master

gordonda · 2017-02-21T13:02:02Z

I have a similar problem with a slight change to the example above

df=pd.DataFrame({'variable': ['a','a','b','b','c','c'],
                 'value' : [1000,2000,1000,20,100,200]},
                 index=[1,2]*3)
df.nlargest(3, 'value')

An expected output would be

    value variable
2   2000        a
1   1000        a
1   1000        b

Instead it is

   value variable
2   2000        a
1   1000        a
1   1000        b
1   1000        a
1   1000        b

It seems that duplicate values in the column specified in nlargest leads to this behaviour

jreback · 2017-02-21T14:44:18Z

@gordonda see #15297 which is the same

jreback closed this as completed Dec 9, 2016

This was referenced Feb 21, 2017

Fix nsmallest/nlargest With Identical Values #15299

Closed

Weird behavior using nlargest/nsmallest when there are the n smallest/largest values are identical #15297

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: DataFrame.nlargest() returns incorrect result when DataFrame has non-unique index #14846

BUG: DataFrame.nlargest() returns incorrect result when DataFrame has non-unique index #14846

Dr-Irv commented Dec 9, 2016

chris-b1 commented Dec 9, 2016

gordonda commented Feb 21, 2017

jreback commented Feb 21, 2017

BUG: DataFrame.nlargest() returns incorrect result when DataFrame has non-unique index #14846

BUG: DataFrame.nlargest() returns incorrect result when DataFrame has non-unique index #14846

Comments

Dr-Irv commented Dec 9, 2016

Code Sample, a copy-pastable example if possible

Problem description

Expected Output

Output of pd.show_versions()

chris-b1 commented Dec 9, 2016

gordonda commented Feb 21, 2017

jreback commented Feb 21, 2017

Output of `pd.show_versions()`