Weird behavior using nlargest/nsmallest when there are the n smallest/largest values are identical #15297

RogerThomas · 2017-02-03T11:22:54Z

Code Sample, a copy-pastable example if possible

python -c "import pandas as pd; df = pd.DataFrame(dict(a=[1, 1, 2, 3], b=[1, 2, 3, 4])); print(df.nsmallest(2, 'a'))"

Problem description

When using nlargest/nsmallest and the n largest / smallest values are identical, the method seems to return the dataframe concatenated with the filtered version of itself.
Furthermore if all values are identical, you get the full dataframe concatenated with itself, regardless of the choice of n

Expected Output

Not really sure, I guess in the example above you should simply get a dataframe that looks like this
pd.DataFrame(dict(a=[1, 1], b=[1, 2]))
however if you were to have
df = pd.DataFrame(dict(a=[1, 1, 1, 1], b=[1, 2, 3, 4]))
and asked for
df.nlargest(2, 'a') you should again get
pd.DataFrame(dict(a=[1, 1], b=[1, 2]))

Output of `pd.show_versions()`

INSTALLED VERSIONS ------------------ commit: None python: 2.7.12.final.0 python-bits: 64 OS: Linux OS-release: 4.8.0-34-generic machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_IE.UTF-8 LOCALE: None.None

pandas: 0.19.2
nose: 1.3.7
pip: 9.0.1
setuptools: 28.3.0
Cython: 0.23.4
numpy: 1.12.0
scipy: 0.16.1
statsmodels: 0.6.1
xarray: None
IPython: None
sphinx: 1.3.1
patsy: 0.4.1
dateutil: 2.6.0
pytz: 2016.10
blosc: None
bottleneck: None
tables: 3.2.0
numexpr: 2.4.6
matplotlib: 1.5.0
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
httplib2: 0.9.2
apiclient: None
sqlalchemy: 1.0.9
pymysql: None
psycopg2: None
jinja2: 2.8
boto: 2.38.0
pandas_datareader: None

The text was updated successfully, but these errors were encountered:

jreback · 2017-02-03T13:47:06Z

so this was fixed for duplicates in the index 6e514da (for 0.19.2).

yeah this does look a bit odd.So looks the the 'dups' are getting duplicated. Want to have a look and see if you can find where?

In [5]: df = pd.DataFrame(dict(a=[1, 1, 2, 3], b=[1, 2, 3, 4]))

In [7]: df.nsmallest(3, 'a')
Out[7]: 
   a  b
0  1  1
1  1  2
0  1  1
1  1  2
2  2  3

RogerThomas · 2017-02-03T14:03:08Z

Sure, I'll take a look!

jreback · 2017-02-03T14:06:14Z

great!

jreback · 2017-02-21T14:44:03Z

xref this: #14846 (comment)

jreback added Bug Difficulty Intermediate Reshaping Concat, Merge/Join, Stack/Unstack, Explode labels Feb 3, 2017

jreback added this to the 0.20.0 milestone Feb 3, 2017

RogerThomas mentioned this issue Feb 3, 2017

Fix nsmallest/nlargest With Identical Values #15299

Closed

4 tasks

jreback mentioned this issue Feb 21, 2017

BUG: DataFrame.nlargest() returns incorrect result when DataFrame has non-unique index #14846

Closed

jreback mentioned this issue Mar 13, 2017

BUG: in _nsorted for frame with duplicated values index #13412

Closed

jreback modified the milestones: 0.20.0, Next Major Release Mar 23, 2017

jreback modified the milestones: 0.20.0, Next Major Release Mar 31, 2017

jreback closed this as completed in c112252 Apr 6, 2017

jorisvandenbossche mentioned this issue May 12, 2017

DataFrame.nlargest result error #16314

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Weird behavior using nlargest/nsmallest when there are the n smallest/largest values are identical #15297

Weird behavior using nlargest/nsmallest when there are the n smallest/largest values are identical #15297

RogerThomas commented Feb 3, 2017 •

edited by jreback

Loading

jreback commented Feb 3, 2017

RogerThomas commented Feb 3, 2017

jreback commented Feb 3, 2017

jreback commented Feb 21, 2017

Weird behavior using nlargest/nsmallest when there are the n smallest/largest values are identical #15297

Weird behavior using nlargest/nsmallest when there are the n smallest/largest values are identical #15297

Comments

RogerThomas commented Feb 3, 2017 • edited by jreback Loading

Code Sample, a copy-pastable example if possible

Problem description

Expected Output

Output of pd.show_versions()

jreback commented Feb 3, 2017

RogerThomas commented Feb 3, 2017

jreback commented Feb 3, 2017

jreback commented Feb 21, 2017

RogerThomas commented Feb 3, 2017 •

edited by jreback

Loading

Output of `pd.show_versions()`