DataFrame.nlargest result error #16314

flystarhe · 2017-05-10T01:14:35Z

Code Sample, a copy-pastable example if possible

import pandas as pd
import numpy as np

df = pd.DataFrame({'a': [1, 10, 8, 11, 8],
    'b': list('abdce'),
    'c': [1.0, 2.0, np.nan, 3.0, 4.0]})
print('_________')
print(df.nlargest(10,['a','b']))

Problem description

DataFrame的nlargest在遇到rank相同的情况时，结果错误。如下，第二行和第四行反复出现了。

Expected Output

    a  b    c
3  11  c  3.0
1  10  b  2.0
2   8  d  NaN
4   8  e  4.0
2   8  d  NaN
4   8  e  4.0
0   1  a  1.0
[Finished in 0.6s]

Output of `pd.show_versions()`

INSTALLED VERSIONS
------------------
commit: None
python: 3.6.0.final.0
python-bits: 64
OS: Windows
OS-release: 10
machine: AMD64
processor: Intel64 Family 6 Model 94 Stepping 3, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None

The text was updated successfully, but these errors were encountered:

jreback · 2017-05-10T13:14:29Z

what exactly is the problem? show the full pd.show_versions() as well.

jreback · 2017-05-10T13:15:04Z

This is exactly what should be reported.

In [9]: df.nlargest(10, columns=['a', 'b'])
TypeError: Column 'b' has dtype object, cannot use method 'nlargest' with this dtype

oldclesleycode · 2017-05-12T04:01:21Z

Yeah, so it appears that it doesn't handle dtypes that are "objects" since it wasn't built to handle sorting strings.

>>> df.dtypes
a      int64
b     object
c    float64

flystarhe · 2017-05-12T06:24:05Z

@jreback @lesley2958
no, gourp by key list exist duplicates:

tmp = df.nlargest(10,['a','b']).index.unique()
print(df.loc[tmp])

output:

    a  b    c
3  11  c  3.0
1  10  b  2.0
2   8  d  NaN
4   8  e  4.0
0   1  a  1.0

jorisvandenbossche · 2017-05-12T08:35:55Z

What might be a reason for confusion here (depending on the version @flystarhe is using) is that there is a difference between 0.19.2 and 0.20

In [6]: df.nlargest(10,['a','b'])
Out[6]: 
    a  b    c
3  11  c  3.0
1  10  b  2.0
2   8  d  NaN
4   8  e  4.0
2   8  d  NaN
4   8  e  4.0
0   1  a  1.0

In [7]: pd.__version__
Out[7]: '0.19.2'

In [57]: df.nlargest(10,['a','b'])
...
TypeError: Column 'b' has dtype object, cannot use method 'nlargest' with this dtype

In [58]: pd.__version__
Out[58]: '0.21.0.dev+19.g69a5d6f.dirty'

That said, the output of 0.19.2 also seems wrong (even if object columns would be allowed). -> but that seems to be fixed: #15297

jorisvandenbossche · 2017-05-12T08:46:52Z

Given that the following methods that rely on order work for object dtype:

In [62]: pd.Series(['a', 'b', 'd', 'c']).sort_values()
Out[62]: 
0    a
1    b
3    c
2    d
dtype: object

In [63]: pd.Series(['a', 'b', 'd', 'c']).max()
Out[63]: 'd'

you could also say nlargest should work for object dtype.

flystarhe · 2017-05-12T09:19:25Z

@jorisvandenbossche But it doesn't solve the problem. Because nlargest has different efficiencies and computational idea

jreback · 2017-05-12T09:56:01Z

This already raised for Series in 0.19.2, but not for DataFrame. This was unified to disallow object columns generally and take on the Series behavior (and of course fix the actual duplicated issues)
xref #15299

In [3]: Series(list('abc')).nlargest(1)
TypeError: Cannot use method 'nlargest' with dtype object

jreback · 2017-05-12T09:59:06Z

@flystarhe your question is not clear

These might be what you want

In [5]: df.groupby(['a', 'b']).nth([0, 1, 2])
Out[5]: 
        c
a  b     
1  a  1.0
8  d  NaN
   e  4.0
10 b  2.0
11 c  3.0

In [6]: df.sort_values(['a', 'b']).groupby(['a', 'b']).head(10)
Out[6]: 
    a  b    c
0   1  a  1.0
2   8  d  NaN
4   8  e  4.0
1  10  b  2.0
3  11  c  3.0

jorisvandenbossche · 2017-05-12T15:58:30Z

But it doesn't solve the problem. Because nlargest has different efficiencies and computational idea

It was not meant to solve your problem (which you should try to explain better). I was just giving a possible reason to allow nlargest on object columns. But since we also raise for series, I don't think we are going to change this.

mroeschke · 2018-11-21T17:09:48Z

Closing as the current behavior is intended and correct.

nasirudeenraheem · 2019-11-05T09:58:35Z

better still since it does deals string format.
so the df[''].value_counts().nlargest

jreback added Reshaping Concat, Merge/Join, Stack/Unstack, Explode Usage Question labels May 10, 2017

jorisvandenbossche added the Needs Info Clarification about behavior needed to assess issue label May 12, 2017

mroeschke closed this as completed Nov 21, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DataFrame.nlargest result error #16314

DataFrame.nlargest result error #16314

flystarhe commented May 10, 2017 •

edited

Loading

jreback commented May 10, 2017

jreback commented May 10, 2017

oldclesleycode commented May 12, 2017 •

edited

Loading

flystarhe commented May 12, 2017

jorisvandenbossche commented May 12, 2017 •

edited

Loading

jorisvandenbossche commented May 12, 2017

flystarhe commented May 12, 2017

jreback commented May 12, 2017

jreback commented May 12, 2017

jorisvandenbossche commented May 12, 2017

mroeschke commented Nov 21, 2018

nasirudeenraheem commented Nov 5, 2019

DataFrame.nlargest result error #16314

DataFrame.nlargest result error #16314

Comments

flystarhe commented May 10, 2017 • edited Loading

Code Sample, a copy-pastable example if possible

Problem description

Expected Output

Output of pd.show_versions()

jreback commented May 10, 2017

jreback commented May 10, 2017

oldclesleycode commented May 12, 2017 • edited Loading

flystarhe commented May 12, 2017

jorisvandenbossche commented May 12, 2017 • edited Loading

jorisvandenbossche commented May 12, 2017

flystarhe commented May 12, 2017

jreback commented May 12, 2017

jreback commented May 12, 2017

jorisvandenbossche commented May 12, 2017

mroeschke commented Nov 21, 2018

nasirudeenraheem commented Nov 5, 2019

flystarhe commented May 10, 2017 •

edited

Loading

Output of `pd.show_versions()`

oldclesleycode commented May 12, 2017 •

edited

Loading

jorisvandenbossche commented May 12, 2017 •

edited

Loading