Groupby('colname').nth(0) results in index with duplicate entries if 'colname' has a np.NaN value #26011

joseortiz3 · 2019-04-06T08:56:07Z

Code Sample

>>> import pandas as pd
>>> import numpy as np
>>> test = pd.DataFrame(data = [[np.NaN, 1, np.NaN], [2, 3, 4]], index = [0, 1], columns = ['one','two','three'])
>>> test
   one  two  three
0  NaN    1    NaN
1  2.0    3    4.0
>>> test.groupby('one').nth(0)
     two  three
one            
2.0    1    NaN
2.0    3    4.0

Problem description

A df.groupby('colname').nth(0) operation should return a new data-frame with an index that contains no duplicates; each unique colname will become a single value in the resulting index.

However if the column labeled colname contains a np.NaN value, instead of this np.NaN value becoming its own unique value in the index or simply ignored, it instead turns into a duplicate value as seen above, which is extremely unexpected behavior (probably a bug).

Note only nth is affected - first and last never produce duplicate indices. However first and last have performance problems on categorical data: #25397

Expected Output

# first actually "just works"
>>> test.groupby('one').first()
     two  three
one            
2.0    3    4.0

# so probably this is the correct behavior
>>> test.groupby('one').nth(0)
     two  three
one            
2.0    3    4.0

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit: None
python: 3.7.1.final.0
python-bits: 64
OS: Windows
OS-release: 10
machine: AMD64
processor: Intel64 Family 6 Model 158 Stepping 9, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None

pandas: 0.24.2
pytest: None
pip: 19.0.3
setuptools: 39.0.1
Cython: None
numpy: 1.15.3
scipy: 1.1.0
pyarrow: 0.12.0
xarray: 0.11.0
IPython: 7.3.0
sphinx: None
patsy: 0.5.1
dateutil: 2.7.5
pytz: 2018.6
blosc: None
bottleneck: None
tables: 3.4.4
numexpr: 2.6.8
feather: 0.4.0
matplotlib: 3.0.1
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml.etree: 4.2.6
bs4: 4.6.3
html5lib: None
sqlalchemy: 1.2.17
pymysql: None
psycopg2: 2.7.7 (dt dec pq3 ext lo64)
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
gcsfs: None

The text was updated successfully, but these errors were encountered:

aditya1702 · 2019-04-06T15:54:18Z

If this is indeed potential bug then I would like to work on it

aditya1702 · 2019-04-08T03:45:18Z

@gfyoung Is this a bug or something done intentionally?

gfyoung · 2019-04-08T05:24:13Z

cc @jreback

joseortiz3 · 2019-04-14T05:19:03Z

This is definitely a bug. I'm using the below code as a workaround:

def groupby_first_fast(df, groupby):
    """Pandas groupby.first() is slow due to a conversion.
    Meanwhile groupby.nth(0) has a bug with missing values. 
    This takes care of both problems.
    Issues #25397 and #26011"""

    # Get rid of missing values that break nth.
    df = df[~(df[groupby].isnull())]

    # Use nth instead of first for speed.
    return df.groupby(groupby).nth(0)

If getting rid of rows with missing/NaN groupby values before calling nth is an acceptable solution, that's a pretty easy PR.

gfyoung added Groupby Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate labels Apr 8, 2019

joseortiz3 mentioned this issue Apr 15, 2019

DataFrame groupby.first() is much slower than groupby.nth(0) on categorical dtypes #25397

Closed

WillAyd mentioned this issue Apr 19, 2019

Fix Bug with NA value in Grouping for Groupby.nth #26152

Merged

4 tasks

jreback added this to the 0.25.0 milestone Apr 26, 2019

jreback closed this as completed in #26152 May 5, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Groupby('colname').nth(0) results in index with duplicate entries if 'colname' has a np.NaN value #26011

Groupby('colname').nth(0) results in index with duplicate entries if 'colname' has a np.NaN value #26011

joseortiz3 commented Apr 6, 2019 •

edited

Loading

INSTALLED VERSIONS

aditya1702 commented Apr 6, 2019

aditya1702 commented Apr 8, 2019

gfyoung commented Apr 8, 2019

joseortiz3 commented Apr 14, 2019 •

edited

Loading

Groupby('colname').nth(0) results in index with duplicate entries if 'colname' has a np.NaN value #26011

Groupby('colname').nth(0) results in index with duplicate entries if 'colname' has a np.NaN value #26011

Comments

joseortiz3 commented Apr 6, 2019 • edited Loading

Code Sample

Problem description

Expected Output

Output of pd.show_versions()

INSTALLED VERSIONS

aditya1702 commented Apr 6, 2019

aditya1702 commented Apr 8, 2019

gfyoung commented Apr 8, 2019

joseortiz3 commented Apr 14, 2019 • edited Loading

joseortiz3 commented Apr 6, 2019 •

edited

Loading

Output of `pd.show_versions()`

joseortiz3 commented Apr 14, 2019 •

edited

Loading