Skip to content

Groupby('colname').nth(0) results in index with duplicate entries if 'colname' has a np.NaN value #26011

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
joseortiz3 opened this issue Apr 6, 2019 · 4 comments · Fixed by #26152
Labels
Groupby Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate
Milestone

Comments

@joseortiz3
Copy link
Contributor

joseortiz3 commented Apr 6, 2019

Code Sample

>>> import pandas as pd
>>> import numpy as np
>>> test = pd.DataFrame(data = [[np.NaN, 1, np.NaN], [2, 3, 4]], index = [0, 1], columns = ['one','two','three'])
>>> test
   one  two  three
0  NaN    1    NaN
1  2.0    3    4.0
>>> test.groupby('one').nth(0)
     two  three
one            
2.0    1    NaN
2.0    3    4.0

Problem description

A df.groupby('colname').nth(0) operation should return a new data-frame with an index that contains no duplicates; each unique colname will become a single value in the resulting index.

However if the column labeled colname contains a np.NaN value, instead of this np.NaN value becoming its own unique value in the index or simply ignored, it instead turns into a duplicate value as seen above, which is extremely unexpected behavior (probably a bug).

Note only nth is affected - first and last never produce duplicate indices. However first and last have performance problems on categorical data: #25397

Expected Output

# first actually "just works"
>>> test.groupby('one').first()
     two  three
one            
2.0    3    4.0

# so probably this is the correct behavior
>>> test.groupby('one').nth(0)
     two  three
one            
2.0    3    4.0

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.7.1.final.0
python-bits: 64
OS: Windows
OS-release: 10
machine: AMD64
processor: Intel64 Family 6 Model 158 Stepping 9, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None

pandas: 0.24.2
pytest: None
pip: 19.0.3
setuptools: 39.0.1
Cython: None
numpy: 1.15.3
scipy: 1.1.0
pyarrow: 0.12.0
xarray: 0.11.0
IPython: 7.3.0
sphinx: None
patsy: 0.5.1
dateutil: 2.7.5
pytz: 2018.6
blosc: None
bottleneck: None
tables: 3.4.4
numexpr: 2.6.8
feather: 0.4.0
matplotlib: 3.0.1
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml.etree: 4.2.6
bs4: 4.6.3
html5lib: None
sqlalchemy: 1.2.17
pymysql: None
psycopg2: 2.7.7 (dt dec pq3 ext lo64)
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
gcsfs: None

@aditya1702
Copy link

If this is indeed potential bug then I would like to work on it

@aditya1702
Copy link

@gfyoung Is this a bug or something done intentionally?

@gfyoung gfyoung added Groupby Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate labels Apr 8, 2019
@gfyoung
Copy link
Member

gfyoung commented Apr 8, 2019

cc @jreback

@joseortiz3
Copy link
Contributor Author

joseortiz3 commented Apr 14, 2019

This is definitely a bug. I'm using the below code as a workaround:

def groupby_first_fast(df, groupby):
    """Pandas groupby.first() is slow due to a conversion.
    Meanwhile groupby.nth(0) has a bug with missing values. 
    This takes care of both problems.
    Issues #25397 and #26011"""

    # Get rid of missing values that break nth.
    df = df[~(df[groupby].isnull())]

    # Use nth instead of first for speed.
    return df.groupby(groupby).nth(0)

If getting rid of rows with missing/NaN groupby values before calling nth is an acceptable solution, that's a pretty easy PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Groupby Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants