You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
A df.groupby('colname').nth(0) operation should return a new data-frame with an index that contains no duplicates; each unique colname will become a single value in the resulting index.
However if the column labeled colname contains a np.NaN value, instead of this np.NaN value becoming its own unique value in the index or simply ignored, it instead turns into a duplicate value as seen above, which is extremely unexpected behavior (probably a bug).
Note only nth is affected - first and last never produce duplicate indices. However first and last have performance problems on categorical data: #25397
Expected Output
# first actually "just works">>>test.groupby('one').first()
twothreeone2.034.0# so probably this is the correct behavior>>>test.groupby('one').nth(0)
twothreeone2.034.0
Output of pd.show_versions()
INSTALLED VERSIONS
commit: None
python: 3.7.1.final.0
python-bits: 64
OS: Windows
OS-release: 10
machine: AMD64
processor: Intel64 Family 6 Model 158 Stepping 9, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None
This is definitely a bug. I'm using the below code as a workaround:
defgroupby_first_fast(df, groupby):
"""Pandas groupby.first() is slow due to a conversion. Meanwhile groupby.nth(0) has a bug with missing values. This takes care of both problems. Issues #25397 and #26011"""# Get rid of missing values that break nth.df=df[~(df[groupby].isnull())]
# Use nth instead of first for speed.returndf.groupby(groupby).nth(0)
If getting rid of rows with missing/NaN groupby values before calling nth is an acceptable solution, that's a pretty easy PR.
Code Sample
Problem description
A
df.groupby('colname').nth(0)
operation should return a new data-frame with an index that contains no duplicates; each uniquecolname
will become a single value in the resulting index.However if the column labeled
colname
contains anp.NaN
value, instead of thisnp.NaN
value becoming its own unique value in the index or simply ignored, it instead turns into a duplicate value as seen above, which is extremely unexpected behavior (probably a bug).Note only
nth
is affected -first
andlast
never produce duplicate indices. Howeverfirst
andlast
have performance problems on categorical data: #25397Expected Output
Output of
pd.show_versions()
INSTALLED VERSIONS
commit: None
python: 3.7.1.final.0
python-bits: 64
OS: Windows
OS-release: 10
machine: AMD64
processor: Intel64 Family 6 Model 158 Stepping 9, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None
pandas: 0.24.2
pytest: None
pip: 19.0.3
setuptools: 39.0.1
Cython: None
numpy: 1.15.3
scipy: 1.1.0
pyarrow: 0.12.0
xarray: 0.11.0
IPython: 7.3.0
sphinx: None
patsy: 0.5.1
dateutil: 2.7.5
pytz: 2018.6
blosc: None
bottleneck: None
tables: 3.4.4
numexpr: 2.6.8
feather: 0.4.0
matplotlib: 3.0.1
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml.etree: 4.2.6
bs4: 4.6.3
html5lib: None
sqlalchemy: 1.2.17
pymysql: None
psycopg2: 2.7.7 (dt dec pq3 ext lo64)
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
gcsfs: None
The text was updated successfully, but these errors were encountered: