groupby nunique makes inplace replacement of NaN values to -9.223372036854776e18 #32632

nicholasyli · 2020-03-11T16:59:21Z

Code Sample, a copy-pastable example if possible

import pandas as pd
import numpy as np

df = pd.DataFrame({'x': [1, 1, 1, 2, 2, 2], 'y': [1, 2, np.nan, np.nan, np.nan, 6]})
print(df)

# Doesn't replace NaN's
df.groupby('x').sum()
print(df)

# Replaces NaN's
df.groupby('x').nunique()
print(df)

Problem description

Using DataFrameGroupBy.nunique on a dataframe makes an inplace replacement of NaN values. Simply applying the function (and not assigning it to anything) will replace existing NaN's with -9223372036854775808.

There are two distinct issues:

There is some connection to issue API / BUG: How do we differentiate between -9223372036854775808 and iNaT? #16674 in comparing NaN with -9223372036854775808 (thanks @mroeschke)
The inplace replacement, which seems to be more problematic.

Expected Output

x y
0 1 1.0
1 1 2.0
2 1 NaN
3 2 NaN
4 2 NaN
5 2 6.0
x y
0 1 1.0
1 1 2.0
2 1 NaN
3 2 NaN
4 2 NaN
5 2 6.0
x y
0 1 1.0
1 1 2.0
2 1 NaN
3 2 NaN
4 2 NaN
5 2 6.0

My Output

x y
0 1 1.0
1 1 2.0
2 1 NaN
3 2 NaN
4 2 NaN
5 2 6.0
x y
0 1 1.0
1 1 2.0
2 1 NaN
3 2 NaN
4 2 NaN
5 2 6.0
x y
0 1 1.000000e+00
1 1 2.000000e+00
2 1 -9.223372e+18
3 2 -9.223372e+18
4 2 -9.223372e+18
5 2 6.000000e+00

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit : None
python : 3.7.6.final.0
python-bits : 64
OS : Linux
OS-release : 2.6.32-754.27.1.el6.x86_64
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 1.0.1
numpy : 1.18.1
pytz : 2019.3
dateutil : 2.8.1
pip : 20.0.2
setuptools : 45.2.0.post20200210
Cython : None
pytest : None
hypothesis : None
sphinx : 2.4.0
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 2.11.1
IPython : 7.12.0
pandas_datareader: None
bs4 : 4.8.0
bottleneck : None
fastparquet : 0.3.2
gcsfs : None
lxml.etree : None
matplotlib : 3.1.3
numexpr : 2.7.0
odfpy : None
openpyxl : 3.0.3
pandas_gbq : None
pyarrow : 0.15.1
pytables : None
pytest : None
pyxlsb : None
s3fs : None
scipy : 1.4.1
sqlalchemy : None
tables : 3.6.1
tabulate : None
xarray : None
xlrd : 1.2.0
xlwt : None
xlsxwriter : None
numba : 0.48.0

The text was updated successfully, but these errors were encountered:

mroeschke · 2020-03-11T18:14:38Z

Thanks for the report. I think this is the same issue as #16674

nicholasyli · 2020-03-11T18:55:36Z

Apologies for not being clearer. I think the issues are related, but not quite the same. There are two issues here:

comparison of a NaN with -9223372036854775808. This is duplicated.
a groupby operation making a replacement inplace

The second issue here is not a duplicate.

mroeschke · 2020-03-11T19:07:26Z

Thanks for clarifying. @nicholasyli Could you clarify the issue title and description highlighting the 2nd issue.

nicholasyli · 2020-03-11T19:13:59Z

Thanks @mroeschke, hopefully that's clearer.

mroeschke · 2020-03-11T19:15:33Z

Perfect, thank you.

dsaxton · 2020-03-11T22:22:31Z

I think this was fixed on master by #32175

mroeschke · 2020-03-11T22:31:58Z

Thanks for the catch @dsaxton. Does look fixed on master

In [7]: df
Out[7]:
   x    y
0  1  1.0
1  1  2.0
2  1  NaN
3  2  NaN
4  2  NaN
5  2  6.0

In [8]: df.groupby('x').sum()
Out[8]:
     y
x
1  3.0
2  6.0

In [9]: df.groupby('x').nunique()
Out[9]:
   x  y
x
1  1  2
2  1  1

In [10]: df
Out[10]:
   x    y
0  1  1.0
1  1  2.0
2  1  NaN
3  2  NaN
4  2  NaN
5  2  6.0

In [11]: pd.__version__
Out[11]: '1.1.0.dev0+739.g1951c8ea6.dirty'

nicholasyli · 2020-03-12T13:37:12Z

Thanks all, sorry for missing the previous issue.

mroeschke closed this as completed Mar 11, 2020

mroeschke added the Duplicate Report Duplicate issue or pull request label Mar 11, 2020

mroeschke reopened this Mar 11, 2020

nicholasyli changed the title ~~groupby nunique replaces NaN with -9.223372036854776e18~~ groupby nunique makes inplace replacement of NaN values to -9.223372036854776e18 Mar 11, 2020

mroeschke added Bug Groupby and removed Duplicate Report Duplicate issue or pull request labels Mar 11, 2020

mroeschke closed this as completed Mar 11, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

groupby nunique makes inplace replacement of NaN values to -9.223372036854776e18 #32632

groupby nunique makes inplace replacement of NaN values to -9.223372036854776e18 #32632

nicholasyli commented Mar 11, 2020 •

edited

Loading

INSTALLED VERSIONS

mroeschke commented Mar 11, 2020

nicholasyli commented Mar 11, 2020

mroeschke commented Mar 11, 2020

nicholasyli commented Mar 11, 2020

mroeschke commented Mar 11, 2020

dsaxton commented Mar 11, 2020 •

edited

Loading

mroeschke commented Mar 11, 2020

nicholasyli commented Mar 12, 2020

groupby nunique makes inplace replacement of NaN values to -9.223372036854776e18 #32632

groupby nunique makes inplace replacement of NaN values to -9.223372036854776e18 #32632

Comments

nicholasyli commented Mar 11, 2020 • edited Loading

Code Sample, a copy-pastable example if possible

Problem description

Expected Output

My Output

Output of pd.show_versions()

INSTALLED VERSIONS

mroeschke commented Mar 11, 2020

nicholasyli commented Mar 11, 2020

mroeschke commented Mar 11, 2020

nicholasyli commented Mar 11, 2020

mroeschke commented Mar 11, 2020

dsaxton commented Mar 11, 2020 • edited Loading

mroeschke commented Mar 11, 2020

nicholasyli commented Mar 12, 2020

nicholasyli commented Mar 11, 2020 •

edited

Loading

Output of `pd.show_versions()`

dsaxton commented Mar 11, 2020 •

edited

Loading