Skip to content

groupby nunique makes inplace replacement of NaN values to -9.223372036854776e18 #32632

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
nicholasyli opened this issue Mar 11, 2020 · 8 comments

Comments

@nicholasyli
Copy link

nicholasyli commented Mar 11, 2020

Code Sample, a copy-pastable example if possible

import pandas as pd
import numpy as np

df = pd.DataFrame({'x': [1, 1, 1, 2, 2, 2], 'y': [1, 2, np.nan, np.nan, np.nan, 6]})
print(df)

# Doesn't replace NaN's
df.groupby('x').sum()
print(df)

# Replaces NaN's
df.groupby('x').nunique()
print(df)

Problem description

Using DataFrameGroupBy.nunique on a dataframe makes an inplace replacement of NaN values. Simply applying the function (and not assigning it to anything) will replace existing NaN's with -9223372036854775808.

There are two distinct issues:

  1. There is some connection to issue API / BUG: How do we differentiate between -9223372036854775808 and iNaT? #16674 in comparing NaN with -9223372036854775808 (thanks @mroeschke)
  2. The inplace replacement, which seems to be more problematic.

Expected Output

x y
0 1 1.0
1 1 2.0
2 1 NaN
3 2 NaN
4 2 NaN
5 2 6.0
x y
0 1 1.0
1 1 2.0
2 1 NaN
3 2 NaN
4 2 NaN
5 2 6.0
x y
0 1 1.0
1 1 2.0
2 1 NaN
3 2 NaN
4 2 NaN
5 2 6.0

My Output

x y
0 1 1.0
1 1 2.0
2 1 NaN
3 2 NaN
4 2 NaN
5 2 6.0
x y
0 1 1.0
1 1 2.0
2 1 NaN
3 2 NaN
4 2 NaN
5 2 6.0
x y
0 1 1.000000e+00
1 1 2.000000e+00
2 1 -9.223372e+18
3 2 -9.223372e+18
4 2 -9.223372e+18
5 2 6.000000e+00

Output of pd.show_versions()

INSTALLED VERSIONS

commit : None
python : 3.7.6.final.0
python-bits : 64
OS : Linux
OS-release : 2.6.32-754.27.1.el6.x86_64
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 1.0.1
numpy : 1.18.1
pytz : 2019.3
dateutil : 2.8.1
pip : 20.0.2
setuptools : 45.2.0.post20200210
Cython : None
pytest : None
hypothesis : None
sphinx : 2.4.0
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 2.11.1
IPython : 7.12.0
pandas_datareader: None
bs4 : 4.8.0
bottleneck : None
fastparquet : 0.3.2
gcsfs : None
lxml.etree : None
matplotlib : 3.1.3
numexpr : 2.7.0
odfpy : None
openpyxl : 3.0.3
pandas_gbq : None
pyarrow : 0.15.1
pytables : None
pytest : None
pyxlsb : None
s3fs : None
scipy : 1.4.1
sqlalchemy : None
tables : 3.6.1
tabulate : None
xarray : None
xlrd : 1.2.0
xlwt : None
xlsxwriter : None
numba : 0.48.0

@mroeschke
Copy link
Member

Thanks for the report. I think this is the same issue as #16674

@mroeschke mroeschke added the Duplicate Report Duplicate issue or pull request label Mar 11, 2020
@nicholasyli
Copy link
Author

Apologies for not being clearer. I think the issues are related, but not quite the same. There are two issues here:

  1. comparison of a NaN with -9223372036854775808. This is duplicated.
  2. a groupby operation making a replacement inplace

The second issue here is not a duplicate.

@mroeschke
Copy link
Member

Thanks for clarifying. @nicholasyli Could you clarify the issue title and description highlighting the 2nd issue.

@mroeschke mroeschke reopened this Mar 11, 2020
@nicholasyli nicholasyli changed the title groupby nunique replaces NaN with -9.223372036854776e18 groupby nunique makes inplace replacement of NaN values to -9.223372036854776e18 Mar 11, 2020
@mroeschke mroeschke added Bug Groupby and removed Duplicate Report Duplicate issue or pull request labels Mar 11, 2020
@nicholasyli
Copy link
Author

Thanks @mroeschke, hopefully that's clearer.

@mroeschke
Copy link
Member

Perfect, thank you.

@dsaxton
Copy link
Member

dsaxton commented Mar 11, 2020

I think this was fixed on master by #32175

@mroeschke
Copy link
Member

Thanks for the catch @dsaxton. Does look fixed on master

In [7]: df
Out[7]:
   x    y
0  1  1.0
1  1  2.0
2  1  NaN
3  2  NaN
4  2  NaN
5  2  6.0

In [8]: df.groupby('x').sum()
Out[8]:
     y
x
1  3.0
2  6.0

In [9]: df.groupby('x').nunique()
Out[9]:
   x  y
x
1  1  2
2  1  1

In [10]: df
Out[10]:
   x    y
0  1  1.0
1  1  2.0
2  1  NaN
3  2  NaN
4  2  NaN
5  2  6.0

In [11]: pd.__version__
Out[11]: '1.1.0.dev0+739.g1951c8ea6.dirty'

@nicholasyli
Copy link
Author

Thanks all, sorry for missing the previous issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants