Skip to content

BUG: df.groupby().count() returns NaN instead of Zero #35028

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
3 tasks done
smithto1 opened this issue Jun 27, 2020 · 6 comments · Fixed by #35280
Closed
3 tasks done

BUG: df.groupby().count() returns NaN instead of Zero #35028

smithto1 opened this issue Jun 27, 2020 · 6 comments · Fixed by #35280
Assignees
Labels
Bug Categorical Categorical Data Type Groupby Numeric Operations Arithmetic, Comparison, and Logical operations
Milestone

Comments

@smithto1
Copy link
Member

smithto1 commented Jun 27, 2020

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • (optional) I have confirmed this bug exists on the master branch of pandas.


Code Sample, a copy-pastable example

import pandas as pd

df = pd.DataFrame(
    {
        "cat_1": pd.Categorical(list("AABB"), categories=list("ABC")),
        "cat_2": pd.Categorical(list("1111"), categories=list("12")),
        "value": [0.1, 0.1, 0.1, 0.1],
    }
)


# SeriesGroupBy on one pd.Categorical: unobserved categories have a count of 0
srg_grp = df.groupby(['cat_1'], observed=False)['value']
print(srg_grp.count())

# SeriesGroupBy on two pd.Categorical: unobserved categories have a count of 0
srs_grp = df.groupby(['cat_1', 'cat_2'], observed=False)['value']
print(srs_grp.count())

# DataFrameGroupBy on one pd.Categorical: unobserved categories have a count of 0
df_grp = df.groupby(['cat_1'], observed=False)
print(df_grp.count())

# DataFrameGroupBy on two pd.Categorical: unobserved categories have a count of NaN
df_grp = df.groupby(['cat_1', 'cat_2'], observed=False)
print(df_grp.count())

Problem description

When grouping by multiple pd.Categorical columns, DataFrameGroupBy.count() returns NaN for missing categories, and the dtype is float.

A similar problem is reported for .sum() in #31422

Expected Output

.count() should return zero for missing categories with a dtype of int.

Output of pd.show_versions()

INSTALLED VERSIONS

commit : 248c191
python : 3.8.3.final.0
python-bits : 64
OS : Windows
OS-release : 10
Version : 10.0.18362
machine : AMD64
processor : Intel64 Family 6 Model 142 Stepping 9, GenuineIntel
byteorder : little
LC_ALL : None
LANG : None
LOCALE : English_United Kingdom.1252

pandas : 1.1.0.dev0+1973.g248c19147.dirty
numpy : 1.18.5
pytz : 2020.1
dateutil : 2.8.1
pip : 20.1.1
setuptools : 47.3.1.post20200616
Cython : 0.29.20
pytest : 5.4.3
hypothesis : 5.18.0
sphinx : 3.1.1
blosc : None
feather : None
xlsxwriter : 1.2.9
lxml.etree : 4.5.1
html5lib : 1.1
pymysql : None
psycopg2 : None
jinja2 : 2.11.2
IPython : 7.15.0
pandas_datareader: None
bs4 : 4.9.1
bottleneck : 1.3.2
fsspec : 0.7.4
fastparquet : 0.4.0
gcsfs : 0.6.2
matplotlib : 3.2.1
numexpr : 2.7.1
odfpy : None
openpyxl : 3.0.3
pandas_gbq : None
pyarrow : 0.17.1
pytables : None
pyxlsb : None
s3fs : 0.4.2
scipy : 1.3.2
sqlalchemy : 1.3.17
tables : 3.6.1
tabulate : 0.8.7
xarray : 0.15.1
xlrd : 1.2.0
xlwt : 1.3.0
numba : 0.48.0

@Demetrio92
Copy link

@smithto1 have you read the description of the observed parameter to the DataFrame.groupby method? This is not a bug, it's a feature.

And it does not matter if you use count, sum or any other aggregation method.

@smithto1
Copy link
Member Author

smithto1 commented Jun 27, 2020

Hey @Demetrio92 , I think there may be a misunderstanding, the issue is not with the observed parameter, that works fine.

The issue I am raising only applies when you have observed=False and you expect your groupby to return the missing categories and you are grouping by more than one pd.Categorical (I've added observed=False to all the examples above to make this explicit).

The specific bug is that .count() returns NaN for the missing categories, when it should be returning 0. If you group by just one category, the .count() returns 0 for the missing categories, but when you groupby two pd.Categoricals, it returns a count of NaN.

Also, this only applies to the DataFrameGroupBy. The SeriesGroupBy does return 0 when you group by two pd.Categoricals. The inconsistency suggests this is a bug.

@dsaxton dsaxton added Categorical Categorical Data Type Groupby Numeric Operations Arithmetic, Comparison, and Logical operations and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Jun 27, 2020
@biddwan09
Copy link
Contributor

Hi I would like to start contributing in this project . Can I look into this issue ?

@smithto1
Copy link
Member Author

@biddwan09 have a read here on how to assign an issue to yourself.

https://pandas.pydata.org/docs/development/contributing.html#where-to-start

@biddwan09
Copy link
Contributor

take

@biddwan09
Copy link
Contributor

@biddwan09 have a read here on how to assign an issue to yourself.

https://pandas.pydata.org/docs/development/contributing.html#where-to-start

@smithto1 thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment