BUG: GroupBy.count() and GroupBy.sum() incorreclty return NaN instead of 0 for missing categories #35201

smithto1 · 2020-07-09T23:23:32Z

closes Inconsistent behavior when groupby pandas Categorical variables #31422
closes BUG: df.groupby().count() returns NaN instead of Zero #35028
tests added / passed
passes black pandas
passes git diff upstream/master -u -- "*.py" | flake8 --diff
whatsnew entry

Behavioural Changes
Fixing two related bugs: when grouping on multiple categoricals, .sum() and .count() would return NaN for the missing categories, but they are expected to return 0 for the missing categories. Both these bugs are fixed.

Tests
Tests were added in PR #35022 when these bugs were discovered and the tests were marked with an xfail. For this PR the xfails are removed and the tests are passing normally. As well, a few other existing tests were expecting sum() to return NaN; these have been updated so that the tests now expect to get 0 (which is the desired behaviour).

Pivot
The change to .sum() also impacts the df.pivot_table() if it is called with aggfunc=sum and is pivoted on a Categorical column with observed=False. This is not explicitly mentioned in either of the bugs, but it does make the behaviour consistent (i.e. the sum of a missing category is zero, not NaN). One test on test_pivot.py was updated to reflect this change.

Default Behaviour
Because df.groupby() and df.pivot_table() have observed=False as the default, the default behaviour will change for a user calling df.groupby().sum() or df.pivot_table(..., aggfunc='sum') if they are grouping/pivoting on a categorical with missing categories. Previously the default would give them NaN for the missing categories, now the default will give them 0.

What is the appropriate to highlight/document this change to the default behaviour?

…ories when groupby by multiple categories

…grouping by multiple Categoricals. Updates to tests to reflect this expected output

…nstead of NaN

… whether observed True/False

jreback

@smithto1 I am not thrilled with threading this fill arg thru everything. any way to isolate this to just sum itself (e.g. pass in sem thru _agg_general)?

smithto1 · 2020-07-11T22:45:45Z

@jreback I've made another attempt using a different approach. Check out #35241

(I'm more thrilled with #35241 right now, so will mark this one as draft, expecting we'll decline it later.)

smithto1 added 7 commits July 8, 2020 23:05

pandas-dev#35028 DFGroupBy.count() now returns zero for missing categ…

b65467a

…ories when groupby by multiple categories

pandas-dev#31422 GroupBy.sum() returns 0 for missing categories when …

4de1d6b

…grouping by multiple Categoricals. Updates to tests to reflect this expected output

Merge remote-tracking branch 'upstream/master' into issue31422

eaff0ef

whatsnew

78ae5c7

addressing Type Validation check errors

1ec3389

addressing second Type Validation error

1efb6e1

test_pivot updates to reflect that .count() and .sum() now return 0 i…

b534394

…nstead of NaN

smithto1 marked this pull request as draft July 10, 2020 09:00

fixed pivot test to use observed so the output is different depending…

1e2fec8

… whether observed True/False

smithto1 marked this pull request as ready for review July 10, 2020 09:26

jreback added Categorical Categorical Data Type Groupby Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate labels Jul 10, 2020

jreback requested changes Jul 10, 2020

View reviewed changes

smithto1 marked this pull request as draft July 11, 2020 22:46

smithto1 closed this Jul 15, 2020

smithto1 deleted the issue31422 branch July 15, 2020 01:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: GroupBy.count() and GroupBy.sum() incorreclty return NaN instead of 0 for missing categories #35201

BUG: GroupBy.count() and GroupBy.sum() incorreclty return NaN instead of 0 for missing categories #35201

smithto1 commented Jul 9, 2020 •

edited

Loading

jreback left a comment

smithto1 commented Jul 11, 2020

BUG: GroupBy.count() and GroupBy.sum() incorreclty return NaN instead of 0 for missing categories #35201

BUG: GroupBy.count() and GroupBy.sum() incorreclty return NaN instead of 0 for missing categories #35201

Conversation

smithto1 commented Jul 9, 2020 • edited Loading

jreback left a comment

Choose a reason for hiding this comment

smithto1 commented Jul 11, 2020

smithto1 commented Jul 9, 2020 •

edited

Loading