BUG: GroupBy.count() and GroupBy.sum() incorreclty return NaN instead of 0 for missing categories #35201
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
black pandas
git diff upstream/master -u -- "*.py" | flake8 --diff
Behavioural Changes
Fixing two related bugs: when grouping on multiple categoricals,
.sum()
and.count()
would returnNaN
for the missing categories, but they are expected to return0
for the missing categories. Both these bugs are fixed.Tests
Tests were added in PR #35022 when these bugs were discovered and the tests were marked with an
xfail
. For this PR thexfails
are removed and the tests are passing normally. As well, a few other existing tests were expectingsum()
to returnNaN
; these have been updated so that the tests now expect to get0
(which is the desired behaviour).Pivot
The change to
.sum()
also impacts thedf.pivot_table()
if it is called withaggfunc=sum
and is pivoted on a Categorical column withobserved=False
. This is not explicitly mentioned in either of the bugs, but it does make the behaviour consistent (i.e. the sum of a missing category is zero, notNaN
). One test on test_pivot.py was updated to reflect this change.Default Behaviour
Because
df.groupby()
anddf.pivot_table()
haveobserved=False
as the default, the default behaviour will change for a user callingdf.groupby().sum()
ordf.pivot_table(..., aggfunc='sum')
if they are grouping/pivoting on a categorical with missing categories. Previously the default would give themNaN
for the missing categories, now the default will give them0
.What is the appropriate to highlight/document this change to the default behaviour?