-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG: GroupBy.count() and GroupBy.sum() incorreclty return NaN instead of 0 for missing categories #35280
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@jreback Apologies for the PR spam, but because my new work was on a new branch I couldn't push to the old PR. Please have a reivew of this one. |
GroupBy.apply(sum)
Calling This change necessitated an update to @jreback , what are your thoughts on this? Is it fine to leave I am not inclined to address it in this PR, but if we do want it fixed, I'll raise an issue for it. |
can u show the output of #35280 (comment) here (and code as welll) |
|
@jreback I think all issues are addressed here. Can you do a second review? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm. doc updates.
lgtm. pls merge master and ping on green. |
cc @TomAugspurger if you can give a glance. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @smithto1. Looks good.
thanks @smithto1 very nice |
black pandas
git diff upstream/master -u -- "*.py" | flake8 --diff
Behavioural Changes
Fixing two related bugs: when grouping on multiple categoricals,
.sum()
and.count()
would returnNaN
for the missing categories, but they are expected to return0
for the missing categories. Both these bugs are fixed.Tests
Tests were added in PR #35022 when these bugs were discovered and the tests were marked with an
xfail
. For this PR thexfails
are removed and the tests are passing normally. As well, a few other existing tests were expectingsum()
to returnNaN
; these have been updated so that the tests now expect to get0
(which is the desired behaviour).Pivot
The change to
.sum()
also impacts thedf.pivot_table()
if it is called withaggfunc=sum
and is pivoted on a Categorical column withobserved=False
. This is not explicitly mentioned in either of the bugs, but it does make the behaviour consistent (i.e. the sum of a missing category is zero, notNaN
). One test on test_pivot.py was updated to reflect this change.