Skip to content

BUG: GroupBy.count() and GroupBy.sum() incorreclty return NaN instead of 0 for missing categories #35201

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 8 commits into from

Conversation

smithto1
Copy link
Member

@smithto1 smithto1 commented Jul 9, 2020

Behavioural Changes
Fixing two related bugs: when grouping on multiple categoricals, .sum() and .count() would return NaN for the missing categories, but they are expected to return 0 for the missing categories. Both these bugs are fixed.

Tests
Tests were added in PR #35022 when these bugs were discovered and the tests were marked with an xfail. For this PR the xfails are removed and the tests are passing normally. As well, a few other existing tests were expecting sum() to return NaN; these have been updated so that the tests now expect to get 0 (which is the desired behaviour).

Pivot
The change to .sum() also impacts the df.pivot_table() if it is called with aggfunc=sum and is pivoted on a Categorical column with observed=False. This is not explicitly mentioned in either of the bugs, but it does make the behaviour consistent (i.e. the sum of a missing category is zero, not NaN). One test on test_pivot.py was updated to reflect this change.

Default Behaviour
Because df.groupby() and df.pivot_table() have observed=False as the default, the default behaviour will change for a user calling df.groupby().sum() or df.pivot_table(..., aggfunc='sum') if they are grouping/pivoting on a categorical with missing categories. Previously the default would give them NaN for the missing categories, now the default will give them 0.

What is the appropriate to highlight/document this change to the default behaviour?

@smithto1 smithto1 marked this pull request as draft July 10, 2020 09:00
@smithto1 smithto1 marked this pull request as ready for review July 10, 2020 09:26
@jreback jreback added Categorical Categorical Data Type Groupby Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate labels Jul 10, 2020
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@smithto1 I am not thrilled with threading this fill arg thru everything. any way to isolate this to just sum itself (e.g. pass in sem thru _agg_general)?

@smithto1
Copy link
Member Author

@jreback I've made another attempt using a different approach. Check out #35241

(I'm more thrilled with #35241 right now, so will mark this one as draft, expecting we'll decline it later.)

@smithto1 smithto1 marked this pull request as draft July 11, 2020 22:46
@smithto1 smithto1 closed this Jul 15, 2020
@smithto1 smithto1 deleted the issue31422 branch July 15, 2020 01:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Categorical Categorical Data Type Groupby Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate
Projects
None yet
2 participants