TST: add test to ensure that df.groupby() returns the missing categories when grouping on 2 pd.Categoricals #35022
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
black pandas
git diff upstream/master -u -- "*.py" | flake8 --diff
New Tests Addeded
Inconsistant behaviour of empty groups when grouping with one vs. many #23865 Groupby ignores unobserved combinations when passing more than one categorical column even if observed=True #27075 both reported that when df.groupby was called and
by
was set to more than one pd.Categorical column, that any missing categories were not returned, even when observed=False. This issue was fixed in BUG: Series groupby does not include nan counts for all categorical labels (#17605) #29690. This Pull Request adds tests to make sure that this correct behaviour is enforced by tests.New Bug Found
Testing did reveal one further issue: DataFrameGroupBy.count() returns NaN for missing Categories, when it should return a count of 0. SeriesGroupBy.count() does return 0, which is the expected behaviour. I have raised an issue for this bug (BUG: df.groupby().count() returns NaN instead of Zero #35028 ) and marked the test with an xfail. When the bug is fixed, the xfail will cause the tests to fail, and the xfail can be removed.
Existing Test Changed as it had the Wrong Expected Result
A similar issue was reported for .sum() in Inconsistent behavior when groupby pandas Categorical variables #31422: missing categories return a sum of NaN when they should return a sum of 0. There was a mistake on the existing test for SeriesGroupBy.sum(), as it said the expected output was NaN (see below) when it should have been 0.
pandas/pandas/tests/groupby/test_categorical.py
Line 1315 in 0159cba
I have changed this so that the expected output is 0 (this is inline with the comment here: #31422 (comment) ) and marked the tests for .sum() with xfail. When the bug is addressed, the xfail will cause the tests to fail, and the xfails can be removed.