TST: add test to ensure that df.groupby() returns the missing categories when grouping on 2 pd.Categoricals #35022

smithto1 · 2020-06-26T21:24:55Z

closes Inconsistant behaviour of empty groups when grouping with one vs. many #23865
closes Groupby ignores unobserved combinations when passing more than one categorical column even if observed=True #27075
tests added / passed
passes black pandas
passes git diff upstream/master -u -- "*.py" | flake8 --diff

New Tests Addeded
Inconsistant behaviour of empty groups when grouping with one vs. many #23865 Groupby ignores unobserved combinations when passing more than one categorical column even if observed=True #27075 both reported that when df.groupby was called and by was set to more than one pd.Categorical column, that any missing categories were not returned, even when observed=False. This issue was fixed in BUG: Series groupby does not include nan counts for all categorical labels (#17605) #29690. This Pull Request adds tests to make sure that this correct behaviour is enforced by tests.
New Bug Found
Testing did reveal one further issue: DataFrameGroupBy.count() returns NaN for missing Categories, when it should return a count of 0. SeriesGroupBy.count() does return 0, which is the expected behaviour. I have raised an issue for this bug (BUG: df.groupby().count() returns NaN instead of Zero #35028 ) and marked the test with an xfail. When the bug is fixed, the xfail will cause the tests to fail, and the xfail can be removed.
Existing Test Changed as it had the Wrong Expected Result
A similar issue was reported for .sum() in Inconsistent behavior when groupby pandas Categorical variables #31422: missing categories return a sum of NaN when they should return a sum of 0. There was a mistake on the existing test for SeriesGroupBy.sum(), as it said the expected output was NaN (see below) when it should have been 0.

pandas/pandas/tests/groupby/test_categorical.py

Line 1315 in 0159cba

("sum", np.NaN),

I have changed this so that the expected output is 0 (this is inline with the comment here: #31422 (comment) ) and marked the tests for .sum() with xfail. When the bug is addressed, the xfail will cause the tests to fail, and the xfails can be removed.

…fail and reference to GH issues.

pandas/tests/groupby/test_categorical.py

smithto1 · 2020-07-08T15:02:21Z

@jreback Can I get a another review/approval on this? I'd like to close it off and continue addressing the related bugs.

Addressed all your previous comments and merged in the latest master with all checks passing.

jreback · 2020-07-08T15:35:44Z

thanks @smithto1 very nice, very happy to have follows to fix things :->

smithto1 added 2 commits June 26, 2020 21:37

tests for dataframe.groupby with 2 Categoricals

f717a7e

black

248c191

smithto1 changed the title ~~Issue27075~~ TST: DataFrame.groupby on multiple pd.Categoricals returns missing categories Jun 26, 2020

smithto1 added 2 commits June 27, 2020 13:07

add issue number to test comments

8621e75

expected output for .sum() changed from NaN to 0. tests marked with x…

d1dcc61

…fail and reference to GH issues.

smithto1 changed the title ~~TST: DataFrame.groupby on multiple pd.Categoricals returns missing categories~~ TST: add test to ensure that df.groupby() returns the missing categories when grouping on 2 pd.Categoricals Jun 27, 2020

jreback requested changes Jun 30, 2020

View reviewed changes

pandas/tests/groupby/test_categorical.py Outdated Show resolved Hide resolved

pandas/tests/groupby/test_categorical.py Show resolved Hide resolved

pandas/tests/groupby/test_categorical.py Outdated Show resolved Hide resolved

jreback added Categorical Categorical Data Type Groupby Testing pandas testing functions or related to the test suite labels Jun 30, 2020

responding to PR comments

bc4a3b8

smithto1 requested a review from jreback July 3, 2020 15:15

Merge remote-tracking branch 'upstream/master' into issue27075

7c312d4

jreback added this to the 1.1 milestone Jul 8, 2020

jreback approved these changes Jul 8, 2020

View reviewed changes

jreback merged commit 2ed252a into pandas-dev:master Jul 8, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

TST: add test to ensure that df.groupby() returns the missing categories when grouping on 2 pd.Categoricals #35022

TST: add test to ensure that df.groupby() returns the missing categories when grouping on 2 pd.Categoricals #35022

Uh oh!

smithto1 commented Jun 26, 2020 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

smithto1 commented Jul 8, 2020

Uh oh!

jreback commented Jul 8, 2020

Uh oh!

Uh oh!

Uh oh!

TST: add test to ensure that df.groupby() returns the missing categories when grouping on 2 pd.Categoricals #35022

TST: add test to ensure that df.groupby() returns the missing categories when grouping on 2 pd.Categoricals #35022

Uh oh!

Conversation

smithto1 commented Jun 26, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

smithto1 commented Jul 8, 2020

Uh oh!

jreback commented Jul 8, 2020

Uh oh!

Uh oh!

smithto1 commented Jun 26, 2020 •

edited

Loading