Skip to content

TST: add test to ensure that df.groupby() returns the missing categories when grouping on 2 pd.Categoricals #35022

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
Jul 8, 2020

Conversation

smithto1
Copy link
Member

@smithto1 smithto1 commented Jun 26, 2020

  1. New Tests Addeded
    Inconsistant behaviour of empty groups when grouping with one vs. many  #23865 Groupby ignores unobserved combinations when passing more than one categorical column even if observed=True #27075 both reported that when df.groupby was called and by was set to more than one pd.Categorical column, that any missing categories were not returned, even when observed=False. This issue was fixed in BUG: Series groupby does not include nan counts for all categorical labels (#17605) #29690. This Pull Request adds tests to make sure that this correct behaviour is enforced by tests.

  2. New Bug Found
    Testing did reveal one further issue: DataFrameGroupBy.count() returns NaN for missing Categories, when it should return a count of 0. SeriesGroupBy.count() does return 0, which is the expected behaviour. I have raised an issue for this bug (BUG: df.groupby().count() returns NaN instead of Zero #35028 ) and marked the test with an xfail. When the bug is fixed, the xfail will cause the tests to fail, and the xfail can be removed.

  3. Existing Test Changed as it had the Wrong Expected Result
    A similar issue was reported for .sum() in Inconsistent behavior when groupby pandas Categorical variables #31422: missing categories return a sum of NaN when they should return a sum of 0. There was a mistake on the existing test for SeriesGroupBy.sum(), as it said the expected output was NaN (see below) when it should have been 0.

I have changed this so that the expected output is 0 (this is inline with the comment here: #31422 (comment) ) and marked the tests for .sum() with xfail. When the bug is addressed, the xfail will cause the tests to fail, and the xfails can be removed.

@smithto1 smithto1 changed the title Issue27075 TST: DataFrame.groupby on multiple pd.Categoricals returns missing categories Jun 26, 2020
@smithto1 smithto1 changed the title TST: DataFrame.groupby on multiple pd.Categoricals returns missing categories TST: add test to ensure that df.groupby() returns the missing categories when grouping on 2 pd.Categoricals Jun 27, 2020
@jreback jreback added Categorical Categorical Data Type Groupby Testing pandas testing functions or related to the test suite labels Jun 30, 2020
@smithto1 smithto1 requested a review from jreback July 3, 2020 15:15
@smithto1
Copy link
Member Author

smithto1 commented Jul 8, 2020

@jreback Can I get a another review/approval on this? I'd like to close it off and continue addressing the related bugs.

Addressed all your previous comments and merged in the latest master with all checks passing.

@jreback jreback added this to the 1.1 milestone Jul 8, 2020
@jreback jreback merged commit 2ed252a into pandas-dev:master Jul 8, 2020
@jreback
Copy link
Contributor

jreback commented Jul 8, 2020

thanks @smithto1 very nice, very happy to have follows to fix things :->

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Categorical Categorical Data Type Groupby Testing pandas testing functions or related to the test suite
Projects
None yet
2 participants