BUG: GroupBy.count() and GroupBy.sum() incorreclty return NaN instead of 0 for missing categories #35280

smithto1 · 2020-07-15T01:40:16Z

closes Inconsistent behavior when groupby pandas Categorical variables #31422
closes BUG: df.groupby().count() returns NaN instead of Zero #35028
tests added / passed
passes black pandas
passes git diff upstream/master -u -- "*.py" | flake8 --diff
whatsnew entry

Behavioural Changes
Fixing two related bugs: when grouping on multiple categoricals, .sum() and .count() would return NaN for the missing categories, but they are expected to return 0 for the missing categories. Both these bugs are fixed.

Tests
Tests were added in PR #35022 when these bugs were discovered and the tests were marked with an xfail. For this PR the xfails are removed and the tests are passing normally. As well, a few other existing tests were expecting sum() to return NaN; these have been updated so that the tests now expect to get 0 (which is the desired behaviour).

Pivot
The change to .sum() also impacts the df.pivot_table() if it is called with aggfunc=sum and is pivoted on a Categorical column with observed=False. This is not explicitly mentioned in either of the bugs, but it does make the behaviour consistent (i.e. the sum of a missing category is zero, not NaN). One test on test_pivot.py was updated to reflect this change.

… called with 0

… reindex with 0

smithto1 · 2020-07-15T01:41:41Z

@jreback Apologies for the PR spam, but because my new work was on a new branch I couldn't push to the old PR.

Please have a reivew of this one.

smithto1 · 2020-07-15T01:42:57Z

GroupBy.apply(sum)
One issue to note is that GroupBy.agg(sum) and GroupBy.apply(sum) now produce different outputs. .agg(sum) produces the desired behaviour of returning 0 for missing categories, while .apply(sum) still has the old behaviour of returning NaN.

import pandas as pd

df = pd.DataFrame(
    {
        "cat_1": pd.Categorical(list("AABB"), categories=list("ABC")),
        "cat_2": pd.Categorical(list("1111"), categories=list("12")),
        "value": [0.1, 0.1, 0.1, 0.1],
    }
)
df_grp = df.groupby(["cat_1", "cat_2"], observed=False)

# Using branch from this PR:
# .sum() returns zeros 
print( df_grp.sum() )

# agg returns zeros  
print( df_grp.agg(sum) )

# apply still returns NaN 
print( df_grp.apply(sum) )

Calling .apply(sum) never actually calls the GroupBy.sum() method, it just applies the passed in function. .agg(sum) does call the GroupBy.sum() method (happening here https://github.com/pandas-dev/pandas/blob/master/pandas/core/groupby/generic.py#L253). Thes means that .agg(sum) does get the special behaviour to return zero for missing categories, while .apply(sum) misses out on the special behaviour and still returns NaN.

This change necessitated an update to groupby\test_categorical.py::test_seriesgroupby_observed_false_or_none.

@jreback , what are your thoughts on this? Is it fine to leave .apply(sum) as it is (returning NaN for missing categories).

I am not inclined to address it in this PR, but if we do want it fixed, I'll raise an issue for it.

doc/source/whatsnew/v1.1.0.rst

pandas/core/groupby/groupby.py

doc/source/whatsnew/v1.1.0.rst

pandas/core/groupby/generic.py

pandas/core/groupby/groupby.py

jreback · 2020-07-16T11:55:53Z

can u show the output of #35280 (comment) here (and code as welll)

smithto1 · 2020-07-16T15:36:18Z

can u show the output of #35280 (comment) here (and code as welll)

@jreback

import pandas as pd

df = pd.DataFrame(
    {
        "cat_1": pd.Categorical(list("AABB"), categories=list("ABC")),
        "cat_2": pd.Categorical(list("1111"), categories=list("12")),
        "value": [0.1, 0.1, 0.1, 0.1],
    }
)
df_grp = df.groupby(["cat_1", "cat_2"], observed=False)

# Using branch from this PR:
# .sum() returns zeros 
print( df_grp.sum() )

#              value
# cat_1 cat_2
# A     1        0.2
#       2        0.0
# B     1        0.2
#       2        0.0
# C     1        0.0
#       2        0.0

# agg returns zeros  
print( df_grp.agg(sum) )

#              value
# cat_1 cat_2
# A     1        0.2
#       2        0.0
# B     1        0.2
#       2        0.0
# C     1        0.0
#       2        0.0

# apply still returns NaN 
print( df_grp.apply(sum) )

#              value
# cat_1 cat_2
# A     1        0.2
#       2        NaN
# B     1        0.2
#       2        NaN
# C     1        NaN
#       2        NaN

…ests that fail occassionally)

smithto1 · 2020-07-20T19:00:51Z

@jreback I think all issues are addressed here. Can you do a second review?

jreback

lgtm. doc updates.

doc/source/whatsnew/v1.1.0.rst

jreback · 2020-08-06T23:56:49Z

lgtm. pls merge master and ping on green.

jreback · 2020-08-07T12:32:56Z

cc @TomAugspurger if you can give a glance.

TomAugspurger

Thanks @smithto1. Looks good.

jreback · 2020-08-07T15:21:18Z

thanks @smithto1 very nice

smithto1 added 11 commits July 11, 2020 23:08

set self.observed=True for .sum() and .count() so that reindex can be…

ec577d7

… called with 0

wrote test for .sum() and .count() exception handling

586b1bd

black

ec4bd28

updated pivot tests

eed48cb

whatsnew

88d8a14

comments

940b32e

black

3e39f1a

.sum() and .count() create inner groupby with observed=True, and then…

bd4c4fe

… reindex with 0

modify self.observed with a context manager

fcbb58b

added comments

5f89c17

merging master

7ed0982

Fixed import order that caused Linting error

9da5b18

jreback requested changes Jul 16, 2020

View reviewed changes

jreback added Groupby Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate labels Jul 16, 2020

smithto1 added 4 commits July 16, 2020 09:27

addressing comments

534d60f

merging master

b323c9c

fixing whatsnew comment

d2abd4d

ammend comment

361098a

jreback requested changes Jul 16, 2020

View reviewed changes

doc/source/whatsnew/v1.1.0.rst Outdated Show resolved Hide resolved

pandas/core/groupby/generic.py Outdated Show resolved Hide resolved

pandas/core/groupby/groupby.py Outdated Show resolved Hide resolved

ammend comment

9879199

addressing PR comments

39ee6d0

smithto1 requested a review from jreback July 16, 2020 20:08

smithto1 added 3 commits July 17, 2020 00:32

ammend comment

fc1a846

amend comment

f774288

amend comment (just to rerun tests

1986f51

smithto1 added 4 commits July 17, 2020 15:50

ammend comment to start new tests (tests failing due to some flakey t…

c4c1a92

…ests that fail occassionally)

merging master

7093c91

amend comment to kick off tests

5f8ba0e

ammend test to kick off tests

49b5fc8

smithto1 added 5 commits July 26, 2020 23:57

merging master

0454e3a

fixed merge

28794e3

merge fix

f616d5a

udpate comment to kick off tests

80d4b7d

amend comment to restart checks

442ad1d

jreback requested changes Aug 4, 2020

View reviewed changes

doc/source/whatsnew/v1.1.0.rst Outdated Show resolved Hide resolved

doc/source/whatsnew/v1.1.0.rst Outdated Show resolved Hide resolved

jreback requested a review from TomAugspurger August 4, 2020 00:22

jreback added this to the 1.2 milestone Aug 4, 2020

smithto1 added 4 commits August 5, 2020 15:15

Merge remote-tracking branch 'upstream/master' into issue31422-v7

be8b78c

doc updates

ed99b60

restart tests

528ff7e

Merge remote-tracking branch 'upstream/master' into issue31422-v7

b2a331f

Merge branch 'master' into issue31422-v7

cb010fa

TomAugspurger approved these changes Aug 7, 2020

View reviewed changes

smithto1 added 2 commits August 7, 2020 15:16

merge

d6ebe10

merge

f575687

jreback approved these changes Aug 7, 2020

View reviewed changes

jreback merged commit a21ce87 into pandas-dev:master Aug 7, 2020

Uh oh!

BUG: GroupBy.count() and GroupBy.sum() incorreclty return NaN instead of 0 for missing categories #35280

BUG: GroupBy.count() and GroupBy.sum() incorreclty return NaN instead of 0 for missing categories #35280

Uh oh!

Conversation

smithto1 commented Jul 15, 2020

Uh oh!

smithto1 commented Jul 15, 2020

Uh oh!

smithto1 commented Jul 15, 2020

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jreback commented Jul 16, 2020

Uh oh!

smithto1 commented Jul 16, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

smithto1 commented Jul 20, 2020

Uh oh!

jreback left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

jreback commented Aug 6, 2020

Uh oh!

jreback commented Aug 7, 2020

Uh oh!

TomAugspurger left a comment

Choose a reason for hiding this comment

Uh oh!

jreback commented Aug 7, 2020

Uh oh!

Uh oh!

smithto1 commented Jul 16, 2020 •

edited

Loading