Skip to content

BUG: GroupBy.count() and GroupBy.sum() incorreclty return NaN instead of 0 for missing categories #35280

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 37 commits into from
Aug 7, 2020

Conversation

smithto1
Copy link
Member

Behavioural Changes
Fixing two related bugs: when grouping on multiple categoricals, .sum() and .count() would return NaN for the missing categories, but they are expected to return 0 for the missing categories. Both these bugs are fixed.

Tests
Tests were added in PR #35022 when these bugs were discovered and the tests were marked with an xfail. For this PR the xfails are removed and the tests are passing normally. As well, a few other existing tests were expecting sum() to return NaN; these have been updated so that the tests now expect to get 0 (which is the desired behaviour).

Pivot
The change to .sum() also impacts the df.pivot_table() if it is called with aggfunc=sum and is pivoted on a Categorical column with observed=False. This is not explicitly mentioned in either of the bugs, but it does make the behaviour consistent (i.e. the sum of a missing category is zero, not NaN). One test on test_pivot.py was updated to reflect this change.

@smithto1
Copy link
Member Author

@jreback Apologies for the PR spam, but because my new work was on a new branch I couldn't push to the old PR.

Please have a reivew of this one.

@smithto1
Copy link
Member Author

GroupBy.apply(sum)
One issue to note is that GroupBy.agg(sum) and GroupBy.apply(sum) now produce different outputs. .agg(sum) produces the desired behaviour of returning 0 for missing categories, while .apply(sum) still has the old behaviour of returning NaN.

import pandas as pd

df = pd.DataFrame(
    {
        "cat_1": pd.Categorical(list("AABB"), categories=list("ABC")),
        "cat_2": pd.Categorical(list("1111"), categories=list("12")),
        "value": [0.1, 0.1, 0.1, 0.1],
    }
)
df_grp = df.groupby(["cat_1", "cat_2"], observed=False)

# Using branch from this PR:
# .sum() returns zeros 
print( df_grp.sum() )

# agg returns zeros  
print( df_grp.agg(sum) )

# apply still returns NaN 
print( df_grp.apply(sum) )

Calling .apply(sum) never actually calls the GroupBy.sum() method, it just applies the passed in function. .agg(sum) does call the GroupBy.sum() method (happening here https://github.com/pandas-dev/pandas/blob/master/pandas/core/groupby/generic.py#L253). Thes means that .agg(sum) does get the special behaviour to return zero for missing categories, while .apply(sum) misses out on the special behaviour and still returns NaN.

This change necessitated an update to groupby\test_categorical.py::test_seriesgroupby_observed_false_or_none.

@jreback , what are your thoughts on this? Is it fine to leave .apply(sum) as it is (returning NaN for missing categories).

I am not inclined to address it in this PR, but if we do want it fixed, I'll raise an issue for it.

@jreback jreback added Groupby Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate labels Jul 16, 2020
@jreback
Copy link
Contributor

jreback commented Jul 16, 2020

can u show the output of #35280 (comment) here (and code as welll)

@smithto1
Copy link
Member Author

smithto1 commented Jul 16, 2020

can u show the output of #35280 (comment) here (and code as welll)

@jreback

import pandas as pd

df = pd.DataFrame(
    {
        "cat_1": pd.Categorical(list("AABB"), categories=list("ABC")),
        "cat_2": pd.Categorical(list("1111"), categories=list("12")),
        "value": [0.1, 0.1, 0.1, 0.1],
    }
)
df_grp = df.groupby(["cat_1", "cat_2"], observed=False)

# Using branch from this PR:
# .sum() returns zeros 
print( df_grp.sum() )

#              value
# cat_1 cat_2
# A     1        0.2
#       2        0.0
# B     1        0.2
#       2        0.0
# C     1        0.0
#       2        0.0

# agg returns zeros  
print( df_grp.agg(sum) )

#              value
# cat_1 cat_2
# A     1        0.2
#       2        0.0
# B     1        0.2
#       2        0.0
# C     1        0.0
#       2        0.0

# apply still returns NaN 
print( df_grp.apply(sum) )

#              value
# cat_1 cat_2
# A     1        0.2
#       2        NaN
# B     1        0.2
#       2        NaN
# C     1        NaN
#       2        NaN

@smithto1 smithto1 requested a review from jreback July 16, 2020 20:08
@smithto1
Copy link
Member Author

@jreback I think all issues are addressed here. Can you do a second review?

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm. doc updates.

@jreback jreback requested a review from TomAugspurger August 4, 2020 00:22
@jreback jreback added this to the 1.2 milestone Aug 4, 2020
@jreback
Copy link
Contributor

jreback commented Aug 6, 2020

lgtm. pls merge master and ping on green.

@jreback
Copy link
Contributor

jreback commented Aug 7, 2020

cc @TomAugspurger if you can give a glance.

Copy link
Contributor

@TomAugspurger TomAugspurger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @smithto1. Looks good.

@jreback jreback merged commit a21ce87 into pandas-dev:master Aug 7, 2020
@jreback
Copy link
Contributor

jreback commented Aug 7, 2020

thanks @smithto1 very nice

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Groupby Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate
Projects
None yet
3 participants