Skip to content

BUG: GroupBy.count() and GroupBy.sum() incorreclty return NaN instead of 0 for missing categories #35280

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 37 commits into from
Aug 7, 2020
Merged
Show file tree
Hide file tree
Changes from 12 commits
Commits
Show all changes
37 commits
Select commit Hold shift + click to select a range
ec577d7
set self.observed=True for .sum() and .count() so that reindex can be…
smithto1 Jul 11, 2020
586b1bd
wrote test for .sum() and .count() exception handling
smithto1 Jul 11, 2020
ec4bd28
black
smithto1 Jul 11, 2020
eed48cb
updated pivot tests
smithto1 Jul 11, 2020
88d8a14
whatsnew
smithto1 Jul 11, 2020
940b32e
comments
smithto1 Jul 11, 2020
3e39f1a
black
smithto1 Jul 11, 2020
bd4c4fe
.sum() and .count() create inner groupby with observed=True, and then…
smithto1 Jul 14, 2020
fcbb58b
modify self.observed with a context manager
smithto1 Jul 15, 2020
5f89c17
added comments
smithto1 Jul 15, 2020
7ed0982
merging master
smithto1 Jul 15, 2020
9da5b18
Fixed import order that caused Linting error
smithto1 Jul 15, 2020
534d60f
addressing comments
smithto1 Jul 16, 2020
b323c9c
merging master
smithto1 Jul 16, 2020
d2abd4d
fixing whatsnew comment
smithto1 Jul 16, 2020
361098a
ammend comment
smithto1 Jul 16, 2020
9879199
ammend comment
smithto1 Jul 16, 2020
39ee6d0
addressing PR comments
smithto1 Jul 16, 2020
fc1a846
ammend comment
smithto1 Jul 16, 2020
f774288
amend comment
smithto1 Jul 17, 2020
1986f51
amend comment (just to rerun tests
smithto1 Jul 17, 2020
c4c1a92
ammend comment to start new tests (tests failing due to some flakey t…
smithto1 Jul 17, 2020
7093c91
merging master
smithto1 Jul 18, 2020
5f8ba0e
amend comment to kick off tests
smithto1 Jul 18, 2020
49b5fc8
ammend test to kick off tests
smithto1 Jul 18, 2020
0454e3a
merging master
smithto1 Jul 26, 2020
28794e3
fixed merge
smithto1 Jul 26, 2020
f616d5a
merge fix
smithto1 Jul 26, 2020
80d4b7d
udpate comment to kick off tests
smithto1 Jul 27, 2020
442ad1d
amend comment to restart checks
smithto1 Aug 1, 2020
be8b78c
Merge remote-tracking branch 'upstream/master' into issue31422-v7
smithto1 Aug 5, 2020
ed99b60
doc updates
smithto1 Aug 5, 2020
528ff7e
restart tests
smithto1 Aug 5, 2020
b2a331f
Merge remote-tracking branch 'upstream/master' into issue31422-v7
smithto1 Aug 6, 2020
cb010fa
Merge branch 'master' into issue31422-v7
jreback Aug 7, 2020
d6ebe10
merge
smithto1 Aug 7, 2020
f575687
merge
smithto1 Aug 7, 2020
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 5 additions & 0 deletions doc/source/whatsnew/v1.1.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -1081,6 +1081,9 @@ Groupby/resample/rolling
- Bug in :meth:`DataFrame.ewm.cov` was throwing ``AssertionError`` for :class:`MultiIndex` inputs (:issue:`34440`)
- Bug in :meth:`core.groupby.DataFrameGroupBy.transform` when ``func='nunique'`` and columns are of type ``datetime64``, the result would also be of type ``datetime64`` instead of ``int64`` (:issue:`35109`)
- Bug in :meth:'DataFrameGroupBy.first' and :meth:'DataFrameGroupBy.last' that would raise an unnecessary ``ValueError`` when grouping on multiple ``Categoricals`` (:issue:`34951`)
- Bug in :meth:`DataFrameGroupBy.count` was returning ``NaN`` for missing categories when grouped on multiple ``Categoricals``. Now returning ``0`` (:issue:`35028`)
- Bug in :meth:`DataFrameGroupBy.sum` and :meth:`SeriesGroupBy.sum` was reutrning ``NaN`` for missing categories when grouped on multiple ``Categorials``. Now returning ``0`` (:issue:`31422`)


Reshaping
^^^^^^^^^
Expand Down Expand Up @@ -1115,6 +1118,8 @@ Reshaping
- Bug in :meth:`Series.where` with an empty Series and empty ``cond`` having non-bool dtype (:issue:`34592`)
- Fixed regression where :meth:`DataFrame.apply` would raise ``ValueError`` for elements whth ``S`` dtype (:issue:`34529`)
- Bug in :meth:`DataFrame.append` leading to sorting columns even when ``sort=False`` is specified (:issue:`35092`)
- Bug in :meth:`DataFrame.pivot_table` with ``aggfunc='count'``, was returning ``NaN`` for missing categories when pivoted on a ``Categorical``. Now returning ``0`` (:issue:`35028`)
- Bug in :meth:`DataFrame.pivot_table` with ``aggfunc='sum'``, was reutrning ``NaN`` for missing categories when pivoted on a ``Categorical``. Now returning ``0`` (:issue:`31422`)

Sparse
^^^^^^
Expand Down
11 changes: 10 additions & 1 deletion pandas/core/groupby/generic.py
Original file line number Diff line number Diff line change
Expand Up @@ -70,6 +70,7 @@
GroupBy,
_agg_template,
_apply_docs,
_observed_is_true,
_transform_template,
get_groupby,
)
Expand Down Expand Up @@ -1827,7 +1828,15 @@ def count(self):
)
blocks = [make_block(val, placement=loc) for val, loc in zip(counted, locs)]

return self._wrap_agged_blocks(blocks, items=data.items)
# If we are grouping on categoricals we want unobserved categories to
# return zero, rather than the default of NaN which the reindexing in
# _wrap_agged_blocks() returns. We set self.observed=True for the call to
# _wrap_agged_blocks() to avoid it reindexing. We then reindex below with
# fill_value=0. See GH 35028
with _observed_is_true(self):
result = self._wrap_agged_blocks(blocks, items=data.items)

return self._reindex_output(result, fill_value=0)

def nunique(self, dropna: bool = True):
"""
Expand Down
29 changes: 26 additions & 3 deletions pandas/core/groupby/groupby.py
Original file line number Diff line number Diff line change
Expand Up @@ -464,6 +464,17 @@ def _group_selection_context(groupby):
groupby._reset_group_selection()


@contextmanager
def _observed_is_true(groupby):
"""
Set GroupBy.observed to True, then restore to original value
"""
original_value = groupby.observed
groupby.observed = True
yield groupby
groupby.observed = original_value


_KeysArgType = Union[
Hashable,
List[Hashable],
Expand Down Expand Up @@ -1531,9 +1542,21 @@ def size(self) -> FrameOrSeriesUnion:

@doc(_groupby_agg_method_template, fname="sum", no=True, mc=0)
def sum(self, numeric_only: bool = True, min_count: int = 0):
return self._agg_general(
numeric_only=numeric_only, min_count=min_count, alias="add", npfunc=np.sum
)

# If we are grouping on categoricals we want unobserved categories to
# return zero, rather than the default of NaN which the reindexing in
# _agg_general() returns. We set self.observed=True for the call to
# _agg_general() to avoid it reindexing. We then reindex below with
# fill_value=0. See GH 31422
with _observed_is_true(self):
result = self._agg_general(
numeric_only=numeric_only,
min_count=min_count,
alias="add",
npfunc=np.sum,
)

return self._reindex_output(result, fill_value=0)

@doc(_groupby_agg_method_template, fname="prod", no=True, mc=0)
def prod(self, numeric_only: bool = True, min_count: int = 0):
Expand Down
46 changes: 13 additions & 33 deletions pandas/tests/groupby/test_categorical.py
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@
import pandas._testing as tm


def cartesian_product_for_groupers(result, args, names):
def cartesian_product_for_groupers(result, args, names, fill_value=np.NaN):
""" Reindex to a cartesian production for the groupers,
preserving the nature (Categorical) of each grouper
"""
Expand All @@ -33,7 +33,7 @@ def f(a):
return a

index = MultiIndex.from_product(map(f, args), names=names)
return result.reindex(index).sort_index()
return result.reindex(index, fill_value=fill_value).sort_index()


_results_for_groupbys_with_missing_categories = dict(
Expand Down Expand Up @@ -309,7 +309,7 @@ def test_observed(observed):
result = gb.sum()
if not observed:
expected = cartesian_product_for_groupers(
expected, [cat1, cat2, ["foo", "bar"]], list("ABC")
expected, [cat1, cat2, ["foo", "bar"]], list("ABC"), fill_value=0
)

tm.assert_frame_equal(result, expected)
Expand All @@ -319,7 +319,9 @@ def test_observed(observed):
expected = DataFrame({"values": [1, 2, 3, 4]}, index=exp_index)
result = gb.sum()
if not observed:
expected = cartesian_product_for_groupers(expected, [cat1, cat2], list("AB"))
expected = cartesian_product_for_groupers(
expected, [cat1, cat2], list("AB"), fill_value=0
)

tm.assert_frame_equal(result, expected)

Expand Down Expand Up @@ -1189,6 +1191,8 @@ def test_seriesgroupby_observed_false_or_none(df_cat, observed, operation):
).sortlevel()

expected = Series(data=[2, 4, np.nan, 1, np.nan, 3], index=index, name="C")
if operation == "agg":
expected = expected.fillna(0, downcast="infer")
grouped = df_cat.groupby(["A", "B"], observed=observed)["C"]
result = getattr(grouped, operation)(sum)
tm.assert_series_equal(result, expected)
Expand Down Expand Up @@ -1340,15 +1344,6 @@ def test_series_groupby_on_2_categoricals_unobserved_zeroes_or_nans(
)
request.node.add_marker(mark)

if reduction_func == "sum": # GH 31422
mark = pytest.mark.xfail(
reason=(
"sum should return 0 but currently returns NaN. "
"This is a known bug. See GH 31422."
)
)
request.node.add_marker(mark)

df = pd.DataFrame(
{
"cat_1": pd.Categorical(list("AABB"), categories=list("ABC")),
Expand All @@ -1369,8 +1364,11 @@ def test_series_groupby_on_2_categoricals_unobserved_zeroes_or_nans(
val = result.loc[idx]
assert (pd.isna(zero_or_nan) and pd.isna(val)) or (val == zero_or_nan)

# If we expect unobserved values to be zero, we also expect the dtype to be int
if zero_or_nan == 0:
# If we expect unobserved values to be zero, we also expect the dtype to be int.
# Except for .sum(). If the observed categories sum to dtype=float (i.e. their
# sums have decimals), then the zeros for the missing categories should also be
# floats.
if zero_or_nan == 0 and reduction_func != "sum":
assert np.issubdtype(result.dtype, np.integer)


Expand Down Expand Up @@ -1412,24 +1410,6 @@ def test_dataframe_groupby_on_2_categoricals_when_observed_is_false(
if reduction_func == "ngroup":
pytest.skip("ngroup does not return the Categories on the index")

if reduction_func == "count": # GH 35028
mark = pytest.mark.xfail(
reason=(
"DataFrameGroupBy.count returns np.NaN for missing "
"categories, when it should return 0. See GH 35028"
)
)
request.node.add_marker(mark)

if reduction_func == "sum": # GH 31422
mark = pytest.mark.xfail(
reason=(
"sum should return 0 but currently returns NaN. "
"This is a known bug. See GH 31422."
)
)
request.node.add_marker(mark)

df = pd.DataFrame(
{
"cat_1": pd.Categorical(list("AABB"), categories=list("ABC")),
Expand Down
13 changes: 7 additions & 6 deletions pandas/tests/reshape/test_pivot.py
Original file line number Diff line number Diff line change
Expand Up @@ -1817,7 +1817,7 @@ def test_categorical_aggfunc(self, observed):
["A", "B", "C"], categories=["A", "B", "C"], ordered=False, name="C1"
)
expected_columns = pd.Index(["a", "b"], name="C2")
expected_data = np.array([[1.0, np.nan], [1.0, np.nan], [np.nan, 2.0]])
expected_data = np.array([[1, 0], [1, 0], [0, 2]], dtype=np.int64)
expected = pd.DataFrame(
expected_data, index=expected_index, columns=expected_columns
)
Expand Down Expand Up @@ -1851,18 +1851,19 @@ def test_categorical_pivot_index_ordering(self, observed):
values="Sales",
index="Month",
columns="Year",
dropna=observed,
observed=observed,
aggfunc="sum",
)
expected_columns = pd.Int64Index([2013, 2014], name="Year")
expected_index = pd.CategoricalIndex(
["January"], categories=months, ordered=False, name="Month"
months, categories=months, ordered=False, name="Month"
)
expected_data = [[320, 120]] + [[0, 0]] * 11
expected = pd.DataFrame(
[[320, 120]], index=expected_index, columns=expected_columns
expected_data, index=expected_index, columns=expected_columns
)
if not observed:
result = result.dropna().astype(np.int64)
if observed:
expected = expected.loc[["January"]]

tm.assert_frame_equal(result, expected)

Expand Down