Skip to content

BUG: DataFrameGroupBy.value_counts includes non-observed categories of non-grouping columns #46798

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
Aug 7, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
47 changes: 44 additions & 3 deletions doc/source/whatsnew/v1.5.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -393,6 +393,48 @@ upon serialization. (Related issue :issue:`12997`)
# Roundtripping now works
pd.read_json(a.to_json(date_format='iso'), typ="series").index == a.index


.. _whatsnew_150.notable_bug_fixes.groupby_value_counts_categorical:

DataFrameGroupBy.value_counts with non-grouping categorical columns and ``observed=True``
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Calling :meth:`.DataFrameGroupBy.value_counts` with ``observed=True`` would incorrectly drop non-observed categories of non-grouping columns (:issue:`46357`).

.. code-block:: ipython

In [6]: df = pd.DataFrame(["a", "b", "c"], dtype="category").iloc[0:2]
In [7]: df
Out[7]:
0
0 a
1 b

*Old Behavior*

.. code-block:: ipython

In [8]: df.groupby(level=0, observed=True).value_counts()
Out[8]:
0 a 1
1 b 1
dtype: int64


*New Behavior*

.. code-block:: ipython

In [9]: df.groupby(level=0, observed=True).value_counts()
Out[9]:
0 a 1
1 a 0
b 1
0 b 0
c 0
1 c 0
dtype: int64

.. ---------------------------------------------------------------------------
.. _whatsnew_150.api_breaking:

Expand Down Expand Up @@ -820,9 +862,8 @@ Bug fixes

Categorical
^^^^^^^^^^^
- Bug in :meth:`.Categorical.view` not accepting integer dtypes (:issue:`25464`)
- Bug in :meth:`.CategoricalIndex.union` when the index's categories are integer-dtype and the index contains ``NaN`` values incorrectly raising instead of casting to ``float64`` (:issue:`45362`)
-
- Bug in :meth:`Categorical.view` not accepting integer dtypes (:issue:`25464`)
- Bug in :meth:`CategoricalIndex.union` when the index's categories are integer-dtype and the index contains ``NaN`` values incorrectly raising instead of casting to ``float64`` (:issue:`45362`)

Datetimelike
^^^^^^^^^^^^
Expand Down
21 changes: 19 additions & 2 deletions pandas/core/groupby/generic.py
Original file line number Diff line number Diff line change
Expand Up @@ -69,6 +69,7 @@
reconstruct_func,
validate_func_kwargs,
)
from pandas.core.arrays.categorical import Categorical
import pandas.core.common as com
from pandas.core.construction import create_series_with_explicit_dtype
from pandas.core.frame import DataFrame
Expand All @@ -87,6 +88,7 @@
MultiIndex,
all_indexes_same,
)
from pandas.core.indexes.category import CategoricalIndex
from pandas.core.series import Series
from pandas.core.shared_docs import _shared_docs
from pandas.core.util.numba_ import maybe_use_numba
Expand Down Expand Up @@ -1821,6 +1823,7 @@ def value_counts(
key=key,
axis=self.axis,
sort=self.sort,
observed=False,
dropna=dropna,
)
groupings += list(grouper.groupings)
Expand All @@ -1834,6 +1837,19 @@ def value_counts(
)
result_series = cast(Series, gb.size())

# GH-46357 Include non-observed categories
# of non-grouping columns regardless of `observed`
if any(
isinstance(grouping.grouping_vector, (Categorical, CategoricalIndex))
and not grouping._observed
for grouping in groupings
):
levels_list = [ping.result_index for ping in groupings]
multi_index, _ = MultiIndex.from_product(
levels_list, names=[ping.name for ping in groupings]
).sortlevel()
result_series = result_series.reindex(multi_index, fill_value=0)

if normalize:
# Normalize the results by dividing by the original group sizes.
# We are guaranteed to have the first N levels be the
Expand All @@ -1844,12 +1860,13 @@ def value_counts(
indexed_group_size = result_series.groupby(
result_series.index.droplevel(levels),
sort=self.sort,
observed=self.observed,
dropna=self.dropna,
).transform("sum")

result_series /= indexed_group_size

# Handle groups of non-observed categories
result_series = result_series.fillna(0.0)

if sort:
# Sort the values and then resort by the main grouping
index_level = range(len(self.grouper.groupings))
Expand Down
Loading