Skip to content

Deprecate groupby/pivot observed=False default #35967

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 28 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
28 commits
Select commit Hold shift + click to select a range
7fed06d
Deprecate default observed=False for groupby
jseabold Aug 28, 2020
32397cf
Deprecate default for series
jseabold Aug 28, 2020
5b7507c
Test warns for series
jseabold Aug 28, 2020
6a45118
Test that pivot warns
jseabold Aug 28, 2020
dd02ced
Change default observed for pivot table
jseabold Aug 28, 2020
b1e7b50
Don't set stacklevel bc of multiple entrypoints
jseabold Aug 28, 2020
1e311d7
Don't check stacklevel on tests.
jseabold Aug 28, 2020
8b46e0f
Blacken
jseabold Aug 28, 2020
4137040
Add deprecation doc
jseabold Aug 28, 2020
29b3bf8
Don't use pytest.warns
jseabold Aug 31, 2020
5e30b0f
okwarning option
jseabold Aug 31, 2020
4b9c36e
Remove incorrect series note
jseabold Nov 30, 2020
2deaea4
Hoop jump
jseabold Nov 30, 2020
d06f70d
Blacken
jseabold Nov 30, 2020
e7a9975
Fix backticks
jseabold Nov 30, 2020
61fe1a3
okwarning
jseabold Nov 30, 2020
bacb61d
Move method to right class. Rebase mistake.
jseabold Nov 30, 2020
394fe5b
Expect FutureWarning
jseabold Dec 2, 2020
be6d3c1
crosstab doesn't have observed keyword
jseabold Dec 2, 2020
6cca8c0
Filter fewer warnings
jseabold Dec 3, 2020
0864317
Hard code observed behavior to silence warning. See #35967
jseabold Dec 3, 2020
029edd0
PR in comment
jseabold Dec 3, 2020
5d15dd1
Hardcode vs. filtering warning
jseabold Dec 3, 2020
fb6e4b0
Blacken
jseabold Dec 4, 2020
363865b
Silence deprecation warning.
jseabold Dec 4, 2020
d4d918b
Hard code default behavior. See #35967
jseabold Dec 4, 2020
57d99a7
Hardcode default categorical behavior. See #35967
jseabold Dec 4, 2020
8526064
Will raise in the future
jseabold Dec 7, 2020
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions doc/source/user_guide/10min.rst
Original file line number Diff line number Diff line change
Expand Up @@ -702,11 +702,11 @@ Sorting is per order in the categories, not lexical order.

df.sort_values(by="grade")

Grouping by a categorical column also shows empty categories.
Grouping by a categorical column can also show empty categories, using the observed keyword.

.. ipython:: python

df.groupby("grade").size()
df.groupby("grade", observed=False).size()


Plotting
Expand Down
4 changes: 2 additions & 2 deletions doc/source/user_guide/advanced.rst
Original file line number Diff line number Diff line change
Expand Up @@ -809,8 +809,8 @@ Groupby operations on the index will preserve the index nature as well.

.. ipython:: python

df2.groupby(level=0).sum()
df2.groupby(level=0).sum().index
df2.groupby(level=0, observed=False).sum()
df2.groupby(level=0, observed=False).sum().index

Reindexing operations will return a resulting index based on the type of the passed
indexer. Passing a list will return a plain-old ``Index``; indexing with
Expand Down
11 changes: 6 additions & 5 deletions doc/source/user_guide/categorical.rst
Original file line number Diff line number Diff line change
Expand Up @@ -622,7 +622,7 @@ even if some categories are not present in the data:
s = pd.Series(pd.Categorical(["a", "b", "c", "c"], categories=["c", "a", "b", "d"]))
s.value_counts()

``DataFrame`` methods like :meth:`DataFrame.sum` also show "unused" categories.
``DataFrame`` methods like :meth:`DataFrame.sum` also show "unused" categories:

.. ipython:: python

Expand All @@ -635,15 +635,16 @@ even if some categories are not present in the data:
)
df.sum(axis=1, level=1)

Groupby will also show "unused" categories:
Groupby will also show "unused" categories by default, though this behavior
is deprecated. In a future release, users must specify a value for ``observed``:

.. ipython:: python

cats = pd.Categorical(
["a", "b", "b", "b", "c", "c", "c"], categories=["a", "b", "c", "d"]
)
df = pd.DataFrame({"cats": cats, "values": [1, 2, 2, 2, 3, 4, 5]})
df.groupby("cats").mean()
df.groupby("cats", observed=False).mean()

cats2 = pd.Categorical(["a", "a", "b", "b"], categories=["a", "b", "c"])
df2 = pd.DataFrame(
Expand All @@ -653,7 +654,7 @@ Groupby will also show "unused" categories:
"values": [1, 2, 3, 4],
}
)
df2.groupby(["cats", "B"]).mean()
df2.groupby(["cats", "B"], observed=False).mean()


Pivot tables:
Expand All @@ -662,7 +663,7 @@ Pivot tables:

raw_cat = pd.Categorical(["a", "a", "b", "b"], categories=["a", "b", "c"])
df = pd.DataFrame({"A": raw_cat, "B": ["c", "d", "c", "d"], "values": [1, 2, 3, 4]})
pd.pivot_table(df, values="values", index=["A", "B"])
pd.pivot_table(df, values="values", index=["A", "B"], observed=False)

Data munging
------------
Expand Down
2 changes: 1 addition & 1 deletion doc/source/user_guide/groupby.rst
Original file line number Diff line number Diff line change
Expand Up @@ -1269,7 +1269,7 @@ can be used as group keys. If so, the order of the levels will be preserved:

factor = pd.qcut(data, [0, 0.25, 0.5, 0.75, 1.0])

data.groupby(factor).mean()
data.groupby(factor, observed=True).mean()

.. _groupby.specify:

Expand Down
1 change: 1 addition & 0 deletions doc/source/whatsnew/v0.19.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -1131,6 +1131,7 @@ An analogous change has been made to ``MultiIndex.from_product``.
As a consequence, ``groupby`` and ``set_index`` also preserve categorical dtypes in indexes

.. ipython:: python
:okwarning:

df = pd.DataFrame({"A": [0, 1], "B": [10, 11], "C": cat})
df_grouped = df.groupby(by=["A", "C"]).first()
Expand Down
1 change: 1 addition & 0 deletions doc/source/whatsnew/v0.20.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -291,6 +291,7 @@ In previous versions, ``.groupby(..., sort=False)`` would fail with a ``ValueErr
**New behavior**:

.. ipython:: python
:okwarning:

df[df.chromosomes != '1'].groupby('chromosomes', sort=False).sum()

Expand Down
2 changes: 2 additions & 0 deletions doc/source/whatsnew/v0.22.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -118,6 +118,7 @@ instead of ``NaN``.
*pandas 0.22*

.. ipython:: python
:okwarning:

grouper = pd.Categorical(["a", "a"], categories=["a", "b"])
pd.Series([1, 2]).groupby(grouper).sum()
Expand All @@ -126,6 +127,7 @@ To restore the 0.21 behavior of returning ``NaN`` for unobserved groups,
use ``min_count>=1``.

.. ipython:: python
:okwarning:

pd.Series([1, 2]).groupby(grouper).sum(min_count=1)

Expand Down
1 change: 1 addition & 0 deletions doc/source/whatsnew/v0.23.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -288,6 +288,7 @@ For pivoting operations, this behavior is *already* controlled by the ``dropna``
df

.. ipython:: python
:okwarning:

pd.pivot_table(df, values='values', index=['A', 'B'],
dropna=True)
Expand Down
1 change: 1 addition & 0 deletions doc/source/whatsnew/v1.2.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -522,6 +522,7 @@ Deprecations
- Deprecated :meth:`Index.asi8` for :class:`Index` subclasses other than :class:`.DatetimeIndex`, :class:`.TimedeltaIndex`, and :class:`PeriodIndex` (:issue:`37877`)
- The ``inplace`` parameter of :meth:`Categorical.remove_unused_categories` is deprecated and will be removed in a future version (:issue:`37643`)
- The ``null_counts`` parameter of :meth:`DataFrame.info` is deprecated and replaced by ``show_counts``. It will be removed in a future version (:issue:`37999`)
- Deprecated default keyword argument of ``observed=False`` in :~meth:`DataFrame.groupby` and :~meth:`DataFrame.pivot_table` (:issue:`17594`)

.. ---------------------------------------------------------------------------

Expand Down
6 changes: 3 additions & 3 deletions pandas/core/frame.py
Original file line number Diff line number Diff line change
Expand Up @@ -5677,7 +5677,7 @@ def value_counts(
if subset is None:
subset = self.columns.tolist()

counts = self.groupby(subset).grouper.size()
counts = self.groupby(subset, observed=True).grouper.size()

if sort:
counts = counts.sort_values(ascending=ascending)
Expand Down Expand Up @@ -6698,7 +6698,7 @@ def groupby(
sort: bool = True,
group_keys: bool = True,
squeeze: bool = no_default,
observed: bool = False,
observed: Optional[bool] = None,
dropna: bool = True,
) -> DataFrameGroupBy:
from pandas.core.groupby.generic import DataFrameGroupBy
Expand Down Expand Up @@ -7029,7 +7029,7 @@ def pivot_table(
margins=False,
dropna=True,
margins_name="All",
observed=False,
observed=None,
) -> DataFrame:
from pandas.core.reshape.pivot import pivot_table

Expand Down
14 changes: 10 additions & 4 deletions pandas/core/generic.py
Original file line number Diff line number Diff line change
Expand Up @@ -87,10 +87,15 @@
from pandas.core.dtypes.missing import isna, notna

import pandas as pd
from pandas.core import arraylike, indexing, missing, nanops
import pandas.core.algorithms as algos
from pandas.core import (
algorithms as algos,
arraylike,
common as com,
indexing,
missing,
nanops,
)
from pandas.core.base import PandasObject, SelectionMixin
import pandas.core.common as com
from pandas.core.construction import create_series_with_explicit_dtype
from pandas.core.flags import Flags
from pandas.core.indexes import base as ibase
Expand Down Expand Up @@ -10545,7 +10550,8 @@ def pct_change(
def _agg_by_level(self, name, axis=0, level=0, skipna=True, **kwargs):
if axis is None:
raise ValueError("Must specify 'axis' when aggregating by level.")
grouped = self.groupby(level=level, axis=axis, sort=False)
# see pr-35967 for discussion about the observed keyword
grouped = self.groupby(level=level, axis=axis, sort=False, observed=False)
if hasattr(grouped, name) and skipna:
return getattr(grouped, name)(**kwargs)
axis = self._get_axis_number(axis)
Expand Down
4 changes: 2 additions & 2 deletions pandas/core/groupby/groupby.py
Original file line number Diff line number Diff line change
Expand Up @@ -526,7 +526,7 @@ def __init__(
sort: bool = True,
group_keys: bool = True,
squeeze: bool = False,
observed: bool = False,
observed: Optional[bool] = None,
mutated: bool = False,
dropna: bool = True,
):
Expand Down Expand Up @@ -3016,7 +3016,7 @@ def get_groupby(
sort: bool = True,
group_keys: bool = True,
squeeze: bool = False,
observed: bool = False,
observed: Optional[bool] = None,
mutated: bool = False,
dropna: bool = True,
) -> GroupBy:
Expand Down
21 changes: 19 additions & 2 deletions pandas/core/groupby/grouper.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@
Provide user facing operators for doing the split part of the
split-apply-combine paradigm.
"""
import textwrap
from typing import Dict, Hashable, List, Optional, Set, Tuple
import warnings

Expand Down Expand Up @@ -31,6 +32,18 @@

from pandas.io.formats.printing import pprint_thing

_observed_msg = textwrap.dedent(
"""\
Grouping by a categorical but 'observed' was not specified.
Using 'observed=False', but in a future version of pandas
not specifying 'observed' will raise an error. Pass
'observed=True' or 'observed=False' to silence this warning.

See the `groupby` documentation for more information on the
observed keyword.
"""
)


class Grouper:
"""
Expand Down Expand Up @@ -432,7 +445,7 @@ def __init__(
name=None,
level=None,
sort: bool = True,
observed: bool = False,
observed: Optional[bool] = None,
in_axis: bool = False,
dropna: bool = True,
):
Expand Down Expand Up @@ -495,6 +508,10 @@ def __init__(
# a passed Categorical
elif is_categorical_dtype(self.grouper):

if observed is None:
warnings.warn(_observed_msg, FutureWarning)
observed = False

self.grouper, self.all_grouper = recode_for_groupby(
self.grouper, self.sort, observed
)
Expand Down Expand Up @@ -631,7 +648,7 @@ def get_grouper(
axis: int = 0,
level=None,
sort: bool = True,
observed: bool = False,
observed: Optional[bool] = None,
mutated: bool = False,
validate: bool = True,
dropna: bool = True,
Expand Down
7 changes: 6 additions & 1 deletion pandas/core/indexes/base.py
Original file line number Diff line number Diff line change
Expand Up @@ -493,7 +493,12 @@ def _format_duplicate_message(self):
duplicates = self[self.duplicated(keep="first")].unique()
assert len(duplicates)

out = Series(np.arange(len(self))).groupby(self).agg(list)[duplicates]
# see pr-35967 about the observed keyword
out = (
Series(np.arange(len(self)))
.groupby(self, observed=False)
.agg(list)[duplicates]
)
if self.nlevels == 1:
out = out.rename_axis("label")
return out.to_frame(name="positions")
Expand Down
6 changes: 4 additions & 2 deletions pandas/core/reshape/merge.py
Original file line number Diff line number Diff line change
Expand Up @@ -109,13 +109,15 @@ def _groupby_and_merge(by, on, left: "DataFrame", right: "DataFrame", merge_piec
if not isinstance(by, (list, tuple)):
by = [by]

lby = left.groupby(by, sort=False)
# see pr-35967 for discussion about observed=False
# this is the previous default behavior if the group is a categorical
lby = left.groupby(by, sort=False, observed=False)
rby: Optional[groupby.DataFrameGroupBy] = None

# if we can groupby the rhs
# then we can get vastly better perf
if all(item in right.columns for item in by):
rby = right.groupby(by, sort=False)
rby = right.groupby(by, sort=False, observed=False)

for key, lhs in lby:

Expand Down
4 changes: 3 additions & 1 deletion pandas/core/reshape/pivot.py
Original file line number Diff line number Diff line change
Expand Up @@ -46,7 +46,7 @@ def pivot_table(
margins=False,
dropna=True,
margins_name="All",
observed=False,
observed=None,
) -> "DataFrame":
index = _convert_by(index)
columns = _convert_by(columns)
Expand Down Expand Up @@ -612,6 +612,8 @@ def crosstab(
margins=margins,
margins_name=margins_name,
dropna=dropna,
# the below is only here to silence the FutureWarning
observed=False,
**kwargs,
)

Expand Down
2 changes: 1 addition & 1 deletion pandas/core/series.py
Original file line number Diff line number Diff line change
Expand Up @@ -1674,7 +1674,7 @@ def groupby(
sort: bool = True,
group_keys: bool = True,
squeeze: bool = no_default,
observed: bool = False,
observed: Optional[bool] = None,
dropna: bool = True,
) -> "SeriesGroupBy":
from pandas.core.groupby.generic import SeriesGroupBy
Expand Down
11 changes: 11 additions & 0 deletions pandas/core/shared_docs.py
Original file line number Diff line number Diff line change
Expand Up @@ -119,6 +119,17 @@
This only applies if any of the groupers are Categoricals.
If True: only show observed values for categorical groupers.
If False: show all values for categorical groupers.

The current default of ``observed=False`` is deprecated. In
the future this will be a required keyword in the presence
of a categorical grouper and a failure to specify a value will
result in an error.

Explicitly pass ``observed=True`` to silence the warning and not
show all observed values.
Explicitly pass ``observed=False`` to silence the warning and
show groups for all observed values.

dropna : bool, default True
If True, and if group keys contain NA values, NA values together
with row/column will be dropped.
Expand Down
2 changes: 1 addition & 1 deletion pandas/plotting/_matplotlib/boxplot.py
Original file line number Diff line number Diff line change
Expand Up @@ -195,7 +195,7 @@ def _grouped_plot_by_column(
return_type=None,
**kwargs,
):
grouped = data.groupby(by)
grouped = data.groupby(by, observed=False)
if columns is None:
if not isinstance(by, (list, tuple)):
by = [by]
Expand Down
7 changes: 3 additions & 4 deletions pandas/tests/groupby/aggregate/test_aggregate.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,8 +13,7 @@
from pandas.core.dtypes.common import is_integer_dtype

import pandas as pd
from pandas import DataFrame, Index, MultiIndex, Series, concat
import pandas._testing as tm
from pandas import DataFrame, Index, MultiIndex, Series, _testing as tm, concat
from pandas.core.base import SpecificationError
from pandas.core.groupby.grouper import Grouping

Expand Down Expand Up @@ -1074,7 +1073,7 @@ def test_groupby_single_agg_cat_cols(grp_col_dict, exp_data):

input_df = input_df.astype({"cat": "category", "cat_ord": "category"})
input_df["cat_ord"] = input_df["cat_ord"].cat.as_ordered()
result_df = input_df.groupby("cat").agg(grp_col_dict)
result_df = input_df.groupby("cat", observed=False).agg(grp_col_dict)

# create expected dataframe
cat_index = pd.CategoricalIndex(
Expand Down Expand Up @@ -1108,7 +1107,7 @@ def test_groupby_combined_aggs_cat_cols(grp_col_dict, exp_data):

input_df = input_df.astype({"cat": "category", "cat_ord": "category"})
input_df["cat_ord"] = input_df["cat_ord"].cat.as_ordered()
result_df = input_df.groupby("cat").agg(grp_col_dict)
result_df = input_df.groupby("cat", observed=False).agg(grp_col_dict)

# create expected dataframe
cat_index = pd.CategoricalIndex(
Expand Down
Loading