Skip to content

Commit 0baa925

Browse files
authored
DEPR: observed=False default in groupby (#51811)
* DEPR: observed=False default in groupby * Fixup docs * DEPR: observed=False default in groupby * fixup * Mention defaulting to True * fixup
1 parent 26999eb commit 0baa925

24 files changed

+84
-47
lines changed

doc/source/user_guide/10min.rst

+2-2
Original file line numberDiff line numberDiff line change
@@ -702,11 +702,11 @@ Sorting is per order in the categories, not lexical order:
702702
703703
df.sort_values(by="grade")
704704
705-
Grouping by a categorical column also shows empty categories:
705+
Grouping by a categorical column with ``observed=False`` also shows empty categories:
706706

707707
.. ipython:: python
708708
709-
df.groupby("grade").size()
709+
df.groupby("grade", observed=False).size()
710710
711711
712712
Plotting

doc/source/user_guide/advanced.rst

+2-2
Original file line numberDiff line numberDiff line change
@@ -800,8 +800,8 @@ Groupby operations on the index will preserve the index nature as well.
800800

801801
.. ipython:: python
802802
803-
df2.groupby(level=0).sum()
804-
df2.groupby(level=0).sum().index
803+
df2.groupby(level=0, observed=True).sum()
804+
df2.groupby(level=0, observed=True).sum().index
805805
806806
Reindexing operations will return a resulting index based on the type of the passed
807807
indexer. Passing a list will return a plain-old ``Index``; indexing with

doc/source/user_guide/categorical.rst

+5-5
Original file line numberDiff line numberDiff line change
@@ -607,7 +607,7 @@ even if some categories are not present in the data:
607607
s = pd.Series(pd.Categorical(["a", "b", "c", "c"], categories=["c", "a", "b", "d"]))
608608
s.value_counts()
609609
610-
``DataFrame`` methods like :meth:`DataFrame.sum` also show "unused" categories.
610+
``DataFrame`` methods like :meth:`DataFrame.sum` also show "unused" categories when ``observed=False``.
611611

612612
.. ipython:: python
613613
@@ -618,17 +618,17 @@ even if some categories are not present in the data:
618618
data=[[1, 2, 3], [4, 5, 6]],
619619
columns=pd.MultiIndex.from_arrays([["A", "B", "B"], columns]),
620620
).T
621-
df.groupby(level=1).sum()
621+
df.groupby(level=1, observed=False).sum()
622622
623-
Groupby will also show "unused" categories:
623+
Groupby will also show "unused" categories when ``observed=False``:
624624

625625
.. ipython:: python
626626
627627
cats = pd.Categorical(
628628
["a", "b", "b", "b", "c", "c", "c"], categories=["a", "b", "c", "d"]
629629
)
630630
df = pd.DataFrame({"cats": cats, "values": [1, 2, 2, 2, 3, 4, 5]})
631-
df.groupby("cats").mean()
631+
df.groupby("cats", observed=False).mean()
632632
633633
cats2 = pd.Categorical(["a", "a", "b", "b"], categories=["a", "b", "c"])
634634
df2 = pd.DataFrame(
@@ -638,7 +638,7 @@ Groupby will also show "unused" categories:
638638
"values": [1, 2, 3, 4],
639639
}
640640
)
641-
df2.groupby(["cats", "B"]).mean()
641+
df2.groupby(["cats", "B"], observed=False).mean()
642642
643643
644644
Pivot tables:

doc/source/user_guide/groupby.rst

+1-1
Original file line numberDiff line numberDiff line change
@@ -1401,7 +1401,7 @@ can be used as group keys. If so, the order of the levels will be preserved:
14011401
14021402
factor = pd.qcut(data, [0, 0.25, 0.5, 0.75, 1.0])
14031403
1404-
data.groupby(factor).mean()
1404+
data.groupby(factor, observed=False).mean()
14051405
14061406
.. _groupby.specify:
14071407

doc/source/whatsnew/v0.15.0.rst

+1-1
Original file line numberDiff line numberDiff line change
@@ -85,7 +85,7 @@ For full docs, see the :ref:`categorical introduction <categorical>` and the
8585
"medium", "good", "very good"])
8686
df["grade"]
8787
df.sort_values("grade")
88-
df.groupby("grade").size()
88+
df.groupby("grade", observed=False).size()
8989
9090
- ``pandas.core.group_agg`` and ``pandas.core.factor_agg`` were removed. As an alternative, construct
9191
a dataframe and use ``df.groupby(<group>).agg(<func>)``.

doc/source/whatsnew/v0.19.0.rst

+1-1
Original file line numberDiff line numberDiff line change
@@ -1134,7 +1134,7 @@ As a consequence, ``groupby`` and ``set_index`` also preserve categorical dtypes
11341134
.. ipython:: python
11351135
11361136
df = pd.DataFrame({"A": [0, 1], "B": [10, 11], "C": cat})
1137-
df_grouped = df.groupby(by=["A", "C"]).first()
1137+
df_grouped = df.groupby(by=["A", "C"], observed=False).first()
11381138
df_set_idx = df.set_index(["A", "C"])
11391139
11401140
**Previous behavior**:

doc/source/whatsnew/v0.20.0.rst

+2-2
Original file line numberDiff line numberDiff line change
@@ -289,15 +289,15 @@ In previous versions, ``.groupby(..., sort=False)`` would fail with a ``ValueErr
289289

290290
.. code-block:: ipython
291291
292-
In [3]: df[df.chromosomes != '1'].groupby('chromosomes', sort=False).sum()
292+
In [3]: df[df.chromosomes != '1'].groupby('chromosomes', observed=False, sort=False).sum()
293293
---------------------------------------------------------------------------
294294
ValueError: items in new_categories are not the same as in old categories
295295
296296
**New behavior**:
297297

298298
.. ipython:: python
299299
300-
df[df.chromosomes != '1'].groupby('chromosomes', sort=False).sum()
300+
df[df.chromosomes != '1'].groupby('chromosomes', observed=False, sort=False).sum()
301301
302302
.. _whatsnew_0200.enhancements.table_schema:
303303

doc/source/whatsnew/v0.22.0.rst

+3-3
Original file line numberDiff line numberDiff line change
@@ -109,7 +109,7 @@ instead of ``NaN``.
109109
110110
In [8]: grouper = pd.Categorical(['a', 'a'], categories=['a', 'b'])
111111
112-
In [9]: pd.Series([1, 2]).groupby(grouper).sum()
112+
In [9]: pd.Series([1, 2]).groupby(grouper, observed=False).sum()
113113
Out[9]:
114114
a 3.0
115115
b NaN
@@ -120,14 +120,14 @@ instead of ``NaN``.
120120
.. ipython:: python
121121
122122
grouper = pd.Categorical(["a", "a"], categories=["a", "b"])
123-
pd.Series([1, 2]).groupby(grouper).sum()
123+
pd.Series([1, 2]).groupby(grouper, observed=False).sum()
124124
125125
To restore the 0.21 behavior of returning ``NaN`` for unobserved groups,
126126
use ``min_count>=1``.
127127

128128
.. ipython:: python
129129
130-
pd.Series([1, 2]).groupby(grouper).sum(min_count=1)
130+
pd.Series([1, 2]).groupby(grouper, observed=False).sum(min_count=1)
131131
132132
Resample
133133
^^^^^^^^

doc/source/whatsnew/v2.1.0.rst

+1
Original file line numberDiff line numberDiff line change
@@ -99,6 +99,7 @@ Deprecations
9999
- Deprecated silently dropping unrecognized timezones when parsing strings to datetimes (:issue:`18702`)
100100
- Deprecated :meth:`DataFrame._data` and :meth:`Series._data`, use public APIs instead (:issue:`33333`)
101101
- Deprecating pinning ``group.name`` to each group in :meth:`SeriesGroupBy.aggregate` aggregations; if your operation requires utilizing the groupby keys, iterate over the groupby object instead (:issue:`41090`)
102+
- Deprecated the default of ``observed=False`` in :meth:`DataFrame.groupby` and :meth:`Series.groupby`; this will default to ``True`` in a future version (:issue:`43999`)
102103
- Deprecated ``axis=1`` in :meth:`DataFrame.groupby` and in :class:`Grouper` constructor, do ``frame.T.groupby(...)`` instead (:issue:`51203`)
103104
- Deprecated passing a :class:`DataFrame` to :meth:`DataFrame.from_records`, use :meth:`DataFrame.set_index` or :meth:`DataFrame.drop` instead (:issue:`51353`)
104105
- Deprecated accepting slices in :meth:`DataFrame.take`, call ``obj[slicer]`` or pass a sequence of integers instead (:issue:`51539`)

pandas/core/frame.py

+1-1
Original file line numberDiff line numberDiff line change
@@ -8681,7 +8681,7 @@ def groupby(
86818681
as_index: bool = True,
86828682
sort: bool = True,
86838683
group_keys: bool = True,
8684-
observed: bool = False,
8684+
observed: bool | lib.NoDefault = lib.no_default,
86858685
dropna: bool = True,
86868686
) -> DataFrameGroupBy:
86878687
if axis is not lib.no_default:

pandas/core/groupby/groupby.py

+18-3
Original file line numberDiff line numberDiff line change
@@ -68,6 +68,7 @@ class providing the base-class of operations.
6868
cache_readonly,
6969
doc,
7070
)
71+
from pandas.util._exceptions import find_stack_level
7172

7273
from pandas.core.dtypes.cast import ensure_dtype_can_hold_na
7374
from pandas.core.dtypes.common import (
@@ -905,7 +906,7 @@ def __init__(
905906
as_index: bool = True,
906907
sort: bool = True,
907908
group_keys: bool | lib.NoDefault = True,
908-
observed: bool = False,
909+
observed: bool | lib.NoDefault = lib.no_default,
909910
dropna: bool = True,
910911
) -> None:
911912
self._selection = selection
@@ -922,7 +923,6 @@ def __init__(
922923
self.keys = keys
923924
self.sort = sort
924925
self.group_keys = group_keys
925-
self.observed = observed
926926
self.dropna = dropna
927927

928928
if grouper is None:
@@ -932,10 +932,23 @@ def __init__(
932932
axis=axis,
933933
level=level,
934934
sort=sort,
935-
observed=observed,
935+
observed=False if observed is lib.no_default else observed,
936936
dropna=self.dropna,
937937
)
938938

939+
if observed is lib.no_default:
940+
if any(ping._passed_categorical for ping in grouper.groupings):
941+
warnings.warn(
942+
"The default of observed=False is deprecated and will be changed "
943+
"to True in a future version of pandas. Pass observed=False to "
944+
"retain current behavior or observed=True to adopt the future "
945+
"default and silence this warning.",
946+
FutureWarning,
947+
stacklevel=find_stack_level(),
948+
)
949+
observed = False
950+
self.observed = observed
951+
939952
self.obj = obj
940953
self.axis = obj._get_axis_number(axis)
941954
self.grouper = grouper
@@ -2125,6 +2138,8 @@ def _value_counts(
21252138
result_series.index.droplevel(levels),
21262139
sort=self.sort,
21272140
dropna=self.dropna,
2141+
# GH#43999 - deprecation of observed=False
2142+
observed=False,
21282143
).transform("sum")
21292144
result_series /= indexed_group_size
21302145

pandas/core/indexes/base.py

+5-1
Original file line numberDiff line numberDiff line change
@@ -726,7 +726,11 @@ def _format_duplicate_message(self) -> DataFrame:
726726
duplicates = self[self.duplicated(keep="first")].unique()
727727
assert len(duplicates)
728728

729-
out = Series(np.arange(len(self))).groupby(self).agg(list)[duplicates]
729+
out = (
730+
Series(np.arange(len(self)))
731+
.groupby(self, observed=False)
732+
.agg(list)[duplicates]
733+
)
730734
if self._is_multi:
731735
# test_format_duplicate_labels_message_multi
732736
# error: "Type[Index]" has no attribute "from_tuples" [attr-defined]

pandas/core/series.py

+1-1
Original file line numberDiff line numberDiff line change
@@ -2008,7 +2008,7 @@ def groupby(
20082008
as_index: bool = True,
20092009
sort: bool = True,
20102010
group_keys: bool = True,
2011-
observed: bool = False,
2011+
observed: bool | lib.NoDefault = lib.no_default,
20122012
dropna: bool = True,
20132013
) -> SeriesGroupBy:
20142014
from pandas.core.groupby.generic import SeriesGroupBy

pandas/core/shared_docs.py

+5
Original file line numberDiff line numberDiff line change
@@ -154,6 +154,11 @@
154154
This only applies if any of the groupers are Categoricals.
155155
If True: only show observed values for categorical groupers.
156156
If False: show all values for categorical groupers.
157+
158+
.. deprecated:: 2.1.0
159+
160+
The default value will change to True in a future version of pandas.
161+
157162
dropna : bool, default True
158163
If True, and if group keys contain NA values, NA values together
159164
with row/column will be dropped.

pandas/plotting/_matplotlib/boxplot.py

+1-1
Original file line numberDiff line numberDiff line change
@@ -254,7 +254,7 @@ def _grouped_plot_by_column(
254254
return_type=None,
255255
**kwargs,
256256
):
257-
grouped = data.groupby(by)
257+
grouped = data.groupby(by, observed=False)
258258
if columns is None:
259259
if not isinstance(by, (list, tuple)):
260260
by = [by]

pandas/tests/groupby/aggregate/test_aggregate.py

+2-2
Original file line numberDiff line numberDiff line change
@@ -1250,7 +1250,7 @@ def test_groupby_single_agg_cat_cols(grp_col_dict, exp_data):
12501250

12511251
input_df = input_df.astype({"cat": "category", "cat_ord": "category"})
12521252
input_df["cat_ord"] = input_df["cat_ord"].cat.as_ordered()
1253-
result_df = input_df.groupby("cat").agg(grp_col_dict)
1253+
result_df = input_df.groupby("cat", observed=False).agg(grp_col_dict)
12541254

12551255
# create expected dataframe
12561256
cat_index = pd.CategoricalIndex(
@@ -1289,7 +1289,7 @@ def test_groupby_combined_aggs_cat_cols(grp_col_dict, exp_data):
12891289

12901290
input_df = input_df.astype({"cat": "category", "cat_ord": "category"})
12911291
input_df["cat_ord"] = input_df["cat_ord"].cat.as_ordered()
1292-
result_df = input_df.groupby("cat").agg(grp_col_dict)
1292+
result_df = input_df.groupby("cat", observed=False).agg(grp_col_dict)
12931293

12941294
# create expected dataframe
12951295
cat_index = pd.CategoricalIndex(

pandas/tests/groupby/test_apply.py

+1-1
Original file line numberDiff line numberDiff line change
@@ -883,7 +883,7 @@ def test_apply_multi_level_name(category):
883883
df = DataFrame(
884884
{"A": np.arange(10), "B": b, "C": list(range(10)), "D": list(range(10))}
885885
).set_index(["A", "B"])
886-
result = df.groupby("B").apply(lambda x: x.sum())
886+
result = df.groupby("B", observed=False).apply(lambda x: x.sum())
887887
tm.assert_frame_equal(result, expected)
888888
assert df.index.names == ["A", "B"]
889889

pandas/tests/groupby/test_categorical.py

+22-10
Original file line numberDiff line numberDiff line change
@@ -739,7 +739,7 @@ def test_categorical_series(series, data):
739739
# Group the given series by a series with categorical data type such that group A
740740
# takes indices 0 and 3 and group B indices 1 and 2, obtaining the values mapped in
741741
# the given data.
742-
groupby = series.groupby(Series(list("ABBA"), dtype="category"))
742+
groupby = series.groupby(Series(list("ABBA"), dtype="category"), observed=False)
743743
result = groupby.aggregate(list)
744744
expected = Series(data, index=CategoricalIndex(data.keys()))
745745
tm.assert_series_equal(result, expected)
@@ -1115,7 +1115,7 @@ def test_groupby_multiindex_categorical_datetime():
11151115
"values": np.arange(9),
11161116
}
11171117
)
1118-
result = df.groupby(["key1", "key2"]).mean()
1118+
result = df.groupby(["key1", "key2"], observed=False).mean()
11191119

11201120
idx = MultiIndex.from_product(
11211121
[
@@ -1291,8 +1291,8 @@ def test_seriesgroupby_observed_apply_dict(df_cat, observed, index, data):
12911291

12921292
def test_groupby_categorical_series_dataframe_consistent(df_cat):
12931293
# GH 20416
1294-
expected = df_cat.groupby(["A", "B"])["C"].mean()
1295-
result = df_cat.groupby(["A", "B"]).mean()["C"]
1294+
expected = df_cat.groupby(["A", "B"], observed=False)["C"].mean()
1295+
result = df_cat.groupby(["A", "B"], observed=False).mean()["C"]
12961296
tm.assert_series_equal(result, expected)
12971297

12981298

@@ -1303,11 +1303,11 @@ def test_groupby_categorical_axis_1(code):
13031303
cat = Categorical.from_codes(code, categories=list("abc"))
13041304
msg = "DataFrame.groupby with axis=1 is deprecated"
13051305
with tm.assert_produces_warning(FutureWarning, match=msg):
1306-
gb = df.groupby(cat, axis=1)
1306+
gb = df.groupby(cat, axis=1, observed=False)
13071307
result = gb.mean()
13081308
msg = "The 'axis' keyword in DataFrame.groupby is deprecated"
13091309
with tm.assert_produces_warning(FutureWarning, match=msg):
1310-
gb2 = df.T.groupby(cat, axis=0)
1310+
gb2 = df.T.groupby(cat, axis=0, observed=False)
13111311
expected = gb2.mean().T
13121312
tm.assert_frame_equal(result, expected)
13131313

@@ -1478,7 +1478,7 @@ def test_series_groupby_categorical_aggregation_getitem():
14781478
df = DataFrame(d)
14791479
cat = pd.cut(df["foo"], np.linspace(0, 20, 5))
14801480
df["range"] = cat
1481-
groups = df.groupby(["range", "baz"], as_index=True, sort=True)
1481+
groups = df.groupby(["range", "baz"], as_index=True, sort=True, observed=False)
14821482
result = groups["foo"].agg("mean")
14831483
expected = groups.agg("mean")["foo"]
14841484
tm.assert_series_equal(result, expected)
@@ -1539,7 +1539,7 @@ def test_read_only_category_no_sort():
15391539
{"a": [1, 3, 5, 7], "b": Categorical([1, 1, 2, 2], categories=Index(cats))}
15401540
)
15411541
expected = DataFrame(data={"a": [2.0, 6.0]}, index=CategoricalIndex(cats, name="b"))
1542-
result = df.groupby("b", sort=False).mean()
1542+
result = df.groupby("b", sort=False, observed=False).mean()
15431543
tm.assert_frame_equal(result, expected)
15441544

15451545

@@ -1583,7 +1583,7 @@ def test_sorted_missing_category_values():
15831583
dtype="category",
15841584
)
15851585

1586-
result = df.groupby(["bar", "foo"]).size().unstack()
1586+
result = df.groupby(["bar", "foo"], observed=False).size().unstack()
15871587

15881588
tm.assert_frame_equal(result, expected)
15891589

@@ -1748,7 +1748,7 @@ def test_groupby_categorical_indices_unused_categories():
17481748
"col": range(3),
17491749
}
17501750
)
1751-
grouped = df.groupby("key", sort=False)
1751+
grouped = df.groupby("key", sort=False, observed=False)
17521752
result = grouped.indices
17531753
expected = {
17541754
"b": np.array([0, 1], dtype="intp"),
@@ -2013,3 +2013,15 @@ def test_many_categories(as_index, sort, index_kind, ordered):
20132013
expected = DataFrame({"a": Series(index), "b": data})
20142014

20152015
tm.assert_frame_equal(result, expected)
2016+
2017+
2018+
@pytest.mark.parametrize("cat_columns", ["a", "b", ["a", "b"]])
2019+
@pytest.mark.parametrize("keys", ["a", "b", ["a", "b"]])
2020+
def test_groupby_default_depr(cat_columns, keys):
2021+
# GH#43999
2022+
df = DataFrame({"a": [1, 1, 2, 3], "b": [4, 5, 6, 7]})
2023+
df[cat_columns] = df[cat_columns].astype("category")
2024+
msg = "The default of observed=False is deprecated"
2025+
klass = FutureWarning if set(cat_columns) & set(keys) else None
2026+
with tm.assert_produces_warning(klass, match=msg):
2027+
df.groupby(keys)

pandas/tests/groupby/test_groupby.py

+2-2
Original file line numberDiff line numberDiff line change
@@ -1926,7 +1926,7 @@ def test_empty_groupby(
19261926

19271927
df = df.iloc[:0]
19281928

1929-
gb = df.groupby(keys, group_keys=False, dropna=dropna)[columns]
1929+
gb = df.groupby(keys, group_keys=False, dropna=dropna, observed=False)[columns]
19301930

19311931
def get_result(**kwargs):
19321932
if method == "attr":
@@ -2638,7 +2638,7 @@ def test_datetime_categorical_multikey_groupby_indices():
26382638
"c": Categorical.from_codes([-1, 0, 1], categories=[0, 1]),
26392639
}
26402640
)
2641-
result = df.groupby(["a", "b"]).indices
2641+
result = df.groupby(["a", "b"], observed=False).indices
26422642
expected = {
26432643
("a", Timestamp("2018-01-01 00:00:00")): np.array([0]),
26442644
("b", Timestamp("2018-02-01 00:00:00")): np.array([1]),

0 commit comments

Comments
 (0)