Skip to content

Commit 9b20759

Browse files
authored
DEPR: DataFrameGroupBy.apply operating on the group keys (#52477)
* DEPR: DataFrameGroupBy.apply operating on the group keys * Reorder whatsnew * Remove warnings from pivot, minor refinements * Handle warning in docs * Improve warning message * Add note to user guide * Improve whatsnew * Adjust docstrings
1 parent 7eeec0d commit 9b20759

31 files changed

+705
-256
lines changed

doc/source/user_guide/cookbook.rst

+2-2
Original file line numberDiff line numberDiff line change
@@ -459,7 +459,7 @@ Unlike agg, apply's callable is passed a sub-DataFrame which gives you access to
459459
df
460460
461461
# List the size of the animals with the highest weight.
462-
df.groupby("animal").apply(lambda subf: subf["size"][subf["weight"].idxmax()])
462+
df.groupby("animal")[["size", "weight"]].apply(lambda subf: subf["size"][subf["weight"].idxmax()])
463463
464464
`Using get_group
465465
<https://stackoverflow.com/questions/14734533/how-to-access-pandas-groupby-dataframe-by-key>`__
@@ -482,7 +482,7 @@ Unlike agg, apply's callable is passed a sub-DataFrame which gives you access to
482482
return pd.Series(["L", avg_weight, True], index=["size", "weight", "adult"])
483483
484484
485-
expected_df = gb.apply(GrowUp)
485+
expected_df = gb[["size", "weight"]].apply(GrowUp)
486486
expected_df
487487
488488
`Expanding apply

doc/source/user_guide/groupby.rst

+10-4
Original file line numberDiff line numberDiff line change
@@ -430,6 +430,12 @@ This is mainly syntactic sugar for the alternative, which is much more verbose:
430430
Additionally, this method avoids recomputing the internal grouping information
431431
derived from the passed key.
432432

433+
You can also include the grouping columns if you want to operate on them.
434+
435+
.. ipython:: python
436+
437+
grouped[["A", "B"]].sum()
438+
433439
.. _groupby.iterating-label:
434440

435441
Iterating through groups
@@ -1067,7 +1073,7 @@ missing values with the ``ffill()`` method.
10671073
).set_index("date")
10681074
df_re
10691075
1070-
df_re.groupby("group").resample("1D").ffill()
1076+
df_re.groupby("group")[["val"]].resample("1D").ffill()
10711077
10721078
.. _groupby.filter:
10731079

@@ -1233,13 +1239,13 @@ the argument ``group_keys`` which defaults to ``True``. Compare
12331239

12341240
.. ipython:: python
12351241
1236-
df.groupby("A", group_keys=True).apply(lambda x: x)
1242+
df.groupby("A", group_keys=True)[["B", "C", "D"]].apply(lambda x: x)
12371243
12381244
with
12391245

12401246
.. ipython:: python
12411247
1242-
df.groupby("A", group_keys=False).apply(lambda x: x)
1248+
df.groupby("A", group_keys=False)[["B", "C", "D"]].apply(lambda x: x)
12431249
12441250
12451251
Numba Accelerated Routines
@@ -1722,7 +1728,7 @@ column index name will be used as the name of the inserted column:
17221728
result = {"b_sum": x["b"].sum(), "c_mean": x["c"].mean()}
17231729
return pd.Series(result, name="metrics")
17241730
1725-
result = df.groupby("a").apply(compute_metrics)
1731+
result = df.groupby("a")[["b", "c"]].apply(compute_metrics)
17261732
17271733
result
17281734

doc/source/whatsnew/v0.14.0.rst

+17-5
Original file line numberDiff line numberDiff line change
@@ -328,13 +328,25 @@ More consistent behavior for some groupby methods:
328328

329329
- groupby ``head`` and ``tail`` now act more like ``filter`` rather than an aggregation:
330330

331-
.. ipython:: python
331+
.. code-block:: ipython
332332
333-
df = pd.DataFrame([[1, 2], [1, 4], [5, 6]], columns=['A', 'B'])
334-
g = df.groupby('A')
335-
g.head(1) # filters DataFrame
333+
In [1]: df = pd.DataFrame([[1, 2], [1, 4], [5, 6]], columns=['A', 'B'])
334+
335+
In [2]: g = df.groupby('A')
336+
337+
In [3]: g.head(1) # filters DataFrame
338+
Out[3]:
339+
A B
340+
0 1 2
341+
2 5 6
342+
343+
In [4]: g.apply(lambda x: x.head(1)) # used to simply fall-through
344+
Out[4]:
345+
A B
346+
A
347+
1 0 1 2
348+
5 2 5 6
336349
337-
g.apply(lambda x: x.head(1)) # used to simply fall-through
338350
339351
- groupby head and tail respect column selection:
340352

doc/source/whatsnew/v0.18.1.rst

+87-6
Original file line numberDiff line numberDiff line change
@@ -77,9 +77,52 @@ Previously you would have to do this to get a rolling window mean per-group:
7777
df = pd.DataFrame({"A": [1] * 20 + [2] * 12 + [3] * 8, "B": np.arange(40)})
7878
df
7979
80-
.. ipython:: python
80+
.. code-block:: ipython
8181
82-
df.groupby("A").apply(lambda x: x.rolling(4).B.mean())
82+
In [1]: df.groupby("A").apply(lambda x: x.rolling(4).B.mean())
83+
Out[1]:
84+
A
85+
1 0 NaN
86+
1 NaN
87+
2 NaN
88+
3 1.5
89+
4 2.5
90+
5 3.5
91+
6 4.5
92+
7 5.5
93+
8 6.5
94+
9 7.5
95+
10 8.5
96+
11 9.5
97+
12 10.5
98+
13 11.5
99+
14 12.5
100+
15 13.5
101+
16 14.5
102+
17 15.5
103+
18 16.5
104+
19 17.5
105+
2 20 NaN
106+
21 NaN
107+
22 NaN
108+
23 21.5
109+
24 22.5
110+
25 23.5
111+
26 24.5
112+
27 25.5
113+
28 26.5
114+
29 27.5
115+
30 28.5
116+
31 29.5
117+
3 32 NaN
118+
33 NaN
119+
34 NaN
120+
35 33.5
121+
36 34.5
122+
37 35.5
123+
38 36.5
124+
39 37.5
125+
Name: B, dtype: float64
83126
84127
Now you can do:
85128

@@ -101,15 +144,53 @@ For ``.resample(..)`` type of operations, previously you would have to:
101144
102145
df
103146
104-
.. ipython:: python
147+
.. code-block:: ipython
105148
106-
df.groupby("group").apply(lambda x: x.resample("1D").ffill())
149+
In[1]: df.groupby("group").apply(lambda x: x.resample("1D").ffill())
150+
Out[1]:
151+
group val
152+
group date
153+
1 2016-01-03 1 5
154+
2016-01-04 1 5
155+
2016-01-05 1 5
156+
2016-01-06 1 5
157+
2016-01-07 1 5
158+
2016-01-08 1 5
159+
2016-01-09 1 5
160+
2016-01-10 1 6
161+
2 2016-01-17 2 7
162+
2016-01-18 2 7
163+
2016-01-19 2 7
164+
2016-01-20 2 7
165+
2016-01-21 2 7
166+
2016-01-22 2 7
167+
2016-01-23 2 7
168+
2016-01-24 2 8
107169
108170
Now you can do:
109171

110-
.. ipython:: python
172+
.. code-block:: ipython
111173
112-
df.groupby("group").resample("1D").ffill()
174+
In[1]: df.groupby("group").resample("1D").ffill()
175+
Out[1]:
176+
group val
177+
group date
178+
1 2016-01-03 1 5
179+
2016-01-04 1 5
180+
2016-01-05 1 5
181+
2016-01-06 1 5
182+
2016-01-07 1 5
183+
2016-01-08 1 5
184+
2016-01-09 1 5
185+
2016-01-10 1 6
186+
2 2016-01-17 2 7
187+
2016-01-18 2 7
188+
2016-01-19 2 7
189+
2016-01-20 2 7
190+
2016-01-21 2 7
191+
2016-01-22 2 7
192+
2016-01-23 2 7
193+
2016-01-24 2 8
113194
114195
.. _whatsnew_0181.enhancements.method_chain:
115196

doc/source/whatsnew/v2.1.0.rst

+1
Original file line numberDiff line numberDiff line change
@@ -200,6 +200,7 @@ Other API changes
200200

201201
Deprecations
202202
~~~~~~~~~~~~
203+
- Deprecated :meth:`.DataFrameGroupBy.apply` and methods on the objects returned by :meth:`.DataFrameGroupBy.resample` operating on the grouping column(s); select the columns to operate on after groupby to either explicitly include or exclude the groupings and avoid the ``FutureWarning`` (:issue:`7155`)
203204
- Deprecated silently dropping unrecognized timezones when parsing strings to datetimes (:issue:`18702`)
204205
- Deprecated :meth:`DataFrame._data` and :meth:`Series._data`, use public APIs instead (:issue:`33333`)
205206
- Deprecated :meth:`.Groupby.all` and :meth:`.GroupBy.any` with datetime64 or :class:`PeriodDtype` values, matching the :class:`Series` and :class:`DataFrame` deprecations (:issue:`34479`)

pandas/core/frame.py

+13-13
Original file line numberDiff line numberDiff line change
@@ -8595,20 +8595,20 @@ def update(
85958595
>>> df = pd.DataFrame({'Animal': ['Falcon', 'Falcon',
85968596
... 'Parrot', 'Parrot'],
85978597
... 'Max Speed': [380., 370., 24., 26.]})
8598-
>>> df.groupby("Animal", group_keys=True).apply(lambda x: x)
8599-
Animal Max Speed
8598+
>>> df.groupby("Animal", group_keys=True)[['Max Speed']].apply(lambda x: x)
8599+
Max Speed
86008600
Animal
8601-
Falcon 0 Falcon 380.0
8602-
1 Falcon 370.0
8603-
Parrot 2 Parrot 24.0
8604-
3 Parrot 26.0
8605-
8606-
>>> df.groupby("Animal", group_keys=False).apply(lambda x: x)
8607-
Animal Max Speed
8608-
0 Falcon 380.0
8609-
1 Falcon 370.0
8610-
2 Parrot 24.0
8611-
3 Parrot 26.0
8601+
Falcon 0 380.0
8602+
1 370.0
8603+
Parrot 2 24.0
8604+
3 26.0
8605+
8606+
>>> df.groupby("Animal", group_keys=False)[['Max Speed']].apply(lambda x: x)
8607+
Max Speed
8608+
0 380.0
8609+
1 370.0
8610+
2 24.0
8611+
3 26.0
86128612
"""
86138613
)
86148614
)

pandas/core/groupby/groupby.py

+50-30
Original file line numberDiff line numberDiff line change
@@ -260,7 +260,7 @@ class providing the base-class of operations.
260260
each group together into a Series, including setting the index as
261261
appropriate:
262262
263-
>>> g1.apply(lambda x: x.C.max() - x.B.min())
263+
>>> g1[['B', 'C']].apply(lambda x: x.C.max() - x.B.min())
264264
A
265265
a 5
266266
b 2
@@ -1487,6 +1487,16 @@ def f(g):
14871487
with option_context("mode.chained_assignment", None):
14881488
try:
14891489
result = self._python_apply_general(f, self._selected_obj)
1490+
if (
1491+
not isinstance(self.obj, Series)
1492+
and self._selection is None
1493+
and self._selected_obj.shape != self._obj_with_exclusions.shape
1494+
):
1495+
warnings.warn(
1496+
message=_apply_groupings_depr.format(type(self).__name__),
1497+
category=FutureWarning,
1498+
stacklevel=find_stack_level(),
1499+
)
14901500
except TypeError:
14911501
# gh-20949
14921502
# try again, with .apply acting as a filtering
@@ -2645,55 +2655,55 @@ def resample(self, rule, *args, **kwargs):
26452655
Downsample the DataFrame into 3 minute bins and sum the values of
26462656
the timestamps falling into a bin.
26472657
2648-
>>> df.groupby('a').resample('3T').sum()
2649-
a b
2658+
>>> df.groupby('a')[['b']].resample('3T').sum()
2659+
b
26502660
a
2651-
0 2000-01-01 00:00:00 0 2
2652-
2000-01-01 00:03:00 0 1
2653-
5 2000-01-01 00:00:00 5 1
2661+
0 2000-01-01 00:00:00 2
2662+
2000-01-01 00:03:00 1
2663+
5 2000-01-01 00:00:00 1
26542664
26552665
Upsample the series into 30 second bins.
26562666
2657-
>>> df.groupby('a').resample('30S').sum()
2658-
a b
2667+
>>> df.groupby('a')[['b']].resample('30S').sum()
2668+
b
26592669
a
2660-
0 2000-01-01 00:00:00 0 1
2661-
2000-01-01 00:00:30 0 0
2662-
2000-01-01 00:01:00 0 1
2663-
2000-01-01 00:01:30 0 0
2664-
2000-01-01 00:02:00 0 0
2665-
2000-01-01 00:02:30 0 0
2666-
2000-01-01 00:03:00 0 1
2667-
5 2000-01-01 00:02:00 5 1
2670+
0 2000-01-01 00:00:00 1
2671+
2000-01-01 00:00:30 0
2672+
2000-01-01 00:01:00 1
2673+
2000-01-01 00:01:30 0
2674+
2000-01-01 00:02:00 0
2675+
2000-01-01 00:02:30 0
2676+
2000-01-01 00:03:00 1
2677+
5 2000-01-01 00:02:00 1
26682678
26692679
Resample by month. Values are assigned to the month of the period.
26702680
2671-
>>> df.groupby('a').resample('M').sum()
2672-
a b
2681+
>>> df.groupby('a')[['b']].resample('M').sum()
2682+
b
26732683
a
2674-
0 2000-01-31 0 3
2675-
5 2000-01-31 5 1
2684+
0 2000-01-31 3
2685+
5 2000-01-31 1
26762686
26772687
Downsample the series into 3 minute bins as above, but close the right
26782688
side of the bin interval.
26792689
2680-
>>> df.groupby('a').resample('3T', closed='right').sum()
2681-
a b
2690+
>>> df.groupby('a')[['b']].resample('3T', closed='right').sum()
2691+
b
26822692
a
2683-
0 1999-12-31 23:57:00 0 1
2684-
2000-01-01 00:00:00 0 2
2685-
5 2000-01-01 00:00:00 5 1
2693+
0 1999-12-31 23:57:00 1
2694+
2000-01-01 00:00:00 2
2695+
5 2000-01-01 00:00:00 1
26862696
26872697
Downsample the series into 3 minute bins and close the right side of
26882698
the bin interval, but label each bin using the right edge instead of
26892699
the left.
26902700
2691-
>>> df.groupby('a').resample('3T', closed='right', label='right').sum()
2692-
a b
2701+
>>> df.groupby('a')[['b']].resample('3T', closed='right', label='right').sum()
2702+
b
26932703
a
2694-
0 2000-01-01 00:00:00 0 1
2695-
2000-01-01 00:03:00 0 2
2696-
5 2000-01-01 00:03:00 5 1
2704+
0 2000-01-01 00:00:00 1
2705+
2000-01-01 00:03:00 2
2706+
5 2000-01-01 00:03:00 1
26972707
"""
26982708
from pandas.core.resample import get_resampler_for_grouping
26992709

@@ -4309,3 +4319,13 @@ def _insert_quantile_level(idx: Index, qs: npt.NDArray[np.float64]) -> MultiInde
43094319
else:
43104320
mi = MultiIndex.from_product([idx, qs])
43114321
return mi
4322+
4323+
4324+
# GH#7155
4325+
_apply_groupings_depr = (
4326+
"{}.apply operated on the grouping columns. This behavior is deprecated, "
4327+
"and in a future version of pandas the grouping columns will be excluded "
4328+
"from the operation. Select the columns to operate on after groupby to"
4329+
"either explicitly include or exclude the groupings and silence "
4330+
"this warning."
4331+
)

0 commit comments

Comments
 (0)