Skip to content

DEPR: DataFrameGroupBy.apply operating on the group keys #52477

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 10 commits into from
Apr 12, 2023
4 changes: 2 additions & 2 deletions doc/source/user_guide/cookbook.rst
Original file line number Diff line number Diff line change
Expand Up @@ -459,7 +459,7 @@ Unlike agg, apply's callable is passed a sub-DataFrame which gives you access to
df
# List the size of the animals with the highest weight.
df.groupby("animal").apply(lambda subf: subf["size"][subf["weight"].idxmax()])
df.groupby("animal")[["size", "weight"]].apply(lambda subf: subf["size"][subf["weight"].idxmax()])
`Using get_group
<https://stackoverflow.com/questions/14734533/how-to-access-pandas-groupby-dataframe-by-key>`__
Expand All @@ -482,7 +482,7 @@ Unlike agg, apply's callable is passed a sub-DataFrame which gives you access to
return pd.Series(["L", avg_weight, True], index=["size", "weight", "adult"])
expected_df = gb.apply(GrowUp)
expected_df = gb[["size", "weight"]].apply(GrowUp)
expected_df
`Expanding apply
Expand Down
14 changes: 10 additions & 4 deletions doc/source/user_guide/groupby.rst
Original file line number Diff line number Diff line change
Expand Up @@ -430,6 +430,12 @@ This is mainly syntactic sugar for the alternative, which is much more verbose:
Additionally, this method avoids recomputing the internal grouping information
derived from the passed key.

You can also include the grouping columns if you want to operate on them.

.. ipython:: python
grouped[["A", "B"]].sum()
.. _groupby.iterating-label:

Iterating through groups
Expand Down Expand Up @@ -1067,7 +1073,7 @@ missing values with the ``ffill()`` method.
).set_index("date")
df_re
df_re.groupby("group").resample("1D").ffill()
df_re.groupby("group")[["val"]].resample("1D").ffill()
.. _groupby.filter:

Expand Down Expand Up @@ -1233,13 +1239,13 @@ the argument ``group_keys`` which defaults to ``True``. Compare

.. ipython:: python
df.groupby("A", group_keys=True).apply(lambda x: x)
df.groupby("A", group_keys=True)[["B", "C", "D"]].apply(lambda x: x)
with

.. ipython:: python
df.groupby("A", group_keys=False).apply(lambda x: x)
df.groupby("A", group_keys=False)[["B", "C", "D"]].apply(lambda x: x)
Numba Accelerated Routines
Expand Down Expand Up @@ -1722,7 +1728,7 @@ column index name will be used as the name of the inserted column:
result = {"b_sum": x["b"].sum(), "c_mean": x["c"].mean()}
return pd.Series(result, name="metrics")
result = df.groupby("a").apply(compute_metrics)
result = df.groupby("a")[["b", "c"]].apply(compute_metrics)
result
Expand Down
22 changes: 17 additions & 5 deletions doc/source/whatsnew/v0.14.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -328,13 +328,25 @@ More consistent behavior for some groupby methods:

- groupby ``head`` and ``tail`` now act more like ``filter`` rather than an aggregation:

.. ipython:: python
.. code-block:: ipython
df = pd.DataFrame([[1, 2], [1, 4], [5, 6]], columns=['A', 'B'])
g = df.groupby('A')
g.head(1) # filters DataFrame
In [1]: df = pd.DataFrame([[1, 2], [1, 4], [5, 6]], columns=['A', 'B'])
In [2]: g = df.groupby('A')
In [3]: g.head(1) # filters DataFrame
Out[3]:
A B
0 1 2
2 5 6
In [4]: g.apply(lambda x: x.head(1)) # used to simply fall-through
Out[4]:
A B
A
1 0 1 2
5 2 5 6
g.apply(lambda x: x.head(1)) # used to simply fall-through
- groupby head and tail respect column selection:

Expand Down
93 changes: 87 additions & 6 deletions doc/source/whatsnew/v0.18.1.rst
Original file line number Diff line number Diff line change
Expand Up @@ -77,9 +77,52 @@ Previously you would have to do this to get a rolling window mean per-group:
df = pd.DataFrame({"A": [1] * 20 + [2] * 12 + [3] * 8, "B": np.arange(40)})
df
.. ipython:: python
.. code-block:: ipython
df.groupby("A").apply(lambda x: x.rolling(4).B.mean())
In [1]: df.groupby("A").apply(lambda x: x.rolling(4).B.mean())
Out[1]:
A
1 0 NaN
1 NaN
2 NaN
3 1.5
4 2.5
5 3.5
6 4.5
7 5.5
8 6.5
9 7.5
10 8.5
11 9.5
12 10.5
13 11.5
14 12.5
15 13.5
16 14.5
17 15.5
18 16.5
19 17.5
2 20 NaN
21 NaN
22 NaN
23 21.5
24 22.5
25 23.5
26 24.5
27 25.5
28 26.5
29 27.5
30 28.5
31 29.5
3 32 NaN
33 NaN
34 NaN
35 33.5
36 34.5
37 35.5
38 36.5
39 37.5
Name: B, dtype: float64
Now you can do:

Expand All @@ -101,15 +144,53 @@ For ``.resample(..)`` type of operations, previously you would have to:
df
.. ipython:: python
.. code-block:: ipython
df.groupby("group").apply(lambda x: x.resample("1D").ffill())
In[1]: df.groupby("group").apply(lambda x: x.resample("1D").ffill())
Out[1]:
group val
group date
1 2016-01-03 1 5
2016-01-04 1 5
2016-01-05 1 5
2016-01-06 1 5
2016-01-07 1 5
2016-01-08 1 5
2016-01-09 1 5
2016-01-10 1 6
2 2016-01-17 2 7
2016-01-18 2 7
2016-01-19 2 7
2016-01-20 2 7
2016-01-21 2 7
2016-01-22 2 7
2016-01-23 2 7
2016-01-24 2 8
Now you can do:

.. ipython:: python
.. code-block:: ipython
df.groupby("group").resample("1D").ffill()
In[1]: df.groupby("group").resample("1D").ffill()
Out[1]:
group val
group date
1 2016-01-03 1 5
2016-01-04 1 5
2016-01-05 1 5
2016-01-06 1 5
2016-01-07 1 5
2016-01-08 1 5
2016-01-09 1 5
2016-01-10 1 6
2 2016-01-17 2 7
2016-01-18 2 7
2016-01-19 2 7
2016-01-20 2 7
2016-01-21 2 7
2016-01-22 2 7
2016-01-23 2 7
2016-01-24 2 8
.. _whatsnew_0181.enhancements.method_chain:

Expand Down
1 change: 1 addition & 0 deletions doc/source/whatsnew/v2.1.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -199,6 +199,7 @@ Other API changes

Deprecations
~~~~~~~~~~~~
- Deprecated :meth:`.DataFrameGroupBy.apply` and methods on the objects returned by :meth:`.DataFrameGroupBy.resample` operating on the grouping column(s); select the columns to operate on after groupby to either explicitly include or exclude the groupings and avoid the ``FutureWarning`` (:issue:`7155`)
- Deprecated silently dropping unrecognized timezones when parsing strings to datetimes (:issue:`18702`)
- Deprecated :meth:`DataFrame._data` and :meth:`Series._data`, use public APIs instead (:issue:`33333`)
- Deprecated :meth:`.Groupby.all` and :meth:`.GroupBy.any` with datetime64 or :class:`PeriodDtype` values, matching the :class:`Series` and :class:`DataFrame` deprecations (:issue:`34479`)
Expand Down
26 changes: 13 additions & 13 deletions pandas/core/frame.py
Original file line number Diff line number Diff line change
Expand Up @@ -8595,20 +8595,20 @@ def update(
>>> df = pd.DataFrame({'Animal': ['Falcon', 'Falcon',
... 'Parrot', 'Parrot'],
... 'Max Speed': [380., 370., 24., 26.]})
>>> df.groupby("Animal", group_keys=True).apply(lambda x: x)
Animal Max Speed
>>> df.groupby("Animal", group_keys=True)[['Max Speed']].apply(lambda x: x)
Max Speed
Animal
Falcon 0 Falcon 380.0
1 Falcon 370.0
Parrot 2 Parrot 24.0
3 Parrot 26.0
>>> df.groupby("Animal", group_keys=False).apply(lambda x: x)
Animal Max Speed
0 Falcon 380.0
1 Falcon 370.0
2 Parrot 24.0
3 Parrot 26.0
Falcon 0 380.0
1 370.0
Parrot 2 24.0
3 26.0
>>> df.groupby("Animal", group_keys=False)[['Max Speed']].apply(lambda x: x)
Max Speed
0 380.0
1 370.0
2 24.0
3 26.0
"""
)
)
Expand Down
80 changes: 50 additions & 30 deletions pandas/core/groupby/groupby.py
Original file line number Diff line number Diff line change
Expand Up @@ -260,7 +260,7 @@ class providing the base-class of operations.
each group together into a Series, including setting the index as
appropriate:
>>> g1.apply(lambda x: x.C.max() - x.B.min())
>>> g1[['B', 'C']].apply(lambda x: x.C.max() - x.B.min())
A
a 5
b 2
Expand Down Expand Up @@ -1487,6 +1487,16 @@ def f(g):
with option_context("mode.chained_assignment", None):
try:
result = self._python_apply_general(f, self._selected_obj)
if (
not isinstance(self.obj, Series)
and self._selection is None
and self._selected_obj.shape != self._obj_with_exclusions.shape
):
warnings.warn(
message=_apply_groupings_depr.format(type(self).__name__),
category=FutureWarning,
stacklevel=find_stack_level(),
)
except TypeError:
# gh-20949
# try again, with .apply acting as a filtering
Expand Down Expand Up @@ -2645,55 +2655,55 @@ def resample(self, rule, *args, **kwargs):
Downsample the DataFrame into 3 minute bins and sum the values of
the timestamps falling into a bin.
>>> df.groupby('a').resample('3T').sum()
a b
>>> df.groupby('a')[['b']].resample('3T').sum()
b
a
0 2000-01-01 00:00:00 0 2
2000-01-01 00:03:00 0 1
5 2000-01-01 00:00:00 5 1
0 2000-01-01 00:00:00 2
2000-01-01 00:03:00 1
5 2000-01-01 00:00:00 1
Upsample the series into 30 second bins.
>>> df.groupby('a').resample('30S').sum()
a b
>>> df.groupby('a')[['b']].resample('30S').sum()
b
a
0 2000-01-01 00:00:00 0 1
2000-01-01 00:00:30 0 0
2000-01-01 00:01:00 0 1
2000-01-01 00:01:30 0 0
2000-01-01 00:02:00 0 0
2000-01-01 00:02:30 0 0
2000-01-01 00:03:00 0 1
5 2000-01-01 00:02:00 5 1
0 2000-01-01 00:00:00 1
2000-01-01 00:00:30 0
2000-01-01 00:01:00 1
2000-01-01 00:01:30 0
2000-01-01 00:02:00 0
2000-01-01 00:02:30 0
2000-01-01 00:03:00 1
5 2000-01-01 00:02:00 1
Resample by month. Values are assigned to the month of the period.
>>> df.groupby('a').resample('M').sum()
a b
>>> df.groupby('a')[['b']].resample('M').sum()
b
a
0 2000-01-31 0 3
5 2000-01-31 5 1
0 2000-01-31 3
5 2000-01-31 1
Downsample the series into 3 minute bins as above, but close the right
side of the bin interval.
>>> df.groupby('a').resample('3T', closed='right').sum()
a b
>>> df.groupby('a')[['b']].resample('3T', closed='right').sum()
b
a
0 1999-12-31 23:57:00 0 1
2000-01-01 00:00:00 0 2
5 2000-01-01 00:00:00 5 1
0 1999-12-31 23:57:00 1
2000-01-01 00:00:00 2
5 2000-01-01 00:00:00 1
Downsample the series into 3 minute bins and close the right side of
the bin interval, but label each bin using the right edge instead of
the left.
>>> df.groupby('a').resample('3T', closed='right', label='right').sum()
a b
>>> df.groupby('a')[['b']].resample('3T', closed='right', label='right').sum()
b
a
0 2000-01-01 00:00:00 0 1
2000-01-01 00:03:00 0 2
5 2000-01-01 00:03:00 5 1
0 2000-01-01 00:00:00 1
2000-01-01 00:03:00 2
5 2000-01-01 00:03:00 1
"""
from pandas.core.resample import get_resampler_for_grouping

Expand Down Expand Up @@ -4309,3 +4319,13 @@ def _insert_quantile_level(idx: Index, qs: npt.NDArray[np.float64]) -> MultiInde
else:
mi = MultiIndex.from_product([idx, qs])
return mi


# GH#7155
_apply_groupings_depr = (
"{}.apply operated on the grouping columns. This behavior is deprecated, "
"and in a future version of pandas the grouping columns will be excluded "
"from the operation. Select the columns to operate on after groupby to"
"either explicitly include or exclude the groupings and silence "
"this warning."
)
Loading