Skip to content

Commit 6b29968

Browse files
TomAugspurgeryehoshuadimarsky
authored andcommitted
API: User-control of result keys in GroupBy.apply (pandas-dev#34998)
1 parent 19ed0e2 commit 6b29968

27 files changed

+543
-185
lines changed

doc/source/user_guide/groupby.rst

+39-12
Original file line numberDiff line numberDiff line change
@@ -1052,7 +1052,14 @@ Some operations on the grouped data might not fit into either the aggregate or
10521052
transform categories. Or, you may simply want GroupBy to infer how to combine
10531053
the results. For these, use the ``apply`` function, which can be substituted
10541054
for both ``aggregate`` and ``transform`` in many standard use cases. However,
1055-
``apply`` can handle some exceptional use cases, for example:
1055+
``apply`` can handle some exceptional use cases.
1056+
1057+
.. note::
1058+
1059+
``apply`` can act as a reducer, transformer, *or* filter function, depending
1060+
on exactly what is passed to it. It can depend on the passed function and
1061+
exactly what you are grouping. Thus the grouped column(s) may be included in
1062+
the output as well as set the indices.
10561063

10571064
.. ipython:: python
10581065
@@ -1064,16 +1071,14 @@ for both ``aggregate`` and ``transform`` in many standard use cases. However,
10641071
10651072
The dimension of the returned result can also change:
10661073

1067-
.. ipython::
1068-
1069-
In [8]: grouped = df.groupby('A')['C']
1074+
.. ipython:: python
10701075
1071-
In [10]: def f(group):
1072-
....: return pd.DataFrame({'original': group,
1073-
....: 'demeaned': group - group.mean()})
1074-
....:
1076+
grouped = df.groupby('A')['C']
10751077
1076-
In [11]: grouped.apply(f)
1078+
def f(group):
1079+
return pd.DataFrame({'original': group,
1080+
'demeaned': group - group.mean()})
1081+
grouped.apply(f)
10771082
10781083
``apply`` on a Series can operate on a returned value from the applied function,
10791084
that is itself a series, and possibly upcast the result to a DataFrame:
@@ -1088,11 +1093,33 @@ that is itself a series, and possibly upcast the result to a DataFrame:
10881093
s
10891094
s.apply(f)
10901095
1096+
Control grouped column(s) placement with ``group_keys``
1097+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1098+
10911099
.. note::
10921100

1093-
``apply`` can act as a reducer, transformer, *or* filter function, depending on exactly what is passed to it.
1094-
So depending on the path taken, and exactly what you are grouping. Thus the grouped columns(s) may be included in
1095-
the output as well as set the indices.
1101+
If ``group_keys=True`` is specified when calling :meth:`~DataFrame.groupby`,
1102+
functions passed to ``apply`` that return like-indexed outputs will have the
1103+
group keys added to the result index. Previous versions of pandas would add
1104+
the group keys only when the result from the applied function had a different
1105+
index than the input. If ``group_keys`` is not specified, the group keys will
1106+
not be added for like-indexed outputs. In the future this behavior
1107+
will change to always respect ``group_keys``, which defaults to ``True``.
1108+
1109+
.. versionchanged:: 1.5.0
1110+
1111+
To control whether the grouped column(s) are included in the indices, you can use
1112+
the argument ``group_keys``. Compare
1113+
1114+
.. ipython:: python
1115+
1116+
df.groupby("A", group_keys=True).apply(lambda x: x)
1117+
1118+
with
1119+
1120+
.. ipython:: python
1121+
1122+
df.groupby("A", group_keys=False).apply(lambda x: x)
10961123
10971124
Similar to :ref:`groupby.aggregate.udfs`, the resulting dtype will reflect that of the
10981125
apply function. If the results from different groups have different dtypes, then

doc/source/whatsnew/v0.25.0.rst

+8-3
Original file line numberDiff line numberDiff line change
@@ -342,10 +342,15 @@ Now every group is evaluated only a single time.
342342
343343
*New behavior*:
344344

345-
.. ipython:: python
346-
347-
df.groupby("a").apply(func)
345+
.. code-block:: python
348346
347+
In [3]: df.groupby('a').apply(func)
348+
x
349+
y
350+
Out[3]:
351+
a b
352+
0 x 1
353+
1 y 2
349354
350355
Concatenating sparse values
351356
^^^^^^^^^^^^^^^^^^^^^^^^^^^

doc/source/whatsnew/v1.4.0.rst

+13-3
Original file line numberDiff line numberDiff line change
@@ -455,10 +455,20 @@ result's index is not the same as the input's.
455455

456456
*New behavior*:
457457

458-
.. ipython:: python
458+
.. code-block:: ipython
459459
460-
df.groupby(['a']).apply(func)
461-
df.set_index(['a', 'b']).groupby(['a']).apply(func)
460+
In [5]: df.groupby(['a']).apply(func)
461+
Out[5]:
462+
a b c
463+
0 1 3 5
464+
1 2 4 6
465+
466+
In [6]: df.set_index(['a', 'b']).groupby(['a']).apply(func)
467+
Out[6]:
468+
c
469+
a b
470+
1 3 5
471+
2 4 6
462472
463473
Now in both cases it is determined that ``func`` is a transform. In each case,
464474
the result has the same index as the input.

doc/source/whatsnew/v1.5.0.rst

+66-3
Original file line numberDiff line numberDiff line change
@@ -24,10 +24,56 @@ Styler
2424
- Added a new method :meth:`.Styler.concat` which allows adding customised footer rows to visualise additional calculations on the data, e.g. totals and counts etc. (:issue:`43875`, :issue:`46186`)
2525
- :meth:`.Styler.highlight_null` now accepts ``color`` consistently with other builtin methods and deprecates ``null_color`` although this remains backwards compatible (:issue:`45907`)
2626

27-
.. _whatsnew_150.enhancements.enhancement2:
27+
.. _whatsnew_150.enhancements.resample_group_keys:
2828

29-
enhancement2
30-
^^^^^^^^^^^^
29+
Control of index with ``group_keys`` in :meth:`DataFrame.resample`
30+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
31+
32+
The argument ``group_keys`` has been added to the method :meth:`DataFrame.resample`.
33+
As with :meth:`DataFrame.groupby`, this argument controls the whether each group is added
34+
to the index in the resample when :meth:`.Resampler.apply` is used.
35+
36+
.. warning::
37+
Not specifying the ``group_keys`` argument will retain the
38+
previous behavior and emit a warning if the result will change
39+
by specifying ``group_keys=False``. In a future version
40+
of pandas, not specifying ``group_keys`` will default to
41+
the same behavior as ``group_keys=False``.
42+
43+
.. ipython:: python
44+
45+
df = pd.DataFrame(
46+
{'a': range(6)},
47+
index=pd.date_range("2021-01-01", periods=6, freq="8H")
48+
)
49+
df.resample("D", group_keys=True).apply(lambda x: x)
50+
df.resample("D", group_keys=False).apply(lambda x: x)
51+
52+
Previously, the resulting index would depend upon the values returned by ``apply``,
53+
as seen in the following example.
54+
55+
.. code-block:: ipython
56+
57+
In [1]: # pandas 1.3
58+
In [2]: df.resample("D").apply(lambda x: x)
59+
Out[2]:
60+
a
61+
2021-01-01 00:00:00 0
62+
2021-01-01 08:00:00 1
63+
2021-01-01 16:00:00 2
64+
2021-01-02 00:00:00 3
65+
2021-01-02 08:00:00 4
66+
2021-01-02 16:00:00 5
67+
68+
In [3]: df.resample("D").apply(lambda x: x.reset_index())
69+
Out[3]:
70+
index a
71+
2021-01-01 0 2021-01-01 00:00:00 0
72+
1 2021-01-01 08:00:00 1
73+
2 2021-01-01 16:00:00 2
74+
2021-01-02 0 2021-01-02 00:00:00 3
75+
1 2021-01-02 08:00:00 4
76+
2 2021-01-02 16:00:00 5
3177
3278
.. _whatsnew_150.enhancements.other:
3379

@@ -345,6 +391,23 @@ that their usage is considered unsafe, and can lead to unexpected results.
345391

346392
See the documentation of :class:`ExcelWriter` for further details.
347393

394+
.. _whatsnew_150.deprecations.group_keys_in_apply:
395+
396+
Using ``group_keys`` with transformers in :meth:`.GroupBy.apply`
397+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
398+
399+
In previous versions of pandas, if it was inferred that the function passed to
400+
:meth:`.GroupBy.apply` was a transformer (i.e. the resulting index was equal to
401+
the input index), the ``group_keys`` argument of :meth:`DataFrame.groupby` and
402+
:meth:`Series.groupby` was ignored and the group keys would never be added to
403+
the index of the result. In the future, the group keys will be added to the index
404+
when the user specifies ``group_keys=True``.
405+
406+
As ``group_keys=True`` is the default value of :meth:`DataFrame.groupby` and
407+
:meth:`Series.groupby`, not specifying ``group_keys`` with a transformer will
408+
raise a ``FutureWarning``. This can be silenced and the previous behavior
409+
retained by specifying ``group_keys=False``.
410+
348411
.. _whatsnew_150.deprecations.other:
349412

350413
Other Deprecations

pandas/core/frame.py

+24-1
Original file line numberDiff line numberDiff line change
@@ -7885,6 +7885,27 @@ def update(
78857885
a 13.0 13.0
78867886
b 12.3 123.0
78877887
NaN 12.3 33.0
7888+
7889+
When using ``.apply()``, use ``group_keys`` to include or exclude the group keys.
7890+
The ``group_keys`` argument defaults to ``True`` (include).
7891+
7892+
>>> df = pd.DataFrame({'Animal': ['Falcon', 'Falcon',
7893+
... 'Parrot', 'Parrot'],
7894+
... 'Max Speed': [380., 370., 24., 26.]})
7895+
>>> df.groupby("Animal", group_keys=True).apply(lambda x: x)
7896+
Animal Max Speed
7897+
Animal
7898+
Falcon 0 Falcon 380.0
7899+
1 Falcon 370.0
7900+
Parrot 2 Parrot 24.0
7901+
3 Parrot 26.0
7902+
7903+
>>> df.groupby("Animal", group_keys=False).apply(lambda x: x)
7904+
Animal Max Speed
7905+
0 Falcon 380.0
7906+
1 Falcon 370.0
7907+
2 Parrot 24.0
7908+
3 Parrot 26.0
78887909
"""
78897910
)
78907911
@Appender(_shared_docs["groupby"] % _shared_doc_kwargs)
@@ -7895,7 +7916,7 @@ def groupby(
78957916
level: Level | None = None,
78967917
as_index: bool = True,
78977918
sort: bool = True,
7898-
group_keys: bool = True,
7919+
group_keys: bool | lib.NoDefault = no_default,
78997920
squeeze: bool | lib.NoDefault = no_default,
79007921
observed: bool = False,
79017922
dropna: bool = True,
@@ -10840,6 +10861,7 @@ def resample(
1084010861
level=None,
1084110862
origin: str | TimestampConvertibleTypes = "start_day",
1084210863
offset: TimedeltaConvertibleTypes | None = None,
10864+
group_keys: bool | lib.NoDefault = no_default,
1084310865
) -> Resampler:
1084410866
return super().resample(
1084510867
rule=rule,
@@ -10854,6 +10876,7 @@ def resample(
1085410876
level=level,
1085510877
origin=origin,
1085610878
offset=offset,
10879+
group_keys=group_keys,
1085710880
)
1085810881

1085910882
def to_timestamp(

pandas/core/generic.py

+13
Original file line numberDiff line numberDiff line change
@@ -8041,6 +8041,7 @@ def resample(
80418041
level=None,
80428042
origin: str | TimestampConvertibleTypes = "start_day",
80438043
offset: TimedeltaConvertibleTypes | None = None,
8044+
group_keys: bool_t | lib.NoDefault = lib.no_default,
80448045
) -> Resampler:
80458046
"""
80468047
Resample time-series data.
@@ -8115,6 +8116,17 @@ def resample(
81158116
81168117
.. versionadded:: 1.1.0
81178118
8119+
group_keys : bool, optional
8120+
Whether to include the group keys in the result index when using
8121+
``.apply()`` on the resampled object. Not specifying ``group_keys``
8122+
will retain values-dependent behavior from pandas 1.4
8123+
and earlier (see :ref:`pandas 1.5.0 Release notes
8124+
<whatsnew_150.enhancements.resample_group_keys>`
8125+
for examples). In a future version of pandas, the behavior will
8126+
default to the same as specifying ``group_keys=False``.
8127+
8128+
.. versionadded:: 1.5.0
8129+
81188130
Returns
81198131
-------
81208132
pandas.core.Resampler
@@ -8454,6 +8466,7 @@ def resample(
84548466
level=level,
84558467
origin=origin,
84568468
offset=offset,
8469+
group_keys=group_keys,
84578470
)
84588471

84598472
@final

pandas/core/groupby/generic.py

+32-6
Original file line numberDiff line numberDiff line change
@@ -357,6 +357,7 @@ def _wrap_applied_output(
357357
data: Series,
358358
values: list[Any],
359359
not_indexed_same: bool = False,
360+
override_group_keys: bool = False,
360361
) -> DataFrame | Series:
361362
"""
362363
Wrap the output of SeriesGroupBy.apply into the expected result.
@@ -395,7 +396,11 @@ def _wrap_applied_output(
395396
res_ser.name = self.obj.name
396397
return res_ser
397398
elif isinstance(values[0], (Series, DataFrame)):
398-
return self._concat_objects(values, not_indexed_same=not_indexed_same)
399+
return self._concat_objects(
400+
values,
401+
not_indexed_same=not_indexed_same,
402+
override_group_keys=override_group_keys,
403+
)
399404
else:
400405
# GH #6265 #24880
401406
result = self.obj._constructor(
@@ -983,7 +988,11 @@ def _aggregate_item_by_item(self, func, *args, **kwargs) -> DataFrame:
983988
return res_df
984989

985990
def _wrap_applied_output(
986-
self, data: DataFrame, values: list, not_indexed_same: bool = False
991+
self,
992+
data: DataFrame,
993+
values: list,
994+
not_indexed_same: bool = False,
995+
override_group_keys: bool = False,
987996
):
988997

989998
if len(values) == 0:
@@ -1000,7 +1009,11 @@ def _wrap_applied_output(
10001009
# GH9684 - All values are None, return an empty frame.
10011010
return self.obj._constructor()
10021011
elif isinstance(first_not_none, DataFrame):
1003-
return self._concat_objects(values, not_indexed_same=not_indexed_same)
1012+
return self._concat_objects(
1013+
values,
1014+
not_indexed_same=not_indexed_same,
1015+
override_group_keys=override_group_keys,
1016+
)
10041017

10051018
key_index = self.grouper.result_index if self.as_index else None
10061019

@@ -1026,7 +1039,11 @@ def _wrap_applied_output(
10261039
else:
10271040
# values are Series
10281041
return self._wrap_applied_output_series(
1029-
values, not_indexed_same, first_not_none, key_index
1042+
values,
1043+
not_indexed_same,
1044+
first_not_none,
1045+
key_index,
1046+
override_group_keys,
10301047
)
10311048

10321049
def _wrap_applied_output_series(
@@ -1035,6 +1052,7 @@ def _wrap_applied_output_series(
10351052
not_indexed_same: bool,
10361053
first_not_none,
10371054
key_index,
1055+
override_group_keys: bool,
10381056
) -> DataFrame | Series:
10391057
# this is to silence a DeprecationWarning
10401058
# TODO(2.0): Remove when default dtype of empty Series is object
@@ -1058,7 +1076,11 @@ def _wrap_applied_output_series(
10581076
# if any of the sub-series are not indexed the same
10591077
# OR we don't have a multi-index and we have only a
10601078
# single values
1061-
return self._concat_objects(values, not_indexed_same=not_indexed_same)
1079+
return self._concat_objects(
1080+
values,
1081+
not_indexed_same=not_indexed_same,
1082+
override_group_keys=override_group_keys,
1083+
)
10621084

10631085
# still a series
10641086
# path added as of GH 5545
@@ -1069,7 +1091,11 @@ def _wrap_applied_output_series(
10691091

10701092
if not all_indexed_same:
10711093
# GH 8467
1072-
return self._concat_objects(values, not_indexed_same=True)
1094+
return self._concat_objects(
1095+
values,
1096+
not_indexed_same=True,
1097+
override_group_keys=override_group_keys,
1098+
)
10731099

10741100
# Combine values
10751101
# vstack+constructor is faster than concat and handles MI-columns

0 commit comments

Comments
 (0)