Skip to content

Commit 072408e

Browse files
TomAugspurgerjreback
authored andcommitted
ENH: Support nested renaming / selection (#26399)
1 parent 59df3e0 commit 072408e

File tree

10 files changed

+337
-31
lines changed

10 files changed

+337
-31
lines changed

doc/source/user_guide/groupby.rst

+61-13
Original file line numberDiff line numberDiff line change
@@ -568,6 +568,67 @@ For a grouped ``DataFrame``, you can rename in a similar manner:
568568
'mean': 'bar',
569569
'std': 'baz'}))
570570
571+
.. _groupby.aggregate.named:
572+
573+
Named Aggregation
574+
~~~~~~~~~~~~~~~~~
575+
576+
.. versionadded:: 0.25.0
577+
578+
To support column-specific aggregation *with control over the output column names*, pandas
579+
accepts the special syntax in :meth:`GroupBy.agg`, known as "named aggregation", where
580+
581+
- The keywords are the *output* column names
582+
- The values are tuples whose first element is the column to select
583+
and the second element is the aggregation to apply to that column. Pandas
584+
provides the ``pandas.NamedAgg`` namedtuple with the fields ``['column', 'aggfunc']``
585+
to make it clearer what the arguments are. As usual, the aggregation can
586+
be a callable or a string alias.
587+
588+
.. ipython:: python
589+
590+
animals = pd.DataFrame({'kind': ['cat', 'dog', 'cat', 'dog'],
591+
'height': [9.1, 6.0, 9.5, 34.0],
592+
'weight': [7.9, 7.5, 9.9, 198.0]})
593+
animals
594+
595+
animals.groupby("kind").agg(
596+
min_height=pd.NamedAgg(column='height', aggfunc='min'),
597+
max_height=pd.NamedAgg(column='height', aggfunc='max'),
598+
average_weight=pd.NamedAgg(column='height', aggfunc=np.mean),
599+
)
600+
601+
602+
``pandas.NamedAgg`` is just a ``namedtuple``. Plain tuples are allowed as well.
603+
604+
.. ipython:: python
605+
606+
animals.groupby("kind").agg(
607+
min_height=('height', 'min'),
608+
max_height=('height', 'max'),
609+
average_weight=('height', np.mean),
610+
)
611+
612+
613+
If your desired output column names are not valid python keywords, construct a dictionary
614+
and unpack the keyword arguments
615+
616+
.. ipython:: python
617+
618+
animals.groupby("kind").agg(**{
619+
'total weight': pd.NamedAgg(column='weight', aggfunc=sum),
620+
})
621+
622+
Additional keyword arguments are not passed through to the aggregation functions. Only pairs
623+
of ``(column, aggfunc)`` should be passed as ``**kwargs``. If your aggregation functions
624+
requires additional arguments, partially apply them with :meth:`functools.partial`.
625+
626+
.. note::
627+
628+
For Python 3.5 and earlier, the order of ``**kwargs`` in a functions was not
629+
preserved. This means that the output column ordering would not be
630+
consistent. To ensure consistent ordering, the keys (and so output columns)
631+
will always be sorted for Python 3.5.
571632

572633
Applying different functions to DataFrame columns
573634
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
@@ -588,19 +649,6 @@ must be either implemented on GroupBy or available via :ref:`dispatching
588649
589650
grouped.agg({'C': 'sum', 'D': 'std'})
590651
591-
.. note::
592-
593-
If you pass a dict to ``aggregate``, the ordering of the output columns is
594-
non-deterministic. If you want to be sure the output columns will be in a specific
595-
order, you can use an ``OrderedDict``. Compare the output of the following two commands:
596-
597-
.. ipython:: python
598-
599-
from collections import OrderedDict
600-
601-
grouped.agg({'D': 'std', 'C': 'mean'})
602-
grouped.agg(OrderedDict([('D', 'std'), ('C', 'mean')]))
603-
604652
.. _groupby.aggregate.cython:
605653

606654
Cython-optimized aggregation functions

doc/source/whatsnew/v0.25.0.rst

+41
Original file line numberDiff line numberDiff line change
@@ -19,6 +19,47 @@ These are the changes in pandas 0.25.0. See :ref:`release` for a full changelog
1919
including other versions of pandas.
2020

2121

22+
Enhancements
23+
~~~~~~~~~~~~
24+
25+
.. _whatsnew_0250.enhancements.agg_relabel:
26+
27+
Groupby Aggregation with Relabeling
28+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
29+
30+
Pandas has added special groupby behavior, known as "named aggregation", for naming the
31+
output columns when applying multiple aggregation functions to specific columns (:issue:`18366`).
32+
33+
.. ipython:: python
34+
35+
animals = pd.DataFrame({'kind': ['cat', 'dog', 'cat', 'dog'],
36+
'height': [9.1, 6.0, 9.5, 34.0],
37+
'weight': [7.9, 7.5, 9.9, 198.0]})
38+
animals
39+
animals.groupby("kind").agg(
40+
min_height=pd.NamedAgg(column='height', aggfunc='min'),
41+
max_height=pd.NamedAgg(column='height', aggfunc='max'),
42+
average_weight=pd.NamedAgg(column='height', aggfunc=np.mean),
43+
)
44+
45+
Pass the desired columns names as the ``**kwargs`` to ``.agg``. The values of ``**kwargs``
46+
should be tuples where the first element is the column selection, and the second element is the
47+
aggregation function to apply. Pandas provides the ``pandas.NamedAgg`` namedtuple to make it clearer
48+
what the arguments to the function are, but plain tuples are accepted as well.
49+
50+
.. ipython:: python
51+
52+
animals.groupby("kind").agg(
53+
min_height=('height', 'min'),
54+
max_height=('height', 'max'),
55+
average_weight=('height', np.mean),
56+
)
57+
58+
Named aggregation is the recommended replacement for the deprecated "dict-of-dicts"
59+
approach to naming the output of column-specific aggregations (:ref:`whatsnew_0200.api_breaking.deprecate_group_agg_dict`).
60+
61+
See :ref:`_groupby.aggregate.named` for more.
62+
2263
.. _whatsnew_0250.enhancements.other:
2364

2465
Other Enhancements

pandas/__init__.py

+1-1
Original file line numberDiff line numberDiff line change
@@ -65,7 +65,7 @@
6565
to_numeric, to_datetime, to_timedelta,
6666

6767
# misc
68-
np, Grouper, factorize, unique, value_counts,
68+
np, Grouper, factorize, unique, value_counts, NamedAgg,
6969
array, Categorical, set_eng_float_format, Series, DataFrame,
7070
Panel)
7171

pandas/core/api.py

+1-1
Original file line numberDiff line numberDiff line change
@@ -21,7 +21,7 @@
2121
DatetimeTZDtype,
2222
)
2323
from pandas.core.arrays import Categorical, array
24-
from pandas.core.groupby import Grouper
24+
from pandas.core.groupby import Grouper, NamedAgg
2525
from pandas.io.formats.format import set_eng_float_format
2626
from pandas.core.index import (Index, CategoricalIndex, Int64Index,
2727
UInt64Index, RangeIndex, Float64Index,

pandas/core/base.py

+9-5
Original file line numberDiff line numberDiff line change
@@ -340,11 +340,15 @@ def _aggregate(self, arg, *args, **kwargs):
340340
def nested_renaming_depr(level=4):
341341
# deprecation of nested renaming
342342
# GH 15931
343-
warnings.warn(
344-
("using a dict with renaming "
345-
"is deprecated and will be removed in a future "
346-
"version"),
347-
FutureWarning, stacklevel=level)
343+
msg = textwrap.dedent("""\
344+
using a dict with renaming is deprecated and will be removed
345+
in a future version.
346+
347+
For column-specific groupby renaming, use named aggregation
348+
349+
>>> df.groupby(...).agg(name=('column', aggfunc))
350+
""")
351+
warnings.warn(msg, FutureWarning, stacklevel=level)
348352

349353
# if we have a dict of any non-scalars
350354
# eg. {'A' : ['mean']}, normalize all to

pandas/core/groupby/__init__.py

+2-2
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
from pandas.core.groupby.groupby import GroupBy # noqa: F401
21
from pandas.core.groupby.generic import ( # noqa: F401
3-
SeriesGroupBy, DataFrameGroupBy)
2+
DataFrameGroupBy, NamedAgg, SeriesGroupBy)
3+
from pandas.core.groupby.groupby import GroupBy # noqa: F401
44
from pandas.core.groupby.grouper import Grouper # noqa: F401

pandas/core/groupby/generic.py

+120-8
Original file line numberDiff line numberDiff line change
@@ -6,15 +6,18 @@
66
which here returns a DataFrameGroupBy object.
77
"""
88

9-
from collections import OrderedDict, abc
9+
from collections import OrderedDict, abc, namedtuple
1010
import copy
1111
from functools import partial
1212
from textwrap import dedent
13+
import typing
14+
from typing import Any, Callable, List, Union
1315
import warnings
1416

1517
import numpy as np
1618

1719
from pandas._libs import Timestamp, lib
20+
from pandas.compat import PY36
1821
from pandas.errors import AbstractMethodError
1922
from pandas.util._decorators import Appender, Substitution
2023

@@ -41,6 +44,10 @@
4144

4245
from pandas.plotting._core import boxplot_frame_groupby
4346

47+
NamedAgg = namedtuple("NamedAgg", ["column", "aggfunc"])
48+
# TODO(typing) the return value on this callable should be any *scalar*.
49+
AggScalar = Union[str, Callable[..., Any]]
50+
4451

4552
class NDFrameGroupBy(GroupBy):
4653

@@ -144,8 +151,18 @@ def _cython_agg_blocks(self, how, alt=None, numeric_only=True,
144151
return new_items, new_blocks
145152

146153
def aggregate(self, func, *args, **kwargs):
147-
148154
_level = kwargs.pop('_level', None)
155+
156+
relabeling = func is None and _is_multi_agg_with_relabel(**kwargs)
157+
if relabeling:
158+
func, columns, order = _normalize_keyword_aggregation(kwargs)
159+
160+
kwargs = {}
161+
elif func is None:
162+
# nicer error message
163+
raise TypeError("Must provide 'func' or tuples of "
164+
"'(column, aggfunc).")
165+
149166
result, how = self._aggregate(func, _level=_level, *args, **kwargs)
150167
if how is None:
151168
return result
@@ -179,6 +196,10 @@ def aggregate(self, func, *args, **kwargs):
179196
self._insert_inaxis_grouper_inplace(result)
180197
result.index = np.arange(len(result))
181198

199+
if relabeling:
200+
result = result[order]
201+
result.columns = columns
202+
182203
return result._convert(datetime=True)
183204

184205
agg = aggregate
@@ -791,11 +812,8 @@ def _aggregate_multiple_funcs(self, arg, _level):
791812
# list of functions / function names
792813
columns = []
793814
for f in arg:
794-
if isinstance(f, str):
795-
columns.append(f)
796-
else:
797-
# protect against callables without names
798-
columns.append(com.get_callable_name(f))
815+
columns.append(com.get_callable_name(f) or f)
816+
799817
arg = zip(columns, arg)
800818

801819
results = OrderedDict()
@@ -1296,6 +1314,26 @@ class DataFrameGroupBy(NDFrameGroupBy):
12961314
A
12971315
1 1 2 0.590716
12981316
2 3 4 0.704907
1317+
1318+
To control the output names with different aggregations per column,
1319+
pandas supports "named aggregation"
1320+
1321+
>>> df.groupby("A").agg(
1322+
... b_min=pd.NamedAgg(column="B", aggfunc="min"),
1323+
... c_sum=pd.NamedAgg(column="C", aggfunc="sum"))
1324+
b_min c_sum
1325+
A
1326+
1 1 -1.956929
1327+
2 3 -0.322183
1328+
1329+
- The keywords are the *output* column names
1330+
- The values are tuples whose first element is the column to select
1331+
and the second element is the aggregation to apply to that column.
1332+
Pandas provides the ``pandas.NamedAgg`` namedtuple with the fields
1333+
``['column', 'aggfunc']`` to make it clearer what the arguments are.
1334+
As usual, the aggregation can be a callable or a string alias.
1335+
1336+
See :ref:`groupby.aggregate.named` for more.
12991337
""")
13001338

13011339
@Substitution(see_also=_agg_see_also_doc,
@@ -1304,7 +1342,7 @@ class DataFrameGroupBy(NDFrameGroupBy):
13041342
klass='DataFrame',
13051343
axis='')
13061344
@Appender(_shared_docs['aggregate'])
1307-
def aggregate(self, arg, *args, **kwargs):
1345+
def aggregate(self, arg=None, *args, **kwargs):
13081346
return super().aggregate(arg, *args, **kwargs)
13091347

13101348
agg = aggregate
@@ -1577,3 +1615,77 @@ def groupby_series(obj, col=None):
15771615
return results
15781616

15791617
boxplot = boxplot_frame_groupby
1618+
1619+
1620+
def _is_multi_agg_with_relabel(**kwargs):
1621+
"""
1622+
Check whether the kwargs pass to .agg look like multi-agg with relabling.
1623+
1624+
Parameters
1625+
----------
1626+
**kwargs : dict
1627+
1628+
Returns
1629+
-------
1630+
bool
1631+
1632+
Examples
1633+
--------
1634+
>>> _is_multi_agg_with_relabel(a='max')
1635+
False
1636+
>>> _is_multi_agg_with_relabel(a_max=('a', 'max'),
1637+
... a_min=('a', 'min'))
1638+
True
1639+
>>> _is_multi_agg_with_relabel()
1640+
False
1641+
"""
1642+
return all(
1643+
isinstance(v, tuple) and len(v) == 2
1644+
for v in kwargs.values()
1645+
) and kwargs
1646+
1647+
1648+
def _normalize_keyword_aggregation(kwargs):
1649+
"""
1650+
Normalize user-provided "named aggregation" kwargs.
1651+
1652+
Transforms from the new ``Dict[str, NamedAgg]`` style kwargs
1653+
to the old OrderedDict[str, List[scalar]]].
1654+
1655+
Parameters
1656+
----------
1657+
kwargs : dict
1658+
1659+
Returns
1660+
-------
1661+
aggspec : dict
1662+
The transformed kwargs.
1663+
columns : List[str]
1664+
The user-provided keys.
1665+
order : List[Tuple[str, str]]
1666+
Pairs of the input and output column names.
1667+
1668+
Examples
1669+
--------
1670+
>>> _normalize_keyword_aggregation({'output': ('input', 'sum')})
1671+
(OrderedDict([('input', ['sum'])]), ('output',), [('input', 'sum')])
1672+
"""
1673+
if not PY36:
1674+
kwargs = OrderedDict(sorted(kwargs.items()))
1675+
1676+
# Normalize the aggregation functions as Dict[column, List[func]],
1677+
# process normally, then fixup the names.
1678+
# TODO(Py35): When we drop python 3.5, change this to
1679+
# defaultdict(list)
1680+
aggspec = OrderedDict() # type: typing.OrderedDict[str, List[AggScalar]]
1681+
order = []
1682+
columns, pairs = list(zip(*kwargs.items()))
1683+
1684+
for name, (column, aggfunc) in zip(columns, pairs):
1685+
if column in aggspec:
1686+
aggspec[column].append(aggfunc)
1687+
else:
1688+
aggspec[column] = [aggfunc]
1689+
order.append((column,
1690+
com.get_callable_name(aggfunc) or aggfunc))
1691+
return aggspec, columns, order

pandas/tests/api/test_api.py

+1
Original file line numberDiff line numberDiff line change
@@ -47,6 +47,7 @@ class TestPDApi(Base):
4747
'DatetimeTZDtype',
4848
'Int8Dtype', 'Int16Dtype', 'Int32Dtype', 'Int64Dtype',
4949
'UInt8Dtype', 'UInt16Dtype', 'UInt32Dtype', 'UInt64Dtype',
50+
'NamedAgg',
5051
]
5152

5253
# these are already deprecated; awaiting removal

0 commit comments

Comments
 (0)