-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
ENH: Implement Keyword Aggregation for DataFrame.agg and Series.agg #29116
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 58 commits
7e461a1
1314059
8bcb313
7bc368d
cf5c6c3
5298331
4fb74b5
97209be
ca273ff
1d2ab15
3ca193c
c8f80ed
8c738e9
d4d9ea4
2a6de27
21e09f9
058a8e9
0da68d8
15e3659
438398d
d47b790
832b8d9
5a3b690
4fb86f0
a1369bf
ef981a3
82c8960
c610391
679ba59
2ee2628
31f7033
2acb244
dfbd67a
ff5e60f
7c6c891
3e55fcb
6d74b29
532337e
400ff3e
05af2de
6206fa4
34199ad
15d099c
c56f05f
d3f0620
20ecfda
89b8e6b
8aa1cc9
091ca75
c2d5104
50ebdaf
0484f5e
425c802
d5c2c6c
8bb9714
2607c5d
0545231
0a27889
a66053e
7311ef0
da2ff37
bcc5bc3
0825027
d3c35f5
cef2b50
b96a942
3123284
7bb3bd0
3da2e2a
3ce91fc
1426ee2
5893a0e
cc85db4
0f55073
90d52ba
381a697
238b4cc
f8e1891
66e9b38
f4d8a4f
c3e34a0
0c0dbad
61f6201
88c7751
e2b957a
99f75b2
30b7296
baea583
04bffe6
42091c3
fc13e19
1403426
3d9655e
d78c57c
0487928
7435ac5
f1cd16c
469691c
1bb35b5
7a6f496
3730f7d
cd8d00f
6dddd55
96dc3ed
075b85b
a44471c
0e2eae4
2fb4b27
5e04185
56d0f89
7f4839e
65d578b
3e6a06c
8651447
a7439fe
449d40f
9fd8ec5
736bea2
d20be20
35b2b17
74d6169
0546224
f5f0e68
54ff962
ac57023
484e42c
47e6598
f28b452
81b4186
1fd4b5b
47dc5fe
9190f7f
89de59e
8493383
9a9dd7f
c75c882
fa61db7
00a1ccf
26b380a
165ea83
d6923f2
f6a5cc1
a747ab6
44405e8
faea906
7e30a61
3d20524
05921af
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,225 @@ | ||
from collections import defaultdict | ||
from functools import partial | ||
from typing import Any, DefaultDict, Sequence | ||
|
||
from pandas.core.dtypes.common import is_dict_like, is_list_like | ||
|
||
from pandas.core.base import SpecificationError | ||
import pandas.core.common as com | ||
from pandas.core.indexes.api import Index | ||
|
||
|
||
def reconstruct_func(func, *args, **kwargs): | ||
""" | ||
This is the internal function to reconstruct func given if there is relabeling | ||
jreback marked this conversation as resolved.
Show resolved
Hide resolved
|
||
or not. And also normalize the keyword to get new order of columns; | ||
|
||
If relabeling is True, will return relabeling, reconstructed func, column | ||
names, and the reconstructed order of columns. | ||
If relabeling is False, the columns and order will be None. | ||
""" | ||
relabeling = func is None and is_multi_agg_with_relabel(**kwargs) | ||
if relabeling: | ||
func, columns, order = normalize_keyword_aggregation(kwargs) | ||
|
||
elif isinstance(func, list) and len(func) > len(set(func)): | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. can you do these conditions at the top? it makes the logic easier to grok There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. yeah, moved it up |
||
|
||
# GH 28426 will raise error if duplicated function names are used and | ||
# there is no reassigned name | ||
raise SpecificationError( | ||
"Function names must be unique if there is no new column names assigned" | ||
) | ||
elif func is None: | ||
# nicer error message | ||
raise TypeError("Must provide 'func' or tuples of '(column, aggfunc).") | ||
|
||
func = maybe_mangle_lambdas(func) | ||
if not relabeling: | ||
columns = None | ||
order = None | ||
|
||
return relabeling, func, columns, order | ||
|
||
|
||
def is_multi_agg_with_relabel(**kwargs) -> bool: | ||
""" | ||
Check whether kwargs passed to .agg look like multi-agg with relabeling. | ||
|
||
Parameters | ||
---------- | ||
**kwargs : dict | ||
|
||
Returns | ||
------- | ||
bool | ||
|
||
Examples | ||
-------- | ||
>>> _is_multi_agg_with_relabel(a='max') | ||
False | ||
>>> _is_multi_agg_with_relabel(a_max=('a', 'max'), | ||
... a_min=('a', 'min')) | ||
True | ||
>>> _is_multi_agg_with_relabel() | ||
False | ||
""" | ||
return all(isinstance(v, tuple) and len(v) == 2 for v in kwargs.values()) and ( | ||
len(kwargs) > 0 | ||
) | ||
|
||
|
||
def normalize_keyword_aggregation(kwargs): | ||
""" | ||
Normalize user-provided "named aggregation" kwargs. | ||
|
||
Transforms from the new ``Mapping[str, NamedAgg]`` style kwargs | ||
to the old Dict[str, List[scalar]]]. | ||
|
||
Parameters | ||
---------- | ||
kwargs : dict | ||
|
||
Returns | ||
------- | ||
aggspec : dict | ||
The transformed kwargs. | ||
columns : List[str] | ||
The user-provided keys. | ||
col_idx_order : List[int] | ||
List of columns indices. | ||
|
||
Examples | ||
-------- | ||
>>> _normalize_keyword_aggregation({'output': ('input', 'sum')}) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. needs updating here There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. changed this kind of mistakes for all in docstring #30856 |
||
({'input': ['sum']}, ('output',), [('input', 'sum')]) | ||
""" | ||
# Normalize the aggregation functions as Mapping[column, List[func]], | ||
# process normally, then fixup the names. | ||
# TODO: aggspec type: typing.Dict[str, List[AggScalar]] | ||
# May be hitting https://github.com/python/mypy/issues/5958 | ||
# saying it doesn't have an attribute __name__ | ||
aggspec: DefaultDict = defaultdict(list) | ||
order = [] | ||
columns, pairs = list(zip(*kwargs.items())) | ||
|
||
for name, (column, aggfunc) in zip(columns, pairs): | ||
aggspec[column].append(aggfunc) | ||
order.append((column, com.get_callable_name(aggfunc) or aggfunc)) | ||
|
||
# uniquify aggfunc name if duplicated in order list | ||
uniquified_order = _make_unique(order) | ||
|
||
# GH 25719, due to aggspec will change the order of assigned columns in aggregation | ||
# uniquified_aggspec will store uniquified order list and will compare it with order | ||
# based on index | ||
aggspec_order = [ | ||
(column, com.get_callable_name(aggfunc) or aggfunc) | ||
for column, aggfuncs in aggspec.items() | ||
for aggfunc in aggfuncs | ||
] | ||
uniquified_aggspec = _make_unique(aggspec_order) | ||
|
||
# get the new indice of columns by comparison | ||
col_idx_order = Index(uniquified_aggspec).get_indexer(uniquified_order) | ||
return aggspec, columns, col_idx_order | ||
|
||
|
||
def _make_unique(seq): | ||
"""Uniquify aggfunc name of the pairs in the order list | ||
|
||
Examples: | ||
-------- | ||
>>> _make_unique([('a', '<lambda>'), ('a', '<lambda>'), ('b', '<lambda>')]) | ||
[('a', '<lambda>_0'), ('a', '<lambda>_1'), ('b', '<lambda>')] | ||
""" | ||
return [ | ||
(pair[0], "_".join([pair[1], str(seq[:i].count(pair))])) | ||
if seq.count(pair) > 1 | ||
else pair | ||
for i, pair in enumerate(seq) | ||
] | ||
|
||
|
||
# TODO: Can't use, because mypy doesn't like us setting __name__ | ||
# error: "partial[Any]" has no attribute "__name__" | ||
# the type is: | ||
# typing.Sequence[Callable[..., ScalarResult]] | ||
# -> typing.Sequence[Callable[..., ScalarResult]]: | ||
|
||
|
||
def _managle_lambda_list(aggfuncs: Sequence[Any]) -> Sequence[Any]: | ||
""" | ||
Possibly mangle a list of aggfuncs. | ||
|
||
Parameters | ||
---------- | ||
aggfuncs : Sequence | ||
|
||
Returns | ||
------- | ||
mangled: list-like | ||
A new AggSpec sequence, where lambdas have been converted | ||
to have unique names. | ||
|
||
Notes | ||
----- | ||
If just one aggfunc is passed, the name will not be mangled. | ||
""" | ||
if len(aggfuncs) <= 1: | ||
# don't mangle for .agg([lambda x: .]) | ||
return aggfuncs | ||
i = 0 | ||
mangled_aggfuncs = [] | ||
for aggfunc in aggfuncs: | ||
if com.get_callable_name(aggfunc) == "<lambda>": | ||
aggfunc = partial(aggfunc) | ||
aggfunc.__name__ = f"<lambda_{i}>" | ||
i += 1 | ||
mangled_aggfuncs.append(aggfunc) | ||
|
||
return mangled_aggfuncs | ||
|
||
|
||
def maybe_mangle_lambdas(agg_spec: Any) -> Any: | ||
""" | ||
Make new lambdas with unique names. | ||
|
||
Parameters | ||
---------- | ||
agg_spec : Any | ||
An argument to GroupBy.agg. | ||
Non-dict-like `agg_spec` are pass through as is. | ||
For dict-like `agg_spec` a new spec is returned | ||
with name-mangled lambdas. | ||
|
||
Returns | ||
------- | ||
mangled : Any | ||
Same type as the input. | ||
|
||
Examples | ||
-------- | ||
>>> _maybe_mangle_lambdas('sum') | ||
'sum' | ||
|
||
>>> _maybe_mangle_lambdas([lambda: 1, lambda: 2]) # doctest: +SKIP | ||
[<function __main__.<lambda_0>, | ||
<function pandas...._make_lambda.<locals>.f(*args, **kwargs)>] | ||
""" | ||
is_dict = is_dict_like(agg_spec) | ||
if not (is_dict or is_list_like(agg_spec)): | ||
return agg_spec | ||
mangled_aggspec = type(agg_spec)() # dict or OrderdDict | ||
|
||
if is_dict: | ||
for key, aggfuncs in agg_spec.items(): | ||
if is_list_like(aggfuncs) and not is_dict_like(aggfuncs): | ||
mangled_aggfuncs = _managle_lambda_list(aggfuncs) | ||
else: | ||
mangled_aggfuncs = aggfuncs | ||
|
||
mangled_aggspec[key] = mangled_aggfuncs | ||
else: | ||
mangled_aggspec = _managle_lambda_list(agg_spec) | ||
|
||
return mangled_aggspec |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -97,6 +97,7 @@ | |
|
||
from pandas.core import algorithms, common as com, nanops, ops | ||
from pandas.core.accessor import CachedAccessor | ||
from pandas.core.aggregation import reconstruct_func | ||
from pandas.core.arrays import Categorical, ExtensionArray | ||
from pandas.core.arrays.datetimelike import DatetimeLikeArrayMixin as DatetimeLikeArray | ||
from pandas.core.arrays.sparse import SparseFrameAccessor | ||
|
@@ -6618,16 +6619,39 @@ def _gotitem( | |
**_shared_doc_kwargs, | ||
) | ||
@Appender(_shared_docs["aggregate"]) | ||
def aggregate(self, func, axis=0, *args, **kwargs): | ||
def aggregate(self, func=None, axis=0, *args, **kwargs): | ||
jreback marked this conversation as resolved.
Show resolved
Hide resolved
|
||
axis = self._get_axis_number(axis) | ||
|
||
relabeling, func, columns, order = reconstruct_func(func, *args, **kwargs) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. is this function also used in groupby? did you already make the appropriate changes there? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. yeah, this is exactly the same as part of agg in |
||
if relabeling: | ||
kwargs = {} | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. why do you need to do this? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. right, removed this, not needed |
||
|
||
result = None | ||
try: | ||
result, how = self._aggregate(func, axis=axis, *args, **kwargs) | ||
except TypeError: | ||
pass | ||
if result is None: | ||
return self.apply(func, axis=axis, args=args, **kwargs) | ||
|
||
if relabeling: | ||
jreback marked this conversation as resolved.
Show resolved
Hide resolved
|
||
reordered_indexes = [ | ||
pair[0] for pair in sorted(zip(columns, order), key=lambda t: t[1]) | ||
] | ||
|
||
# when there are more than one column being used in aggregate, the order | ||
# of result will be reversed, and in case the func is not used by other | ||
# columns, there might be NaN values, so separate these two cases | ||
reordered_result = DataFrame(index=columns) | ||
idx = 0 | ||
for col, funcs in func.items(): | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. this is extremely inefficient. instead try appending to a list and concat at the end. why do you need .values? that is not very idiomatic here, nor is the dropna There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. you can also just reorder the indicies and do an indexing selection would be way better There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. yeah, I reduced half of the code in this loop, i am pretty sure the previous version was due to some order issue. for instance in the example below, i think a correct behaviour should be: df = pd.DataFrame({"A": [1, 2, 1, 2], "B": [1, 2, 3, 4], "C": [3,4,5,6]})
df.agg(ab=("C", np.max), foo=("A", max), bar=("B", "mean"), cat=("A", "min"),
dat=("B", "max"), f=("A", "max"), g=("C", "min")) the column of new aggregated df is : Any advice to simply the code would be much appreciated! |
||
v = reordered_indexes[idx : idx + len(funcs)] | ||
if len(func) > 1: | ||
reordered_result.loc[v, col] = result[col][::-1].dropna().values | ||
else: | ||
reordered_result.loc[v, col] = result[col].values | ||
idx = idx + len(funcs) | ||
result = reordered_result | ||
return result | ||
|
||
def _aggregate(self, arg, axis=0, *args, **kwargs): | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you annotate this signature? The return is pretty complicated so would be better to explicitly state what that is
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, I will try to make it!