-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
ENH: Experimental Higher Order Methods API #45557
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 6 commits
9a194fa
77b6f1e
f00e37a
2b98c4d
0d96b50
3557735
e377168
87fc57f
c3403d9
daf04b9
f996f5e
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,76 @@ | ||
.. _homs: | ||
|
||
:orphan: | ||
|
||
{{ header }} | ||
|
||
*************************** | ||
pandas Higher Order Methods | ||
*************************** | ||
|
||
pandas is experimenting with improving the behavior of higher order methods (HOMs). These | ||
are methods that take a function as an argument, often a user-defined function (UDF). | ||
They include ``.apply``, ``.agg``, ``.transform``, and ``.filter``. The goal is to make | ||
these methods behave in a more predictable and consistent manner, reducing the complexity | ||
of their implementation, and improving performance where possible. This page details the | ||
differences between the old and new behaviors, as well as providing some context behind | ||
each change that is being made. | ||
|
||
There are a great number of changes that are planned. In order to transition in a | ||
reasonable manner for users, all changes are behind an experimental "api.use_hom" | ||
option. When enabled, pandas HOMs are subject to breaking changes without notice. | ||
Users can opt into the new behavior and provide feedback. Once the improvements have | ||
been made, this option will be declared no longer experimental. At this point, any | ||
breaking changes will happen only when preceded by a ``FutureWarning`` and when | ||
pandas releases a major version. After a period of community feedback, and when the | ||
behavior is deemed ready for release, pandas will then raise a ``FutureWarning`` that | ||
the default value of this option will be set to ``True`` in a future version. Once the | ||
default is ``True``, users can still override it to ``False``. After a sufficient | ||
amount of time, pandas will remove this option altogether and only the new behavior | ||
will remain. | ||
|
||
``DataFrame.agg`` with list-likes | ||
--------------------------------- | ||
|
||
Previously, using ``DataFrame.agg`` with a list-like argument would transpose the result when | ||
compared with just providing a single aggregation function. | ||
|
||
.. ipython:: python | ||
|
||
df = pd.DataFrame({"a": [1, 2, 3], "b": [4, 5, 6], "c": [7, 8, 9]}) | ||
|
||
df.agg("sum") | ||
df.agg(["sum"]) | ||
|
||
This transpose no longer occurs, making the result more consistent. | ||
|
||
.. ipython:: python | ||
|
||
with pd.option_context("api.use_hom", True): | ||
result = df.agg(["sum"]) | ||
result | ||
|
||
with pd.option_context("api.use_hom", True): | ||
result = df.agg(["sum", "mean"]) | ||
result | ||
|
||
``DataFrame.groupby(...).agg`` with list-likes | ||
---------------------------------------------- | ||
|
||
Previously, using ``DataFrame.groupby(...).agg`` with a list-like argument would put the | ||
columns as the first level of the resulting hierarchical columns. The result is | ||
that the columns for each aggregation function are separated, inconsistent with the result | ||
for a single aggregator. | ||
|
||
.. ipython:: python | ||
|
||
df.groupby("a").agg("sum") | ||
df.groupby("a").agg(["sum", "min"]) | ||
|
||
Now the levels are swapped, so that the columns for each aggregation are together. | ||
|
||
.. ipython:: python | ||
|
||
with pd.option_context("api.use_hom", True): | ||
result = df.groupby("a").agg(["sum", "min"]) | ||
result |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -22,7 +22,10 @@ | |
|
||
import numpy as np | ||
|
||
from pandas._config import option_context | ||
from pandas._config import ( | ||
get_option, | ||
option_context, | ||
) | ||
|
||
from pandas._libs import lib | ||
from pandas._typing import ( | ||
|
@@ -168,7 +171,10 @@ def agg(self) -> DataFrame | Series | None: | |
return self.agg_dict_like() | ||
elif is_list_like(arg): | ||
# we require a list, but not a 'str' | ||
return self.agg_list_like() | ||
if get_option("api.use_hom"): | ||
return self.hom_list_like("agg") | ||
else: | ||
return self.agg_list_like() | ||
|
||
if callable(arg): | ||
f = com.get_cython_func(arg) | ||
|
@@ -442,6 +448,80 @@ def agg_list_like(self) -> DataFrame | Series: | |
) | ||
return concatenated.reindex(full_ordered_index, copy=False) | ||
|
||
def hom_list_single_arg( | ||
self, method: str, a: AggFuncTypeBase, result_dim: int | None | ||
) -> tuple[int | None, AggFuncTypeBase | None, DataFrame | Series | None]: | ||
name = None | ||
result = None | ||
try: | ||
if isinstance(a, (tuple, list)): | ||
# Handle (name, value) pairs | ||
name, a = a | ||
else: | ||
name = com.get_callable_name(a) or a | ||
result = getattr(self.obj, method)(a) | ||
if result_dim is None: | ||
result_dim = getattr(result, "ndim", 0) | ||
elif getattr(result, "ndim", 0) != result_dim: | ||
raise ValueError("cannot combine transform and aggregation operations") | ||
except (TypeError, DataError): | ||
jreback marked this conversation as resolved.
Show resolved
Hide resolved
|
||
warnings.warn( | ||
f"{name} did not aggregate successfully. If any error is " | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. why do we want to warn here? (e.g. new api can we just raise) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Right - new API will just raise. Starting out, I'd like to maintain as much consistency between |
||
"raised this will raise in a future version of pandas. " | ||
"Drop these columns/ops to avoid this warning.", | ||
FutureWarning, | ||
stacklevel=find_stack_level(), | ||
) | ||
|
||
return result_dim, name, result | ||
|
||
def hom_list_like(self, method: str) -> DataFrame | Series: | ||
""" | ||
Compute aggregation in the case of a list-like argument. | ||
|
||
Returns | ||
------- | ||
Result of aggregation. | ||
""" | ||
from pandas.core.reshape.concat import concat | ||
|
||
obj = self.obj | ||
arg = cast(List[AggFuncTypeBase], self.f) | ||
|
||
results = [] | ||
keys = [] | ||
result_dim = None | ||
|
||
for a in arg: | ||
result_dim, name, new_res = self.hom_list_single_arg(method, a, result_dim) | ||
if new_res is not None: | ||
results.append(new_res) | ||
keys.append(name) | ||
|
||
# if we are empty | ||
if not len(results): | ||
raise ValueError("no results") | ||
|
||
try: | ||
concatenated = concat(results, keys=keys, axis=1, sort=False) | ||
except TypeError: | ||
# we are concatting non-NDFrame objects, | ||
# e.g. a list of scalars | ||
from pandas import Series | ||
|
||
result = Series(results, index=keys, name=obj.name) | ||
return result | ||
else: | ||
# Concat uses the first index to determine the final indexing order. | ||
# The union of a shorter first index with the other indices causes | ||
# the index sorting to be different from the order of the aggregating | ||
# functions. Reindex if this is the case. | ||
index_size = concatenated.index.size | ||
full_ordered_index = next( | ||
result.index for result in results if result.index.size == index_size | ||
) | ||
return concatenated.reindex(full_ordered_index, copy=False) | ||
|
||
def agg_dict_like(self) -> DataFrame | Series: | ||
""" | ||
Compute aggregation in the case of a dict-like argument. | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -26,6 +26,8 @@ | |
|
||
import numpy as np | ||
|
||
from pandas._config import get_option | ||
|
||
from pandas._libs import reduction as libreduction | ||
from pandas._typing import ( | ||
ArrayLike, | ||
|
@@ -876,6 +878,8 @@ def aggregate(self, func=None, *args, engine=None, engine_kwargs=None, **kwargs) | |
result.columns = columns | ||
|
||
if result is None: | ||
if get_option("api.use_hom"): | ||
return self._hom_agg(func, args, kwargs) | ||
|
||
# grouper specific aggregations | ||
if self.grouper.nkeys > 1: | ||
|
@@ -926,6 +930,28 @@ def aggregate(self, func=None, *args, engine=None, engine_kwargs=None, **kwargs) | |
|
||
return result | ||
|
||
def _hom_agg(self, func, args, kwargs): | ||
if args or kwargs: | ||
# test_pass_args_kwargs gets here (with and without as_index) | ||
# can't return early | ||
result = self._aggregate_frame(func, *args, **kwargs) | ||
|
||
elif self.axis == 1 and self.grouper.nkeys == 1: | ||
# _aggregate_multiple_funcs does not allow self.axis == 1 | ||
# Note: axis == 1 precludes 'not self.as_index', see __init__ | ||
result = self._aggregate_frame(func) | ||
return result | ||
else: | ||
# test_groupby_as_index_series_scalar gets here | ||
# with 'not self.as_index' | ||
return self._python_agg_general(func, *args, **kwargs) | ||
|
||
if not self.as_index: | ||
self._insert_inaxis_grouper_inplace(result) | ||
result.index = Index(range(len(result))) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. some of this looks familiar. is it as de-duplicated as it can reasonably get? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Perhaps not, but will be changed when implementing "agg always aggs" (#35725). In other words, duplicating further will likely have to be undone in the future. |
||
|
||
return result | ||
|
||
agg = aggregate | ||
|
||
def _iterate_slices(self) -> Iterable[Series]: | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
certainly could be done in followups
use_hom
is enabled.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will link to doc strings here; going to hold off modifying the public doc-strings until this is ready.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. I only listed methods that are currently being modified when using
api.use_hom=True
. Will expand the list as modifications occur.