Skip to content

ENH: new .agg for list-likes #43736

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 47 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
47 commits
Select commit Hold shift + click to select a range
3dfb779
ENH: new .agg for list-likes
rhshadrach Sep 6, 2021
9ef1eb0
Refactor single arg computation, test fixup
rhshadrach Sep 24, 2021
1974e07
Revert change to GroupBy.agg
rhshadrach Sep 25, 2021
af18184
Merge branch 'master' of https://github.com/pandas-dev/pandas into ne…
rhshadrach Sep 25, 2021
d7b6c7f
Rename option and methods
rhshadrach Sep 25, 2021
af002fc
Merge branch 'master' of https://github.com/pandas-dev/pandas into ne…
rhshadrach Oct 9, 2021
d412b4f
Merge fixups
rhshadrach Oct 9, 2021
0cea15b
BUG/ERR: sparse array cmp methods mismatched len (#43863)
mzeitlin11 Oct 4, 2021
665b304
Add deprecation tag for passing a string for ewm(times=...) (#43873)
mroeschke Oct 4, 2021
214ba4a
Make components of Suffixes Optional (#42544)
scravy Oct 4, 2021
9d6da6d
BUG: Fix dtypes for read_json (#42819)
r-raymond Oct 4, 2021
005598c
TST: dropping of nuisance columns for groupby ops #38815 (#43674)
horaceklai Oct 5, 2021
7afb062
BUG: retain EA dtypes in DataFrame __pos__, __neg__ (#43883)
jbrockmendel Oct 5, 2021
195f9cf
TST: Test Series' settitem with Interval and NaN (#43844)
ElDeveloper Oct 5, 2021
6021c06
PERF: tighter cython declarations, faster __iter__ (#43872)
jbrockmendel Oct 5, 2021
aa0a1d6
PERF: read_csv with memory_map=True when file encoding is UTF-8 (#437…
michal-gh Oct 6, 2021
ef35a19
TYP: enable reportMissingImports (#43790)
twoertwein Oct 6, 2021
eefd0f0
Don't suppress exception chaining for optional dependencies (#43882)
takluyver Oct 6, 2021
d3f5a44
BUG: DataFrame arithmetic with subclass where constructor is not the …
jbrockmendel Oct 6, 2021
1146215
REF: remove _get_attributes_dict (#43895)
jbrockmendel Oct 6, 2021
58ff02d
Annotates `indexers/utils.py` functions that don't return anything wi…
sobolevn Oct 6, 2021
c9b0a6d
CI: Test Python 3.10 on MacOS and Windows too (#43772)
lithomas1 Oct 6, 2021
28c28c7
ENH: ExponentialMovingWindow.sum (#43871)
mroeschke Oct 6, 2021
f157d4d
TST: slow collection in test_algos.py (#43898)
mzeitlin11 Oct 6, 2021
cdc7b4a
ENH: implement ExtensionArray.__array_ufunc__ (#43899)
jbrockmendel Oct 6, 2021
2688ca8
[ENH] introducing IntpHashMap and making unique_label_indices use int…
realead Oct 7, 2021
5fe8d7d
ENH: implement Index.__array_ufunc__ (#43904)
jbrockmendel Oct 7, 2021
a49977c
TST/REF: share/split index tests (#43905)
jbrockmendel Oct 7, 2021
4779171
TST/REF: misplaced Index.putmask tests (#43906)
jbrockmendel Oct 7, 2021
d4ae657
Add clarifications to the docs regarding `to_feather` (#43866)
jmakov Oct 7, 2021
bde9b11
TST/REF: collect/de-dup index tests (#43914)
jbrockmendel Oct 7, 2021
ecab3a2
BENCH: indexing_engines (#43916)
jbrockmendel Oct 7, 2021
adef17c
TST: avoid re-running tests 14 times (#43922)
jbrockmendel Oct 8, 2021
d5716c7
CLN: unnecessary warning-catching (#43919)
jbrockmendel Oct 8, 2021
505ed3f
TST/REF: fixturize (#43918)
jbrockmendel Oct 8, 2021
9ee956b
BUG: NumericIndex.insert (#43933)
jbrockmendel Oct 9, 2021
1e370aa
TST: Skip leaky test on Python 3.10 (#43910)
lithomas1 Oct 9, 2021
acb7650
ENH: EA.tolist (#43920)
jbrockmendel Oct 9, 2021
e12643e
fixed rolling for a decreasing index, added a test for that (#43928)
rosagold Oct 9, 2021
8a454e0
Merge branch 'master' of https://github.com/pandas-dev/pandas into ne…
rhshadrach Oct 10, 2021
3bba371
Added docs
rhshadrach Oct 18, 2021
4a69adf
Merge branch 'master' of https://github.com/pandas-dev/pandas into ne…
rhshadrach Oct 18, 2021
7abdff9
Make quotes consistent
rhshadrach Oct 18, 2021
a72a5eb
Fixup docs
rhshadrach Oct 19, 2021
f8aa318
Merge branch 'master' of https://github.com/pandas-dev/pandas into ne…
rhshadrach Nov 7, 2021
afc27ba
Merge cleanup
rhshadrach Nov 7, 2021
f42eb00
Merge branch 'new_udfs_list_agg' of https://github.com/rhshadrach/pan…
rhshadrach Nov 7, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
72 changes: 72 additions & 0 deletions doc/source/user_guide/future_udf_behavior.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,72 @@
.. _future_udf_behavior:

:orphan:

{{ header }}

*******************
Future UDF Behavior
*******************

pandas is experimenting with improving the behavior of methods that take a
user-defined function (UDF). These methods include ``.apply``, ``.agg``, ``.transform``,
and ``.filter``. The goal is to make these methods behave in a more predictable
and consistent manner, reducing the complexity of their implementation, and improving
performance where possible. This page details the differences between the old and
new behaviors, as well as providing some context behind each change that is being made.

There are a great number of changes that are planned. In order to transition in a
reasonable manner for users, all changes are behind an experimental "future_udf_behavior"
option. This is currently experimental and subject to breaking changes without notice.
Users can opt into the new behavior and provide feedback. Once the improvements have
been made, this option will be declared no longer experimental. pandas will then raise
a ``FutureWarning`` that the default value of this option will be set to ``True`` in
a future version. Once the default is ``True``, users can still override it to ``False``.
After a sufficient amount of time, pandas will remove this option altogether and only
the future behavior will remain.

``DataFrame.agg`` with list-likes
---------------------------------

Previously, using ``DataFrame.agg`` with a list-like argument would transpose the result when
compared with just providing a single aggregation function.

.. ipython:: python

df = pd.DataFrame({"a": [1, 2, 3], "b": [4, 5, 6], "c": [7, 8, 9]})

df.agg("sum")
df.agg(["sum"])

This transpose no longer occurs, making the result more consistent.

.. ipython:: python

with pd.option_context("future_udf_behavior", True):
result = df.agg(["sum"])
result

with pd.option_context("future_udf_behavior", True):
result = df.agg(["sum", "mean"])
result

``DataFrame.groupby(...).agg`` with list-likes
----------------------------------------------

Previously, using ``DataFrame.groupby(...).agg`` with a list-like argument would put the
columns as the first level of the resulting hierarchical columns. The result is
that the columns for each aggregation function are separated, inconsistent with the result
for a single aggregator.

.. ipython:: python

df.groupby("a").agg("sum")
df.groupby("a").agg(["sum", "min"])

Now the levels are swapped, so that the columns for each aggregation are together.

.. ipython:: python

with pd.option_context("future_udf_behavior", True):
result = df.groupby("a").agg(["sum", "min"])
result
92 changes: 90 additions & 2 deletions pandas/core/apply.py
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,10 @@

import numpy as np

from pandas._config import option_context
from pandas._config import (
get_option,
option_context,
)

from pandas._libs import lib
from pandas._typing import (
Expand Down Expand Up @@ -169,7 +172,10 @@ def agg(self) -> DataFrame | Series | None:
return self.agg_dict_like()
elif is_list_like(arg):
# we require a list, but not a 'str'
return self.agg_list_like()
if get_option("future_udf_behavior"):
return self.future_list_like("agg")
else:
return self.agg_list_like()

if callable(arg):
f = com.get_cython_func(arg)
Expand Down Expand Up @@ -443,6 +449,88 @@ def agg_list_like(self) -> DataFrame | Series:
)
return concatenated.reindex(full_ordered_index, copy=False)

def future_list_single_arg(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same request from the other PR re naming; "future" won't be very helpful for a reader a year from now

Copy link
Member Author

@rhshadrach rhshadrach Sep 29, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The intention is to have "future_" methods alongside the current methods; all with the same prefix so they are easy to identify. Any such method is behind the option "future_udf_behavior" meaning they will only be called when set to True. Assuming we do end up going forward with this new (experimental) behavior, once it is in a good place we deprecate the option and then remove the option.

A year from now, we will still have the option "future_udf_behavior", and in my opinion, the "future_" prefix is meaningful and helpful - namely in its connection to this option. It is also the (intended) future behavior of the methods. When the option is removed, the "old" methods are removed and the "future_" methods are renamed by removing the prefix (none are public).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you've clearly given this more thought than i have so im going to stop complaining about this, will instead grumble to myself

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The name is still up for improvements and suggestions are most welcome, but wanted to explain why I felt "future" was appropriate.

self, method: str, a: AggFuncTypeBase, result_dim: int | None
) -> tuple[int | None, AggFuncTypeBase | None, DataFrame | Series | None]:
name = None
result = None
try:
if isinstance(a, (tuple, list)):
# Handle (name, value) pairs
name, a = a
result = getattr(self.obj, method)(a)
if result_dim is None:
result_dim = getattr(result, "ndim", 0)
elif getattr(result, "ndim", 0) != result_dim:
raise ValueError("cannot combine transform and aggregation operations")
except TypeError:
pass
# make sure we find a good name
if name is None:
name = com.get_callable_name(a) or a
return result_dim, name, result

def future_list_like(self, method: str) -> DataFrame | Series:
"""
Compute aggregation in the case of a list-like argument.

Returns
-------
Result of aggregation.
"""
from pandas.core.reshape.concat import concat

obj = self.obj
arg = cast(List[AggFuncTypeBase], self.f)

results = []
keys = []
result_dim = None
failed_names = []

for a in arg:
result_dim, name, new_res = self.future_list_single_arg(
method, a, result_dim
)
if new_res is not None:
results.append(new_res)
keys.append(name)
else:
failed_names.append(a)

# if we are empty
if not len(results):
raise ValueError("no results")

if len(failed_names) > 0:
warnings.warn(
f"{failed_names} did not aggregate successfully. If any error is "
"raised this will raise in a future version of pandas. "
"Drop these columns/ops to avoid this warning.",
FutureWarning,
stacklevel=find_stack_level(),
)

try:
concatenated = concat(results, keys=keys, axis=1, sort=False)
except TypeError:
# we are concatting non-NDFrame objects,
# e.g. a list of scalars
from pandas import Series

result = Series(results, index=keys, name=obj.name)
return result
else:
# Concat uses the first index to determine the final indexing order.
# The union of a shorter first index with the other indices causes
# the index sorting to be different from the order of the aggregating
# functions. Reindex if this is the case.
index_size = concatenated.index.size
full_ordered_index = next(
result.index for result in results if result.index.size == index_size
)
return concatenated.reindex(full_ordered_index, copy=False)

def agg_dict_like(self) -> DataFrame | Series:
"""
Compute aggregation in the case of a dict-like argument.
Expand Down
17 changes: 17 additions & 0 deletions pandas/core/config_init.py
Original file line number Diff line number Diff line change
Expand Up @@ -511,6 +511,23 @@ def use_inf_as_na_cb(key):
validator=is_one_of_factory(["block", "array"]),
)

future_udf_behavior = """
: boolean
Whether to use the future UDF method implementations. Currently experimental.
Defaults to False.
"""


with cf.config_prefix("mode"):
cf.register_option(
"future_udf_behavior",
# Get the default from an environment variable, if set, otherwise defaults
# to False. This environment variable can be set for testing.
os.environ.get("PANDAS_FUTURE_UDF_BEHAVIOR", "false").lower() == "true",
future_udf_behavior,
validator=is_bool,
)


# user warnings
chained_assignment = """
Expand Down
5 changes: 3 additions & 2 deletions pandas/core/frame.py
Original file line number Diff line number Diff line change
Expand Up @@ -83,6 +83,7 @@
doc,
rewrite_axis_style_signature,
)
from pandas.util._exceptions import find_stack_level
from pandas.util._validators import (
validate_ascending,
validate_axis_style_args,
Expand Down Expand Up @@ -10016,7 +10017,7 @@ def _get_data() -> DataFrame:
"version this will raise TypeError. Select only valid "
"columns before calling the reduction.",
FutureWarning,
stacklevel=5,
stacklevel=find_stack_level(),
)

return out
Expand Down Expand Up @@ -10049,7 +10050,7 @@ def _get_data() -> DataFrame:
"version this will raise TypeError. Select only valid "
"columns before calling the reduction.",
FutureWarning,
stacklevel=5,
stacklevel=find_stack_level(),
)

if hasattr(result, "dtype"):
Expand Down
26 changes: 26 additions & 0 deletions pandas/core/groupby/generic.py
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,8 @@

import numpy as np

from pandas._config import get_option

from pandas._libs import reduction as libreduction
from pandas._typing import (
ArrayLike,
Expand Down Expand Up @@ -873,6 +875,8 @@ def aggregate(self, func=None, *args, engine=None, engine_kwargs=None, **kwargs)
result.columns = columns

if result is None:
if get_option("future_udf_behavior"):
return self._future_agg(func, args, kwargs)

# grouper specific aggregations
if self.grouper.nkeys > 1:
Expand Down Expand Up @@ -923,6 +927,28 @@ def aggregate(self, func=None, *args, engine=None, engine_kwargs=None, **kwargs)

return result

def _future_agg(self, func, args, kwargs):
if args or kwargs:
# test_pass_args_kwargs gets here (with and without as_index)
# can't return early
result = self._aggregate_frame(func, *args, **kwargs)

elif self.axis == 1 and self.grouper.nkeys == 1:
# _aggregate_multiple_funcs does not allow self.axis == 1
# Note: axis == 1 precludes 'not self.as_index', see __init__
result = self._aggregate_frame(func)
return result
else:
# test_groupby_as_index_series_scalar gets here
# with 'not self.as_index'
return self._python_agg_general(func, *args, **kwargs)

if not self.as_index:
self._insert_inaxis_grouper_inplace(result)
result.index = Index(range(len(result)))

return result

agg = aggregate

def _iterate_slices(self) -> Iterable[Series]:
Expand Down
Loading