Skip to content

PDEP-13: Deprecate the apply method on Series and DataFrame and make the agg and transform methods operate on series data #54747

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 13 commits into from
158 changes: 158 additions & 0 deletions web/pandas/pdeps/0013-standardize-apply.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,158 @@
# PDEP-13: Deprecate the apply method on Series and DataFrame and make the agg and transform methods operate on series data

- Created: 24 August 2023
- Status: Under discussion
- Discussion: [#52140](https://github.com/pandas-dev/pandas/issues/52509)
- Author: [Terji Petersen](https://github.com/topper-123)
- Revision: 2

## Abstract

The `apply`, `transform` and `agg` methods have very complex behavior when given callables because they in some cases operate on elements in series, in some cases on series and sometimes try one first, and it that fails, falls back to try the other. There is not a logical system how these behaviors are arranged and it can therefore be difficult for users to understand these methods.

It is proposed that `apply`, `transform` and `agg` in the future will work as follows:

1. the `agg` and `transform` methods of `Series`, `DataFrame` and `groupby` will always operate series-wise and never element-wise
2. `Series.apply` and `DataFrame.apply` will be deprecated.
3. The current behavior when supplying string to the methods will not be changed.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
3. The current behavior when supplying string to the methods will not be changed.
3. The current behaviors when supplying strings for the function arguments to the methods will not be changed.

4. `groupby.apply` will not be deprecated (because it behaves differently than `Series.apply` and `DataFrame.apply`)

The above changes means that the future behavior, when users want to apply arbitrary callables in pandas, can be described as follows:

1. When users want to operate on single elements in a `Series` or `DataFrame`, they should use `Series.map` and `DataFrame.map` respectively.
Copy link
Member

@MarcoGorelli MarcoGorelli Sep 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we sure this is accurate? For the use cases I've seen of DataFrame.apply, I think DataFrame.agg would probably be more likely as a replacement

Currently, say I do

In [43]: df = pd.DataFrame({'a': ['quetzal', 'quetzal', 'baboon'], 'b': ['panda', 'chinchilla', 'elk']})

In [44]: df
Out[44]:
         a           b
0  quetzal       panda
1  quetzal  chinchilla
2   baboon         elk

In [45]: def func(row):
    ...:     return f"{row['a']} or {row['b']}?"
    ...:

In [46]: df.apply(func, axis=1)
Out[46]:
0         quetzal or panda?
1    quetzal or chinchilla?
2            baboon or elk?
dtype: object

Then if I use DataFrame.map, it'll operate over all elements of the dataframe 1-by-1. Whereas if I use .agg, then I'll preserve what I'm currently doing:

In [47]: df.agg(func, axis=1)
Out[47]:
0         quetzal or panda?
1    quetzal or chinchilla?
2            baboon or elk?
dtype: object

This feels very unnatural though - func here doesn't look like an aggregation at all, so I wouldn't have reached for .agg

Could DataFrame.map get an axis keyword as well, so that DataFrame.map(func, axis=1) preserves the current behaviour of DataFrame.apply(func, axis=1)?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this a transform? But users don't get to control what gets passed to the UDF they supply - is it columm by column, row by row, or the entire frame. Perhaps they should.

Copy link
Member

@MarcoGorelli MarcoGorelli Sep 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

transform would raise here:

In [53]: df.transform(func, axis=1)
---------------------------------------------------------------------------

ValueError: Function did not transform

And this underscores just how confusing this is

is it columm by column, row by row, or the entire frame. Perhaps they should.

yes, exactly, it could be:

  • axis=None: element by element (current behaviour)
  • axis=1: row by row
  • axis=0: element by element

Then, df.apply(lambda row: ..., axis=1) can just become df.map(lambda row: ..., axis=1)

This is a very common way of using apply, e.g.

titanic_df['Person'] = titanic_df[['Age','Sex']].apply(get_person,axis=1)

from https://www.kaggle.com/code/omarelgabry/a-journey-through-titanic

Copy link
Contributor Author

@topper-123 topper-123 Sep 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MarcoGorelli , your example is an aggregation though and func does operate on rows, not elements, so everything works as intended in your agg example.

However, I guess agg linguistically(?) is best understood as a numerical reduction, not a text combination operation, so using the name agg may be a bit unintuitive. My original proposal would have kept apply for situations like this, because the problem I wanted to solve was apply giving subtly different results for Series.apply and DataFrame.apply (including for axis=1).

Your example could be an argument for keeping apply, but drop the element-by-element behavior of Series.apply, i.e. revert to the original proposal of this PDEP.

I'm -1 on making map operate on rows/columns. It's a strength of the method that it always gives same-shaped results and it would be confusing for the result changing shape based on if an axis parameter is given or not.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

true, it's a horizontal aggregation: it combines multiple columns into 1

I'm just thinking about the upgrade process

I've only ever seen people use DataFrame.apply to create a single column, and if the upgrade process is going to be:

  • Series.apply(func) => Series.map(func)
  • DataFrame.apply(func, axis=1) => DataFrame.agg(func, axis=1)

then that could be confusing, and certainly harder than just grepping for .apply

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @MarcoGorelli about the upgrade process. I find changing Dataframe.apply(func, axis=1) to DataFrame.agg(func, axis=1) to be non-intuitive.

But maybe we should consider introducing a method DataFrame.project(func) that is the replacement for DataFrame.apply(func, axis=1) . Idea being that you "project" a function onto each row of the DF.

Or should we make DataFrame.transform(func, axis=1) work by transforming each row?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wait let's not add yet another method to the API 😅

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(just another voice here to support the notion that requiring agg for this use case feels very unnatural)

2. When users want to aggregate a `Series`, columns/rows of a `DataFrame` or groups in `groupby` objects, they should use `Series.agg`, `DataFrame.agg` and `groupby.agg` respectively.
3. When users want to transform a `Series`, columns/rows of a `DataFrame` or groups in `groupby` objects, they should use `Series.transform`, `DataFrame.transform` and `groupby.transform` respectively.
Copy link
Member

@mroeschke mroeschke Sep 18, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add a 4th When users want to neither aggregate or transform a ..., they should ...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you need to have the case when users want to have an operation on a row of a DataFrame, they should use DataFrame.agg(func, axis=1) . Today a common use case is using DataFrame.apply(func, axis=1). Since func can do anything, it isn't entirely clear that agg with axis=1 will supply each row of the series to the function, meaning that the corresponding function doesn't need to "aggregate".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DataFrame.apply(func, axis=1) is just DataFrame.T.apply(func).T, so any issue with axis=1 will also be present with axis=0. So I don't see a special issue with axis=1, except maybe some idioms that are more common with axis=1 (which may be a valid concern, though).

Can you give some example code snippets showing some possible problems?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Dr-Irv I've added an example here https://github.com/pandas-dev/pandas/pull/54747/files#r1339884642

Inclined to agree that .agg feels really unnatural here

4. Functions that are not applicable for `map`, `agg` nor `transform` are considered relatively rare and in the future users should call these functions directly rather than use the `apply` method.

The use of `Series.apply` and `DataFrame.apply` will after the proposed change in almost all cases be replaced by `map`, `agg` or `transform`. In the very few cases where `Series.apply` and `DataFrame.apply` cannot be substituted by `map`, `agg` or `transform`, it is proposed that it will be accepted that users will have to find alternative ways to apply the functions, i.e. typically apply the functions manually and possibly concatenating the results.

## Motivation

The current behavior of `apply`, `agg` and `transform` is very complex and therefore difficult to understand for non-expert users. The difficulty is especially that the methods sometimes apply callables on elements of series/dataframes, sometimes on Series or columns/rows of Dataframes and sometimes try element-wise operation and if that fails, falls back to series-wise operations.
Copy link
Member

@WillAyd WillAyd Oct 2, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find the motivation to be pretty strong from a developer perspective but I think it is lacking from an end user perspective. To play devil's advocate...why should a non-expert user care or try to understand this distinction?

While there are definitely warts here part of me thinks we've had them for X number of years and gotten by without too much end user complaint about it. I think this would be the largest deprecation we've done on the project as long as I've been involved, so I'm a little wary of causing churn instead of just trying to "softly" guide users towards the more explicit map / agg / transform options


Below is an overview of the current behavior in table form when giving callables to `agg`, `transform` and `apply`. As an example on how to read the tables, when a non-ufunc callable is given to `Series.agg`, `Series.agg` will first try to apply the callable to each element in the series, and if that fails, will fall back to call the series using the callable.

(The description may not be 100 % accurate because of various special cases in the current implementation, but will give a good understanding of the current behavior).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should you add the string representations of functions to the tables?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problems that I try to solve only concern callables and I try to keep the PDEP brief. I can coment somewhere that the proposal does not affect supplying strings to the methods.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But even the behavior with strings is confusing. Here's some examples to illustrate:

>>> s = pd.Series([1,2,3])
>>> s.apply("sum")
6
>>> s.apply(np.sum)
0    1
1    2
2    3
dtype: int32
>>> s.apply("mean")
2.0
>>> s.apply(np.mean)
0    1.0
1    2.0
2    3.0
dtype: float64
>>> s.apply(abs)
0    1
1    2
2    3
dtype: int64
>>> s.apply(np.abs)
0    1
1    2
2    3
dtype: int64
>>> s.apply("abs")
0    1
1    2
2    3
dtype: int64


### agg

| | Series | DataFrame | groupby |
|:-----------------------------------|:---------------------------------|:---------------------------------|:----------|
| ufunc or list/dict of ufuncs | series | series | series |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Below, it is mentioned that agg (and others) will always operate on a Series, which might indirectly apply that this line in the table is correct.
But a ufunc never aggregates, so I assume we don't want this, and the end goal should be that agg always reduces, and so never accepts a ufunc ?

| other callables (non ufunc) | Try elements, fallback to series | series | series |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The "try elements" is already deprecated, right?

In [19]: df = pd.DataFrame({"A": range(3)})

In [21]: df["A"].agg(lambda x: np.sum(x))
<ipython-input-21-2e8408eae29f>:1: FutureWarning: using <function <lambda> at 0x7f26752e5ee0> in Series.agg cannot aggregate and has been deprecated. Use Series.transform to keep behavior unchanged.
  df["A"].agg(lambda x: np.sum(x))
Out[21]: 
0    0
1    1
2    2
Name: A, dtype: int64

| list/dict of callables (non-ufunc) | Try elements, fallback to series | Try elements, fallback to series | series |

### transform

| | Series | DataFrame | groupby |
|:-----------------------------------|:---------------------------------|:---------------------------------|:----------|
| ufunc or list/dict of ufuncs | series | series | series |
| other callables (non ufunc) | Try elements, fallback to series | series | series |
| list/dict of callables (non-ufunc) | Try elements, fallback to series | Try elements, fallback to series | series |

### apply

| | Series | DataFrame | groupby |
|:-----------------------------------|:---------|:------------|:----------|
| ufunc or list/dict of ufuncs | series | series | series |
| other callables (non ufunc) | elements | series | series |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
| other callables (non ufunc) | elements | series | series |
| other callables (non ufunc) | elements | series | dataframe |

?

(and this is actually one of the other inconsistencies in the whole groupby side of the story, that we don't have a clear way to choose whether you want to apply your function to each column in the group or on the group as a whole, i.e. as a dataframe)

| list/dict of callables (non-ufunc) | Try elements, fallback to series | series | series |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
| list/dict of callables (non-ufunc) | Try elements, fallback to series | series | series |
| list/dict of callables (non-ufunc) | Try elements, fallback to series | series | - |

I am not sure groupby.apply accepts a list or dict?

For example

In [42]: df = pd.DataFrame({"A": range(3)})

In [43]: df.groupby([0, 1, 0]).apply([lambda x: x])
...
TypeError: unhashable type: 'list'


The 3 tables show that:

1. when given numpy ufuncs, callables given to `agg`/`transform`/`apply` operate on series data
2. when used on groupby objects, callables given to `agg`/`transform`/`apply` operate on series data
3. else, in some case it will try element-wise operation and fall back to series-wise operations if that fails, in some case will operate on series data and in some cases on element data.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sidenote: to be honest, this "in some cases on element data" is IMO by far the main use case of Series.apply, and in this case (being passed a single function), it always operates element by element, consistently and unambiguously.


The above differences result on some non-obvious differences in how the same callable given to `agg`/`transform`/`apply` will behave.

For example, calling `agg` using the same callable will give different results depending on context:

```python
>>> import pandas as pd
>>> df = pd.DataFrame({"A": range(3)})
>>>
>>> df.agg(lambda x: np.sum(x)) # ok
A 3
dtype: int64
>>> df.agg([lambda x: np.sum(x)]) # not ok
A
<lambda>
0 0
1 1
2 2
>>> df.A.agg(lambda x: np.sum(x)) # not ok
0 0
1 1
2 2
Name: A, dtype: int64
```

It can also have great effect on performance, even when the result is correct. For example:

```python
>>> df = pd.DataFrame({"A": range(1_000_000)})
>>> %tiemit df.transform(lambda x: x + 1) # fast
1.43 ms ± 3.6 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
>>> %timeit df.transform([lambda x: x + 1]) # slow
163 ms ± 754 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
>>> %timeit df.A.transform(lambda x: x + 1) # slow
162 ms ± 980 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
```

The reason for the great performance difference is that `df.transform(func)` operates on series data, which is fast, while `df.transform(func_list)` will attempt elementwise operation first, and if that works (which is does here), will be much slower than series operations.

In addition to the above effects of the current implementation of `agg`/`transform` and `apply`, see [#52140](https://github.com/pandas-dev/pandas/issues/52140) for more examples of the unexpected effects of how `apply` is implemented.

It can also be noted that `Series.apply` and `DataFrame.apply` could almost always be replaced with calls to `agg`, `transform` or `map`, if `agg` and `transform` were to always operate on series data. For some examples, see the table below for alternatives using `apply(func)`:

| func | Series | DataFrame |
|:--------------------|:-----------|:------------|
| lambda x: str(x) | .map | .map |
| lambda x: x + 1 | .transform | .transform |
| [lambda x: x.sum()] | .agg | .agg |

So, for example, `ser.apply(lambda x: str(x))` can be replaced with `ser.map(lambda x: str(x))` while `df.apply([lambda x: x.sum()])` can be replaced with `df.agg([lambda x: x.sum()])`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be really worthwhile to include some axis=1 examples for DataFrame.apply()


Overall, because of their flexibility, `Series.apply` and `DataFrame.apply` are considered unnecessarily complex, and it would be clearer for users to use `.map`, `.agg` or `.transform`, as appropriate in the given situation.

## Proposal

With the above in mind, it is proposed that in the future `apply`, `transform` and `agg` will work as follows:

1. the `agg` and `transform` methods of `Series`, `DataFrame` and `groupby` will always operate series-wise and never element-wise
2. `Series.apply` and `DataFrame.apply` will be deprecated.
3. `groupby.apply` will not be deprecated (because it behaves differently than `Series.apply` and `DataFrame.apply`)

The above changes means that the future behavior, when users want to apply arbitrary callables in pandas, can be described as follows:

1. When users want to operate on single elements in a `Series` or `DataFrame`, they should use `Series.map` and `DataFrame.map` respectively.
2. When users want to aggregate a `Series`, columns/rows of a `DataFrame` or groups in `groupby` objects, they should use `Series.agg`, `DataFrame.agg` and `groupby.agg` respectively.
3. When users want to transform a `Series`, columns/rows of a `DataFrame` or groups in `groupby` objects, they should use `Series.transform`, `DataFrame.transform` and `groupby.transform` respectively.
4. Functions that are not applicable for `map`, `agg` nor `transform` are considered relatively rare and in the future users should call these functions directly rather than use the `apply` method.

The use of `Series.apply` and `DataFrame.apply` will after the proposed change in almost all cases be replaced by `map`, `agg` or `transform`. In the very few cases where `Series.apply` and `DataFrame.apply` cannot be substituted by `map`, `agg` or `transform`, it is proposed that it will be accepted that users will have to find alternative ways to apply the functions, i.e. typically apply the functions manually and possibly concatenating the results.

It can be noted that the behavior of `groupby.agg`, `groupby.transform` and `groupby.apply` are not proposed changed in this PDEP, because `groupby.agg`, `groupby.transform` already behave as desired and `groupby.apply` behaves differently than `Series.apply` and `DataFrame.apply`. Likewise, the behavior when given ufuncs (e.g. `np.sqrt`) and string input (e.g. `"sqrt"`) will remain unchanged, because the behavior is already as intended in all cases.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
It can be noted that the behavior of `groupby.agg`, `groupby.transform` and `groupby.apply` are not proposed changed in this PDEP, because `groupby.agg`, `groupby.transform` already behave as desired and `groupby.apply` behaves differently than `Series.apply` and `DataFrame.apply`. Likewise, the behavior when given ufuncs (e.g. `np.sqrt`) and string input (e.g. `"sqrt"`) will remain unchanged, because the behavior is already as intended in all cases.
It can be noted that the behavior of `groupby.agg`, `groupby.transform` and `groupby.apply` are not proposed to be changed in this PDEP, because `groupby.agg`, `groupby.transform` already behave as desired and `groupby.apply` behaves differently than `Series.apply` and `DataFrame.apply`. Likewise, the behavior when given ufuncs (e.g. `np.sqrt`) and string input (e.g. `"sqrt"`) will remain unchanged, because the behavior is already as intended in all cases.


## Deprecation process

To change the current behavior, it will have to be deprecated. However, `Series.apply` and `DataFrame.apply` are very widely used methods, so will be deprecated very gradually:

This means that in v2.2:

1. Calls to `Series.apply` and `DataFrame.apply`will emit a `DeprecationWarning` with an appropriate deprecation message.
2. A `series_ops_only` argument with type `bool | lib.NoDefault` parameter will be added to the `agg` and `transform` methods of `Series` and `DataFrame` with a default value of `lib.NoDefault`. When `series_ops_only` is set to `False`, `agg` and `transform` will behave as they do currently. When set to `True`, `agg` and `transform` will never operate on elements, but always on Series. When set to `no_default`, `agg` and `transform` will behave as `series_ops_only=False`, but will emit a `DeprecationWarning`, the current behavior will be removed in the future.

In Pandas v3.0:
1. Calls to `Series.apply` and `DataFrame.apply` will emit a `FutureWarning` and emit an appropriate deprecation message.
2. The `agg` and `transform` will always operate on series/columns/rows data and the `series_ops_only` parameter will have no effect and be deprecated.

In Pandas v4.0:
1. `Series.apply` and `DataFrame.apply` will be removed from the code base.
2. The `series_ops_only` parameter of agg` and `transform` will be removed from the code base.

## PDEP History

- 24 august 2023: Initial version (proposed to change `Series.apply` and `DataFrame.apply` to always operate on series/columns/rows)
- 17. september 2023: version 2 (renamed and proposing to deprecate `Series.apply` and `DataFrame.apply` and make `agg`/`transform` always operate on series/columns/rows)