-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
PDEP-13: Deprecate the apply method on Series and DataFrame and make the agg and transform methods operate on series data #54747
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
1787286
6ae4031
6cdd5e9
87c1faa
22414df
c5736ab
809a27f
8130e26
70d08de
5276b11
9d1ac2f
42eae36
138e4c6
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change | ||||
---|---|---|---|---|---|---|
@@ -0,0 +1,158 @@ | ||||||
# PDEP-13: Deprecate the apply method on Series and DataFrame and make the agg and transform methods operate on series data | ||||||
|
||||||
- Created: 24 August 2023 | ||||||
- Status: Under discussion | ||||||
- Discussion: [#52140](https://github.com/pandas-dev/pandas/issues/52509) | ||||||
- Author: [Terji Petersen](https://github.com/topper-123) | ||||||
- Revision: 2 | ||||||
|
||||||
## Abstract | ||||||
|
||||||
The `apply`, `transform` and `agg` methods have very complex behavior when given callables because they in some cases operate on elements in series, in some cases on series and sometimes try one first, and it that fails, falls back to try the other. There is not a logical system how these behaviors are arranged and it can therefore be difficult for users to understand these methods. | ||||||
|
||||||
It is proposed that `apply`, `transform` and `agg` in the future will work as follows: | ||||||
|
||||||
1. the `agg` and `transform` methods of `Series`, `DataFrame` and `groupby` will always operate series-wise and never element-wise | ||||||
2. `Series.apply` and `DataFrame.apply` will be deprecated. | ||||||
3. The current behavior when supplying string to the methods will not be changed. | ||||||
4. `groupby.apply` will not be deprecated (because it behaves differently than `Series.apply` and `DataFrame.apply`) | ||||||
|
||||||
The above changes means that the future behavior, when users want to apply arbitrary callables in pandas, can be described as follows: | ||||||
|
||||||
1. When users want to operate on single elements in a `Series` or `DataFrame`, they should use `Series.map` and `DataFrame.map` respectively. | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Are we sure this is accurate? For the use cases I've seen of Currently, say I do In [43]: df = pd.DataFrame({'a': ['quetzal', 'quetzal', 'baboon'], 'b': ['panda', 'chinchilla', 'elk']})
In [44]: df
Out[44]:
a b
0 quetzal panda
1 quetzal chinchilla
2 baboon elk
In [45]: def func(row):
...: return f"{row['a']} or {row['b']}?"
...:
In [46]: df.apply(func, axis=1)
Out[46]:
0 quetzal or panda?
1 quetzal or chinchilla?
2 baboon or elk?
dtype: object Then if I use In [47]: df.agg(func, axis=1)
Out[47]:
0 quetzal or panda?
1 quetzal or chinchilla?
2 baboon or elk?
dtype: object This feels very unnatural though - Could There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Isn't this a transform? But users don't get to control what gets passed to the UDF they supply - is it columm by column, row by row, or the entire frame. Perhaps they should. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. transform would raise here: In [53]: df.transform(func, axis=1)
---------------------------------------------------------------------------
ValueError: Function did not transform And this underscores just how confusing this is
yes, exactly, it could be:
Then, This is a very common way of using titanic_df['Person'] = titanic_df[['Age','Sex']].apply(get_person,axis=1) from https://www.kaggle.com/code/omarelgabry/a-journey-through-titanic There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @MarcoGorelli , your example is an aggregation though and However, I guess Your example could be an argument for keeping I'm -1 on making There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. true, it's a horizontal aggregation: it combines multiple columns into 1 I'm just thinking about the upgrade process I've only ever seen people use
then that could be confusing, and certainly harder than just grepping for There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I agree with @MarcoGorelli about the upgrade process. I find changing But maybe we should consider introducing a method Or should we make There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. wait let's not add yet another method to the API 😅 There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. (just another voice here to support the notion that requiring |
||||||
2. When users want to aggregate a `Series`, columns/rows of a `DataFrame` or groups in `groupby` objects, they should use `Series.agg`, `DataFrame.agg` and `groupby.agg` respectively. | ||||||
3. When users want to transform a `Series`, columns/rows of a `DataFrame` or groups in `groupby` objects, they should use `Series.transform`, `DataFrame.transform` and `groupby.transform` respectively. | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Could you add a 4th There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think you need to have the case when users want to have an operation on a row of a There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Can you give some example code snippets showing some possible problems? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @Dr-Irv I've added an example here https://github.com/pandas-dev/pandas/pull/54747/files#r1339884642 Inclined to agree that
Dr-Irv marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
4. Functions that are not applicable for `map`, `agg` nor `transform` are considered relatively rare and in the future users should call these functions directly rather than use the `apply` method. | ||||||
|
||||||
The use of `Series.apply` and `DataFrame.apply` will after the proposed change in almost all cases be replaced by `map`, `agg` or `transform`. In the very few cases where `Series.apply` and `DataFrame.apply` cannot be substituted by `map`, `agg` or `transform`, it is proposed that it will be accepted that users will have to find alternative ways to apply the functions, i.e. typically apply the functions manually and possibly concatenating the results. | ||||||
|
||||||
## Motivation | ||||||
|
||||||
The current behavior of `apply`, `agg` and `transform` is very complex and therefore difficult to understand for non-expert users. The difficulty is especially that the methods sometimes apply callables on elements of series/dataframes, sometimes on Series or columns/rows of Dataframes and sometimes try element-wise operation and if that fails, falls back to series-wise operations. | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I find the motivation to be pretty strong from a developer perspective but I think it is lacking from an end user perspective. To play devil's advocate...why should a non-expert user care or try to understand this distinction? While there are definitely warts here part of me thinks we've had them for X number of years and gotten by without too much end user complaint about it. I think this would be the largest deprecation we've done on the project as long as I've been involved, so I'm a little wary of causing churn instead of just trying to "softly" guide users towards the more explicit map / agg / transform options |
||||||
|
||||||
Below is an overview of the current behavior in table form when giving callables to `agg`, `transform` and `apply`. As an example on how to read the tables, when a non-ufunc callable is given to `Series.agg`, `Series.agg` will first try to apply the callable to each element in the series, and if that fails, will fall back to call the series using the callable. | ||||||
|
||||||
(The description may not be 100 % accurate because of various special cases in the current implementation, but will give a good understanding of the current behavior). | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. should you add the string representations of functions to the tables? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The problems that I try to solve only concern callables and I try to keep the PDEP brief. I can coment somewhere that the proposal does not affect supplying strings to the methods. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. But even the behavior with strings is confusing. Here's some examples to illustrate: >>> s = pd.Series([1,2,3])
>>> s.apply("sum")
6
>>> s.apply(np.sum)
0 1
1 2
2 3
dtype: int32
>>> s.apply("mean")
2.0
>>> s.apply(np.mean)
0 1.0
1 2.0
2 3.0
dtype: float64
>>> s.apply(abs)
0 1
1 2
2 3
dtype: int64
>>> s.apply(np.abs)
0 1
1 2
2 3
dtype: int64
>>> s.apply("abs")
0 1
1 2
2 3
dtype: int64 |
||||||
|
||||||
### agg | ||||||
|
||||||
| | Series | DataFrame | groupby | | ||||||
|:-----------------------------------|:---------------------------------|:---------------------------------|:----------| | ||||||
| ufunc or list/dict of ufuncs | series | series | series | | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Below, it is mentioned that |
||||||
| other callables (non ufunc) | Try elements, fallback to series | series | series | | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The "try elements" is already deprecated, right?
|
||||||
| list/dict of callables (non-ufunc) | Try elements, fallback to series | Try elements, fallback to series | series | | ||||||
|
||||||
### transform | ||||||
|
||||||
| | Series | DataFrame | groupby | | ||||||
|:-----------------------------------|:---------------------------------|:---------------------------------|:----------| | ||||||
| ufunc or list/dict of ufuncs | series | series | series | | ||||||
| other callables (non ufunc) | Try elements, fallback to series | series | series | | ||||||
| list/dict of callables (non-ufunc) | Try elements, fallback to series | Try elements, fallback to series | series | | ||||||
|
||||||
### apply | ||||||
|
||||||
| | Series | DataFrame | groupby | | ||||||
|:-----------------------------------|:---------|:------------|:----------| | ||||||
| ufunc or list/dict of ufuncs | series | series | series | | ||||||
| other callables (non ufunc) | elements | series | series | | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
? (and this is actually one of the other inconsistencies in the whole groupby side of the story, that we don't have a clear way to choose whether you want to apply your function to each column in the group or on the group as a whole, i.e. as a dataframe) |
||||||
| list/dict of callables (non-ufunc) | Try elements, fallback to series | series | series | | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
I am not sure groupby.apply accepts a list or dict? For example
|
||||||
|
||||||
The 3 tables show that: | ||||||
|
||||||
1. when given numpy ufuncs, callables given to `agg`/`transform`/`apply` operate on series data | ||||||
2. when used on groupby objects, callables given to `agg`/`transform`/`apply` operate on series data | ||||||
3. else, in some case it will try element-wise operation and fall back to series-wise operations if that fails, in some case will operate on series data and in some cases on element data. | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Sidenote: to be honest, this "in some cases on element data" is IMO by far the main use case of |
||||||
|
||||||
The above differences result on some non-obvious differences in how the same callable given to `agg`/`transform`/`apply` will behave. | ||||||
|
||||||
For example, calling `agg` using the same callable will give different results depending on context: | ||||||
|
||||||
```python | ||||||
>>> import pandas as pd | ||||||
>>> df = pd.DataFrame({"A": range(3)}) | ||||||
>>> | ||||||
>>> df.agg(lambda x: np.sum(x)) # ok | ||||||
A 3 | ||||||
dtype: int64 | ||||||
>>> df.agg([lambda x: np.sum(x)]) # not ok | ||||||
A | ||||||
<lambda> | ||||||
0 0 | ||||||
1 1 | ||||||
2 2 | ||||||
>>> df.A.agg(lambda x: np.sum(x)) # not ok | ||||||
0 0 | ||||||
1 1 | ||||||
2 2 | ||||||
Name: A, dtype: int64 | ||||||
``` | ||||||
Dr-Irv marked this conversation as resolved.
Show resolved
Hide resolved
Dr-Irv marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
|
||||||
It can also have great effect on performance, even when the result is correct. For example: | ||||||
|
||||||
```python | ||||||
>>> df = pd.DataFrame({"A": range(1_000_000)}) | ||||||
>>> %tiemit df.transform(lambda x: x + 1) # fast | ||||||
1.43 ms ± 3.6 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each) | ||||||
>>> %timeit df.transform([lambda x: x + 1]) # slow | ||||||
163 ms ± 754 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) | ||||||
>>> %timeit df.A.transform(lambda x: x + 1) # slow | ||||||
162 ms ± 980 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) | ||||||
``` | ||||||
|
||||||
The reason for the great performance difference is that `df.transform(func)` operates on series data, which is fast, while `df.transform(func_list)` will attempt elementwise operation first, and if that works (which is does here), will be much slower than series operations. | ||||||
|
||||||
In addition to the above effects of the current implementation of `agg`/`transform` and `apply`, see [#52140](https://github.com/pandas-dev/pandas/issues/52140) for more examples of the unexpected effects of how `apply` is implemented. | ||||||
|
||||||
It can also be noted that `Series.apply` and `DataFrame.apply` could almost always be replaced with calls to `agg`, `transform` or `map`, if `agg` and `transform` were to always operate on series data. For some examples, see the table below for alternatives using `apply(func)`: | ||||||
|
||||||
| func | Series | DataFrame | | ||||||
|:--------------------|:-----------|:------------| | ||||||
| lambda x: str(x) | .map | .map | | ||||||
| lambda x: x + 1 | .transform | .transform | | ||||||
| [lambda x: x.sum()] | .agg | .agg | | ||||||
|
||||||
So, for example, `ser.apply(lambda x: str(x))` can be replaced with `ser.map(lambda x: str(x))` while `df.apply([lambda x: x.sum()])` can be replaced with `df.agg([lambda x: x.sum()])`. | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Would be really worthwhile to include some |
||||||
|
||||||
Overall, because of their flexibility, `Series.apply` and `DataFrame.apply` are considered unnecessarily complex, and it would be clearer for users to use `.map`, `.agg` or `.transform`, as appropriate in the given situation. | ||||||
|
||||||
## Proposal | ||||||
|
||||||
With the above in mind, it is proposed that in the future `apply`, `transform` and `agg` will work as follows: | ||||||
|
||||||
1. the `agg` and `transform` methods of `Series`, `DataFrame` and `groupby` will always operate series-wise and never element-wise | ||||||
2. `Series.apply` and `DataFrame.apply` will be deprecated. | ||||||
3. `groupby.apply` will not be deprecated (because it behaves differently than `Series.apply` and `DataFrame.apply`) | ||||||
|
||||||
The above changes means that the future behavior, when users want to apply arbitrary callables in pandas, can be described as follows: | ||||||
|
||||||
1. When users want to operate on single elements in a `Series` or `DataFrame`, they should use `Series.map` and `DataFrame.map` respectively. | ||||||
2. When users want to aggregate a `Series`, columns/rows of a `DataFrame` or groups in `groupby` objects, they should use `Series.agg`, `DataFrame.agg` and `groupby.agg` respectively. | ||||||
3. When users want to transform a `Series`, columns/rows of a `DataFrame` or groups in `groupby` objects, they should use `Series.transform`, `DataFrame.transform` and `groupby.transform` respectively. | ||||||
4. Functions that are not applicable for `map`, `agg` nor `transform` are considered relatively rare and in the future users should call these functions directly rather than use the `apply` method. | ||||||
|
||||||
The use of `Series.apply` and `DataFrame.apply` will after the proposed change in almost all cases be replaced by `map`, `agg` or `transform`. In the very few cases where `Series.apply` and `DataFrame.apply` cannot be substituted by `map`, `agg` or `transform`, it is proposed that it will be accepted that users will have to find alternative ways to apply the functions, i.e. typically apply the functions manually and possibly concatenating the results. | ||||||
|
||||||
It can be noted that the behavior of `groupby.agg`, `groupby.transform` and `groupby.apply` are not proposed changed in this PDEP, because `groupby.agg`, `groupby.transform` already behave as desired and `groupby.apply` behaves differently than `Series.apply` and `DataFrame.apply`. Likewise, the behavior when given ufuncs (e.g. `np.sqrt`) and string input (e.g. `"sqrt"`) will remain unchanged, because the behavior is already as intended in all cases. | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||
|
||||||
## Deprecation process | ||||||
|
||||||
To change the current behavior, it will have to be deprecated. However, `Series.apply` and `DataFrame.apply` are very widely used methods, so will be deprecated very gradually: | ||||||
|
||||||
This means that in v2.2: | ||||||
|
||||||
1. Calls to `Series.apply` and `DataFrame.apply`will emit a `DeprecationWarning` with an appropriate deprecation message. | ||||||
2. A `series_ops_only` argument with type `bool | lib.NoDefault` parameter will be added to the `agg` and `transform` methods of `Series` and `DataFrame` with a default value of `lib.NoDefault`. When `series_ops_only` is set to `False`, `agg` and `transform` will behave as they do currently. When set to `True`, `agg` and `transform` will never operate on elements, but always on Series. When set to `no_default`, `agg` and `transform` will behave as `series_ops_only=False`, but will emit a `DeprecationWarning`, the current behavior will be removed in the future. | ||||||
|
||||||
In Pandas v3.0: | ||||||
1. Calls to `Series.apply` and `DataFrame.apply` will emit a `FutureWarning` and emit an appropriate deprecation message. | ||||||
2. The `agg` and `transform` will always operate on series/columns/rows data and the `series_ops_only` parameter will have no effect and be deprecated. | ||||||
|
||||||
In Pandas v4.0: | ||||||
1. `Series.apply` and `DataFrame.apply` will be removed from the code base. | ||||||
2. The `series_ops_only` parameter of agg` and `transform` will be removed from the code base. | ||||||
|
||||||
## PDEP History | ||||||
|
||||||
- 24 august 2023: Initial version (proposed to change `Series.apply` and `DataFrame.apply` to always operate on series/columns/rows) | ||||||
- 17. september 2023: version 2 (renamed and proposing to deprecate `Series.apply` and `DataFrame.apply` and make `agg`/`transform` always operate on series/columns/rows) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.