-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
PDEP-13: Deprecate the apply method on Series and DataFrame and make the agg and transform methods operate on series data #54747
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 11 commits
1787286
6ae4031
6cdd5e9
87c1faa
22414df
c5736ab
809a27f
8130e26
70d08de
5276b11
9d1ac2f
42eae36
138e4c6
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change | ||||
---|---|---|---|---|---|---|
@@ -0,0 +1,151 @@ | ||||||
# PDEP-13: Deprecate the apply method on Series & DataFrame and make the agg and transform methods operate on series data | ||||||
|
||||||
- Created: 24 August 2023 | ||||||
- Status: Under discussion | ||||||
- Discussion: [#52140](https://github.com/pandas-dev/pandas/issues/52509) | ||||||
- Author: [Terji Petersen](https://github.com/topper-123) | ||||||
- Revision: 2 | ||||||
|
||||||
## Abstract | ||||||
|
||||||
The `apply`, `transform` and `agg` methods have very complex behavior because they in some cases operate on elements in series, in some cases on series and sometimes try one first, and it that fails, falls back to try the other. There is not a logical system how these behaviors are arranged and it can therefore be difficult for users to understand these methods. | ||||||
|
||||||
It is proposed that `apply`, `transform` and `agg` in the future will work as follows: | ||||||
|
||||||
1. the `agg` & `transform` methods of `Series`, `DataFrame` & `groupby` will always operate series-wise and never element-wise | ||||||
2. `Series.apply` & `DataFrame.apply` will be deprecated. | ||||||
3. `groupby.apply` will not be deprecated (because it behaves differently than `Series.apply` & `DataFrame.apply`) | ||||||
|
||||||
The above changes means that the future behavior, when users want to apply arbitrary callables in pandas, can be described as follows: | ||||||
|
||||||
1. When users want to operate on single elements in a `Series` or `DataFrame`, they should use `Series.map` and `DataFrame.map` respectively. | ||||||
2. When users want to aggregate a `Series`, columns/rows of a `DataFrame` or groups in `groupby` objects, they should use `Series.agg`, `DataFrame.agg` and `groupby.agg` respectively. | ||||||
3. When users want to transform a `Series`, columns/rows of a `DataFrame` or groups in `groupby` objects, they should use `Series.transform`, `DataFrame.transform` and `groupby.transform` respectively. | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Could you add a 4th There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think you need to have the case when users want to have an operation on a row of a There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Can you give some example code snippets showing some possible problems? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @Dr-Irv I've added an example here https://github.com/pandas-dev/pandas/pull/54747/files#r1339884642 Inclined to agree that
Dr-Irv marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
|
||||||
The use of `Series.apply` & `DataFrame.apply` will after the proposed change in almost all cases be replaced by `map`, `agg` or `transform`. In the very few cases where `Series.apply` & `DataFrame.apply` cannot be substituted by `map`, `agg` or `transform`, it is proposed that it will be accepted that users will have to find alternative ways to apply the functions, i.e. typically apply the functions manually and possibly concatenating the results. | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It may be useful to have some examples of those "very few cases" |
||||||
|
||||||
## Motivation | ||||||
|
||||||
The current behavior of `apply`, `agg` & `transform` is very complex and therefore difficult to understand for non-expert users. The difficulty is especially that the methods sometimes apply callables on elements of series/dataframes, sometimes on Series or columns/rows of Dataframes and sometimes try element-wise operation and if that fails, falls back to series-wise operations. | ||||||
|
||||||
Below is an overview of the current behavior in table form when giving callables to `agg`, `transform` & `apply`. As an example on how to read the tables, when a non-ufunc callable is given to `Series.agg`, `Series.agg` will first try to apply the callable to each element in the series, and if that fails, will fall back to call the series using the callable. | ||||||
|
||||||
(The description may not be 100 % accurate because of various special cases in the current implementation, but will give a good understanding of the current behavior). | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. should you add the string representations of functions to the tables? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The problems that I try to solve only concern callables and I try to keep the PDEP brief. I can coment somewhere that the proposal does not affect supplying strings to the methods. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. But even the behavior with strings is confusing. Here's some examples to illustrate: >>> s = pd.Series([1,2,3])
>>> s.apply("sum")
6
>>> s.apply(np.sum)
0 1
1 2
2 3
dtype: int32
>>> s.apply("mean")
2.0
>>> s.apply(np.mean)
0 1.0
1 2.0
2 3.0
dtype: float64
>>> s.apply(abs)
0 1
1 2
2 3
dtype: int64
>>> s.apply(np.abs)
0 1
1 2
2 3
dtype: int64
>>> s.apply("abs")
0 1
1 2
2 3
dtype: int64 |
||||||
|
||||||
### agg | ||||||
|
||||||
| | Series | DataFrame | groupby | | ||||||
|:-----------------------------------|:---------------------------------|:---------------------------------|:----------| | ||||||
| ufunc or list/dict of ufuncs | series | series | series | | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Below, it is mentioned that |
||||||
| other callables (non ufunc) | Try elements, fallback to series | series | series | | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The "try elements" is already deprecated, right?
|
||||||
| list/dict of callables (non-ufunc) | Try elements, fallback to series | Try elements, fallback to series | series | | ||||||
|
||||||
### transform | ||||||
|
||||||
| | Series | DataFrame | groupby | | ||||||
|:-----------------------------------|:---------------------------------|:---------------------------------|:----------| | ||||||
| ufunc or list/dict of ufuncs | series | series | series | | ||||||
| other callables (non ufunc) | Try elements, fallback to series | series | series | | ||||||
| list/dict of callables (non-ufunc) | Try elements, fallback to series | Try elements, fallback to series | series | | ||||||
|
||||||
### apply | ||||||
|
||||||
| | Series | DataFrame | groupby | | ||||||
|:-----------------------------------|:---------|:------------|:----------| | ||||||
| ufunc or list/dict of ufuncs | series | series | series | | ||||||
| other callables (non ufunc) | elements | series | series | | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
? (and this is actually one of the other inconsistencies in the whole groupby side of the story, that we don't have a clear way to choose whether you want to apply your function to each column in the group or on the group as a whole, i.e. as a dataframe) |
||||||
| list/dict of callables (non-ufunc) | Try elements, fallback to series | series | series | | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
I am not sure groupby.apply accepts a list or dict? For example
|
||||||
|
||||||
The 3 tables show that: | ||||||
|
||||||
1. when given numpy ufuncs, callables given to `agg`/`transform`/`apply` operate on series data | ||||||
2. when used on groupby objects, callables given to `agg`/`transform`/`apply` operate on series data | ||||||
3. else, in some case it will try element-wise operation and fall back to series-wise operations if that fails, in some case will operate on series data and in some cases on element data. | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Sidenote: to be honest, this "in some cases on element data" is IMO by far the main use case of |
||||||
|
||||||
The above differences result on some non-obvious differences in how the same callable given to `agg`/`transform`/`apply` will behave. | ||||||
|
||||||
For example, calling `agg` using the same callable will give different results depending on context: | ||||||
|
||||||
```python | ||||||
>>> import pandas as pd | ||||||
>>> df = pd.DataFrame({"A": range(3)}) | ||||||
>>> | ||||||
>>> df.agg(lambda x: np.sum(x)) # ok | ||||||
A 3 | ||||||
dtype: int64 | ||||||
>>> df.agg([lambda x: np.sum(x)]) # not ok | ||||||
A | ||||||
<lambda> | ||||||
0 0 | ||||||
1 1 | ||||||
2 2 | ||||||
>>> df.A.agg(lambda x: np.sum(x)) # not ok | ||||||
0 0 | ||||||
1 1 | ||||||
2 2 | ||||||
Name: A, dtype: int64 | ||||||
``` | ||||||
Dr-Irv marked this conversation as resolved.
Show resolved
Hide resolved
Dr-Irv marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
|
||||||
It can also have great effect on performance, even when the result is correct. For example: | ||||||
|
||||||
```python | ||||||
>>> df = pd.DataFrame({"A": range(1_000_000)}) | ||||||
>>> %tiemit df.transform(lambda x: x + 1) # fast | ||||||
1.43 ms ± 3.6 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each) | ||||||
>>> %timeit df.transform([lambda x: x + 1]) # slow | ||||||
163 ms ± 754 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) | ||||||
>>> %timeit df.A.transform(lambda x: x + 1) # slow | ||||||
162 ms ± 980 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) | ||||||
``` | ||||||
|
||||||
The reason for the great performance difference is that `df.transform(func)` operates on series data, which is fast, while `df.transform(func_list)` will attempt elementwise operation first, and if that works (which is does here), will be much slower than series operations. | ||||||
|
||||||
In addition to the above effects of the current implementation of `agg`/`transform` & `apply`, see [#52140](https://github.com/pandas-dev/pandas/issues/52140) for more examples of the unexpected effects of how `apply` is implemented. | ||||||
|
||||||
It can also be noted that `Series.apply` & `DataFrame.apply` could almost always be replaced with calls to `agg`, `transform` & `map`, if `agg` & `transform` were to always operate on series data. For some examples, see the table below for alternatives using `apply(func)`: | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||
|
||||||
| func | Series | DataFrame | | ||||||
|:--------------------|:-----------|:------------| | ||||||
| lambda x: str(x) | .map | .map | | ||||||
| lambda x: x + 1 | .transform | .transform | | ||||||
| [lambda x: x.sum()] | .agg | .agg | | ||||||
|
||||||
So, for example, `ser.apply(lambda x: str(x))` can be replaced with `ser.map(lambda x: str(x))` while `df.apply([lambda x: x.sum()])` can be replaced with `df.agg([lambda x: x.sum()])`. | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Would be really worthwhile to include some |
||||||
|
||||||
Overall, because of their flexibility, `Series.apply` & `DataFrame.apply` are considered unnecessarily complex, and it would be clearer for users to use `.map`, `.agg` or `.transform`, as appropriate in the given situation. | ||||||
|
||||||
## Proposal | ||||||
|
||||||
With the above in mind, it is proposed that in the future: | ||||||
|
||||||
It is proposed that `apply`, `transform` and `agg` in the future will work as follows: | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
(You have "in the future" in 2 consecutive sentences) |
||||||
|
||||||
1. the `agg` & `transform` methods of `Series`, `DataFrame` & `groupby` will always operate series-wise and never element-wise | ||||||
2. `Series.apply` & `DataFrame.apply` will be deprecated. | ||||||
3. `groupby.apply` will not be deprecated (because it behaves differently than `Series.apply` & `DataFrame.apply`) | ||||||
|
||||||
The above changes means that the future behavior, when users want to apply arbitrary callables in pandas, can be described as follows: | ||||||
|
||||||
1. When users want to operate on single elements in a `Series` or `DataFrame`, they should use `Series.map` and `DataFrame.map` respectively. | ||||||
2. When users want to aggregate a `Series`, columns/rows of a `DataFrame` or groups in `groupby` objects, they should use `Series.agg`, `DataFrame.agg` and `groupby.agg` respectively. | ||||||
3. When users want to transform a `Series`, columns/rows of a `DataFrame` or groups in `groupby` objects, they should use `Series.transform`, `DataFrame.transform` and `groupby.transform` respectively. | ||||||
|
||||||
The use of `Series.apply` & `DataFrame.apply` will after the proposed change in almost all cases be replaced by `map`, `agg` or `transform`. In the very few cases where `Series.apply` & `DataFrame.apply` cannot be substituted by `map`, `agg` or `transform`, it is proposed that it will be accepted that users will have to find alternative ways to apply the functions, i.e. typically apply the functions manually and possibly concatenating the results. | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. this text seems to be repeated from the abstract. So shorten the abstract - leave the details here |
||||||
|
||||||
It can be noted that `groupby.agg`, `groupby.transform` & `groupby.apply` are not proposed changed in this PDEP, because `groupby.agg`, `groupby.transform` already behave as desired and `groupby.apply` behaves differently than `Series.apply` & `DataFrame.apply`. Likewise, the behavior when given ufuncs will remain unchanged, because the behavior is already as intended in all cases. | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||
|
||||||
## Deprecation process | ||||||
|
||||||
To change the current behavior, it will have to be deprecated. This will be done by in v2.2: | ||||||
|
||||||
1. Deprecate `Series.apply` & `DataFrame.apply`. | ||||||
2. Add a `series_ops_only` with type `bool | lib.NoDefault` parameter to `agg` & `transform` methods of `Series` & `DataFrame`. When `series_ops_only` is set to False, `agg` & `transform` will behave as they do currently. When set to True, `agg` & `transform` will never operate on elements, but always on Series. When set to `no_default`, `agg` & `transform` will behave as `series_ops_only=False`, but will emit a FutureWarning the current behavior will be reoved in the future. | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||
|
||||||
In Pandas v3.0: | ||||||
1. `Series.apply` & `DataFrame.apply` will be removed from the code base (question: or added to `_hidden_attrs`?). | ||||||
1. The `agg` & `transform` will always operate on series/columns/rows data and the `series_ops_only` parameter will have no effect and be deprecated and removed in v4.0 (it must be kept in v3.x in order to facilitate the switch from v2.x to v3.0). | ||||||
|
||||||
## PDEP History | ||||||
|
||||||
- 24 august 2023: Initial version (proposed to change `Series.apply` & `DataFrame.apply` to always operate on series/columns/rows) | ||||||
- 17. september 2023: version 2 (renamed and proposing to deprecate `Series.apply` & `DataFrame.apply` and make `agg`/`transform` always operate on series/columns/rows) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we sure this is accurate? For the use cases I've seen of
DataFrame.apply
, I thinkDataFrame.agg
would probably be more likely as a replacementCurrently, say I do
Then if I use
DataFrame.map
, it'll operate over all elements of the dataframe 1-by-1. Whereas if I use.agg
, then I'll preserve what I'm currently doing:This feels very unnatural though -
func
here doesn't look like an aggregation at all, so I wouldn't have reached for.agg
Could
DataFrame.map
get anaxis
keyword as well, so thatDataFrame.map(func, axis=1)
preserves the current behaviour ofDataFrame.apply(func, axis=1)
?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't this a transform? But users don't get to control what gets passed to the UDF they supply - is it columm by column, row by row, or the entire frame. Perhaps they should.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
transform would raise here:
And this underscores just how confusing this is
yes, exactly, it could be:
axis=None
: element by element (current behaviour)axis=1
: row by rowaxis=0
: element by elementThen,
df.apply(lambda row: ..., axis=1)
can just becomedf.map(lambda row: ..., axis=1)
This is a very common way of using
apply
, e.g.from https://www.kaggle.com/code/omarelgabry/a-journey-through-titanic
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@MarcoGorelli , your example is an aggregation though and
func
does operate on rows, not elements, so everything works as intended in youragg
example.However, I guess
agg
linguistically(?) is best understood as a numerical reduction, not a text combination operation, so using the nameagg
may be a bit unintuitive. My original proposal would have keptapply
for situations like this, because the problem I wanted to solve wasapply
giving subtly different results forSeries.apply
andDataFrame.apply
(including foraxis=1
).Your example could be an argument for keeping
apply
, but drop the element-by-element behavior ofSeries.apply
, i.e. revert to the original proposal of this PDEP.I'm -1 on making
map
operate on rows/columns. It's a strength of the method that it always gives same-shaped results and it would be confusing for the result changing shape based on if an axis parameter is given or not.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
true, it's a horizontal aggregation: it combines multiple columns into 1
I'm just thinking about the upgrade process
I've only ever seen people use
DataFrame.apply
to create a single column, and if the upgrade process is going to be:then that could be confusing, and certainly harder than just grepping for
.apply
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with @MarcoGorelli about the upgrade process. I find changing
Dataframe.apply(func, axis=1)
toDataFrame.agg(func, axis=1)
to be non-intuitive.But maybe we should consider introducing a method
DataFrame.project(func)
that is the replacement forDataFrame.apply(func, axis=1)
. Idea being that you "project" a function onto each row of the DF.Or should we make
DataFrame.transform(func, axis=1)
work by transforming each row?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wait let's not add yet another method to the API 😅
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(just another voice here to support the notion that requiring
agg
for this use case feels very unnatural)