Skip to content

Commit 4becf76

Browse files
committed
PDEP-13: Make the Series.apply method operate Series-wise
1 parent 7134f2c commit 4becf76

File tree

1 file changed

+333
-0
lines changed

1 file changed

+333
-0
lines changed
+333
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,333 @@
1+
# PDEP-13: Make the Series.apply method operate Series-wise
2+
3+
- Created: 24 August 2023
4+
- Status: Under discussion
5+
- Discussion: [#52140](https://github.com/pandas-dev/pandas/issues/52509)
6+
- Author: [Terji Petersen](https://github.com/topper-123)
7+
- Revision: 1
8+
9+
## Abstract
10+
11+
Currently, giving a input to `Series.apply` is treated differently depending on the type of the input:
12+
13+
* if the input is a numpy `ufunc`, `series.apply(func)` is equivalent to `func(series)`, i.e. similar to `series.pipe(func)`.
14+
* if the input is a callable, but not a numpy `ufunc`, `series.apply(func)` is similar to `Series([func(val) for val in series], index=series.index)`, i.e. similar to `series.map(func)`
15+
* if the input is a list-like or dict-like, `series.apply(func)` is equivalent to `series.agg(func)` (which is subtly different than `series.apply`)
16+
17+
In contrast, `DataFrame.apply` has a consistent behavior:
18+
19+
* if the input is a callable, `df.apply(func)` always calls each columns in the DataFrame, so is similar to `func(col) for _, col in
20+
df.items()` + wrapping functionality
21+
* if the input is a list-like or dict-like, `df.apply` call each item in the list/dict and wraps the result as needed. So for example if the input is a list, `df.apply(func_list)` is equivalent to `[df.apply(func) for func in func_list]` + wrapping functionality
22+
23+
This PDEP proposes that:
24+
25+
- The current complex current behavior of `Series.apply` will be deprecated in Pandas 2.2.
26+
- Single callables given to the `.apply` methods of `Series` will in Pandas 3.0 always be called on the whole `Series`, so `series.apply(func)` will become similar to `func(series)`,
27+
- Lists or dicts of callables given to the `Series.apply` will in Pandas 3.0 always call `Series.apply` on each element of the list/dict
28+
29+
In short, this PDEP proposes changing `Series.apply` to be more similar to how `DataFrame.apply` works on single dataframe columns, i.e. operate on the whole series. If a user wants to map a callable to each element of a Series, they should be directed to use `Series.map` instead of using `Series.apply`.
30+
31+
## Motivation
32+
33+
`Series.apply` is currently a very complex method, whose behaviour will differ depending on the nature of its input.
34+
35+
`Series.apply` & `Series.map` currently often behave very similar, but differently enough for it to be confusing when it's a good idea to use one over the other and especially when `Series.apply` is a bad idea to use.
36+
37+
Also, calling `Series.apply` currently gives a different result than the per-column result from calling `DataFrame.apply`, which can be confusing for users who expect `Series.apply` to be the `Series` version of `DataFrame.apply`, similar to how `Series.agg` is the `Series` version of `DataFrame.agg`. For example, currently some functions may work fine with `DataFrame.apply`, but may fail, be very slow when given to `Series.apply` or give a different result than the per-column result from `DataFrame.apply`.
38+
39+
### Similarities and differences between `Series.apply` and `Series.map`
40+
41+
The similarity between the methods is especially that they both fall back to use `Series._map_values` and there use `algorithms.map_array` or `ExtensionArray.map` as relevant.
42+
43+
The differences are many, but each one is relative minor:
44+
45+
1. `Series.map` has a `na_action` parameter, which `Series.apply` doesn't
46+
2. `Series.apply` can take advantage of numpy ufuncs, which `Series.map` can't
47+
3. `Series.apply` can take `args` and `**kwargs`, which `Series.map` can't
48+
4. `Series.apply` is more general and can take a string, e.g. `"sum"`, or lists or dicts of inputs which `Series.map` can't.
49+
5. when given a numpy ufunc, the ufunc will be called on the whole Series, when given to `Series.apply` and on each element of the series, if given to `Series.map`.
50+
51+
In addition, `Series.apply` has some functionality, which `Series.map` does not, but which has already been deprecated:
52+
53+
6. `Series.apply` has a `convert_dtype` parameter, which has been deprecated (deprecated in pandas 2.1, see [GH52257](https://github.com/pandas-dev/pandas/pull/52257))
54+
7. `Series.apply` will return a Dataframe, if its result is a list of Series (deprecated in pandas 2.1, see [GH52123]()https://github.com/pandas-dev/pandas/pull/52123)).
55+
56+
### Similarities and differences between `Series.apply` and `DataFrame.apply`
57+
58+
`Series.apply` and `DataFrame.apply` are similar when given numpy ufuncs as inputs, but when given non-ufuncs as inputs, `Series.apply` and `DataFrame.apply` will behave differently, because `series.apply(func)` will be similar to `series.map(func)` while `Dataframe.apply(func)` will call the input on each column series and combine the result.
59+
60+
If given a list-like or dict-like, `Series.apply` will behave similar to `Series.agg`, while `DataFrame.apply` will call each element in the list-like/dict-like on each column and combine the results.
61+
62+
Also `DataFrame.apply` has some parameters (`raw` and `result_type`) which are relevant for a 2D DataFrame, but may not be relevant for `Series.apply`, because `Series` is a 1D structure.
63+
64+
## Examples of problems with the current way `Series.apply` works
65+
66+
The above similarities and many minor differences makes for confusing and too complex rules for when its a good idea to use `Series.apply` over `Series.map` to do operations, and vica versa, and for when a callable will work well with `Series.apply` versus `DataFrame.apply`. Some examples will show some examples below.
67+
68+
First some setup:
69+
70+
```python
71+
>>> import numpy as np
72+
>>> import pandas as pd
73+
>>>
74+
>>> small_ser = pd.Series([1, 2, 3])
75+
>>> large_ser = pd.Series(range(100_000))
76+
```
77+
78+
### 1: string vs numpy funcs in `Series.apply`
79+
80+
```python
81+
>>> small_ser.apply("sum")
82+
6
83+
>>> small_ser.apply(np.sum)
84+
0 1
85+
1 2
86+
2 3
87+
dtype: int64
88+
```
89+
90+
It will surprise users that these two give different results. Also, anyone using the second pattern is probably making a mistake.
91+
92+
Note that giving `np.sum` to `DataFrame.apply` aggregates properly:
93+
94+
```python
95+
>>> pd.DataFrame(small_ser).apply(np.sum)
96+
0 6
97+
dtype: int64
98+
```
99+
100+
This PDEP proposes that callables will be applies to the whole `Series`, so we in the future will have:
101+
102+
```python
103+
>>> small_ser.apply(np.sum)
104+
6
105+
```
106+
107+
### 2 Callables vs. list/dict of callables
108+
109+
Giving functions and lists/dicts of functions will give different results:
110+
111+
```python
112+
>>> small_ser.apply(np.sum)
113+
0 1
114+
1 2
115+
2 3
116+
dtype: int64
117+
>>> small_ser.apply([np.sum])
118+
sum 6
119+
dtype: int64
120+
```
121+
122+
Also with non-numpy callables:
123+
124+
```python
125+
>>> small_ser.apply(lambda x: x.sum())
126+
AttributeError: 'int' object has no attribute 'sum'
127+
>>> small_ser.apply([lambda x: x.sum()])
128+
<lambda> 6
129+
dtype: int64
130+
```
131+
132+
In both cases above the difference is that `Series.apply` operates element-wise, if given a callable, but series-wise if given a list/dict of callables.
133+
134+
This PDEP proposes that callables will be applies to the whole `Series`, so we in the future will have:
135+
136+
```python
137+
>>> small_ser.apply(lambda x: x.sum())
138+
6
139+
>>> small_ser.apply([lambda x: x.sum()])
140+
<lambda> 6
141+
dtype: int64
142+
```
143+
144+
### 3. Functions in `Series.apply`
145+
146+
The `Series.apply` doc string have examples with using lambdas, but using lambdas in `Series.apply` is often a bad practices because of bad performance:
147+
148+
```python
149+
>>> %timeit large_ser.apply(lambda x: x + 1)
150+
24.1 ms ± 88.8 µs per loop
151+
```
152+
153+
Currently, `Series` does not have a method that makes a callable operate on a series' data. Instead users need to use `Series.pipe` for that operation in order for the operation to be efficient:
154+
155+
```python
156+
>>> %timeit large_ser.pipe(lambda x: x + 1)
157+
44 µs ± 363 ns per loop
158+
```
159+
160+
(The reason for the above performance differences is that apply gets called on each single element, while `pipe` calls `x.__add__(1)`, which operates on the whole array).
161+
162+
Note also that `.pipe` operates on the `Series` while `apply`currently operates on each element in the data, so there is some differences that may have some consequence in some cases.
163+
164+
This PDEP proposes that callables will be applies to the whole `Series`, so we in the future `Series.apply` will be as fast as `Series.pipe`.
165+
166+
### 4. ufuncs in `Series.apply` vs. noral functions
167+
168+
Performance-wise, ufuncs are fine in `Series.apply`, but non-ufunc functions are not:
169+
170+
```python
171+
>>> %timeit large_ser.apply(np.sqrt)
172+
71.6 µs ± 1.17 µs per loop
173+
>>> %timeit large_ser.apply(lambda x:np.sqrt(x))
174+
63.6 ms ± 1.03 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
175+
```
176+
177+
It is difficult to understand why ufuncs are fast in `apply`, while other callables are slow in `apply` (answer: it's because ufuncs operate on the whole Series, while other callables operate elementwise).
178+
179+
This PDEP proposes that callables will be applies to the whole `Series`, so we in the future non-ufunc functions in `Series.apply` will be as fast as ufuncs.
180+
181+
### 5. callables in `Series.apply` is slow, callables in `DataFrame.apply` is fast
182+
183+
Above it was shown that using (non-ufunc) callables in `Series.apply` is bad performance-wise. OTOH using them in `DataFrame.apply` is fine:
184+
185+
```python
186+
>>> %timeit large_ser.apply(lambda x: x + 1)
187+
24.3 ms ± 24 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
188+
>>> %timeit pd.DataFrame(large_ser).apply(lambda x: x + 1)
189+
160 µs +- 1.17 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
190+
```
191+
192+
Having callables being fast to use in the `DataFrame.apply` method, but slow in `Series.apply` is confusing for users.
193+
194+
This PDEP proposes that callables will be applies to the whole `Series`, so we in the future `Series.apply` will be as fast as `DataFrame.apply` already is.
195+
196+
### 6. callables in `Series.apply` may fail, while callables in `DataFrame.apply` do not and vica versa
197+
198+
```python
199+
>>> ser.apply(lambda x: x.sum())
200+
AttributeError: 'int' object has no attribute 'sum'
201+
>>> pd.DataFrame(ser).apply(lambda x: x.sum())
202+
0 6
203+
dtype: int64
204+
```
205+
206+
Having callables fail when used in `Series.apply`, but work in `DataFrame.Apply` or vica versa is confusing for users.
207+
208+
This PDEP proposes that callables will be applies to the whole `Series`, so callables given to `Series.apply` will work the same as when given to `DataFrame.apply`, so in the future we will have that:
209+
210+
```python
211+
>>> ser.apply(lambda x: x.sum())
212+
6
213+
>>> pd.DataFrame(ser).apply(lambda x: x.sum())
214+
0 6
215+
dtype: int64
216+
```
217+
218+
### 7. `Series.apply` vs. `Series.agg`
219+
220+
The doc string for `Series.agg` says about the method's `func` parameter: "If a function, must ... work when passed ... to Series.apply". But compare these:
221+
222+
```python
223+
>>> small_ser.agg(np.sum)
224+
6
225+
>>> small_ser.apply(np.sum)
226+
0 1
227+
1 2
228+
2 3
229+
dtype: int64
230+
```
231+
232+
Users would expect these two to give the same result.
233+
234+
This PDEP proposes that callables will be applies to the whole `Series`, so in the future we will have:
235+
236+
```python
237+
>>> small_ser.agg(np.sum)
238+
6
239+
>>> small_ser.apply(np.sum)
240+
6
241+
```
242+
243+
### 8. dictlikes vs. listlikes in `Series.apply`
244+
245+
Giving a *list* of transforming arguments to `Series.apply` returns a `DataFrame`:
246+
247+
```python
248+
>>> small_ser.apply(["sqrt", np.abs])
249+
sqrt absolute
250+
0 1.000000 1
251+
1 1.414214 2
252+
2 1.732051 3
253+
```
254+
255+
But giving a *dict* of transforming arguments returns a `Series` with a `MultiIndex`:
256+
257+
```python
258+
>>> small_ser.apply({"sqrt" :"sqrt", "abs" : np.abs})
259+
sqrt 0 1.000000
260+
1 1.414214
261+
2 1.732051
262+
abs 0 1.000000
263+
1 2.000000
264+
2 3.000000
265+
dtype: float64
266+
```
267+
268+
These two should give same-shaped output for consistency. Using `Series.transform` instead of `Series.apply`, it returns a `DataFrame` in both cases and I think the dictlike example above should return a `DataFrame` similar to the listlike example.
269+
270+
Minor additional info: listlikes and dictlikes of aggregation arguments do behave the same, so this is only a problem with dictlikes of transforming arguments when using `apply`.
271+
272+
This PDEP proposes that the result from giving list-likes and dict-likes to `Series.apply` will have the same shape as when given list-likes currently:
273+
274+
```python
275+
>>> small_ser.apply(["sqrt", np.abs])
276+
sqrt absolute
277+
0 1.000000 1
278+
1 1.414214 2
279+
2 1.732051 3
280+
>>> small_ser.apply({"sqrt" :"sqrt", "abs" : np.abs})
281+
sqrt absolute
282+
0 1.000000 1
283+
1 1.414214 2
284+
2 1.732051 3
285+
```
286+
287+
## Proposal
288+
289+
With the above in mind, it is proposed that:
290+
291+
1. When given a callable, `Series.apply` always operate on the series. I.e. let `series.apply(func)` be similar to `func(series)` + the needed additional functionality.
292+
2. When given a list-like or dict-like, `Series.apply` will apply each element of the list-like/dict-like to the series. I.e. `series.apply(func_list)` wil be similar to `[series.apply(func) for func in func_list]` + the needed additional functionality
293+
3. The changes made to `Series.apply`will propagate to `Series.agg` and `Series.transform` as needed.
294+
295+
The difference between `Series.apply()` & `Series.map()` will then be that:
296+
297+
* `Series.apply()` makes the passed-in callable operate on the series, similarly to how `(DataFrame|SeriesGroupby|DataFrameGroupBy).apply` operate on series. This is very fast and can do almost anything,
298+
* `Series.map()` makes the passed-in callable operate on each series data elements individually. This is very flexible, but can be very slow, so should only be used if `Series.apply` can't do it.
299+
300+
so, this API change will help make Pandas `Series.(apply|map)` API clearer without losing functionality and let their functionality be explainable in a simple manner, which would be a win for Pandas.
301+
302+
The result from the above change will be that `Series.apply` will operate similar to how `DataFrame.apply` works already per column, similar to how `Series.map` operates similar to how `DataFrame.map` works per column. This will give better coherence between same-named methods on `DataFrames` and `Series`.
303+
304+
## Deprecation process
305+
306+
To change the behavior to the current behavior will have to be deprecated. This can be done by adding a `by_row` parameter to `Series.apply`, which means, when `by_rows=False`, that `Series.apply` will not operate elementwise but Series-wise.
307+
308+
So we will have in pandas v2.2:
309+
310+
```python
311+
>>> def apply(self, ..., by_row: bool | NoDefault=no_default, ...):
312+
if by_row is no_default:
313+
warn("The by_row parameter will be set to False in the future")
314+
by_row = True
315+
...
316+
```
317+
318+
In pandas v3.0 the signature will change to:
319+
320+
```python
321+
>>> def apply(self, ..., by_row: NoDefault=no_default, ...):
322+
if by_row is not no_default:
323+
warn("Do not use the by_row parameter, it will be removed in the future")
324+
...
325+
```
326+
327+
I.e. the `by_row` parameter will be needed in the signature in v3.0 in order be backward compatible with v2.x, but will have no effect.
328+
329+
In Pandas v4.0, the `by_row` parameter will be removed.
330+
331+
## PDEP-13 History
332+
333+
- 24 august 2023: Initial version

0 commit comments

Comments
 (0)