|
| 1 | +# PDEP-13: Make the Series.apply method operate Series-wise |
| 2 | + |
| 3 | +- Created: 24 August 2023 |
| 4 | +- Status: Under discussion |
| 5 | +- Discussion: [#52140](https://github.com/pandas-dev/pandas/issues/52509) |
| 6 | +- Author: [Terji Petersen](https://github.com/topper-123) |
| 7 | +- Revision: 1 |
| 8 | + |
| 9 | +## Abstract |
| 10 | + |
| 11 | +Currently, giving a input to `Series.apply` is treated differently depending on the type of the input: |
| 12 | + |
| 13 | +* if the input is a numpy `ufunc`, `series.apply(func)` is equivalent to `func(series)`, i.e. similar to `series.pipe(func)`. |
| 14 | +* if the input is a callable, but not a numpy `ufunc`, `series.apply(func)` is similar to `Series([func(val) for val in series], index=series.index)`, i.e. similar to `series.map(func)` |
| 15 | +* if the input is a list-like or dict-like, `series.apply(func)` is equivalent to `series.agg(func)` (which is subtly different than `series.apply`) |
| 16 | + |
| 17 | +In contrast, `DataFrame.apply` has a consistent behavior: |
| 18 | + |
| 19 | +* if the input is a callable, `df.apply(func)` always calls each columns in the DataFrame, so is similar to `func(col) for _, col in |
| 20 | +df.items()` + wrapping functionality |
| 21 | +* if the input is a list-like or dict-like, `df.apply` call each item in the list/dict and wraps the result as needed. So for example if the input is a list, `df.apply(func_list)` is equivalent to `[df.apply(func) for func in func_list]` + wrapping functionality |
| 22 | + |
| 23 | +This PDEP proposes that: |
| 24 | + |
| 25 | +- The current complex current behavior of `Series.apply` will be deprecated in Pandas 2.2. |
| 26 | +- Single callables given to the `.apply` methods of `Series` will in Pandas 3.0 always be called on the whole `Series`, so `series.apply(func)` will become similar to `func(series)`, |
| 27 | +- Lists or dicts of callables given to the `Series.apply` will in Pandas 3.0 always call `Series.apply` on each element of the list/dict |
| 28 | + |
| 29 | +In short, this PDEP proposes changing `Series.apply` to be more similar to how `DataFrame.apply` works on single dataframe columns, i.e. operate on the whole series. If a user wants to map a callable to each element of a Series, they should be directed to use `Series.map` instead of using `Series.apply`. |
| 30 | + |
| 31 | +## Motivation |
| 32 | + |
| 33 | +`Series.apply` is currently a very complex method, whose behaviour will differ depending on the nature of its input. |
| 34 | + |
| 35 | +`Series.apply` & `Series.map` currently often behave very similar, but differently enough for it to be confusing when it's a good idea to use one over the other and especially when `Series.apply` is a bad idea to use. |
| 36 | + |
| 37 | +Also, calling `Series.apply` currently gives a different result than the per-column result from calling `DataFrame.apply`, which can be confusing for users who expect `Series.apply` to be the `Series` version of `DataFrame.apply`, similar to how `Series.agg` is the `Series` version of `DataFrame.agg`. For example, currently some functions may work fine with `DataFrame.apply`, but may fail, be very slow when given to `Series.apply` or give a different result than the per-column result from `DataFrame.apply`. |
| 38 | + |
| 39 | +### Similarities and differences between `Series.apply` and `Series.map` |
| 40 | + |
| 41 | +The similarity between the methods is especially that they both fall back to use `Series._map_values` and there use `algorithms.map_array` or `ExtensionArray.map` as relevant. |
| 42 | + |
| 43 | +The differences are many, but each one is relative minor: |
| 44 | + |
| 45 | +1. `Series.map` has a `na_action` parameter, which `Series.apply` doesn't |
| 46 | +2. `Series.apply` can take advantage of numpy ufuncs, which `Series.map` can't |
| 47 | +3. `Series.apply` can take `args` and `**kwargs`, which `Series.map` can't |
| 48 | +4. `Series.apply` is more general and can take a string, e.g. `"sum"`, or lists or dicts of inputs which `Series.map` can't. |
| 49 | +5. when given a numpy ufunc, the ufunc will be called on the whole Series, when given to `Series.apply` and on each element of the series, if given to `Series.map`. |
| 50 | + |
| 51 | +In addition, `Series.apply` has some functionality, which `Series.map` does not, but which has already been deprecated: |
| 52 | + |
| 53 | +6. `Series.apply` has a `convert_dtype` parameter, which has been deprecated (deprecated in pandas 2.1, see [GH52257](https://github.com/pandas-dev/pandas/pull/52257)) |
| 54 | +7. `Series.apply` will return a Dataframe, if its result is a list of Series (deprecated in pandas 2.1, see [GH52123]()https://github.com/pandas-dev/pandas/pull/52123)). |
| 55 | + |
| 56 | +### Similarities and differences between `Series.apply` and `DataFrame.apply` |
| 57 | + |
| 58 | +`Series.apply` and `DataFrame.apply` are similar when given numpy ufuncs as inputs, but when given non-ufuncs as inputs, `Series.apply` and `DataFrame.apply` will behave differently, because `series.apply(func)` will be similar to `series.map(func)` while `Dataframe.apply(func)` will call the input on each column series and combine the result. |
| 59 | + |
| 60 | +If given a list-like or dict-like, `Series.apply` will behave similar to `Series.agg`, while `DataFrame.apply` will call each element in the list-like/dict-like on each column and combine the results. |
| 61 | + |
| 62 | +Also `DataFrame.apply` has some parameters (`raw` and `result_type`) which are relevant for a 2D DataFrame, but may not be relevant for `Series.apply`, because `Series` is a 1D structure. |
| 63 | + |
| 64 | +## Examples of problems with the current way `Series.apply` works |
| 65 | + |
| 66 | +The above similarities and many minor differences makes for confusing and too complex rules for when its a good idea to use `Series.apply` over `Series.map` to do operations, and vica versa, and for when a callable will work well with `Series.apply` versus `DataFrame.apply`. Some examples will show some examples below. |
| 67 | + |
| 68 | +First some setup: |
| 69 | + |
| 70 | +```python |
| 71 | +>>> import numpy as np |
| 72 | +>>> import pandas as pd |
| 73 | +>>> |
| 74 | +>>> small_ser = pd.Series([1, 2, 3]) |
| 75 | +>>> large_ser = pd.Series(range(100_000)) |
| 76 | +``` |
| 77 | + |
| 78 | +### 1: string vs numpy funcs in `Series.apply` |
| 79 | + |
| 80 | +```python |
| 81 | +>>> small_ser.apply("sum") |
| 82 | +6 |
| 83 | +>>> small_ser.apply(np.sum) |
| 84 | +0 1 |
| 85 | +1 2 |
| 86 | +2 3 |
| 87 | +dtype: int64 |
| 88 | +``` |
| 89 | + |
| 90 | +It will surprise users that these two give different results. Also, anyone using the second pattern is probably making a mistake. |
| 91 | + |
| 92 | +Note that giving `np.sum` to `DataFrame.apply` aggregates properly: |
| 93 | + |
| 94 | +```python |
| 95 | +>>> pd.DataFrame(small_ser).apply(np.sum) |
| 96 | +0 6 |
| 97 | +dtype: int64 |
| 98 | +``` |
| 99 | + |
| 100 | +This PDEP proposes that callables will be applies to the whole `Series`, so we in the future will have: |
| 101 | + |
| 102 | +```python |
| 103 | +>>> small_ser.apply(np.sum) |
| 104 | +6 |
| 105 | +``` |
| 106 | + |
| 107 | +### 2 Callables vs. list/dict of callables |
| 108 | + |
| 109 | +Giving functions and lists/dicts of functions will give different results: |
| 110 | + |
| 111 | +```python |
| 112 | +>>> small_ser.apply(np.sum) |
| 113 | +0 1 |
| 114 | +1 2 |
| 115 | +2 3 |
| 116 | +dtype: int64 |
| 117 | +>>> small_ser.apply([np.sum]) |
| 118 | +sum 6 |
| 119 | +dtype: int64 |
| 120 | +``` |
| 121 | + |
| 122 | +Also with non-numpy callables: |
| 123 | + |
| 124 | +```python |
| 125 | +>>> small_ser.apply(lambda x: x.sum()) |
| 126 | +AttributeError: 'int' object has no attribute 'sum' |
| 127 | +>>> small_ser.apply([lambda x: x.sum()]) |
| 128 | +<lambda> 6 |
| 129 | +dtype: int64 |
| 130 | +``` |
| 131 | + |
| 132 | +In both cases above the difference is that `Series.apply` operates element-wise, if given a callable, but series-wise if given a list/dict of callables. |
| 133 | + |
| 134 | +This PDEP proposes that callables will be applies to the whole `Series`, so we in the future will have: |
| 135 | + |
| 136 | +```python |
| 137 | +>>> small_ser.apply(lambda x: x.sum()) |
| 138 | +6 |
| 139 | +>>> small_ser.apply([lambda x: x.sum()]) |
| 140 | +<lambda> 6 |
| 141 | +dtype: int64 |
| 142 | +``` |
| 143 | + |
| 144 | +### 3. Functions in `Series.apply` |
| 145 | + |
| 146 | +The `Series.apply` doc string have examples with using lambdas, but using lambdas in `Series.apply` is often a bad practices because of bad performance: |
| 147 | + |
| 148 | +```python |
| 149 | +>>> %timeit large_ser.apply(lambda x: x + 1) |
| 150 | +24.1 ms ± 88.8 µs per loop |
| 151 | +``` |
| 152 | + |
| 153 | +Currently, `Series` does not have a method that makes a callable operate on a series' data. Instead users need to use `Series.pipe` for that operation in order for the operation to be efficient: |
| 154 | + |
| 155 | +```python |
| 156 | +>>> %timeit large_ser.pipe(lambda x: x + 1) |
| 157 | +44 µs ± 363 ns per loop |
| 158 | +``` |
| 159 | + |
| 160 | +(The reason for the above performance differences is that apply gets called on each single element, while `pipe` calls `x.__add__(1)`, which operates on the whole array). |
| 161 | + |
| 162 | +Note also that `.pipe` operates on the `Series` while `apply`currently operates on each element in the data, so there is some differences that may have some consequence in some cases. |
| 163 | + |
| 164 | +This PDEP proposes that callables will be applies to the whole `Series`, so we in the future `Series.apply` will be as fast as `Series.pipe`. |
| 165 | + |
| 166 | +### 4. ufuncs in `Series.apply` vs. noral functions |
| 167 | + |
| 168 | +Performance-wise, ufuncs are fine in `Series.apply`, but non-ufunc functions are not: |
| 169 | + |
| 170 | +```python |
| 171 | +>>> %timeit large_ser.apply(np.sqrt) |
| 172 | +71.6 µs ± 1.17 µs per loop |
| 173 | +>>> %timeit large_ser.apply(lambda x:np.sqrt(x)) |
| 174 | +63.6 ms ± 1.03 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) |
| 175 | +``` |
| 176 | + |
| 177 | +It is difficult to understand why ufuncs are fast in `apply`, while other callables are slow in `apply` (answer: it's because ufuncs operate on the whole Series, while other callables operate elementwise). |
| 178 | + |
| 179 | +This PDEP proposes that callables will be applies to the whole `Series`, so we in the future non-ufunc functions in `Series.apply` will be as fast as ufuncs. |
| 180 | + |
| 181 | +### 5. callables in `Series.apply` is slow, callables in `DataFrame.apply` is fast |
| 182 | + |
| 183 | +Above it was shown that using (non-ufunc) callables in `Series.apply` is bad performance-wise. OTOH using them in `DataFrame.apply` is fine: |
| 184 | + |
| 185 | +```python |
| 186 | +>>> %timeit large_ser.apply(lambda x: x + 1) |
| 187 | +24.3 ms ± 24 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) |
| 188 | +>>> %timeit pd.DataFrame(large_ser).apply(lambda x: x + 1) |
| 189 | +160 µs +- 1.17 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each) |
| 190 | +``` |
| 191 | + |
| 192 | +Having callables being fast to use in the `DataFrame.apply` method, but slow in `Series.apply` is confusing for users. |
| 193 | + |
| 194 | +This PDEP proposes that callables will be applies to the whole `Series`, so we in the future `Series.apply` will be as fast as `DataFrame.apply` already is. |
| 195 | + |
| 196 | +### 6. callables in `Series.apply` may fail, while callables in `DataFrame.apply` do not and vica versa |
| 197 | + |
| 198 | +```python |
| 199 | +>>> ser.apply(lambda x: x.sum()) |
| 200 | +AttributeError: 'int' object has no attribute 'sum' |
| 201 | +>>> pd.DataFrame(ser).apply(lambda x: x.sum()) |
| 202 | +0 6 |
| 203 | +dtype: int64 |
| 204 | +``` |
| 205 | + |
| 206 | +Having callables fail when used in `Series.apply`, but work in `DataFrame.Apply` or vica versa is confusing for users. |
| 207 | + |
| 208 | +This PDEP proposes that callables will be applies to the whole `Series`, so callables given to `Series.apply` will work the same as when given to `DataFrame.apply`, so in the future we will have that: |
| 209 | + |
| 210 | +```python |
| 211 | +>>> ser.apply(lambda x: x.sum()) |
| 212 | +6 |
| 213 | +>>> pd.DataFrame(ser).apply(lambda x: x.sum()) |
| 214 | +0 6 |
| 215 | +dtype: int64 |
| 216 | +``` |
| 217 | + |
| 218 | +### 7. `Series.apply` vs. `Series.agg` |
| 219 | + |
| 220 | +The doc string for `Series.agg` says about the method's `func` parameter: "If a function, must ... work when passed ... to Series.apply". But compare these: |
| 221 | + |
| 222 | +```python |
| 223 | +>>> small_ser.agg(np.sum) |
| 224 | +6 |
| 225 | +>>> small_ser.apply(np.sum) |
| 226 | +0 1 |
| 227 | +1 2 |
| 228 | +2 3 |
| 229 | +dtype: int64 |
| 230 | +``` |
| 231 | + |
| 232 | +Users would expect these two to give the same result. |
| 233 | + |
| 234 | +This PDEP proposes that callables will be applies to the whole `Series`, so in the future we will have: |
| 235 | + |
| 236 | +```python |
| 237 | +>>> small_ser.agg(np.sum) |
| 238 | +6 |
| 239 | +>>> small_ser.apply(np.sum) |
| 240 | +6 |
| 241 | +``` |
| 242 | + |
| 243 | +### 8. dictlikes vs. listlikes in `Series.apply` |
| 244 | + |
| 245 | +Giving a *list* of transforming arguments to `Series.apply` returns a `DataFrame`: |
| 246 | + |
| 247 | +```python |
| 248 | +>>> small_ser.apply(["sqrt", np.abs]) |
| 249 | + sqrt absolute |
| 250 | +0 1.000000 1 |
| 251 | +1 1.414214 2 |
| 252 | +2 1.732051 3 |
| 253 | +``` |
| 254 | + |
| 255 | +But giving a *dict* of transforming arguments returns a `Series` with a `MultiIndex`: |
| 256 | + |
| 257 | +```python |
| 258 | +>>> small_ser.apply({"sqrt" :"sqrt", "abs" : np.abs}) |
| 259 | +sqrt 0 1.000000 |
| 260 | + 1 1.414214 |
| 261 | + 2 1.732051 |
| 262 | +abs 0 1.000000 |
| 263 | + 1 2.000000 |
| 264 | + 2 3.000000 |
| 265 | +dtype: float64 |
| 266 | +``` |
| 267 | + |
| 268 | +These two should give same-shaped output for consistency. Using `Series.transform` instead of `Series.apply`, it returns a `DataFrame` in both cases and I think the dictlike example above should return a `DataFrame` similar to the listlike example. |
| 269 | + |
| 270 | +Minor additional info: listlikes and dictlikes of aggregation arguments do behave the same, so this is only a problem with dictlikes of transforming arguments when using `apply`. |
| 271 | + |
| 272 | +This PDEP proposes that the result from giving list-likes and dict-likes to `Series.apply` will have the same shape as when given list-likes currently: |
| 273 | + |
| 274 | +```python |
| 275 | +>>> small_ser.apply(["sqrt", np.abs]) |
| 276 | + sqrt absolute |
| 277 | +0 1.000000 1 |
| 278 | +1 1.414214 2 |
| 279 | +2 1.732051 3 |
| 280 | +>>> small_ser.apply({"sqrt" :"sqrt", "abs" : np.abs}) |
| 281 | + sqrt absolute |
| 282 | +0 1.000000 1 |
| 283 | +1 1.414214 2 |
| 284 | +2 1.732051 3 |
| 285 | +``` |
| 286 | + |
| 287 | +## Proposal |
| 288 | + |
| 289 | +With the above in mind, it is proposed that: |
| 290 | + |
| 291 | +1. When given a callable, `Series.apply` always operate on the series. I.e. let `series.apply(func)` be similar to `func(series)` + the needed additional functionality. |
| 292 | +2. When given a list-like or dict-like, `Series.apply` will apply each element of the list-like/dict-like to the series. I.e. `series.apply(func_list)` wil be similar to `[series.apply(func) for func in func_list]` + the needed additional functionality |
| 293 | +3. The changes made to `Series.apply`will propagate to `Series.agg` and `Series.transform` as needed. |
| 294 | + |
| 295 | +The difference between `Series.apply()` & `Series.map()` will then be that: |
| 296 | + |
| 297 | +* `Series.apply()` makes the passed-in callable operate on the series, similarly to how `(DataFrame|SeriesGroupby|DataFrameGroupBy).apply` operate on series. This is very fast and can do almost anything, |
| 298 | +* `Series.map()` makes the passed-in callable operate on each series data elements individually. This is very flexible, but can be very slow, so should only be used if `Series.apply` can't do it. |
| 299 | + |
| 300 | +so, this API change will help make Pandas `Series.(apply|map)` API clearer without losing functionality and let their functionality be explainable in a simple manner, which would be a win for Pandas. |
| 301 | + |
| 302 | +The result from the above change will be that `Series.apply` will operate similar to how `DataFrame.apply` works already per column, similar to how `Series.map` operates similar to how `DataFrame.map` works per column. This will give better coherence between same-named methods on `DataFrames` and `Series`. |
| 303 | + |
| 304 | +## Deprecation process |
| 305 | + |
| 306 | +To change the behavior to the current behavior will have to be deprecated. This can be done by adding a `by_row` parameter to `Series.apply`, which means, when `by_rows=False`, that `Series.apply` will not operate elementwise but Series-wise. |
| 307 | + |
| 308 | +So we will have in pandas v2.2: |
| 309 | + |
| 310 | +```python |
| 311 | +>>> def apply(self, ..., by_row: bool | NoDefault=no_default, ...): |
| 312 | + if by_row is no_default: |
| 313 | + warn("The by_row parameter will be set to False in the future") |
| 314 | + by_row = True |
| 315 | + ... |
| 316 | +``` |
| 317 | + |
| 318 | +In pandas v3.0 the signature will change to: |
| 319 | + |
| 320 | +```python |
| 321 | +>>> def apply(self, ..., by_row: NoDefault=no_default, ...): |
| 322 | + if by_row is not no_default: |
| 323 | + warn("Do not use the by_row parameter, it will be removed in the future") |
| 324 | + ... |
| 325 | +``` |
| 326 | + |
| 327 | +I.e. the `by_row` parameter will be needed in the signature in v3.0 in order be backward compatible with v2.x, but will have no effect. |
| 328 | + |
| 329 | +In Pandas v4.0, the `by_row` parameter will be removed. |
| 330 | + |
| 331 | +## PDEP-13 History |
| 332 | + |
| 333 | +- 24 august 2023: Initial version |
0 commit comments