|
| 1 | +Title: pandas extension arrays |
| 2 | +Date: 2019-01-04 |
| 3 | + |
| 4 | +# pandas extension arrays |
| 5 | + |
| 6 | +Extensibility was a major theme in pandas development over the last couple of |
| 7 | +releases. This post introduces the pandas extension array interface: the |
| 8 | +motivation behind it and how it might affect you as a pandas user. Finally, we |
| 9 | +look at how extension arrays may shape the future of pandas. |
| 10 | + |
| 11 | +Extension Arrays are just one of the changes in pandas 0.24.0. See the |
| 12 | +[whatsnew][whatsnew] for a full changelog. |
| 13 | + |
| 14 | +## The Motivation |
| 15 | + |
| 16 | +Pandas is built on top of NumPy. You could roughly define a Series as a wrapper |
| 17 | +around a NumPy array, and a DataFrame as a collection of Series with a shared |
| 18 | +index. That's not entirely correct for several reasons, but I want to focus on |
| 19 | +the "wrapper around a NumPy array" part. It'd be more correct to say "wrapper |
| 20 | +around an array-like object". |
| 21 | + |
| 22 | +Pandas mostly uses NumPy's builtin data representation; we've restricted it in |
| 23 | +places and extended it in others. For example, pandas' early users cared greatly |
| 24 | +about timezone-aware datetimes, which NumPy doesn't support. So pandas |
| 25 | +internally defined a `DatetimeTZ` dtype (which mimics a NumPy dtype), and |
| 26 | +allowed you to use that dtype in `Index`, `Series`, and as a column in a |
| 27 | +`DataFrame`. That dtype carried around the tzinfo, but wasn't itself a valid |
| 28 | +NumPy dtype. |
| 29 | + |
| 30 | +As another example, consider `Categorical`. This actually composes *two* arrays: |
| 31 | +one for the `categories` and one for the `codes`. But it can be stored in a |
| 32 | +`DataFrame` like any other column. |
| 33 | + |
| 34 | +Each of these extension types pandas added is useful on its own, but carries a |
| 35 | +high maintenance cost. Large sections of the codebase need to be aware of how to |
| 36 | +handle a NumPy array or one of these other kinds of special arrays. This made |
| 37 | +adding new extension types to pandas very difficult. |
| 38 | + |
| 39 | +Anaconda, Inc. had a client who regularly dealt with datasets with IP addresses. |
| 40 | +They wondered if it made sense to add an [IPArray][IPArray] to pandas. In the |
| 41 | +end, we didn't think it passed the cost-benefit test for inclusion in pandas |
| 42 | +*itself*, but we were interested in defining an interface for third-party |
| 43 | +extensions to pandas. Any object implementing this interface would be allowed in |
| 44 | +pandas. I was able to write [cyberpandas][cyberpandas] outside of pandas, but it |
| 45 | +feels like using any other dtype built into pandas. |
| 46 | + |
| 47 | +## The Current State |
| 48 | + |
| 49 | +As of pandas 0.24.0, all of pandas' internal extension arrays (Categorical, |
| 50 | +Datetime with Timezone, Period, Interval, and Sparse) are now built on top of |
| 51 | +the ExtensionArray interface. Users shouldn't notice many changes. The main |
| 52 | +thing you'll notice is that things are cast to `object` dtype in fewer places, |
| 53 | +meaning your code will run faster and your types will be more stable. This |
| 54 | +includes storing `Period` and `Interval` data in `Series` (which were previously |
| 55 | +cast to object dtype). |
| 56 | + |
| 57 | +Additionally, we'll be able to add *new* extension arrays with relative ease. |
| 58 | +For example, 0.24.0 (optionally) solved one of pandas longest-standing pain |
| 59 | +points: missing values casting integer-dtype values to float. |
| 60 | + |
| 61 | + |
| 62 | +```python |
| 63 | +>>> int_ser = pd.Series([1, 2], index=[0, 2]) |
| 64 | +>>> int_ser |
| 65 | +0 1 |
| 66 | +2 2 |
| 67 | +dtype: int64 |
| 68 | + |
| 69 | +>>> int_ser.reindex([0, 1, 2]) |
| 70 | +0 1.0 |
| 71 | +1 NaN |
| 72 | +2 2.0 |
| 73 | +dtype: float64 |
| 74 | +``` |
| 75 | + |
| 76 | +With the new [IntegerArray][IntegerArray] and nullable integer dtypes, we can |
| 77 | +natively represent integer data with missing values. |
| 78 | + |
| 79 | +```python |
| 80 | +>>> int_ser = pd.Series([1, 2], index=[0, 2], dtype=pd.Int64Dtype()) |
| 81 | +>>> int_ser |
| 82 | +0 1 |
| 83 | +2 2 |
| 84 | +dtype: Int64 |
| 85 | + |
| 86 | +>>> int_ser.reindex([0, 1, 2]) |
| 87 | +0 1 |
| 88 | +1 NaN |
| 89 | +2 2 |
| 90 | +dtype: Int64 |
| 91 | +``` |
| 92 | + |
| 93 | +One thing it does slightly change how you should access the raw (unlabeled) |
| 94 | +arrays stored inside a Series or Index, which is occasionally useful. Perhaps |
| 95 | +the method you're calling only works with NumPy arrays, or perhaps you want to |
| 96 | +disable automatic alignment. |
| 97 | + |
| 98 | +In the past, you'd hear things like "Use `.values` to extract the NumPy array |
| 99 | +from a Series or DataFrame." If it were a good resource, they'd tell you that's |
| 100 | +not *entirely* true, since there are some exceptions. I'd like to delve into |
| 101 | +those exceptions. |
| 102 | + |
| 103 | +The fundamental problem with `.values` is that it serves two purposes: |
| 104 | + |
| 105 | +1. Extracting the array backing a Series, Index, or DataFrame |
| 106 | +2. Converting the Series, Index, or DataFrame to a NumPy array |
| 107 | + |
| 108 | +As we saw above, the "array" backing a Series or Index might not be a NumPy |
| 109 | +array, it may instead be an extension array (from pandas or a third-party |
| 110 | +library). For example, consider `Categorical`, |
| 111 | + |
| 112 | +```python |
| 113 | +>>> cat = pd.Categorical(['a', 'b', 'a'], categories=['a', 'b', 'c']) |
| 114 | +>>> ser = pd.Series(cat) |
| 115 | +>>> ser |
| 116 | +0 a |
| 117 | +1 b |
| 118 | +2 a |
| 119 | +dtype: category |
| 120 | +Categories (3, object): [a, b, c] |
| 121 | + |
| 122 | +>>> ser.values |
| 123 | +[a, b, a] |
| 124 | +Categories (3, object): [a, b, c] |
| 125 | +``` |
| 126 | + |
| 127 | +In this case `.values` is a Categorical, not a NumPy array. For period-dtype |
| 128 | +data, `.values` returns a NumPy array of `Period` objects, which is expensive to |
| 129 | +create. For timezone-aware data, `.values` converts to UTC and *drops* the |
| 130 | +timezone info. These kind of surprises (different types, or expensive or lossy |
| 131 | +conversions) stem from trying to shoehorn these extension arrays into a NumPy |
| 132 | +array. But the entire point of an extension array is for representing data NumPy |
| 133 | +*can't* natively represent. |
| 134 | + |
| 135 | +To solve the `.values` problem, we've split its roles into two dedicated methods: |
| 136 | + |
| 137 | +1. Use `.array` to get a zero-copy reference to the underlying data |
| 138 | +2. Use `.to_numpy()` to get a (potentially expensive, lossy) NumPy array of the |
| 139 | + data. |
| 140 | + |
| 141 | +So with our Categorical example, |
| 142 | + |
| 143 | +```python |
| 144 | +>>> ser.array |
| 145 | +[a, b, a] |
| 146 | +Categories (3, object): [a, b, c] |
| 147 | + |
| 148 | +>>> ser.to_numpy() |
| 149 | +array(['a', 'b', 'a'], dtype=object) |
| 150 | +``` |
| 151 | + |
| 152 | +To summarize: |
| 153 | + |
| 154 | +- `.array` will *always* be a an ExtensionArray, and is always a zero-copy |
| 155 | + reference back to the data. |
| 156 | +- `.to_numpy()` is *always* a NumPy array, so you can reliably call |
| 157 | + ndarray-specific methods on it. |
| 158 | + |
| 159 | +You shouldn't ever need `.values` anymore. |
| 160 | + |
| 161 | +## Possible Future Paths |
| 162 | + |
| 163 | +Extension Arrays open up quite a few exciting opportunities. Currently, pandas |
| 164 | +represents string data using Python objects in a NumPy array, which is slow. |
| 165 | +Libraries like [Apache Arrow][arrow] provide native support for variable-length |
| 166 | +strings, and the [Fletcher][fletcher] library provides pandas extension arrays |
| 167 | +for Arrow arrays. It will allow [GeoPandas][geopandas] to store geometry data |
| 168 | +more efficiently. Pandas (or third-party libraries) will be able to support |
| 169 | +nested data, data with units, geo data, GPU arrays. Keep an eye on the |
| 170 | +[pandas ecosystem][eco] page, which will keep track of third-party extension |
| 171 | +arrays. It's an exciting time for pandas development. |
| 172 | + |
| 173 | +## Other Thoughts |
| 174 | + |
| 175 | +I'd like to emphasize that this is an *interface*, and not a concrete array |
| 176 | +implementation. We are *not* reimplementing NumPy here in pandas. Rather, this |
| 177 | +is a way to take any array-like data structure (one or more NumPy arrays, an |
| 178 | +Apache Arrow array, a CuPy array) and place it inside a DataFrame. I think |
| 179 | +getting pandas out of the array business, and instead thinking about |
| 180 | +higher-level tabular data things, is a healthy development for the project. |
| 181 | + |
| 182 | +This works perfectly with NumPy's [`__array_ufunc__`][ufunc] protocol and |
| 183 | +[NEP-18][nep18]. You'll be able to use the familiar NumPy API on objects that |
| 184 | +aren't backed by NumPy memory. |
| 185 | + |
| 186 | +## Upgrade |
| 187 | + |
| 188 | +These new goodies are all available in the recently released pandas 0.24. |
| 189 | + |
| 190 | +conda: |
| 191 | + |
| 192 | + conda install -c conda-forge pandas |
| 193 | + |
| 194 | +pip: |
| 195 | + |
| 196 | + pip install --upgrade pandas |
| 197 | + |
| 198 | +As always, we're happy to hear feedback on the [mailing list][ml], |
| 199 | +[@pandas-dev][twitter], or [issue tracker][tracker]. |
| 200 | + |
| 201 | +Thanks to the many contributors, maintainers, and [institutional |
| 202 | +partners][partners] involved in the pandas community. |
| 203 | + |
| 204 | + |
| 205 | +[IPArray]: https://github.com/pandas-dev/pandas/issues/18767 |
| 206 | +[cyberpandas]: https://cyberpandas.readthedocs.io |
| 207 | +[IntegerArray]: http://pandas.pydata.org/pandas-docs/version/0.24/reference/api/pandas.arrays.IntegerArray.html |
| 208 | +[fletcher]: https://github.com/xhochy/fletcher |
| 209 | +[arrow]: https://arrow.apache.org |
| 210 | +[ufunc]: https://docs.scipy.org/doc/numpy-1.13.0/neps/ufunc-overrides.html |
| 211 | +[nep18]: https://www.numpy.org/neps/nep-0018-array-function-protocol.html |
| 212 | +[ml]: https://mail.python.org/mailman/listinfo/pandas-dev |
| 213 | +[twitter]: https://twitter.com/pandas_dev |
| 214 | +[tracker]: https://github.com/pandas-dev/pandas/issues |
| 215 | +[partners]: https://github.com/pandas-dev/pandas-governance/blob/master/people.md |
| 216 | +[eco]: http://pandas.pydata.org/pandas-docs/stable/ecosystem.html#extension-data-types |
| 217 | +[whatsnew]: http://pandas.pydata.org/pandas-docs/version/0.24/whatsnew/v0.24.0.html |
| 218 | +[geopandas]: https://github.com/geopandas/geopandas |
0 commit comments