Skip to content

Commit e02f737

Browse files
TomAugspurgerjorisvandenbossche
authored andcommitted
DOC: add doc on ExtensionArray and extending pandas (pandas-dev#19936)
1 parent 0ca77b3 commit e02f737

File tree

7 files changed

+312
-196
lines changed

7 files changed

+312
-196
lines changed

doc/source/developer.rst

-43
Original file line numberDiff line numberDiff line change
@@ -140,46 +140,3 @@ As an example of fully-formed metadata:
140140
'metadata': None}
141141
],
142142
'pandas_version': '0.20.0'}
143-
144-
.. _developer.register-accessors:
145-
146-
Registering Custom Accessors
147-
----------------------------
148-
149-
Libraries can use the decorators
150-
:func:`pandas.api.extensions.register_dataframe_accessor`,
151-
:func:`pandas.api.extensions.register_series_accessor`, and
152-
:func:`pandas.api.extensions.register_index_accessor`, to add additional "namespaces" to
153-
pandas objects. All of these follow a similar convention: you decorate a class, providing the name of attribute to add. The
154-
class's `__init__` method gets the object being decorated. For example:
155-
156-
.. code-block:: python
157-
158-
@pd.api.extensions.register_dataframe_accessor("geo")
159-
class GeoAccessor(object):
160-
def __init__(self, pandas_obj):
161-
self._obj = pandas_obj
162-
163-
@property
164-
def center(self):
165-
# return the geographic center point of this DataFarme
166-
lon = self._obj.latitude
167-
lat = self._obj.longitude
168-
return (float(lon.mean()), float(lat.mean()))
169-
170-
def plot(self):
171-
# plot this array's data on a map, e.g., using Cartopy
172-
pass
173-
174-
Now users can access your methods using the `geo` namespace:
175-
176-
>>> ds = pd.DataFrame({'longitude': np.linspace(0, 10),
177-
... 'latitude': np.linspace(0, 20)})
178-
>>> ds.geo.center
179-
(5.0, 10.0)
180-
>>> ds.geo.plot()
181-
# plots data on a map
182-
183-
This can be a convenient way to extend pandas objects without subclassing them.
184-
If you write a custom accessor, make a pull request adding it to our
185-
:ref:`ecosystem` page.

doc/source/ecosystem.rst

+35
Original file line numberDiff line numberDiff line change
@@ -262,3 +262,38 @@ Data validation
262262

263263
Engarde is a lightweight library used to explicitly state your assumptions abour your datasets
264264
and check that they're *actually* true.
265+
266+
.. _ecosystem.extensions:
267+
268+
Extension Data Types
269+
--------------------
270+
271+
Pandas provides an interface for defining
272+
:ref:`extension types <extending.extension-types>` to extend NumPy's type
273+
system. The following libraries implement that interface to provide types not
274+
found in NumPy or pandas, which work well with pandas' data containers.
275+
276+
`cyberpandas`_
277+
~~~~~~~~~~~~~~
278+
279+
Cyberpandas provides an extension type for storing arrays of IP Addresses. These
280+
arrays can be stored inside pandas' Series and DataFrame.
281+
282+
.. _ecosystem.accessors:
283+
284+
Accessors
285+
---------
286+
287+
A directory of projects providing
288+
:ref:`extension accessors <extending.register-accessors>`. This is for users to
289+
discover new accessors and for library authors to coordinate on the namespace.
290+
291+
============== ========== =========================
292+
Library Accessor Classes
293+
============== ========== =========================
294+
`cyberpandas`_ ``ip`` ``Series``
295+
`pdvega`_ ``vgplot`` ``Series``, ``DataFrame``
296+
============== ========== =========================
297+
298+
.. _cyberpandas: https://cyberpandas.readthedocs.io/en/latest
299+
.. _pdvega: https://jakevdp.github.io/pdvega/

doc/source/extending.rst

+269
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,269 @@
1+
.. _extending:
2+
3+
****************
4+
Extending Pandas
5+
****************
6+
7+
While pandas provides a rich set of methods, containers, and data types, your
8+
needs may not be fully satisfied. Pandas offers a few options for extending
9+
pandas.
10+
11+
.. _extending.register-accessors:
12+
13+
Registering Custom Accessors
14+
----------------------------
15+
16+
Libraries can use the decorators
17+
:func:`pandas.api.extensions.register_dataframe_accessor`,
18+
:func:`pandas.api.extensions.register_series_accessor`, and
19+
:func:`pandas.api.extensions.register_index_accessor`, to add additional
20+
"namespaces" to pandas objects. All of these follow a similar convention: you
21+
decorate a class, providing the name of attribute to add. The class's
22+
``__init__`` method gets the object being decorated. For example:
23+
24+
.. code-block:: python
25+
26+
@pd.api.extensions.register_dataframe_accessor("geo")
27+
class GeoAccessor(object):
28+
def __init__(self, pandas_obj):
29+
self._obj = pandas_obj
30+
31+
@property
32+
def center(self):
33+
# return the geographic center point of this DataFrame
34+
lat = self._obj.latitude
35+
lon = self._obj.longitude
36+
return (float(lon.mean()), float(lat.mean()))
37+
38+
def plot(self):
39+
# plot this array's data on a map, e.g., using Cartopy
40+
pass
41+
42+
Now users can access your methods using the ``geo`` namespace:
43+
44+
>>> ds = pd.DataFrame({'longitude': np.linspace(0, 10),
45+
... 'latitude': np.linspace(0, 20)})
46+
>>> ds.geo.center
47+
(5.0, 10.0)
48+
>>> ds.geo.plot()
49+
# plots data on a map
50+
51+
This can be a convenient way to extend pandas objects without subclassing them.
52+
If you write a custom accessor, make a pull request adding it to our
53+
:ref:`ecosystem` page.
54+
55+
.. _extending.extension-types:
56+
57+
Extension Types
58+
---------------
59+
60+
Pandas defines an interface for implementing data types and arrays that *extend*
61+
NumPy's type system. Pandas itself uses the extension system for some types
62+
that aren't built into NumPy (categorical, period, interval, datetime with
63+
timezone).
64+
65+
Libraries can define a custom array and data type. When pandas encounters these
66+
objects, they will be handled properly (i.e. not converted to an ndarray of
67+
objects). Many methods like :func:`pandas.isna` will dispatch to the extension
68+
type's implementation.
69+
70+
If you're building a library that implements the interface, please publicize it
71+
on :ref:`ecosystem.extensions`.
72+
73+
The interface consists of two classes.
74+
75+
``ExtensionDtype``
76+
^^^^^^^^^^^^^^^^^^
77+
78+
An ``ExtensionDtype`` is similar to a ``numpy.dtype`` object. It describes the
79+
data type. Implementors are responsible for a few unique items like the name.
80+
81+
One particularly important item is the ``type`` property. This should be the
82+
class that is the scalar type for your data. For example, if you were writing an
83+
extension array for IP Address data, this might be ``ipaddress.IPv4Address``.
84+
85+
See the `extension dtype source`_ for interface definition.
86+
87+
``ExtensionArray``
88+
^^^^^^^^^^^^^^^^^^
89+
90+
This class provides all the array-like functionality. ExtensionArrays are
91+
limited to 1 dimension. An ExtensionArray is linked to an ExtensionDtype via the
92+
``dtype`` attribute.
93+
94+
Pandas makes no restrictions on how an extension array is created via its
95+
``__new__`` or ``__init__``, and puts no restrictions on how you store your
96+
data. We do require that your array be convertible to a NumPy array, even if
97+
this is relatively expensive (as it is for ``Categorical``).
98+
99+
They may be backed by none, one, or many NumPy arrays. For example,
100+
``pandas.Categorical`` is an extension array backed by two arrays,
101+
one for codes and one for categories. An array of IPv6 addresses may
102+
be backed by a NumPy structured array with two fields, one for the
103+
lower 64 bits and one for the upper 64 bits. Or they may be backed
104+
by some other storage type, like Python lists.
105+
106+
See the `extension array source`_ for the interface definition. The docstrings
107+
and comments contain guidance for properly implementing the interface.
108+
109+
.. _extension dtype source: https://github.com/pandas-dev/pandas/blob/master/pandas/core/dtypes/base.py
110+
.. _extension array source: https://github.com/pandas-dev/pandas/blob/master/pandas/core/arrays/base.py
111+
112+
.. _extending.subclassing-pandas:
113+
114+
Subclassing pandas Data Structures
115+
----------------------------------
116+
117+
.. warning:: There are some easier alternatives before considering subclassing ``pandas`` data structures.
118+
119+
1. Extensible method chains with :ref:`pipe <basics.pipe>`
120+
121+
2. Use *composition*. See `here <http://en.wikipedia.org/wiki/Composition_over_inheritance>`_.
122+
123+
3. Extending by :ref:`registering an accessor <extending.register-accessors>`
124+
125+
4. Extending by :ref:`extension type <extending.extension-types>`
126+
127+
This section describes how to subclass ``pandas`` data structures to meet more specific needs. There are two points that need attention:
128+
129+
1. Override constructor properties.
130+
2. Define original properties
131+
132+
.. note::
133+
134+
You can find a nice example in `geopandas <https://github.com/geopandas/geopandas>`_ project.
135+
136+
Override Constructor Properties
137+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
138+
139+
Each data structure has several *constructor properties* for returning a new
140+
data structure as the result of an operation. By overriding these properties,
141+
you can retain subclasses through ``pandas`` data manipulations.
142+
143+
There are 3 constructor properties to be defined:
144+
145+
- ``_constructor``: Used when a manipulation result has the same dimesions as the original.
146+
- ``_constructor_sliced``: Used when a manipulation result has one lower dimension(s) as the original, such as ``DataFrame`` single columns slicing.
147+
- ``_constructor_expanddim``: Used when a manipulation result has one higher dimension as the original, such as ``Series.to_frame()`` and ``DataFrame.to_panel()``.
148+
149+
Following table shows how ``pandas`` data structures define constructor properties by default.
150+
151+
=========================== ======================= =============
152+
Property Attributes ``Series`` ``DataFrame``
153+
=========================== ======================= =============
154+
``_constructor`` ``Series`` ``DataFrame``
155+
``_constructor_sliced`` ``NotImplementedError`` ``Series``
156+
``_constructor_expanddim`` ``DataFrame`` ``Panel``
157+
=========================== ======================= =============
158+
159+
Below example shows how to define ``SubclassedSeries`` and ``SubclassedDataFrame`` overriding constructor properties.
160+
161+
.. code-block:: python
162+
163+
class SubclassedSeries(Series):
164+
165+
@property
166+
def _constructor(self):
167+
return SubclassedSeries
168+
169+
@property
170+
def _constructor_expanddim(self):
171+
return SubclassedDataFrame
172+
173+
class SubclassedDataFrame(DataFrame):
174+
175+
@property
176+
def _constructor(self):
177+
return SubclassedDataFrame
178+
179+
@property
180+
def _constructor_sliced(self):
181+
return SubclassedSeries
182+
183+
.. code-block:: python
184+
185+
>>> s = SubclassedSeries([1, 2, 3])
186+
>>> type(s)
187+
<class '__main__.SubclassedSeries'>
188+
189+
>>> to_framed = s.to_frame()
190+
>>> type(to_framed)
191+
<class '__main__.SubclassedDataFrame'>
192+
193+
>>> df = SubclassedDataFrame({'A', [1, 2, 3], 'B': [4, 5, 6], 'C': [7, 8, 9]})
194+
>>> df
195+
A B C
196+
0 1 4 7
197+
1 2 5 8
198+
2 3 6 9
199+
200+
>>> type(df)
201+
<class '__main__.SubclassedDataFrame'>
202+
203+
>>> sliced1 = df[['A', 'B']]
204+
>>> sliced1
205+
A B
206+
0 1 4
207+
1 2 5
208+
2 3 6
209+
>>> type(sliced1)
210+
<class '__main__.SubclassedDataFrame'>
211+
212+
>>> sliced2 = df['A']
213+
>>> sliced2
214+
0 1
215+
1 2
216+
2 3
217+
Name: A, dtype: int64
218+
>>> type(sliced2)
219+
<class '__main__.SubclassedSeries'>
220+
221+
Define Original Properties
222+
^^^^^^^^^^^^^^^^^^^^^^^^^^
223+
224+
To let original data structures have additional properties, you should let ``pandas`` know what properties are added. ``pandas`` maps unknown properties to data names overriding ``__getattribute__``. Defining original properties can be done in one of 2 ways:
225+
226+
1. Define ``_internal_names`` and ``_internal_names_set`` for temporary properties which WILL NOT be passed to manipulation results.
227+
2. Define ``_metadata`` for normal properties which will be passed to manipulation results.
228+
229+
Below is an example to define two original properties, "internal_cache" as a temporary property and "added_property" as a normal property
230+
231+
.. code-block:: python
232+
233+
class SubclassedDataFrame2(DataFrame):
234+
235+
# temporary properties
236+
_internal_names = pd.DataFrame._internal_names + ['internal_cache']
237+
_internal_names_set = set(_internal_names)
238+
239+
# normal properties
240+
_metadata = ['added_property']
241+
242+
@property
243+
def _constructor(self):
244+
return SubclassedDataFrame2
245+
246+
.. code-block:: python
247+
248+
>>> df = SubclassedDataFrame2({'A': [1, 2, 3], 'B': [4, 5, 6], 'C': [7, 8, 9]})
249+
>>> df
250+
A B C
251+
0 1 4 7
252+
1 2 5 8
253+
2 3 6 9
254+
255+
>>> df.internal_cache = 'cached'
256+
>>> df.added_property = 'property'
257+
258+
>>> df.internal_cache
259+
cached
260+
>>> df.added_property
261+
property
262+
263+
# properties defined in _internal_names is reset after manipulation
264+
>>> df[['A', 'B']].internal_cache
265+
AttributeError: 'SubclassedDataFrame2' object has no attribute 'internal_cache'
266+
267+
# properties defined in _metadata are retained
268+
>>> df[['A', 'B']].added_property
269+
property

doc/source/index.rst.template

+1
Original file line numberDiff line numberDiff line change
@@ -157,5 +157,6 @@ See the package overview for more detail about what's in the library.
157157
{% if not single_doc -%}
158158
developer
159159
internals
160+
extending
160161
release
161162
{% endif -%}

0 commit comments

Comments
 (0)