Skip to content

Commit e048dc0

Browse files
mroeschkenoatamir
authored andcommitted
DOC/ENH: Add documentation and whatsnew entry for ArrowDtype and ArrowExtensionArray (pandas-dev#47854)
* Start arrow docs * Add whatsnew example * Add note about strings * Address review * modify whatsnew * Add experimental note * Address Joris' comments * Fix repr, and add missing import * Add extra sections * Add quotes and example
1 parent dc1cd51 commit e048dc0

File tree

6 files changed

+174
-18
lines changed

6 files changed

+174
-18
lines changed

doc/source/reference/arrays.rst

+52-13
Original file line numberDiff line numberDiff line change
@@ -19,19 +19,20 @@ objects contained with a :class:`Index`, :class:`Series`, or
1919
For some data types, pandas extends NumPy's type system. String aliases for these types
2020
can be found at :ref:`basics.dtypes`.
2121

22-
=================== ========================= ================== =============================
23-
Kind of Data pandas Data Type Scalar Array
24-
=================== ========================= ================== =============================
25-
TZ-aware datetime :class:`DatetimeTZDtype` :class:`Timestamp` :ref:`api.arrays.datetime`
26-
Timedeltas (none) :class:`Timedelta` :ref:`api.arrays.timedelta`
27-
Period (time spans) :class:`PeriodDtype` :class:`Period` :ref:`api.arrays.period`
28-
Intervals :class:`IntervalDtype` :class:`Interval` :ref:`api.arrays.interval`
29-
Nullable Integer :class:`Int64Dtype`, ... (none) :ref:`api.arrays.integer_na`
30-
Categorical :class:`CategoricalDtype` (none) :ref:`api.arrays.categorical`
31-
Sparse :class:`SparseDtype` (none) :ref:`api.arrays.sparse`
32-
Strings :class:`StringDtype` :class:`str` :ref:`api.arrays.string`
33-
Boolean (with NA) :class:`BooleanDtype` :class:`bool` :ref:`api.arrays.bool`
34-
=================== ========================= ================== =============================
22+
=================== ========================= ============================= =============================
23+
Kind of Data pandas Data Type Scalar Array
24+
=================== ========================= ============================= =============================
25+
TZ-aware datetime :class:`DatetimeTZDtype` :class:`Timestamp` :ref:`api.arrays.datetime`
26+
Timedeltas (none) :class:`Timedelta` :ref:`api.arrays.timedelta`
27+
Period (time spans) :class:`PeriodDtype` :class:`Period` :ref:`api.arrays.period`
28+
Intervals :class:`IntervalDtype` :class:`Interval` :ref:`api.arrays.interval`
29+
Nullable Integer :class:`Int64Dtype`, ... (none) :ref:`api.arrays.integer_na`
30+
Categorical :class:`CategoricalDtype` (none) :ref:`api.arrays.categorical`
31+
Sparse :class:`SparseDtype` (none) :ref:`api.arrays.sparse`
32+
Strings :class:`StringDtype` :class:`str` :ref:`api.arrays.string`
33+
Boolean (with NA) :class:`BooleanDtype` :class:`bool` :ref:`api.arrays.bool`
34+
PyArrow :class:`ArrowDtype` Python Scalars or :class:`NA` :ref:`api.arrays.arrow`
35+
=================== ========================= ============================= =============================
3536

3637
pandas and third-party libraries can extend NumPy's type system (see :ref:`extending.extension-types`).
3738
The top-level :meth:`array` method can be used to create a new array, which may be
@@ -42,6 +43,44 @@ stored in a :class:`Series`, :class:`Index`, or as a column in a :class:`DataFra
4243

4344
array
4445

46+
.. _api.arrays.arrow:
47+
48+
PyArrow
49+
-------
50+
51+
.. warning::
52+
53+
This feature is experimental, and the API can change in a future release without warning.
54+
55+
The :class:`arrays.ArrowExtensionArray` is backed by a :external+pyarrow:py:class:`pyarrow.ChunkedArray` with a
56+
:external+pyarrow:py:class:`pyarrow.DataType` instead of a NumPy array and data type. The ``.dtype`` of a :class:`arrays.ArrowExtensionArray`
57+
is an :class:`ArrowDtype`.
58+
59+
`Pyarrow <https://arrow.apache.org/docs/python/index.html>`__ provides similar array and `data type <https://arrow.apache.org/docs/python/api/datatypes.html>`__
60+
support as NumPy including first-class nullability support for all data types, immutability and more.
61+
62+
.. note::
63+
64+
For string types (``pyarrow.string()``, ``string[pyarrow]``), PyArrow support is still facilitated
65+
by :class:`arrays.ArrowStringArray` and ``StringDtype("pyarrow")``. See the :ref:`string section <api.arrays.string>`
66+
below.
67+
68+
While individual values in an :class:`arrays.ArrowExtensionArray` are stored as a PyArrow objects, scalars are **returned**
69+
as Python scalars corresponding to the data type, e.g. a PyArrow int64 will be returned as Python int, or :class:`NA` for missing
70+
values.
71+
72+
.. autosummary::
73+
:toctree: api/
74+
:template: autosummary/class_without_autosummary.rst
75+
76+
arrays.ArrowExtensionArray
77+
78+
.. autosummary::
79+
:toctree: api/
80+
:template: autosummary/class_without_autosummary.rst
81+
82+
ArrowDtype
83+
4584
.. _api.arrays.datetime:
4685

4786
Datetimes

doc/source/whatsnew/v1.5.0.rst

+34
Original file line numberDiff line numberDiff line change
@@ -24,6 +24,40 @@ https://github.com/pandas-dev/pandas-stubs for more information.
2424

2525
We thank VirtusLab and Microsoft for their initial, significant contributions to ``pandas-stubs``
2626

27+
.. _whatsnew_150.enhancements.arrow:
28+
29+
Native PyArrow-backed ExtensionArray
30+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
31+
32+
With `Pyarrow <https://arrow.apache.org/docs/python/index.html>`__ installed, users can now create pandas objects
33+
that are backed by a ``pyarrow.ChunkedArray`` and ``pyarrow.DataType``.
34+
35+
The ``dtype`` argument can accept a string of a `pyarrow data type <https://arrow.apache.org/docs/python/api/datatypes.html>`__
36+
with ``pyarrow`` in brackets e.g. ``"int64[pyarrow]"`` or, for pyarrow data types that take parameters, a :class:`ArrowDtype`
37+
initialized with a ``pyarrow.DataType``.
38+
39+
.. ipython:: python
40+
41+
import pyarrow as pa
42+
ser_float = pd.Series([1.0, 2.0, None], dtype="float32[pyarrow]")
43+
ser_float
44+
45+
list_of_int_type = pd.ArrowDtype(pa.list_(pa.int64()))
46+
ser_list = pd.Series([[1, 2], [3, None]], dtype=list_of_int_type)
47+
ser_list
48+
49+
ser_list.take([1, 0])
50+
ser_float * 5
51+
ser_float.mean()
52+
ser_float.dropna()
53+
54+
Most operations are supported and have been implemented using `pyarrow compute <https://arrow.apache.org/docs/python/api/compute.html>`__ functions.
55+
We recommend installing the latest version of PyArrow to access the most recently implemented compute functions.
56+
57+
.. warning::
58+
59+
This feature is experimental, and the API can change in a future release without warning.
60+
2761
.. _whatsnew_150.enhancements.dataframe_interchange:
2862

2963
DataFrame interchange protocol implementation

pandas/arrays/__init__.py

+2
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,7 @@
44
See :ref:`extending.extension-types` for more.
55
"""
66
from pandas.core.arrays import (
7+
ArrowExtensionArray,
78
ArrowStringArray,
89
BooleanArray,
910
Categorical,
@@ -19,6 +20,7 @@
1920
)
2021

2122
__all__ = [
23+
"ArrowExtensionArray",
2224
"ArrowStringArray",
2325
"BooleanArray",
2426
"Categorical",

pandas/core/arrays/arrow/array.py

+41-2
Original file line numberDiff line numberDiff line change
@@ -159,8 +159,47 @@ def to_pyarrow_type(
159159

160160
class ArrowExtensionArray(OpsMixin, ExtensionArray):
161161
"""
162-
Base class for ExtensionArray backed by Arrow ChunkedArray.
163-
"""
162+
Pandas ExtensionArray backed by a PyArrow ChunkedArray.
163+
164+
.. warning::
165+
166+
ArrowExtensionArray is considered experimental. The implementation and
167+
parts of the API may change without warning.
168+
169+
Parameters
170+
----------
171+
values : pyarrow.Array or pyarrow.ChunkedArray
172+
173+
Attributes
174+
----------
175+
None
176+
177+
Methods
178+
-------
179+
None
180+
181+
Returns
182+
-------
183+
ArrowExtensionArray
184+
185+
Notes
186+
-----
187+
Most methods are implemented using `pyarrow compute functions. <https://arrow.apache.org/docs/python/api/compute.html>`__
188+
Some methods may either raise an exception or raise a ``PerformanceWarning`` if an
189+
associated compute function is not available based on the installed version of PyArrow.
190+
191+
Please install the latest version of PyArrow to enable the best functionality and avoid
192+
potential bugs in prior versions of PyArrow.
193+
194+
Examples
195+
--------
196+
Create an ArrowExtensionArray with :func:`pandas.array`:
197+
198+
>>> pd.array([1, 1, None], dtype="int64[pyarrow]")
199+
<ArrowExtensionArray>
200+
[1, 1, <NA>]
201+
Length: 3, dtype: int64[pyarrow]
202+
""" # noqa: E501 (http link too long)
164203

165204
_data: pa.ChunkedArray
166205
_dtype: ArrowDtype

pandas/core/arrays/arrow/dtype.py

+44-3
Original file line numberDiff line numberDiff line change
@@ -20,9 +20,47 @@
2020
@register_extension_dtype
2121
class ArrowDtype(StorageExtensionDtype):
2222
"""
23-
Base class for dtypes for ArrowExtensionArray.
24-
Modeled after BaseMaskedDtype
25-
"""
23+
An ExtensionDtype for PyArrow data types.
24+
25+
.. warning::
26+
27+
ArrowDtype is considered experimental. The implementation and
28+
parts of the API may change without warning.
29+
30+
While most ``dtype`` arguments can accept the "string"
31+
constructor, e.g. ``"int64[pyarrow]"``, ArrowDtype is useful
32+
if the data type contains parameters like ``pyarrow.timestamp``.
33+
34+
Parameters
35+
----------
36+
pyarrow_dtype : pa.DataType
37+
An instance of a `pyarrow.DataType <https://arrow.apache.org/docs/python/api/datatypes.html#factory-functions>`__.
38+
39+
Attributes
40+
----------
41+
pyarrow_dtype
42+
43+
Methods
44+
-------
45+
None
46+
47+
Returns
48+
-------
49+
ArrowDtype
50+
51+
Examples
52+
--------
53+
>>> import pyarrow as pa
54+
>>> pd.ArrowDtype(pa.int64())
55+
int64[pyarrow]
56+
57+
Types with parameters must be constructed with ArrowDtype.
58+
59+
>>> pd.ArrowDtype(pa.timestamp("s", tz="America/New_York"))
60+
timestamp[s, tz=America/New_York][pyarrow]
61+
>>> pd.ArrowDtype(pa.list_(pa.int64()))
62+
list<item: int64>[pyarrow]
63+
""" # noqa: E501
2664

2765
_metadata = ("storage", "pyarrow_dtype") # type: ignore[assignment]
2866

@@ -37,6 +75,9 @@ def __init__(self, pyarrow_dtype: pa.DataType) -> None:
3775
)
3876
self.pyarrow_dtype = pyarrow_dtype
3977

78+
def __repr__(self) -> str:
79+
return self.name
80+
4081
@property
4182
def type(self):
4283
"""

scripts/validate_rst_title_capitalization.py

+1
Original file line numberDiff line numberDiff line change
@@ -150,6 +150,7 @@
150150
"LZMA",
151151
"Numba",
152152
"Timestamp",
153+
"PyArrow",
153154
}
154155

155156
CAP_EXCEPTIONS_DICT = {word.lower(): word for word in CAPITALIZATION_EXCEPTIONS}

0 commit comments

Comments
 (0)