-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
DOC/ENH: Add documentation and whatsnew entry for ArrowDtype and ArrowExtensionArray #47854
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 4 commits
aa74642
c207851
e760b1e
c63656e
fbd76aa
a9e8581
741098a
02e3aa4
4e3b3b9
c51b5c2
7d84a5e
2f9ae5e
abf1ffa
42b9002
4af27e8
14422b7
1b01912
549f0ce
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -19,19 +19,20 @@ objects contained with a :class:`Index`, :class:`Series`, or | |
For some data types, pandas extends NumPy's type system. String aliases for these types | ||
can be found at :ref:`basics.dtypes`. | ||
|
||
=================== ========================= ================== ============================= | ||
Kind of Data pandas Data Type Scalar Array | ||
=================== ========================= ================== ============================= | ||
TZ-aware datetime :class:`DatetimeTZDtype` :class:`Timestamp` :ref:`api.arrays.datetime` | ||
Timedeltas (none) :class:`Timedelta` :ref:`api.arrays.timedelta` | ||
Period (time spans) :class:`PeriodDtype` :class:`Period` :ref:`api.arrays.period` | ||
Intervals :class:`IntervalDtype` :class:`Interval` :ref:`api.arrays.interval` | ||
Nullable Integer :class:`Int64Dtype`, ... (none) :ref:`api.arrays.integer_na` | ||
Categorical :class:`CategoricalDtype` (none) :ref:`api.arrays.categorical` | ||
Sparse :class:`SparseDtype` (none) :ref:`api.arrays.sparse` | ||
Strings :class:`StringDtype` :class:`str` :ref:`api.arrays.string` | ||
Boolean (with NA) :class:`BooleanDtype` :class:`bool` :ref:`api.arrays.bool` | ||
=================== ========================= ================== ============================= | ||
=================== ========================= ============================= ============================= | ||
Kind of Data pandas Data Type Scalar Array | ||
=================== ========================= ============================= ============================= | ||
PyArrow :class:`ArrowDtype` Python Scalars or :class:`NA` :ref:`api.arrays.arrow` | ||
TZ-aware datetime :class:`DatetimeTZDtype` :class:`Timestamp` :ref:`api.arrays.datetime` | ||
Timedeltas (none) :class:`Timedelta` :ref:`api.arrays.timedelta` | ||
Period (time spans) :class:`PeriodDtype` :class:`Period` :ref:`api.arrays.period` | ||
Intervals :class:`IntervalDtype` :class:`Interval` :ref:`api.arrays.interval` | ||
Nullable Integer :class:`Int64Dtype`, ... (none) :ref:`api.arrays.integer_na` | ||
Categorical :class:`CategoricalDtype` (none) :ref:`api.arrays.categorical` | ||
Sparse :class:`SparseDtype` (none) :ref:`api.arrays.sparse` | ||
Strings :class:`StringDtype` :class:`str` :ref:`api.arrays.string` | ||
Boolean (with NA) :class:`BooleanDtype` :class:`bool` :ref:`api.arrays.bool` | ||
=================== ========================= ============================= ============================= | ||
|
||
pandas and third-party libraries can extend NumPy's type system (see :ref:`extending.extension-types`). | ||
The top-level :meth:`array` method can be used to create a new array, which may be | ||
|
@@ -42,6 +43,40 @@ stored in a :class:`Series`, :class:`Index`, or as a column in a :class:`DataFra | |
|
||
array | ||
|
||
.. _api.arrays.arrow: | ||
|
||
PyArrow | ||
------- | ||
|
||
The :class:`arrays.ArrowExtensionArray` is backed by a :external+pyarrow:py:class:`pyarrow.ChunkedArray` with a | ||
jorisvandenbossche marked this conversation as resolved.
Show resolved
Hide resolved
|
||
:external+pyarrow:py:class:`pyarrow.DataType` instead of a NumPy array and data type. The ``.dtype`` of a :class:`arrays.ArrowExtensionArray` | ||
is an :class:`ArrowDtype`. | ||
|
||
`Pyarrow <https://arrow.apache.org/docs/python/index.html>`__ provides similar array and `data type <https://arrow.apache.org/docs/python/api/datatypes.html>`__ | ||
support as NumPy including first-class nullability support for all data types, immutability and more. | ||
|
||
.. note:: | ||
|
||
For string types (``pyarrow.string()``, ``string[pyarrow]``), PyArrow support is still facilitated | ||
by :class:`arrays.ArrowStringArray` and ``StringDtype("pyarrow")``. See the :ref:`string section <api.arrays.string>` | ||
below. | ||
|
||
While individual values in an :class:`arrays.ArrowExtensionArray` are stored as a PyArrow objects, scalars are **returned** | ||
as Python scalars corresponding to the data type, e.g. a PyArrow int64 will be returned as Python int, or :class:`NA` for missing | ||
values. | ||
|
||
.. autosummary:: | ||
:toctree: api/ | ||
:template: autosummary/class_without_autosummary.rst | ||
|
||
arrays.ArrowExtensionArray | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Should this be There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yeah, this PR is depends on #47818 where these will be defined. |
||
|
||
.. autosummary:: | ||
:toctree: api/ | ||
:template: autosummary/class_without_autosummary.rst | ||
|
||
ArrowDtype | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Should this be CI is red, I guess this is the cause: https://github.com/pandas-dev/pandas/runs/7616698322?check_suite_focus=true#step:6:153 There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yeah, this PR is depends on #47818 where these will be defined. |
||
|
||
.. _api.arrays.datetime: | ||
|
||
Datetimes | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -14,6 +14,37 @@ including other versions of pandas. | |
Enhancements | ||
~~~~~~~~~~~~ | ||
|
||
.. _whatsnew_150.enhancements.arrow: | ||
|
||
Native PyArrow-backed ExtensionArray | ||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | ||
|
||
With `Pyarrow <https://arrow.apache.org/docs/python/index.html>`__ installed, users can now create pandas objects | ||
that are backed by a ``pyarrow.ChunkedArray`` and ``pyarrow.DataType``. | ||
|
||
The ``dtype`` argument can accept a string of a `pyarrow data type <https://arrow.apache.org/docs/python/api/datatypes.html>`__ | ||
with ``pyarrow`` in brackets e.g. ``int64[pyarrow]`` or, for pyarrow data types that take parameters, a :class:`ArrowDtype` | ||
initialized with a ``pyarrow.DataType`` | ||
|
||
.. ipython:: python | ||
|
||
import pyarrow as pa | ||
ser_float = pd.Series([1.0, 2.0, None], dtype="float32[pyarrow]") | ||
ser_float | ||
|
||
ser_list = pd.Series([[1, 2]], dtype=pd.ArrowDtype(pa.list_(pa.int64()))) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'd probably add a short comment here on what we're creating, and maybe add another row to the Series. Maybe when rendered it becomes clearer, but reading just this it feels like it may take some time for readers to understand what you're doing here. |
||
ser_list | ||
|
||
Most operations are supported and have been implemented using `pyarrow compute <https://arrow.apache.org/docs/python/api/compute.html>`__ functions. | ||
We recommend installing the latest version of PyArrow to access the most recently implemented compute functions. | ||
|
||
.. ipython:: python | ||
|
||
ser_list.take([0, 0]) | ||
ser_float * 5 | ||
ser_float.mean() | ||
ser_float.dropna() | ||
|
||
.. _whatsnew_150.enhancements.dataframe_exchange: | ||
|
||
DataFrame exchange protocol implementation | ||
|
Original file line number | Diff line number | Diff line change | ||||
---|---|---|---|---|---|---|
|
@@ -159,8 +159,25 @@ def to_pyarrow_type( | |||||
|
||||||
class ArrowExtensionArray(OpsMixin, ExtensionArray): | ||||||
""" | ||||||
Base class for ExtensionArray backed by Arrow ChunkedArray. | ||||||
""" | ||||||
Pandas ExtensionArray backed by a PyArrow ChunkedArray. | ||||||
|
||||||
Parameters | ||||||
---------- | ||||||
values: pyarrow.Array or pyarrow.ChunkedArray | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
I think this space is needed for Sphinx to render this correctly |
||||||
|
||||||
Returns | ||||||
------- | ||||||
ArrowExtensionArray | ||||||
|
||||||
Notes | ||||||
----- | ||||||
Most methods are implemented using `pyarrow compute functions. <https://arrow.apache.org/docs/python/api/compute.html>`__ | ||||||
Some methods may either raise an exception or raise a ``PerformanceWarning`` if an | ||||||
associated compute function is not available based on the installed version of PyArrow. | ||||||
|
||||||
Please install the latest version of PyArrow to enable this functionality and avoid | ||||||
potential bugs in prior versions of PyArrow. | ||||||
""" # noqa: E501 | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Maybe a comment on what's E501 and why liniting fails here, so readers don't need to look it up? |
||||||
|
||||||
_data: pa.ChunkedArray | ||||||
|
||||||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Small nit: but since this list is not alphabetically ordered, can you move this to the last entry? Since this is an experimental new feature, while the first entries here are widely used data types, it seems better to leave the others as first listed data types.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair point. Moved this entry to the bottom