-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
DOC/ENH: Add documentation and whatsnew entry for ArrowDtype and ArrowExtensionArray #47854
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 6 commits
aa74642
c207851
e760b1e
c63656e
fbd76aa
a9e8581
741098a
02e3aa4
4e3b3b9
c51b5c2
7d84a5e
2f9ae5e
abf1ffa
42b9002
4af27e8
14422b7
1b01912
549f0ce
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -19,19 +19,20 @@ objects contained with a :class:`Index`, :class:`Series`, or | |
For some data types, pandas extends NumPy's type system. String aliases for these types | ||
can be found at :ref:`basics.dtypes`. | ||
|
||
=================== ========================= ================== ============================= | ||
Kind of Data pandas Data Type Scalar Array | ||
=================== ========================= ================== ============================= | ||
TZ-aware datetime :class:`DatetimeTZDtype` :class:`Timestamp` :ref:`api.arrays.datetime` | ||
Timedeltas (none) :class:`Timedelta` :ref:`api.arrays.timedelta` | ||
Period (time spans) :class:`PeriodDtype` :class:`Period` :ref:`api.arrays.period` | ||
Intervals :class:`IntervalDtype` :class:`Interval` :ref:`api.arrays.interval` | ||
Nullable Integer :class:`Int64Dtype`, ... (none) :ref:`api.arrays.integer_na` | ||
Categorical :class:`CategoricalDtype` (none) :ref:`api.arrays.categorical` | ||
Sparse :class:`SparseDtype` (none) :ref:`api.arrays.sparse` | ||
Strings :class:`StringDtype` :class:`str` :ref:`api.arrays.string` | ||
Boolean (with NA) :class:`BooleanDtype` :class:`bool` :ref:`api.arrays.bool` | ||
=================== ========================= ================== ============================= | ||
=================== ========================= ============================= ============================= | ||
Kind of Data pandas Data Type Scalar Array | ||
=================== ========================= ============================= ============================= | ||
PyArrow :class:`ArrowDtype` Python Scalars or :class:`NA` :ref:`api.arrays.arrow` | ||
TZ-aware datetime :class:`DatetimeTZDtype` :class:`Timestamp` :ref:`api.arrays.datetime` | ||
Timedeltas (none) :class:`Timedelta` :ref:`api.arrays.timedelta` | ||
Period (time spans) :class:`PeriodDtype` :class:`Period` :ref:`api.arrays.period` | ||
Intervals :class:`IntervalDtype` :class:`Interval` :ref:`api.arrays.interval` | ||
Nullable Integer :class:`Int64Dtype`, ... (none) :ref:`api.arrays.integer_na` | ||
Categorical :class:`CategoricalDtype` (none) :ref:`api.arrays.categorical` | ||
Sparse :class:`SparseDtype` (none) :ref:`api.arrays.sparse` | ||
Strings :class:`StringDtype` :class:`str` :ref:`api.arrays.string` | ||
Boolean (with NA) :class:`BooleanDtype` :class:`bool` :ref:`api.arrays.bool` | ||
=================== ========================= ============================= ============================= | ||
|
||
pandas and third-party libraries can extend NumPy's type system (see :ref:`extending.extension-types`). | ||
The top-level :meth:`array` method can be used to create a new array, which may be | ||
|
@@ -42,6 +43,40 @@ stored in a :class:`Series`, :class:`Index`, or as a column in a :class:`DataFra | |
|
||
array | ||
|
||
.. _api.arrays.arrow: | ||
|
||
PyArrow | ||
------- | ||
|
||
The :class:`arrays.ArrowExtensionArray` is backed by a :external+pyarrow:py:class:`pyarrow.ChunkedArray` with a | ||
jorisvandenbossche marked this conversation as resolved.
Show resolved
Hide resolved
|
||
:external+pyarrow:py:class:`pyarrow.DataType` instead of a NumPy array and data type. The ``.dtype`` of a :class:`arrays.ArrowExtensionArray` | ||
is an :class:`ArrowDtype`. | ||
|
||
`Pyarrow <https://arrow.apache.org/docs/python/index.html>`__ provides similar array and `data type <https://arrow.apache.org/docs/python/api/datatypes.html>`__ | ||
support as NumPy including first-class nullability support for all data types, immutability and more. | ||
|
||
.. note:: | ||
|
||
For string types (``pyarrow.string()``, ``string[pyarrow]``), PyArrow support is still facilitated | ||
by :class:`arrays.ArrowStringArray` and ``StringDtype("pyarrow")``. See the :ref:`string section <api.arrays.string>` | ||
below. | ||
|
||
While individual values in an :class:`arrays.ArrowExtensionArray` are stored as a PyArrow objects, scalars are **returned** | ||
as Python scalars corresponding to the data type, e.g. a PyArrow int64 will be returned as Python int, or :class:`NA` for missing | ||
values. | ||
|
||
.. autosummary:: | ||
:toctree: api/ | ||
:template: autosummary/class_without_autosummary.rst | ||
|
||
arrays.ArrowExtensionArray | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Should this be There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yeah, this PR is depends on #47818 where these will be defined. |
||
|
||
.. autosummary:: | ||
:toctree: api/ | ||
:template: autosummary/class_without_autosummary.rst | ||
|
||
ArrowDtype | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Should this be CI is red, I guess this is the cause: https://github.com/pandas-dev/pandas/runs/7616698322?check_suite_focus=true#step:6:153 There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yeah, this PR is depends on #47818 where these will be defined. |
||
|
||
.. _api.arrays.datetime: | ||
|
||
Datetimes | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -14,6 +14,38 @@ including other versions of pandas. | |
Enhancements | ||
~~~~~~~~~~~~ | ||
|
||
.. _whatsnew_150.enhancements.arrow: | ||
|
||
Native PyArrow-backed ExtensionArray | ||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | ||
|
||
With `Pyarrow <https://arrow.apache.org/docs/python/index.html>`__ installed, users can now create pandas objects | ||
that are backed by a ``pyarrow.ChunkedArray`` and ``pyarrow.DataType``. | ||
|
||
The ``dtype`` argument can accept a string of a `pyarrow data type <https://arrow.apache.org/docs/python/api/datatypes.html>`__ | ||
with ``pyarrow`` in brackets e.g. ``int64[pyarrow]`` or, for pyarrow data types that take parameters, a :class:`ArrowDtype` | ||
initialized with a ``pyarrow.DataType`` | ||
|
||
.. ipython:: python | ||
|
||
import pyarrow as pa | ||
ser_float = pd.Series([1.0, 2.0, None], dtype="float32[pyarrow]") | ||
ser_float | ||
|
||
list_of_int_type = pd.ArrowDtype(pa.list_(pa.int64())) | ||
ser_list = pd.Series([[1, 2], [3, None]], dtype=list_of_int_type) | ||
ser_list | ||
|
||
Most operations are supported and have been implemented using `pyarrow compute <https://arrow.apache.org/docs/python/api/compute.html>`__ functions. | ||
We recommend installing the latest version of PyArrow to access the most recently implemented compute functions. | ||
|
||
.. ipython:: python | ||
|
||
ser_list.take([1, 0]) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Looks like variables are cleaned from one block to the next, and sphinx is not happy here for There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
FYI, that's not the case, all variables are kept for (at least) a full file, so you can define a variable in one ipython code block, and use it in a next. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I ended up just combining both blocks anyways. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yep, and that's perfect here, just wanted to clarify ;) |
||
ser_float * 5 | ||
ser_float.mean() | ||
ser_float.dropna() | ||
|
||
.. _whatsnew_150.enhancements.dataframe_exchange: | ||
|
||
DataFrame exchange protocol implementation | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Small nit: but since this list is not alphabetically ordered, can you move this to the last entry? Since this is an experimental new feature, while the first entries here are widely used data types, it seems better to leave the others as first listed data types.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair point. Moved this entry to the bottom