Skip to content

DOC: Clarify difference between StringDtype(pyarrow) and ArrowDtype(string) #52184

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Mar 25, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 4 additions & 3 deletions doc/source/reference/arrays.rst
Original file line number Diff line number Diff line change
Expand Up @@ -93,9 +93,10 @@ PyArrow type pandas extension type NumPy

.. note::

For string types (``pyarrow.string()``, ``string[pyarrow]``), PyArrow support is still facilitated
by :class:`arrays.ArrowStringArray` and ``StringDtype("pyarrow")``. See the :ref:`string section <api.arrays.string>`
below.
Pyarrow-backed string support is provided by both ``pd.StringDtype("pyarrow")`` and ``pd.ArrowDtype(pa.string())``.
``pd.StringDtype("pyarrow")`` is described below in the :ref:`string section <api.arrays.string>`
and will be returned if the string alias ``"string[pyarrow]"`` is specified. ``pd.ArrowDtype(pa.string())``
generally has better interoperability with :class:`ArrowDtype` of different types.

While individual values in an :class:`arrays.ArrowExtensionArray` are stored as a PyArrow objects, scalars are **returned**
as Python scalars corresponding to the data type, e.g. a PyArrow int64 will be returned as Python int, or :class:`NA` for missing
Expand Down
20 changes: 19 additions & 1 deletion doc/source/user_guide/pyarrow.rst
Original file line number Diff line number Diff line change
Expand Up @@ -35,6 +35,23 @@ which is similar to a NumPy array. To construct these from the main pandas data
df = pd.DataFrame([[1, 2], [3, 4]], dtype="uint64[pyarrow]")
df

.. note::

The string alias ``"string[pyarrow]"`` maps to ``pd.StringDtype("pyarrow")`` which is not equivalent to
specifying ``dtype=pd.ArrowDtype(pa.string())``. Generally, operations on the data will behave similarly
except ``pd.StringDtype("pyarrow")`` can return NumPy-backed nullable types while ``pd.ArrowDtype(pa.string())``
will return :class:`ArrowDtype`.

.. ipython:: python

import pyarrow as pa
data = list("abc")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needs a pyarrow import

ser_sd = pd.Series(data, dtype="string[pyarrow]")
ser_ad = pd.Series(data, dtype=pd.ArrowDtype(pa.string()))
ser_ad.dtype == ser_sd.dtype
ser_sd.str.contains("a")
ser_ad.str.contains("a")

For PyArrow types that accept parameters, you can pass in a PyArrow type with those parameters
into :class:`ArrowDtype` to use in the ``dtype`` parameter.

Expand Down Expand Up @@ -106,6 +123,7 @@ The following are just some examples of operations that are accelerated by nativ

.. ipython:: python

import pyarrow as pa
ser = pd.Series([-1.545, 0.211, None], dtype="float32[pyarrow]")
ser.mean()
ser + ser
Expand All @@ -115,7 +133,7 @@ The following are just some examples of operations that are accelerated by nativ
ser.isna()
ser.fillna(0)

ser_str = pd.Series(["a", "b", None], dtype="string[pyarrow]")
ser_str = pd.Series(["a", "b", None], dtype=pd.ArrowDtype(pa.string()))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here probably

ser_str.str.startswith("a")

from datetime import datetime
Expand Down