Skip to content

Commit 9cfb8b5

Browse files
TomAugspurgerjreback
authored andcommitted
API: Add string extension type (#27949)
1 parent a8538f2 commit 9cfb8b5

File tree

20 files changed

+908
-76
lines changed

20 files changed

+908
-76
lines changed

ci/code_checks.sh

+4
Original file line numberDiff line numberDiff line change
@@ -266,6 +266,10 @@ if [[ -z "$CHECK" || "$CHECK" == "doctests" ]]; then
266266
-k"-from_arrays -from_breaks -from_intervals -from_tuples -set_closed -to_tuples -interval_range"
267267
RET=$(($RET + $?)) ; echo $MSG "DONE"
268268

269+
MSG='Doctests arrays/string_.py' ; echo $MSG
270+
pytest -q --doctest-modules pandas/core/arrays/string_.py
271+
RET=$(($RET + $?)) ; echo $MSG "DONE"
272+
269273
fi
270274

271275
### DOCSTRINGS ###

doc/source/getting_started/basics.rst

+16-3
Original file line numberDiff line numberDiff line change
@@ -986,7 +986,7 @@ not noted for a particular column will be ``NaN``:
986986
987987
tsdf.agg({'A': ['mean', 'min'], 'B': 'sum'})
988988
989-
.. _basics.aggregation.mixed_dtypes:
989+
.. _basics.aggregation.mixed_string:
990990

991991
Mixed dtypes
992992
++++++++++++
@@ -1704,14 +1704,21 @@ built-in string methods. For example:
17041704

17051705
.. ipython:: python
17061706
1707-
s = pd.Series(['A', 'B', 'C', 'Aaba', 'Baca', np.nan, 'CABA', 'dog', 'cat'])
1707+
s = pd.Series(['A', 'B', 'C', 'Aaba', 'Baca', np.nan, 'CABA', 'dog', 'cat'],
1708+
dtype="string")
17081709
s.str.lower()
17091710
17101711
Powerful pattern-matching methods are provided as well, but note that
17111712
pattern-matching generally uses `regular expressions
17121713
<https://docs.python.org/3/library/re.html>`__ by default (and in some cases
17131714
always uses them).
17141715

1716+
.. note::
1717+
1718+
Prior to pandas 1.0, string methods were only available on ``object`` -dtype
1719+
``Series``. Pandas 1.0 added the :class:`StringDtype` which is dedicated
1720+
to strings. See :ref:`text.types` for more.
1721+
17151722
Please see :ref:`Vectorized String Methods <text.string_methods>` for a complete
17161723
description.
17171724

@@ -1925,9 +1932,15 @@ period (time spans) :class:`PeriodDtype` :class:`Period` :class:`arrays.
19251932
sparse :class:`SparseDtype` (none) :class:`arrays.SparseArray` :ref:`sparse`
19261933
intervals :class:`IntervalDtype` :class:`Interval` :class:`arrays.IntervalArray` :ref:`advanced.intervalindex`
19271934
nullable integer :class:`Int64Dtype`, ... (none) :class:`arrays.IntegerArray` :ref:`integer_na`
1935+
Strings :class:`StringDtype` :class:`str` :class:`arrays.StringArray` :ref:`text`
19281936
=================== ========================= ================== ============================= =============================
19291937

1930-
Pandas uses the ``object`` dtype for storing strings.
1938+
Pandas has two ways to store strings.
1939+
1940+
1. ``object`` dtype, which can hold any Python object, including strings.
1941+
2. :class:`StringDtype`, which is dedicated to strings.
1942+
1943+
Generally, we recommend using :class:`StringDtype`. See :ref:`text.types` fore more.
19311944

19321945
Finally, arbitrary objects may be stored using the ``object`` dtype, but should
19331946
be avoided to the extent possible (for performance and interoperability with

doc/source/reference/arrays.rst

+25-1
Original file line numberDiff line numberDiff line change
@@ -24,6 +24,7 @@ Intervals :class:`IntervalDtype` :class:`Interval` :ref:`api.array
2424
Nullable Integer :class:`Int64Dtype`, ... (none) :ref:`api.arrays.integer_na`
2525
Categorical :class:`CategoricalDtype` (none) :ref:`api.arrays.categorical`
2626
Sparse :class:`SparseDtype` (none) :ref:`api.arrays.sparse`
27+
Strings :class:`StringDtype` :class:`str` :ref:`api.arrays.string`
2728
=================== ========================= ================== =============================
2829

2930
Pandas and third-party libraries can extend NumPy's type system (see :ref:`extending.extension-types`).
@@ -460,6 +461,29 @@ and methods if the :class:`Series` contains sparse values. See
460461
:ref:`api.series.sparse` for more.
461462

462463

464+
.. _api.arrays.string:
465+
466+
Text data
467+
---------
468+
469+
When working with text data, where each valid element is a string or missing,
470+
we recommend using :class:`StringDtype` (with the alias ``"string"``).
471+
472+
.. autosummary::
473+
:toctree: api/
474+
:template: autosummary/class_without_autosummary.rst
475+
476+
arrays.StringArray
477+
478+
.. autosummary::
479+
:toctree: api/
480+
:template: autosummary/class_without_autosummary.rst
481+
482+
StringDtype
483+
484+
The ``Series.str`` accessor is available for ``Series`` backed by a :class:`arrays.StringArray`.
485+
See :ref:`api.series.str` for more.
486+
463487

464488
.. Dtype attributes which are manually listed in their docstrings: including
465489
.. it here to make sure a docstring page is built for them
@@ -471,4 +495,4 @@ and methods if the :class:`Series` contains sparse values. See
471495
DatetimeTZDtype.unit
472496
DatetimeTZDtype.tz
473497
PeriodDtype.freq
474-
IntervalDtype.subtype
498+
IntervalDtype.subtype

0 commit comments

Comments
 (0)