Skip to content

Commit 336b8d6

Browse files
TomAugspurgerjreback
authored andcommitted
NumPyBackedExtensionArray (#24227)
1 parent ab55d05 commit 336b8d6

23 files changed

+1079
-49
lines changed

doc/source/api.rst

+1
Original file line numberDiff line numberDiff line change
@@ -3997,6 +3997,7 @@ objects.
39973997
api.extensions.register_index_accessor
39983998
api.extensions.ExtensionDtype
39993999
api.extensions.ExtensionArray
4000+
arrays.PandasArray
40004001

40014002
.. This is to prevent warnings in the doc build. We don't want to encourage
40024003
.. these methods.

doc/source/basics.rst

+32-9
Original file line numberDiff line numberDiff line change
@@ -71,8 +71,10 @@ the **array** property
7171
s.array
7272
s.index.array
7373
74-
Depending on the data type (see :ref:`basics.dtypes`), :attr:`~Series.array`
75-
be either a NumPy array or an :ref:`ExtensionArray <extending.extension-type>`.
74+
:attr:`~Series.array` will always be an :class:`~pandas.api.extensions.ExtensionArray`.
75+
The exact details of what an ``ExtensionArray`` is and why pandas uses them is a bit
76+
beyond the scope of this introduction. See :ref:`basics.dtypes` for more.
77+
7678
If you know you need a NumPy array, use :meth:`~Series.to_numpy`
7779
or :meth:`numpy.asarray`.
7880

@@ -81,10 +83,30 @@ or :meth:`numpy.asarray`.
8183
s.to_numpy()
8284
np.asarray(s)
8385
84-
For Series and Indexes backed by NumPy arrays (like we have here), this will
85-
be the same as :attr:`~Series.array`. When the Series or Index is backed by
86-
a :class:`~pandas.api.extension.ExtensionArray`, :meth:`~Series.to_numpy`
87-
may involve copying data and coercing values.
86+
When the Series or Index is backed by
87+
an :class:`~pandas.api.extension.ExtensionArray`, :meth:`~Series.to_numpy`
88+
may involve copying data and coercing values. See :ref:`basics.dtypes` for more.
89+
90+
:meth:`~Series.to_numpy` gives some control over the ``dtype`` of the
91+
resulting :class:`ndarray`. For example, consider datetimes with timezones.
92+
NumPy doesn't have a dtype to represent timezone-aware datetimes, so there
93+
are two possibly useful representations:
94+
95+
1. An object-dtype :class:`ndarray` with :class:`Timestamp` objects, each
96+
with the correct ``tz``
97+
2. A ``datetime64[ns]`` -dtype :class:`ndarray`, where the values have
98+
been converted to UTC and the timezone discarded
99+
100+
Timezones may be preserved with ``dtype=object``
101+
102+
.. ipython:: python
103+
104+
ser = pd.Series(pd.date_range('2000', periods=2, tz="CET"))
105+
ser.to_numpy(dtype=object)
106+
107+
Or thrown away with ``dtype='datetime64[ns]'``
108+
109+
ser.to_numpy(dtype="datetime64[ns]")
88110

89111
:meth:`~Series.to_numpy` gives some control over the ``dtype`` of the
90112
resulting :class:`ndarray`. For example, consider datetimes with timezones.
@@ -109,7 +131,7 @@ Or thrown away with ``dtype='datetime64[ns]'``
109131

110132
Getting the "raw data" inside a :class:`DataFrame` is possibly a bit more
111133
complex. When your ``DataFrame`` only has a single data type for all the
112-
columns, :attr:`DataFrame.to_numpy` will return the underlying data:
134+
columns, :meth:`DataFrame.to_numpy` will return the underlying data:
113135

114136
.. ipython:: python
115137
@@ -136,8 +158,9 @@ drawbacks:
136158

137159
1. When your Series contains an :ref:`extension type <extending.extension-type>`, it's
138160
unclear whether :attr:`Series.values` returns a NumPy array or the extension array.
139-
:attr:`Series.array` will always return the actual array backing the Series,
140-
while :meth:`Series.to_numpy` will always return a NumPy array.
161+
:attr:`Series.array` will always return an ``ExtensionArray``, and will never
162+
copy data. :meth:`Series.to_numpy` will always return a NumPy array,
163+
potentially at the cost of copying / coercing values.
141164
2. When your DataFrame contains a mixture of data types, :attr:`DataFrame.values` may
142165
involve copying data and coercing values to a common dtype, a relatively expensive
143166
operation. :meth:`DataFrame.to_numpy`, being a method, makes it clearer that the

doc/source/dsintro.rst

+6-2
Original file line numberDiff line numberDiff line change
@@ -146,11 +146,15 @@ If you need the actual array backing a ``Series``, use :attr:`Series.array`.
146146
147147
s.array
148148
149-
Again, this is often a NumPy array, but may instead be a
150-
:class:`~pandas.api.extensions.ExtensionArray`. See :ref:`basics.dtypes` for more.
151149
Accessing the array can be useful when you need to do some operation without the
152150
index (to disable :ref:`automatic alignment <dsintro.alignment>`, for example).
153151

152+
:attr:`Series.array` will always be an :class:`~pandas.api.extensions.ExtensionArray`.
153+
Briefly, an ExtensionArray is a thin wrapper around one or more *concrete* arrays like a
154+
:class:`numpy.ndarray`. Pandas knows how to take an ``ExtensionArray`` and
155+
store it in a ``Series`` or a column of a ``DataFrame``.
156+
See :ref:`basics.dtypes` for more.
157+
154158
While Series is ndarray-like, if you need an *actual* ndarray, then use
155159
:meth:`Series.to_numpy`.
156160

doc/source/whatsnew/v0.24.0.rst

+6-3
Original file line numberDiff line numberDiff line change
@@ -65,8 +65,11 @@ If you need an actual NumPy array, use :meth:`Series.to_numpy` or :meth:`Index.t
6565
idx.to_numpy()
6666
pd.Series(idx).to_numpy()
6767
68-
For Series and Indexes backed by normal NumPy arrays, this will be the same thing (and the same
69-
as ``.values``).
68+
For Series and Indexes backed by normal NumPy arrays, :attr:`Series.array` will return a
69+
new :class:`arrays.PandasArray`, which is a thin (no-copy) wrapper around a
70+
:class:`numpy.ndarray`. :class:`arrays.PandasArray` isn't especially useful on its own,
71+
but it does provide the same interface as any extension array defined in pandas or by
72+
a third-party library.
7073

7174
.. ipython:: python
7275
@@ -75,7 +78,7 @@ as ``.values``).
7578
ser.to_numpy()
7679
7780
We haven't removed or deprecated :attr:`Series.values` or :attr:`DataFrame.values`, but we
78-
recommend and using ``.array`` or ``.to_numpy()`` instead.
81+
highly recommend and using ``.array`` or ``.to_numpy()`` instead.
7982

8083
See :ref:`Dtypes <basics.dtypes>` and :ref:`Attributes and Underlying Data <basics.attrs>` for more.
8184

pandas/__init__.py

+1
Original file line numberDiff line numberDiff line change
@@ -49,6 +49,7 @@
4949
from pandas.io.api import *
5050
from pandas.util._tester import test
5151
import pandas.testing
52+
import pandas.arrays
5253

5354
# use the closest tagged version if possible
5455
from ._version import get_versions

pandas/arrays/__init__.py

+11
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,11 @@
1+
"""
2+
All of pandas' ExtensionArrays.
3+
4+
See :ref:`extending.extension-types` for more.
5+
"""
6+
from pandas.core.arrays import PandasArray
7+
8+
9+
__all__ = [
10+
'PandasArray'
11+
]

pandas/core/arrays/__init__.py

+1
Original file line numberDiff line numberDiff line change
@@ -9,3 +9,4 @@
99
from .integer import ( # noqa
1010
IntegerArray, integer_array)
1111
from .sparse import SparseArray # noqa
12+
from .numpy_ import PandasArray, PandasDtype # noqa

pandas/core/arrays/categorical.py

+2-2
Original file line numberDiff line numberDiff line change
@@ -2091,9 +2091,9 @@ def __setitem__(self, key, value):
20912091
If (one or more) Value is not in categories or if a assigned
20922092
`Categorical` does not have the same categories
20932093
"""
2094+
from pandas.core.internals.arrays import extract_array
20942095

2095-
if isinstance(value, (ABCIndexClass, ABCSeries)):
2096-
value = value.array
2096+
value = extract_array(value, extract_numpy=True)
20972097

20982098
# require identical categories set
20992099
if isinstance(value, Categorical):

0 commit comments

Comments
 (0)