Skip to content

Commit 0a4f40c

Browse files
TomAugspurgerjreback
authored andcommitted
API: Public data for Series and Index: .array and .to_numpy() (#23623)
1 parent 8ed347a commit 0a4f40c

22 files changed

+501
-55
lines changed

doc/source/10min.rst

+29-2
Original file line numberDiff line numberDiff line change
@@ -113,13 +113,40 @@ Here is how to view the top and bottom rows of the frame:
113113
df.head()
114114
df.tail(3)
115115
116-
Display the index, columns, and the underlying NumPy data:
116+
Display the index, columns:
117117

118118
.. ipython:: python
119119
120120
df.index
121121
df.columns
122-
df.values
122+
123+
:meth:`DataFrame.to_numpy` gives a NumPy representation of the underlying data.
124+
Note that his can be an expensive operation when your :class:`DataFrame` has
125+
columns with different data types, which comes down to a fundamental difference
126+
between pandas and NumPy: **NumPy arrays have one dtype for the entire array,
127+
while pandas DataFrames have one dtype per column**. When you call
128+
:meth:`DataFrame.to_numpy`, pandas will find the NumPy dtype that can hold *all*
129+
of the dtypes in the DataFrame. This may end up being ``object``, which requires
130+
casting every value to a Python object.
131+
132+
For ``df``, our :class:`DataFrame` of all floating-point values,
133+
:meth:`DataFrame.to_numpy` is fast and doesn't require copying data.
134+
135+
.. ipython:: python
136+
137+
df.to_numpy()
138+
139+
For ``df2``, the :class:`DataFrame` with multiple dtypes,
140+
:meth:`DataFrame.to_numpy` is relatively expensive.
141+
142+
.. ipython:: python
143+
144+
df2.to_numpy()
145+
146+
.. note::
147+
148+
:meth:`DataFrame.to_numpy` does *not* include the index or column
149+
labels in the output.
123150

124151
:func:`~DataFrame.describe` shows a quick statistic summary of your data:
125152

doc/source/advanced.rst

+1-1
Original file line numberDiff line numberDiff line change
@@ -188,7 +188,7 @@ highly performant. If you want to see only the used levels, you can use the
188188

189189
.. ipython:: python
190190
191-
df[['foo', 'qux']].columns.values
191+
df[['foo', 'qux']].columns.to_numpy()
192192
193193
# for a specific level
194194
df[['foo', 'qux']].columns.get_level_values(0)

doc/source/basics.rst

+80-24
Original file line numberDiff line numberDiff line change
@@ -46,8 +46,8 @@ of elements to display is five, but you may pass a custom number.
4646
4747
.. _basics.attrs:
4848

49-
Attributes and the raw ndarray(s)
50-
---------------------------------
49+
Attributes and Underlying Data
50+
------------------------------
5151

5252
pandas objects have a number of attributes enabling you to access the metadata
5353

@@ -65,14 +65,43 @@ Note, **these attributes can be safely assigned to**!
6565
df.columns = [x.lower() for x in df.columns]
6666
df
6767
68-
To get the actual data inside a data structure, one need only access the
69-
**values** property:
68+
Pandas objects (:class:`Index`, :class:`Series`, :class:`DataFrame`) can be
69+
thought of as containers for arrays, which hold the actual data and do the
70+
actual computation. For many types, the underlying array is a
71+
:class:`numpy.ndarray`. However, pandas and 3rd party libraries may *extend*
72+
NumPy's type system to add support for custom arrays
73+
(see :ref:`basics.dtypes`).
74+
75+
To get the actual data inside a :class:`Index` or :class:`Series`, use
76+
the **array** property
77+
78+
.. ipython:: python
79+
80+
s.array
81+
s.index.array
82+
83+
Depending on the data type (see :ref:`basics.dtypes`), :attr:`~Series.array`
84+
be either a NumPy array or an :ref:`ExtensionArray <extending.extension-type>`.
85+
If you know you need a NumPy array, use :meth:`~Series.to_numpy`
86+
or :meth:`numpy.asarray`.
7087

7188
.. ipython:: python
7289
73-
s.values
74-
df.values
75-
wp.values
90+
s.to_numpy()
91+
np.asarray(s)
92+
93+
For Series and Indexes backed by NumPy arrays (like we have here), this will
94+
be the same as :attr:`~Series.array`. When the Series or Index is backed by
95+
a :class:`~pandas.api.extension.ExtensionArray`, :meth:`~Series.to_numpy`
96+
may involve copying data and coercing values.
97+
98+
Getting the "raw data" inside a :class:`DataFrame` is possibly a bit more
99+
complex. When your ``DataFrame`` only has a single data type for all the
100+
columns, :atr:`DataFrame.to_numpy` will return the underlying data:
101+
102+
.. ipython:: python
103+
104+
df.to_numpy()
76105
77106
If a DataFrame or Panel contains homogeneously-typed data, the ndarray can
78107
actually be modified in-place, and the changes will be reflected in the data
@@ -87,6 +116,21 @@ unlike the axis labels, cannot be assigned to.
87116
strings are involved, the result will be of object dtype. If there are only
88117
floats and integers, the resulting array will be of float dtype.
89118

119+
In the past, pandas recommended :attr:`Series.values` or :attr:`DataFrame.values`
120+
for extracting the data from a Series or DataFrame. You'll still find references
121+
to these in old code bases and online. Going forward, we recommend avoiding
122+
``.values`` and using ``.array`` or ``.to_numpy()``. ``.values`` has the following
123+
drawbacks:
124+
125+
1. When your Series contains an :ref:`extension type <extending.extension-type>`, it's
126+
unclear whether :attr:`Series.values` returns a NumPy array or the extension array.
127+
:attr:`Series.array` will always return the actual array backing the Series,
128+
while :meth:`Series.to_numpy` will always return a NumPy array.
129+
2. When your DataFrame contains a mixture of data types, :attr:`DataFrame.values` may
130+
involve copying data and coercing values to a common dtype, a relatively expensive
131+
operation. :meth:`DataFrame.to_numpy`, being a method, makes it clearer that the
132+
returned NumPy array may not be a view on the same data in the DataFrame.
133+
90134
.. _basics.accelerate:
91135

92136
Accelerated operations
@@ -541,7 +585,7 @@ will exclude NAs on Series input by default:
541585
.. ipython:: python
542586
543587
np.mean(df['one'])
544-
np.mean(df['one'].values)
588+
np.mean(df['one'].to_numpy())
545589
546590
:meth:`Series.nunique` will return the number of unique non-NA values in a
547591
Series:
@@ -839,7 +883,7 @@ Series operation on each column or row:
839883
840884
tsdf = pd.DataFrame(np.random.randn(10, 3), columns=['A', 'B', 'C'],
841885
index=pd.date_range('1/1/2000', periods=10))
842-
tsdf.values[3:7] = np.nan
886+
tsdf.iloc[3:7] = np.nan
843887
844888
.. ipython:: python
845889
@@ -1875,17 +1919,29 @@ dtypes
18751919
------
18761920

18771921
For the most part, pandas uses NumPy arrays and dtypes for Series or individual
1878-
columns of a DataFrame. The main types allowed in pandas objects are ``float``,
1879-
``int``, ``bool``, and ``datetime64[ns]`` (note that NumPy does not support
1880-
timezone-aware datetimes).
1881-
1882-
In addition to NumPy's types, pandas :ref:`extends <extending.extension-types>`
1883-
NumPy's type-system for a few cases.
1884-
1885-
* :ref:`Categorical <categorical>`
1886-
* :ref:`Datetime with Timezone <timeseries.timezone_series>`
1887-
* :ref:`Period <timeseries.periods>`
1888-
* :ref:`Interval <indexing.intervallindex>`
1922+
columns of a DataFrame. NumPy provides support for ``float``,
1923+
``int``, ``bool``, ``timedelta64[ns]`` and ``datetime64[ns]`` (note that NumPy
1924+
does not support timezone-aware datetimes).
1925+
1926+
Pandas and third-party libraries *extend* NumPy's type system in a few places.
1927+
This section describes the extensions pandas has made internally.
1928+
See :ref:`extending.extension-types` for how to write your own extension that
1929+
works with pandas. See :ref:`ecosystem.extensions` for a list of third-party
1930+
libraries that have implemented an extension.
1931+
1932+
The following table lists all of pandas extension types. See the respective
1933+
documentation sections for more on each type.
1934+
1935+
=================== ========================= ================== ============================= =============================
1936+
Kind of Data Data Type Scalar Array Documentation
1937+
=================== ========================= ================== ============================= =============================
1938+
tz-aware datetime :class:`DatetimeArray` :class:`Timestamp` :class:`arrays.DatetimeArray` :ref:`timeseries.timezone`
1939+
Categorical :class:`CategoricalDtype` (none) :class:`Categorical` :ref:`categorical`
1940+
period (time spans) :class:`PeriodDtype` :class:`Period` :class:`arrays.PeriodArray` :ref:`timeseries.periods`
1941+
sparse :class:`SparseDtype` (none) :class:`arrays.SparseArray` :ref:`sparse`
1942+
intervals :class:`IntervalDtype` :class:`Interval` :class:`arrays.IntervalArray` :ref:`advanced.intervalindex`
1943+
nullable integer :clsas:`Int64Dtype`, ... (none) :class:`arrays.IntegerArray` :ref:`integer_na`
1944+
=================== ========================= ================== ============================= =============================
18891945

18901946
Pandas uses the ``object`` dtype for storing strings.
18911947

@@ -1983,13 +2039,13 @@ from the current type (e.g. ``int`` to ``float``).
19832039
df3
19842040
df3.dtypes
19852041
1986-
The ``values`` attribute on a DataFrame return the *lower-common-denominator* of the dtypes, meaning
2042+
:meth:`DataFrame.to_numpy` will return the *lower-common-denominator* of the dtypes, meaning
19872043
the dtype that can accommodate **ALL** of the types in the resulting homogeneous dtyped NumPy array. This can
19882044
force some *upcasting*.
19892045

19902046
.. ipython:: python
19912047
1992-
df3.values.dtype
2048+
df3.to_numpy().dtype
19932049
19942050
astype
19952051
~~~~~~
@@ -2211,11 +2267,11 @@ dtypes:
22112267
'float64': np.arange(4.0, 7.0),
22122268
'bool1': [True, False, True],
22132269
'bool2': [False, True, False],
2214-
'dates': pd.date_range('now', periods=3).values,
2270+
'dates': pd.date_range('now', periods=3),
22152271
'category': pd.Series(list("ABC")).astype('category')})
22162272
df['tdeltas'] = df.dates.diff()
22172273
df['uint64'] = np.arange(3, 6).astype('u8')
2218-
df['other_dates'] = pd.date_range('20130101', periods=3).values
2274+
df['other_dates'] = pd.date_range('20130101', periods=3)
22192275
df['tz_aware_dates'] = pd.date_range('20130101', periods=3, tz='US/Eastern')
22202276
df
22212277

doc/source/categorical.rst

+2-2
Original file line numberDiff line numberDiff line change
@@ -178,7 +178,7 @@ are consistent among all columns.
178178

179179
To perform table-wise conversion, where all labels in the entire ``DataFrame`` are used as
180180
categories for each column, the ``categories`` parameter can be determined programmatically by
181-
``categories = pd.unique(df.values.ravel())``.
181+
``categories = pd.unique(df.to_numpy().ravel())``.
182182

183183
If you already have ``codes`` and ``categories``, you can use the
184184
:func:`~pandas.Categorical.from_codes` constructor to save the factorize step
@@ -955,7 +955,7 @@ Use ``.astype`` or ``union_categoricals`` to get ``category`` result.
955955
pd.concat([s1, s3])
956956
957957
pd.concat([s1, s3]).astype('category')
958-
union_categoricals([s1.values, s3.values])
958+
union_categoricals([s1.array, s3.array])
959959
960960
961961
Following table summarizes the results of ``Categoricals`` related concatenations.

doc/source/dsintro.rst

+39-1
Original file line numberDiff line numberDiff line change
@@ -137,7 +137,43 @@ However, operations such as slicing will also slice the index.
137137
s[[4, 3, 1]]
138138
np.exp(s)
139139
140-
We will address array-based indexing in a separate :ref:`section <indexing>`.
140+
.. note::
141+
142+
We will address array-based indexing like ``s[[4, 3, 1]]``
143+
in :ref:`section <indexing>`.
144+
145+
Like a NumPy array, a pandas Series has a :attr:`~Series.dtype`.
146+
147+
.. ipython:: python
148+
149+
s.dtype
150+
151+
This is often a NumPy dtype. However, pandas and 3rd-party libraries
152+
extend NumPy's type system in a few places, in which case the dtype would
153+
be a :class:`~pandas.api.extensions.ExtensionDtype`. Some examples within
154+
pandas are :ref:`categorical` and :ref:`integer_na`. See :ref:`basics.dtypes`
155+
for more.
156+
157+
If you need the actual array backing a ``Series``, use :attr:`Series.array`.
158+
159+
.. ipython:: python
160+
161+
s.array
162+
163+
Again, this is often a NumPy array, but may instead be a
164+
:class:`~pandas.api.extensions.ExtensionArray`. See :ref:`basics.dtypes` for more.
165+
Accessing the array can be useful when you need to do some operation without the
166+
index (to disable :ref:`automatic alignment <dsintro.alignment>`, for example).
167+
168+
While Series is ndarray-like, if you need an *actual* ndarray, then use
169+
:meth:`Series.to_numpy`.
170+
171+
.. ipython:: python
172+
173+
s.to_numpy()
174+
175+
Even if the Series is backed by a :class:`~pandas.api.extensions.ExtensionArray`,
176+
:meth:`Series.to_numpy` will return a NumPy ndarray.
141177

142178
Series is dict-like
143179
~~~~~~~~~~~~~~~~~~~
@@ -617,6 +653,8 @@ slicing, see the :ref:`section on indexing <indexing>`. We will address the
617653
fundamentals of reindexing / conforming to new sets of labels in the
618654
:ref:`section on reindexing <basics.reindexing>`.
619655

656+
.. _dsintro.alignment:
657+
620658
Data alignment and arithmetic
621659
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
622660

doc/source/enhancingperf.rst

+5-3
Original file line numberDiff line numberDiff line change
@@ -221,7 +221,7 @@ the rows, applying our ``integrate_f_typed``, and putting this in the zeros arra
221221

222222
You can **not pass** a ``Series`` directly as a ``ndarray`` typed parameter
223223
to a Cython function. Instead pass the actual ``ndarray`` using the
224-
``.values`` attribute of the ``Series``. The reason is that the Cython
224+
:meth:`Series.to_numpy`. The reason is that the Cython
225225
definition is specific to an ndarray and not the passed ``Series``.
226226

227227
So, do not do this:
@@ -230,11 +230,13 @@ the rows, applying our ``integrate_f_typed``, and putting this in the zeros arra
230230
231231
apply_integrate_f(df['a'], df['b'], df['N'])
232232
233-
But rather, use ``.values`` to get the underlying ``ndarray``:
233+
But rather, use :meth:`Series.to_numpy` to get the underlying ``ndarray``:
234234

235235
.. code-block:: python
236236
237-
apply_integrate_f(df['a'].values, df['b'].values, df['N'].values)
237+
apply_integrate_f(df['a'].to_numpy(),
238+
df['b'].to_numpy(),
239+
df['N'].to_numpy())
238240
239241
.. note::
240242

doc/source/extending.rst

+1-1
Original file line numberDiff line numberDiff line change
@@ -186,7 +186,7 @@ Instead, you should detect these cases and return ``NotImplemented``.
186186
When pandas encounters an operation like ``op(Series, ExtensionArray)``, pandas
187187
will
188188

189-
1. unbox the array from the ``Series`` (roughly ``Series.values``)
189+
1. unbox the array from the ``Series`` (``Series.array``)
190190
2. call ``result = op(values, ExtensionArray)``
191191
3. re-box the result in a ``Series``
192192

doc/source/indexing.rst

+1-1
Original file line numberDiff line numberDiff line change
@@ -190,7 +190,7 @@ columns.
190190

191191
.. ipython:: python
192192
193-
df.loc[:,['B', 'A']] = df[['A', 'B']].values
193+
df.loc[:,['B', 'A']] = df[['A', 'B']].to_numpy()
194194
df[['A', 'B']]
195195
196196

doc/source/missing_data.rst

+1-1
Original file line numberDiff line numberDiff line change
@@ -678,7 +678,7 @@ Replacing more than one value is possible by passing a list.
678678

679679
.. ipython:: python
680680
681-
df00 = df.values[0, 0]
681+
df00 = df.iloc[0, 0]
682682
df.replace([1.5, df00], [np.nan, 'a'])
683683
df[1].dtype
684684

doc/source/reshaping.rst

+7-7
Original file line numberDiff line numberDiff line change
@@ -27,12 +27,12 @@ Reshaping by pivoting DataFrame objects
2727
tm.N = 3
2828
2929
def unpivot(frame):
30-
N, K = frame.shape
31-
data = {'value': frame.values.ravel('F'),
32-
'variable': np.asarray(frame.columns).repeat(N),
33-
'date': np.tile(np.asarray(frame.index), K)}
34-
columns = ['date', 'variable', 'value']
35-
return pd.DataFrame(data, columns=columns)
30+
N, K = frame.shape
31+
data = {'value': frame.to_numpy().ravel('F'),
32+
'variable': np.asarray(frame.columns).repeat(N),
33+
'date': np.tile(np.asarray(frame.index), K)}
34+
columns = ['date', 'variable', 'value']
35+
return pd.DataFrame(data, columns=columns)
3636
3737
df = unpivot(tm.makeTimeDataFrame())
3838
@@ -54,7 +54,7 @@ For the curious here is how the above ``DataFrame`` was created:
5454
5555
def unpivot(frame):
5656
N, K = frame.shape
57-
data = {'value': frame.values.ravel('F'),
57+
data = {'value': frame.to_numpy().ravel('F'),
5858
'variable': np.asarray(frame.columns).repeat(N),
5959
'date': np.tile(np.asarray(frame.index), K)}
6060
return pd.DataFrame(data, columns=['date', 'variable', 'value'])

doc/source/text.rst

+2-2
Original file line numberDiff line numberDiff line change
@@ -317,8 +317,8 @@ All one-dimensional list-likes can be combined in a list-like container (includi
317317
318318
s
319319
u
320-
s.str.cat([u.values,
321-
u.index.astype(str).values], na_rep='-')
320+
s.str.cat([u.array,
321+
u.index.astype(str).array], na_rep='-')
322322
323323
All elements must match in length to the calling ``Series`` (or ``Index``), except those having an index if ``join`` is not None:
324324

0 commit comments

Comments
 (0)