Skip to content

API: Public data for Series and Index: .array and .to_numpy() #23623

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 28 commits into from
Nov 29, 2018
Merged
Show file tree
Hide file tree
Changes from 5 commits
Commits
Show all changes
28 commits
Select commit Hold shift + click to select a range
7959eb6
API: Public data attributes for EA-backed containers
TomAugspurger Oct 30, 2018
5b15894
update
TomAugspurger Nov 6, 2018
4781a36
Merge remote-tracking branch 'upstream/master' into public-data
TomAugspurger Nov 11, 2018
15cc0b7
more notes
TomAugspurger Nov 11, 2018
888853f
update
TomAugspurger Nov 11, 2018
2cfca30
Merge remote-tracking branch 'upstream/master' into public-data
TomAugspurger Nov 11, 2018
3e76f02
Merge remote-tracking branch 'upstream/master' into public-data
TomAugspurger Nov 13, 2018
7e43cf0
Squashed commit of the following:
TomAugspurger Nov 13, 2018
bceb612
DOC: updated docs
TomAugspurger Nov 13, 2018
c19c9bb
Added DataFrame.to_numpy
TomAugspurger Nov 17, 2018
fe813ff
Merge remote-tracking branch 'upstream/master' into public-data
TomAugspurger Nov 17, 2018
8619790
clean
TomAugspurger Nov 17, 2018
639b6fb
Merge remote-tracking branch 'upstream/master' into public-data
TomAugspurger Nov 21, 2018
95f19bc
doc update
TomAugspurger Nov 21, 2018
3292e43
Merge remote-tracking branch 'upstream/master' into public-data
TomAugspurger Nov 21, 2018
5a905ab
update
TomAugspurger Nov 21, 2018
1e6eed4
fixed doctest
TomAugspurger Nov 21, 2018
4545d93
Merge remote-tracking branch 'upstream/master' into public-data
TomAugspurger Nov 26, 2018
2d7abb4
Merge remote-tracking branch 'upstream/master' into public-data
TomAugspurger Nov 27, 2018
a7a13a0
Fixed array error in docs
TomAugspurger Nov 27, 2018
c0a63c0
update docs
TomAugspurger Nov 27, 2018
661b9eb
Fixup for feedback
TomAugspurger Nov 28, 2018
52f5407
Merge remote-tracking branch 'upstream/master' into public-data
TomAugspurger Nov 28, 2018
566a027
skip only on index box
TomAugspurger Nov 28, 2018
062c49f
Series.values
TomAugspurger Nov 28, 2018
78e5824
Merge remote-tracking branch 'upstream/master' into public-data
TomAugspurger Nov 28, 2018
e805c26
remove stale todo
TomAugspurger Nov 28, 2018
f9eee65
Merge remote-tracking branch 'upstream/master' into public-data
TomAugspurger Nov 29, 2018
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
39 changes: 38 additions & 1 deletion doc/source/dsintro.rst
Original file line number Diff line number Diff line change
Expand Up @@ -137,7 +137,42 @@ However, operations such as slicing will also slice the index.
s[[4, 3, 1]]
np.exp(s)

We will address array-based indexing in a separate :ref:`section <indexing>`.
.. note::

We will address array-based indexing like ``s[[4, 3, 1]]``
in :ref:`section <indexing>`.

Like a NumPy array, a pandas Series as a :attr:`Series.dtype`.

.. ipython:: python

s.dtype

This is often a NumPy dtype. However, pandas and 3rd-party libraries
extend NumPy's type system in a few places, in which case the dtype would
be a :class:`~pandas.api.extensions.ExtensionDtype`. Some examples within
pandas are :ref:`categorical` and :ref:`integer_na`. See :ref:`dsintro.data_type` for more.

If you need the actual array backing a ``Series``, use :attr:`Series.array`.

.. ipython:: python

s.array

Again, this is often a NumPy array, but may instead be a
:class:`~pandas.api.extensions.ExtensionArray`. See :ref:`dsintro.data_type` for more.
Accessing the array can be useful when you need to do some operation without the
index (to disable :ref:`automatic alignment <dsintro.alignment>`, for example).

While Series is ndarray-like, if you need an *actual* ndarray, then use
:meth:`Series.to_numpy`.

.. ipython:: python

s.to_numpy()

Even if the Series is backed by a :class:`~pandas.api.extensions.ExtensionArray`,
:meth:`Series.to_numpy` will return a NumPy ndarray.

Series is dict-like
~~~~~~~~~~~~~~~~~~~
Expand Down Expand Up @@ -617,6 +652,8 @@ slicing, see the :ref:`section on indexing <indexing>`. We will address the
fundamentals of reindexing / conforming to new sets of labels in the
:ref:`section on reindexing <basics.reindexing>`.

.. _dsintro.alignment:

Data alignment and arithmetic
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Expand Down
40 changes: 40 additions & 0 deletions doc/source/whatsnew/v0.24.0.txt
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,46 @@ the user to override the engine's default behavior to include or omit the
dataframe's indexes from the resulting Parquet file. (:issue:`20768`)
- :meth:`DataFrame.corr` and :meth:`Series.corr` now accept a callable for generic calculation methods of correlation, e.g. histogram intersection (:issue:`22684`)

.. _whatsnew_0240.values_api:

Accessing the values in a Series or Index
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

:attr:`Series.array` and :attr:`Index.array` have been added for extracting the array backing a
``Series`` or ``Index``.

.. ipython:: python

idx = pd.period_range('2000', periods=4)
idx.array
pd.Series(idx).array

Historically, this would have been done with ``series.values``, but with
``.values`` it was unclear whether the returned value would be the actual array,
some transformation of it, or one of pandas custom arrays (like
``Categorical``). For example, with :class:`PeriodIndex`, ``.values`` generates
a new ndarray of period objects each time.

.. ipython:: python

id(idx.values)
id(idx.values)

If you need an actual NumPy array, use :meth:`Series.to_numpy` or :meth:`Index.to_numpy`.

.. ipython:: python

idx.to_numpy()
pd.Series(idx).to_numpy()

For Series and Indexes backed by normal NumPy arrays, this will be the same thing (and the same
as ``.values``).

.. ipython:: python

ser = pd.Series([1, 2, 3])
ser.array
ser.to_numpy()

.. _whatsnew_0240.enhancements.extension_array_operators:

Expand Down
91 changes: 91 additions & 0 deletions pandas/core/base.py
Original file line number Diff line number Diff line change
Expand Up @@ -765,6 +765,97 @@ def base(self):
FutureWarning, stacklevel=2)
return self.values.base

@property
def array(self):
# type: () -> Union[np.ndarray, ExtensionArray]
"""The actual Array backing this Series or Index.

Returns
-------
Union[ndarray, ExtensionArray]
This is the actual array stored within this object.

Notes
-----
This table lays out the different array types for each extension
dtype within pandas.

================== =============================
dtype array type
================== =============================
category Categorical
period PeriodArray
interval IntervalArray
IntegerNA IntegerArray
datetime64[ns, tz] datetime64[ns]? DatetimeArray
================== =============================

For any 3rd-party extension types, the array type will be an
ExtensionArray.

All remaining arrays (ndarrays), ``.array`` will be the ndarray
stored within.

See Also
--------
to_numpy : Similar method that always returns a NumPy array.

Examples
--------
>>> ser = pd.Series(pd.Categorical(['a', 'b', 'a']))
>>> ser.array
[a, b, a]
Categories (2, object): [a, b]
"""
return self._values

def to_numpy(self):
"""A NumPy array representing the values in this Series or Index.

The returned array will be the same up to equality (values equal
in `self` will be equal in the returned array; likewise for values
that are not equal).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When reading this, I then wonder: and in what sense is it not the same? (loose some type information? Eg in case of categoricals)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right. I'll add that.


Returns
-------
numpy.ndarray
An ndarray with

Notes
-----
For NumPy arrays, this will be a reference to the actual data stored
in this Series or Index.

For extension types, this may involve copying data and coercing the
result to a NumPy type (possibly object), which may be expensive.

This table lays out the different array types for each extension
dtype within pandas.

================== ================================
dtype array type
================== ================================
category[T] ndarray[T] (same dtype as input)
period ndarray[object] (Periods)
interval ndarray[object] (Intervals)
IntegerNA IntegerArray[object]
datetime64[ns, tz] datetime64[ns]? object?
================== ================================

See Also
--------
array : Get the actual data stored within.

Examples
--------
>>> ser = pd.Series(pd.Categorical(['a', 'b', 'a']))
>>> ser.to_numpy()
array(['a', 'b', 'a'], dtype=object)
"""
if is_extension_array_dtype(self.dtype):
return np.asarray(self._values)
return self._values

@property
def _ndarray_values(self):
# type: () -> np.ndarray
Expand Down
2 changes: 1 addition & 1 deletion pandas/core/indexes/base.py
Original file line number Diff line number Diff line change
Expand Up @@ -735,7 +735,7 @@ def _values(self):
values
_ndarray_values
"""
return self.values
return self._data

def get_values(self):
"""
Expand Down
20 changes: 20 additions & 0 deletions pandas/core/indexes/multi.py
Original file line number Diff line number Diff line change
Expand Up @@ -288,6 +288,26 @@ def _verify_integrity(self, labels=None, levels=None):
def levels(self):
return self._levels

@property
def _values(self):
# TODO: remove
# We override here, since our parent uses _data, which we dont' use.
return self.values

@property
def array(self):
"""
Raises a ValueError for `MultiIndex` because there's no single
array backing a MultiIndex.

Raises
------
ValueError
"""
msg = ("MultiIndex has no single backing array. Use "
"'MultiIndex.to_numpy()' to get a NumPy array of tuples.")
raise ValueError(msg)

@property
def _is_homogeneous_type(self):
"""Whether the levels of a MultiIndex all have the same dtype.
Expand Down
51 changes: 51 additions & 0 deletions pandas/tests/test_base.py
Original file line number Diff line number Diff line change
Expand Up @@ -1269,3 +1269,54 @@ def test_ndarray_values(array, expected):
r_values = pd.Index(array)._ndarray_values
tm.assert_numpy_array_equal(l_values, r_values)
tm.assert_numpy_array_equal(l_values, expected)


@pytest.mark.parametrize("array, attr", [
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe put in pandas/tests/arrays/test_arrays.py?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't exist yet :), though my pd.array PR is creating it.

This seemed a bit more appropriate since it's next to our tests for ndarray_values.

(np.array([1, 2], dtype=np.int64), None),
(pd.Categorical(['a', 'b']), '_codes'),
(pd.core.arrays.period_array(['2000', '2001'], freq='D'), '_data'),
(pd.core.arrays.integer_array([0, np.nan]), '_data'),
(pd.core.arrays.IntervalArray.from_breaks([0, 1]), '_left'),
(pd.SparseArray([0, 1]), '_sparse_values'),
# TODO: DatetimeArray(add)
])
@pytest.mark.parametrize('box', [pd.Series, pd.Index])
def test_array(array, attr, box):
if array.dtype.name in ('Int64', 'Sparse[int64, 0]'):
pytest.skip("No index type for {}".format(array.dtype))
result = box(array, copy=False).array

if attr:
array = getattr(array, attr)
result = getattr(result, attr)

assert result is array


def test_array_multiindex_raises():
idx = pd.MultiIndex.from_product([['A'], ['a', 'b']])
with pytest.raises(ValueError, match='MultiIndex'):
idx.array


@pytest.mark.parametrize('array, expected', [
(np.array([1, 2], dtype=np.int64), np.array([1, 2], dtype=np.int64)),
(pd.Categorical(['a', 'b']), np.array(['a', 'b'], dtype=object)),
(pd.core.arrays.period_array(['2000', '2001'], freq='D'),
np.array([pd.Period('2000', freq="D"), pd.Period('2001', freq='D')])),
(pd.core.arrays.integer_array([0, np.nan]),
np.array([1, np.nan], dtype=object)),
(pd.core.arrays.IntervalArray.from_breaks([0, 1, 2]),
np.array([pd.Interval(0, 1), pd.Interval(1, 2)], dtype=object)),
(pd.SparseArray([0, 1]), np.array([0, 1], dtype=np.int64)),
# TODO: DatetimeArray(add)
])
@pytest.mark.parametrize('box', [pd.Series, pd.Index])
def test_to_numpy(array, expected, box):
thing = box(array)

if array.dtype.name in ('Int64', 'Sparse[int64, 0]'):
pytest.skip("No index type for {}".format(array.dtype))

result = thing.to_numpy()
tm.assert_numpy_array_equal(result, expected)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we also test for the case where it is not a copy?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean here? (in case you missed it, the first case is a regular ndarray, so that won't be a copy. Though perhaps you're saying I should assert this for that case?)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's what I meant. If we return a view, and people can rely on it, we should test it.