Skip to content

ENH: add NDArrayBackedExtensionArray to public API #56755

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 47 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
47 commits
Select commit Hold shift + click to select a range
1f93779
ENH: add NDArrayBackedExtensionArray to public API
tswast Jan 21, 2022
522b548
add whatsnew
tswast Jan 21, 2022
ee4e23d
Merge branch 'main' into python-db-dtypes-pandas-issue28
jreback Jan 23, 2022
945f840
add NDArrayBackedExtensionArray to pandas.core.arrays.__init__
tswast Jan 24, 2022
721ae11
add tests for extensions api
tswast Jan 24, 2022
ae68f9d
add docs
tswast Jan 24, 2022
05d0e08
Merge remote-tracking branch 'upstream/main' into python-db-dtypes-pa…
tswast Jan 24, 2022
1ad0338
Merge remote-tracking branch 'origin/python-db-dtypes-pandas-issue28'…
tswast Jan 24, 2022
38113c8
add autosummary for methods and attributes
tswast Jan 24, 2022
18ec784
remove unreferenced methods from docs
tswast Jan 24, 2022
2919f60
fix docstrings
tswast Jan 25, 2022
0c52366
Merge remote-tracking branch 'upstream/main' into python-db-dtypes-pa…
tswast Jan 25, 2022
319ac2b
use doc decorator
tswast Jan 26, 2022
8513863
add code samples and reference to test suite
tswast Jan 26, 2022
5309895
Merge remote-tracking branch 'upstream/main' into python-db-dtypes-pa…
tswast Jan 26, 2022
827f483
Merge branch 'main' into python-db-dtypes-pandas-issue28
jreback Mar 19, 2022
2cd9b31
Merge remote-tracking branch 'upstream/main' into python-db-dtypes-pa…
tswast Apr 6, 2022
cc75eda
add missing methods to extension docs
tswast Apr 6, 2022
ca323bb
Merge remote-tracking branch 'origin/python-db-dtypes-pandas-issue28'…
tswast Apr 6, 2022
bfd31f0
Merge branch 'main' into python-db-dtypes-pandas-issue28
tswast Apr 7, 2022
396da54
Merge branch 'main' into python-db-dtypes-pandas-issue28
jreback Apr 10, 2022
27cf80e
Merge branch 'main' into python-db-dtypes-pandas-issue28
tswast May 20, 2022
c716826
Merge branch 'main' into python-db-dtypes-pandas-issue28
tswast Jun 7, 2022
f4df0e9
Merge branch 'main' into python-db-dtypes-pandas-issue28
tswast Aug 25, 2022
8876b9a
clarify _validate_searchsorted_value and 2d backing array
tswast Aug 26, 2022
1bdd1cd
Merge branch 'main' into python-db-dtypes-pandas-issue28
tswast Aug 29, 2022
4b0a948
Merge remote-tracking branch 'upstream/main' into python-db-dtypes-pa…
tswast Nov 22, 2022
5920778
Merge branch 'python-db-dtypes-pandas-issue28' of github.com:tswast/p…
tswast Nov 22, 2022
38018e6
DOC: make insert docstring have single line summary
tswast Nov 23, 2022
9277cf5
Merge remote-tracking branch 'upstream/main' into python-db-dtypes-pa…
tswast Nov 23, 2022
0b86bd5
Merge branch 'main' into python-db-dtypes-pandas-issue28
tswast Nov 28, 2022
a5ac8ba
datearray changes
andrewgsavage Jan 6, 2024
8a621c5
Merge branch 'main' into ndbacked
andrewgsavage Jan 6, 2024
0e674d4
whatsnew
andrewgsavage Jan 6, 2024
f7e353a
test ndbacked is part of api
andrewgsavage Jan 6, 2024
01191d1
docstrings, extending docs
andrewgsavage Jan 7, 2024
ce4eeef
whatsnew
andrewgsavage Jan 7, 2024
7019bc7
docstring
andrewgsavage Jan 13, 2024
5f99f57
aggressive docstring
andrewgsavage Jan 13, 2024
47f8917
lint
andrewgsavage Jan 13, 2024
4f8c055
lint
andrewgsavage Jan 13, 2024
1aaaa9a
docstrings
andrewgsavage Jan 13, 2024
8a66d3a
to_numpy example
andrewgsavage Jan 13, 2024
a8fe040
to_numpy example
andrewgsavage Jan 13, 2024
f2cbd4b
remove ndarray example
andrewgsavage Jan 13, 2024
552f7a3
value_counts example
andrewgsavage Jan 13, 2024
6ae423d
use base.extensiontests in exmaple
andrewgsavage Jan 28, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
87 changes: 87 additions & 0 deletions doc/source/development/extending.rst
Original file line number Diff line number Diff line change
Expand Up @@ -135,6 +135,93 @@ by some other storage type, like Python lists.
See the `extension array source`_ for the interface definition. The docstrings
and comments contain guidance for properly implementing the interface.

:class:`~pandas.api.extensions.NDArrayBackedExtensionArray`
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

For ExtensionArrays backed by a single NumPy array, the
:class:`~pandas.api.extensions.NDArrayBackedExtensionArray` class can save you
some effort. It contains a private property ``_ndarray`` with the backing NumPy
array and implements the extension array interface.

Implement the following:

``_box_func``
Convert from array values to the type you wish to expose to users.

``_internal_fill_value``
Scalar used to denote ``NA`` value inside our ``self._ndarray``, e.g. ``-1``
for ``Categorical``, ``iNaT`` for ``Period``.

``_validate_scalar``
Convert from an object to a value which can be stored in the NumPy array.

``_validate_setitem_value``
Convert a value or values for use in setting a value or values in the backing
NumPy array.

.. code-block:: python

class DateArray(NDArrayBackedExtensionArray):
_internal_fill_value = numpy.datetime64("NaT")

def __init__(self, values):
backing_array_dtype = "<M8[ns]"
super().__init__(values=values, dtype=backing_array_dtype)

def _box_func(self, value):
if pandas.isna(x):
return pandas.NaT
return x.astype("datetime64[us]").item().date()

def _validate_scalar(self, scalar):
if pandas.isna(scalar):
return numpy.datetime64("NaT")
elif isinstance(scalar, datetime.date):
return pandas.Timestamp(
year=scalar.year, month=scalar.month, day=scalar.day
).to_datetime64()
else:
raise TypeError("Invalid value type", scalar)

def _validate_setitem_value(self, value):
if pandas.api.types.is_list_like(value):
return np.array([self._validate_scalar(v) for v in value], dtype = self.dtype)
return self._validate_scalar(value)


To support 2D arrays, use the ``_from_backing_data`` helper function when a
method is called on multi-dimensional data of the same dtype as ``_ndarray``.

.. code-block:: python

class CustomArray(NDArrayBackedExtensionArray):

...

def min(self, *, axis: Optional[int] = None, skipna: bool = True, **kwargs):
pandas.compat.numpy.function.validate_minnumpy_validate_min((), kwargs)
result = pandas.core.nanops.nanmin(
values=self._ndarray, axis=axis, mask=self.isna(), skipna=skipna
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is implicitly assuming that the ordering of self matches the ordering of self._ndarray

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is that an issue? If someone wants control over that they could use ExtensionArray instead of NDBackedExtensionArray.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In a conversation with @jorisvandenbossche two ideas were clarified for me:

  1. In the case of ordering, examples include IP addresses (https://cyberpandas.readthedocs.io/en/latest/usage.html#pandas-integration), which can be ordered in a number of ways.
  2. On the topic of composing multiple Extension Arrays (e.g. PintArray and UncertaintyArray), we might look at how the composition of PintArray and Arrow Arrays (https://arrow.apache.org/docs/python/pandas.html) might work, and then argue either how we can follow those patterns or why new patterns/assumptions are needed.

If I've misunderstood either the problem or proposed next steps, happy to edit the above to correct.

Copy link
Contributor

@MichaelTiemannOSC MichaelTiemannOSC Mar 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding the ordering question, both NumPy and Pandas select min and max (and also sort) based only on the real component of complex numbers:

import pandas as pd
xx = pd.Series([4+3j, 3+10j, 2+4j])
xx.min()
# (2+4j)
xx.max()
# (4+3j)
xx.map(lambda x: abs(x))
# 0     5.000000
# 1    10.440307
# 2     4.472136
xx.sort_values()
# 2    2.0+ 4.0j
# 1    3.0+10.0j
# 0    4.0+ 3.0j
# dtype: complex128

Thus it is up to us to decide what sorting behavior applies to uncertainties (I propose ordering based on magnitude only, not error terms, except when error is NaN, in which case the value is treated as NaN).

When we look at the composition question (2, above), we will look at having the subclasses deal with this entirely (meaning we can implement an EA of units which might also have uncertain values). And of course update the documentation...

)
if axis is None or self.ndim == 1:
return self._box_func(result)
return self._from_backing_data(result)


Subclass the tests in :mod:`pandas.tests.extension.base` in your test suite to
validate your implementation.

.. code-block:: python

@pytest.fixture
def data():
return CustomArray(numpy.arange(-10, 10, 1)


class TestCustomArray(base.ExtensionTests):
pass


.. _extending.extension.operator:

:class:`~pandas.api.extensions.ExtensionArray` operator support
Expand Down
21 changes: 21 additions & 0 deletions doc/source/reference/extensions.rst
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,7 @@ objects.
:template: autosummary/class_without_autosummary.rst

api.extensions.ExtensionArray
api.extensions.NDArrayBackedExtensionArray
arrays.NumpyExtensionArray

.. We need this autosummary so that methods and attributes are generated.
Expand Down Expand Up @@ -68,6 +69,26 @@ objects.
api.extensions.ExtensionArray.ndim
api.extensions.ExtensionArray.shape
api.extensions.ExtensionArray.tolist
api.extensions.NDArrayBackedExtensionArray.dtype
api.extensions.NDArrayBackedExtensionArray.argmax
api.extensions.NDArrayBackedExtensionArray.argmin
api.extensions.NDArrayBackedExtensionArray.argsort
api.extensions.NDArrayBackedExtensionArray.astype
api.extensions.NDArrayBackedExtensionArray.dropna
api.extensions.NDArrayBackedExtensionArray.equals
api.extensions.NDArrayBackedExtensionArray.factorize
api.extensions.NDArrayBackedExtensionArray.fillna
api.extensions.NDArrayBackedExtensionArray.insert
api.extensions.NDArrayBackedExtensionArray.isin
api.extensions.NDArrayBackedExtensionArray.isna
api.extensions.NDArrayBackedExtensionArray.searchsorted
api.extensions.NDArrayBackedExtensionArray.shift
api.extensions.NDArrayBackedExtensionArray.take
api.extensions.NDArrayBackedExtensionArray.to_numpy
api.extensions.NDArrayBackedExtensionArray.tolist
api.extensions.NDArrayBackedExtensionArray.unique
api.extensions.NDArrayBackedExtensionArray.value_counts
api.extensions.NDArrayBackedExtensionArray.view

Additionally, we have some utility methods for ensuring your object
behaves correctly.
Expand Down
1 change: 1 addition & 0 deletions doc/source/whatsnew/v2.2.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -313,6 +313,7 @@ Other enhancements

- :meth:`~DataFrame.to_sql` with method parameter set to ``multi`` works with Oracle on the backend
- :attr:`Series.attrs` / :attr:`DataFrame.attrs` now uses a deepcopy for propagating ``attrs`` (:issue:`54134`).
- :class:`NDArrayBackedExtensionArray` now exposed in the public API (:issue:`45544`)
- :func:`get_dummies` now returning extension dtypes ``boolean`` or ``bool[pyarrow]`` that are compatible with the input dtype (:issue:`56273`)
- :func:`read_csv` now supports ``on_bad_lines`` parameter with ``engine="pyarrow"`` (:issue:`54480`)
- :func:`read_sas` returns ``datetime64`` dtypes with resolutions better matching those stored natively in SAS, and avoids returning object-dtype in cases that cannot be stored with ``datetime64[ns]`` dtype (:issue:`56127`)
Expand Down
2 changes: 2 additions & 0 deletions pandas/api/extensions/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,7 @@
from pandas.core.arrays import (
ExtensionArray,
ExtensionScalarOpsMixin,
NDArrayBackedExtensionArray,
)

__all__ = [
Expand All @@ -30,4 +31,5 @@
"take",
"ExtensionArray",
"ExtensionScalarOpsMixin",
"NDArrayBackedExtensionArray",
]
2 changes: 2 additions & 0 deletions pandas/core/arrays/__init__.py
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
from pandas.core.arrays._mixins import NDArrayBackedExtensionArray
from pandas.core.arrays.arrow import ArrowExtensionArray
from pandas.core.arrays.base import (
ExtensionArray,
Expand Down Expand Up @@ -34,6 +35,7 @@
"FloatingArray",
"IntegerArray",
"IntervalArray",
"NDArrayBackedExtensionArray",
"NumpyExtensionArray",
"PeriodArray",
"period_array",
Expand Down
39 changes: 24 additions & 15 deletions pandas/core/arrays/_mixins.py
Original file line number Diff line number Diff line change
Expand Up @@ -92,6 +92,17 @@ def method(self, *args, **kwargs):
class NDArrayBackedExtensionArray(NDArrayBacked, ExtensionArray):
"""
ExtensionArray that is backed by a single NumPy ndarray.

Notes
-----
This class is part of the public API, but may be adjusted in non-user-facing
ways more aggressively than the regular API.

Examples
--------
Please see the following:

https://pandas.pydata.org/docs/development/extending.html#NDArrayBackedExtensionArray
"""

_ndarray: np.ndarray
Expand All @@ -114,6 +125,7 @@ def _validate_scalar(self, value):

# ------------------------------------------------------------------------

@doc(ExtensionArray.view)
def view(self, dtype: Dtype | None = None) -> ArrayLike:
# We handle datetime64, datetime64tz, timedelta64, and period
# dtypes here. Everything else we pass through to the underlying
Expand Down Expand Up @@ -154,6 +166,7 @@ def view(self, dtype: Dtype | None = None) -> ArrayLike:
# Sequence[int]]], List[Any], _DTypeDict, Tuple[Any, Any]]]"
return arr.view(dtype=dtype) # type: ignore[arg-type]

@doc(ExtensionArray.view)
def take(
self,
indices: TakeIndexer,
Expand Down Expand Up @@ -440,20 +453,8 @@ def _where(self: Self, mask: npt.NDArray[np.bool_], value) -> Self:
# ------------------------------------------------------------------------
# Index compat methods

@doc(ExtensionArray.insert)
def insert(self, loc: int, item) -> Self:
"""
Make new ExtensionArray inserting new item at location. Follows
Python list.append semantics for negative values.

Parameters
----------
loc : int
item : object

Returns
-------
type(self)
"""
loc = validate_insert_loc(loc, len(self))

code = self._validate_scalar(item)
Expand All @@ -474,16 +475,24 @@ def insert(self, loc: int, item) -> Self:

def value_counts(self, dropna: bool = True) -> Series:
"""
Return a Series containing counts of unique values.
Return a Series containing counts of each unique value.

Parameters
----------
dropna : bool, default True
Don't include counts of NA values.
Don't include counts of missing values.

Returns
-------
Series

Examples
--------
>>> arr = pd.array([4, 5])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this won't give a NDArrayBackedEA

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The docstrings for the other methods are the EA docstrings, so I've used an example similar to those here. Is there a better way of going about this? I can't see how to write an example using a NDArrayBackedEA without several lines initialising an ExtensionDtype and NDArrayBackedEA

>>> arr.value_counts()
4 1
5 1
Name: count, dtype: Int64
"""
if self.ndim != 1:
raise NotImplementedError
Expand Down
1 change: 1 addition & 0 deletions pandas/tests/api/test_api.py
Original file line number Diff line number Diff line change
Expand Up @@ -336,6 +336,7 @@ class TestApi(Base):
"take",
"ExtensionArray",
"ExtensionScalarOpsMixin",
"NDArrayBackedExtensionArray",
]

def test_api(self):
Expand Down