Skip to content

Commit 8729566

Browse files
committed
WIP: NumPyBackedExtensionArray
Adds a NumPyBckedExtensionArray, a thin wrapper around ndarray implementing the EA interface. We use this to ensure that `Series.array -> ExtensionArray`, rather than a `Union[ndarray, ExtensionArray]`.
1 parent 9f29f88 commit 8729566

File tree

13 files changed

+592
-53
lines changed

13 files changed

+592
-53
lines changed

doc/source/basics.rst

+6-6
Original file line numberDiff line numberDiff line change
@@ -71,8 +71,9 @@ the **array** property
7171
s.array
7272
s.index.array
7373
74-
Depending on the data type (see :ref:`basics.dtypes`), :attr:`~Series.array`
75-
be either a NumPy array or an :ref:`ExtensionArray <extending.extension-type>`.
74+
75+
:attr:`~Series.array` will always be an ``ExtensionArray``.
76+
7677
If you know you need a NumPy array, use :meth:`~Series.to_numpy`
7778
or :meth:`numpy.asarray`.
7879

@@ -81,8 +82,7 @@ or :meth:`numpy.asarray`.
8182
s.to_numpy()
8283
np.asarray(s)
8384
84-
For Series and Indexes backed by NumPy arrays (like we have here), this will
85-
be the same as :attr:`~Series.array`. When the Series or Index is backed by
85+
When the Series or Index is backed by
8686
a :class:`~pandas.api.extension.ExtensionArray`, :meth:`~Series.to_numpy`
8787
may involve copying data and coercing values.
8888

@@ -115,8 +115,8 @@ drawbacks:
115115

116116
1. When your Series contains an :ref:`extension type <extending.extension-type>`, it's
117117
unclear whether :attr:`Series.values` returns a NumPy array or the extension array.
118-
:attr:`Series.array` will always return the actual array backing the Series,
119-
while :meth:`Series.to_numpy` will always return a NumPy array.
118+
:attr:`Series.array` will never require copying data, while :meth:`Series.to_numpy`
119+
will always return a NumPy array (potentially at the cost of copying / coercing values).
120120
2. When your DataFrame contains a mixture of data types, :attr:`DataFrame.values` may
121121
involve copying data and coercing values to a common dtype, a relatively expensive
122122
operation. :meth:`DataFrame.to_numpy`, being a method, makes it clearer that the

doc/source/dsintro.rst

+6-2
Original file line numberDiff line numberDiff line change
@@ -146,11 +146,15 @@ If you need the actual array backing a ``Series``, use :attr:`Series.array`.
146146
147147
s.array
148148
149-
Again, this is often a NumPy array, but may instead be a
150-
:class:`~pandas.api.extensions.ExtensionArray`. See :ref:`basics.dtypes` for more.
151149
Accessing the array can be useful when you need to do some operation without the
152150
index (to disable :ref:`automatic alignment <dsintro.alignment>`, for example).
153151

152+
:attr:`Series.array` will always be an :class:`~pandas.api.extensions.ExtensionArray`.
153+
Briefly, and ExtensionArray is a thin wrapper around one or more *real* arrays like a
154+
:class:`numpy.ndarray`. Pandas knows how to take an ``ExtensionArray`` and
155+
store it in a ``Series`` or a column of a ``DataFrame``.
156+
See :ref:`basics.dtypes` for more.
157+
154158
While Series is ndarray-like, if you need an *actual* ndarray, then use
155159
:meth:`Series.to_numpy`.
156160

doc/source/whatsnew/v0.24.0.rst

+4-3
Original file line numberDiff line numberDiff line change
@@ -65,8 +65,9 @@ If you need an actual NumPy array, use :meth:`Series.to_numpy` or :meth:`Index.t
6565
idx.to_numpy()
6666
pd.Series(idx).to_numpy()
6767
68-
For Series and Indexes backed by normal NumPy arrays, this will be the same thing (and the same
69-
as ``.values``).
68+
For Series and Indexes backed by normal NumPy arrays, :attr:`Series.array` will return a
69+
new :class:`NumPyBackedExtensionArray`, which is a thin (no-copy) wrapper around a
70+
:class:`numpy.ndarray`.
7071

7172
.. ipython:: python
7273
@@ -75,7 +76,7 @@ as ``.values``).
7576
ser.to_numpy()
7677
7778
We haven't removed or deprecated :attr:`Series.values` or :attr:`DataFrame.values`, but we
78-
recommend and using ``.array`` or ``.to_numpy()`` instead.
79+
highly recommend and using ``.array`` or ``.to_numpy()`` instead.
7980

8081
See :ref:`basics.dtypes` and :ref:`dsintro.attrs` for more.
8182

pandas/core/arrays/__init__.py

+1
Original file line numberDiff line numberDiff line change
@@ -9,3 +9,4 @@
99
from .integer import ( # noqa
1010
IntegerArray, integer_array)
1111
from .sparse import SparseArray # noqa
12+
from .numpy_ import NumPyExtensionArray, NumPyExtensionDtype # noqa

pandas/core/arrays/categorical.py

+6-7
Original file line numberDiff line numberDiff line change
@@ -16,11 +16,11 @@
1616
from pandas.core.dtypes.cast import (
1717
coerce_indexer_dtype, maybe_infer_to_datetimelike)
1818
from pandas.core.dtypes.common import (
19-
ensure_int64, ensure_object, ensure_platform_int, is_categorical,
20-
is_categorical_dtype, is_datetime64_dtype, is_datetimelike, is_dict_like,
21-
is_dtype_equal, is_extension_array_dtype, is_float_dtype, is_integer_dtype,
22-
is_iterator, is_list_like, is_object_dtype, is_scalar, is_sequence,
23-
is_timedelta64_dtype)
19+
ensure_int64, ensure_object, ensure_platform_int, extract_array,
20+
is_categorical, is_categorical_dtype, is_datetime64_dtype, is_datetimelike,
21+
is_dict_like, is_dtype_equal, is_extension_array_dtype, is_float_dtype,
22+
is_integer_dtype, is_iterator, is_list_like, is_object_dtype, is_scalar,
23+
is_sequence, is_timedelta64_dtype)
2424
from pandas.core.dtypes.dtypes import CategoricalDtype
2525
from pandas.core.dtypes.generic import (
2626
ABCCategoricalIndex, ABCIndexClass, ABCSeries)
@@ -2078,8 +2078,7 @@ def __setitem__(self, key, value):
20782078
`Categorical` does not have the same categories
20792079
"""
20802080

2081-
if isinstance(value, (ABCIndexClass, ABCSeries)):
2082-
value = value.array
2081+
value = extract_array(value)
20832082

20842083
# require identical categories set
20852084
if isinstance(value, Categorical):

pandas/core/arrays/numpy_.py

+271
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,271 @@
1+
import numpy as np
2+
3+
from pandas._libs import lib
4+
5+
from pandas.core.dtypes.common import extract_array
6+
from pandas.core.dtypes.dtypes import ExtensionDtype
7+
from pandas.core.dtypes.generic import ABCIndexClass, ABCSeries
8+
from pandas.core.dtypes.inference import is_list_like
9+
10+
from pandas import compat
11+
from pandas.core import nanops
12+
13+
from .base import ExtensionArray, ExtensionOpsMixin
14+
15+
16+
class NumPyExtensionDtype(ExtensionDtype):
17+
_metadata = ('_dtype',)
18+
19+
def __init__(self, dtype):
20+
assert isinstance(dtype, np.dtype)
21+
self._dtype = dtype
22+
self._name = dtype.name
23+
self._type = dtype.type
24+
25+
@property
26+
def name(self):
27+
return self._name
28+
29+
@property
30+
def type(self):
31+
return self._type
32+
33+
@property
34+
def _is_numeric(self):
35+
# TODO: find numeric types
36+
return True
37+
38+
@property
39+
def _is_boolean(self):
40+
return self.kind == 'b' # object?
41+
42+
@classmethod
43+
def construct_from_string(cls, string):
44+
return cls(np.dtype(string))
45+
46+
def construct_array_type(cls):
47+
return NumPyExtensionArray
48+
49+
@property
50+
def kind(self):
51+
return self._dtype.kind
52+
53+
@property
54+
def itemsize(self):
55+
return self._dtype.itemsize
56+
57+
58+
class NumPyExtensionArray(ExtensionArray, ExtensionOpsMixin):
59+
__array_priority__ = 1000
60+
61+
def __init__(self, values):
62+
if isinstance(values, type(self)):
63+
values = values._ndarray
64+
assert isinstance(values, np.ndarray)
65+
assert values.ndim == 1
66+
67+
self._ndarray = values
68+
self._dtype = NumPyExtensionDtype(values.dtype)
69+
70+
@classmethod
71+
def _from_sequence(cls, scalars, dtype=None, copy=False):
72+
#
73+
# if isinstance(dtype, NumpyDtype):
74+
# dtype = dtype._dtype
75+
# we deliberately ignore dtype to not deal with casting issues.
76+
77+
result = np.asarray(scalars)
78+
if copy and result is scalars:
79+
result = result.copy()
80+
return cls(result)
81+
82+
@classmethod
83+
def _from_factorized(cls, values, original):
84+
return cls(values)
85+
86+
@classmethod
87+
def _concat_same_type(cls, to_concat):
88+
return cls(np.concatenate(to_concat))
89+
90+
@property
91+
def dtype(self):
92+
return self._dtype
93+
94+
def __array__(self, dtype=None):
95+
return np.asarray(self._ndarray, dtype=dtype)
96+
97+
def __getitem__(self, item):
98+
if isinstance(item, type(self)):
99+
item = item._ndarray
100+
101+
result = self._ndarray[item]
102+
if not lib.is_scalar(result):
103+
result = type(self)(result)
104+
return result
105+
106+
def __setitem__(self, key, value):
107+
value = extract_array(value)
108+
109+
if not lib.is_scalar(key) and is_list_like(key):
110+
key = np.asarray(key)
111+
if not len(key):
112+
# early return to avoid casting unnecessarily.
113+
return
114+
115+
if not lib.is_scalar(value):
116+
value = np.asarray(value)
117+
118+
values = self._ndarray
119+
t = np.result_type(value, values)
120+
if t != self._ndarray.dtype:
121+
values = values.astype(t, casting='safe')
122+
values[key] = value
123+
self._dtype = NumPyExtensionDtype(t)
124+
self._ndarray = values
125+
else:
126+
self._ndarray[key] = value
127+
128+
def __len__(self):
129+
return len(self._ndarray)
130+
131+
@property
132+
def nbytes(self):
133+
return self._ndarray.nbytes
134+
135+
def isna(self):
136+
from pandas import isna
137+
138+
return isna(self._ndarray)
139+
140+
def fillna(self, value=None, method=None, limit=None):
141+
from pandas.api.types import is_array_like
142+
from pandas.util._validators import validate_fillna_kwargs
143+
from pandas.core.missing import pad_1d, backfill_1d
144+
145+
# TODO: really need to implement `_values_for_fillna`.
146+
value, method = validate_fillna_kwargs(value, method)
147+
148+
mask = self.isna()
149+
150+
if is_array_like(value):
151+
if len(value) != len(self):
152+
raise ValueError("Length of 'value' does not match. Got ({}) "
153+
" expected {}".format(len(value), len(self)))
154+
value = value[mask]
155+
156+
if mask.any():
157+
if method is not None:
158+
func = pad_1d if method == 'pad' else backfill_1d
159+
new_values = func(self._ndarray, limit=limit,
160+
mask=mask)
161+
new_values = self._from_sequence(new_values, dtype=self.dtype)
162+
else:
163+
# fill with value
164+
new_values = self.copy()
165+
new_values[mask] = value
166+
else:
167+
new_values = self.copy()
168+
return new_values
169+
170+
def take(self, indices, allow_fill=False, fill_value=None):
171+
from pandas.core.algorithms import take
172+
173+
result = take(self._ndarray, indices, allow_fill=allow_fill,
174+
fill_value=fill_value)
175+
return type(self)(result)
176+
177+
def copy(self, deep=False):
178+
return type(self)(self._ndarray.copy())
179+
180+
def _values_for_argsort(self):
181+
return self._ndarray
182+
183+
def _values_for_factorize(self):
184+
return self._ndarray, -1
185+
186+
def unique(self):
187+
from pandas import unique
188+
189+
return type(self)(unique(self._ndarray))
190+
191+
def _reduce(self, name, skipna=True, **kwargs):
192+
meth = getattr(self, name, None)
193+
if meth is None:
194+
# raise from the parent
195+
super(ExtensionArray, self).__reduce__(
196+
name=name, skipna=skipna, **kwargs
197+
)
198+
199+
return meth(skipna=skipna, **kwargs)
200+
201+
def min(self, skipna=True):
202+
return nanops.nanmin(self._ndarray, skipna=skipna)
203+
204+
def max(self, skipna=True):
205+
return nanops.nanmax(self._ndarray, skipna=skipna)
206+
207+
def any(self, skipna=True):
208+
return nanops.nanany(self._ndarray, skipna=skipna)
209+
210+
def all(self, skipna=True):
211+
return nanops.nanall(self._ndarray, skipna=skipna)
212+
213+
def sum(self, skipna=True, min_count=0):
214+
return nanops.nansum(self._ndarray, skipna=skipna,
215+
min_count=min_count)
216+
217+
def mean(self, skipna=True):
218+
return nanops.nanmean(self._ndarray, skipna=skipna)
219+
220+
def median(self, skipna=True):
221+
return nanops.nanmedian(self._ndarray, skipna=skipna)
222+
223+
def prod(self, min_count=0, skipna=True):
224+
return nanops.nanprod(self._ndarray, min_count=min_count,
225+
skipna=skipna)
226+
227+
def std(self, skipna=True, ddof=1):
228+
return nanops.nanstd(self._ndarray, skipna=skipna, ddof=ddof)
229+
230+
def var(self, skipna=True, ddof=1):
231+
return nanops.nanvar(self._ndarray, skipna=skipna, ddof=ddof)
232+
233+
def kurt(self, skipna=True):
234+
return nanops.nankurt(self._ndarray, skipna=skipna)
235+
236+
def skew(self, skipna=True):
237+
return nanops.nanskew(self._ndarray, skipna=skipna)
238+
239+
def sem(self, skipna=True):
240+
return nanops.nansem(self._ndarray, skipna=skipna)
241+
242+
def __invert__(self):
243+
return type(self)(~self._ndarray)
244+
245+
@classmethod
246+
def _create_arithmetic_method(cls, op):
247+
def arithmetic_method(self, other):
248+
if isinstance(other, (ABCIndexClass, ABCSeries)):
249+
return NotImplemented
250+
251+
elif isinstance(other, cls):
252+
other = other._ndarray
253+
254+
with np.errstate(all="ignore"):
255+
result = op(self._ndarray, other)
256+
257+
if op is divmod:
258+
a, b = result
259+
return cls(a), cls(b)
260+
261+
return cls(result)
262+
263+
return compat.set_function_name(arithmetic_method,
264+
"__{}__".format(op.__name__),
265+
cls)
266+
267+
_create_comparison_method = _create_arithmetic_method
268+
269+
270+
NumPyExtensionArray._add_arithmetic_ops()
271+
NumPyExtensionArray._add_comparison_ops()

0 commit comments

Comments
 (0)