Skip to content

Commit 01e99de

Browse files
ENH: Allow storing ExtensionArrays in containers (#19520)
* ENH: non-interval changes * COMPAT: py2 Super * BUG: Use original object for extension array * Consistent boxing / unboxing NumPy compat * 32-bit compat * Add a test array * linting * Default __iter__ * Tests for value_counts * Implement value_counts * Py2 compat * Fixed dropna * Test fixups * Started setitem * REF/Clean: Internal / External values * Move to index base * Setitem tests, decimal example * Compat * Fixed extension block tests. The only "API change" was that you can't just inherit from NonConsolidatableMixin, which is OK since 1. it's a mixin 2. geopandas also inherits from Block * Clarify binop tests Make it clearer which bit might raise * TST: Removed ops tests * Cleanup unique handling * Simplify object concat * Use values for intersection I think eventually we'll want to ndarray_values for this, but it'll require a bit more work to support. Currently, using ndarary_values causes occasional failures on categorical. * hmm * More failing tests * remove bad test * better setitem * Dropna works. * Restore xfail test * Test Categorical * Xfail setitem tests * TST: Skip JSON tests on py2 * Additional testing * More tests * ndarray_values * API: Default ExtensionArray.astype (cherry picked from commit 943a915562b72bed147c857de927afa0daf31c1a) (cherry picked from commit fbf0a06) * Simplify concat_as_object * Py2 compat (cherry picked from commit b20e12c) * Set-ops ugliness * better docstrings * tolist * linting * Moved dtypes (cherry picked from commit d136227) * clean * cleanup * NumPy compat * Use base _values for CategoricalIndex * Update dev docs * cleanup * cleanup (cherry picked from commit 2425621) * cleanup * Linting * Precision in tests * Linting * Move to extension * Push _ndarray_values to ExtensionArray Now IndexOpsMixin._ndarray_values will dispatch all the way down to the EA. Subclasses like Categorical can override it as they see fit. * Clean up tolist * Move test locations * Fixed test * REF: Update per comments * lint * REF: Use _values for size and shape * PERF: Implement size, shape for IntervalIndex * PERF: Avoid materializing values for PeriodIndex shape, size * Cleanup * Override nbytes * Remove unused change * Docs * Test cleanpu * Always set PANDAS_TESTING_MODE * Revert "Always set PANDAS_TESTING_MODE" This reverts commit a312ba5. * Explicitly catch warnings or not * fastparquet warnings * Unicode literals strikes again. Only catch fp warning for newer numpy * Restore circle env var * More parquet test catching * No stacklevel * Lower bound on FP * Exact bound for FP * Don't use fastpath for ExtensionBlock make_block * Consistently use _values * TST: Additional constructor tests * CLN: de-nested a bit * _fill_value handling * Handle user provided dtype in constructors. When the dtype matches, we allow it to proceed. When the dtype would require coercion, we raise. * Document ExtensionBlock._maybe_coerce_values Also changes to use _values as we should * Created ABCExtensionArray * TST: Tests for is_object_dtype and is_string_dtype and EAs * fixup! Handle user provided dtype in constructors. * Doc for setitem * Split base tests * Revert test_parquet changes * API: Removed _fill_value from the interface * Push coercion to extension dtype till later * Linting * ERR: Better error message for coercion to 3rd party dtypes * CLN: Make take_nd EA aware * Revert sparse changes * Other _typ for ABCExtensionArray * Test cleanup and expansion. Tests for concating and aligning frames * Copy if copy * TST: remove self param for fixture * Remove unnescessary EA handling in Series ctor * API: Removed value_counts Moved setitem notes to comment * More doc notes * Handle expanding a DataFrame with an EA * Added ExtensionDtype.__eq__ Support for astype * linting * REF: is_dtype_equal refactor Moved from PandasExtensionDtype to ExtensionDtype with one modification: catch TypeError explicitly. * Remove reference to dtype being a class * move * Moved sparse check to take_nd * Docstring * Split tests * Revert index change * Copy changes * Simplify EA implementation names comments for object vs. str missing values * Linting
1 parent 0176f6e commit 01e99de

32 files changed

+1276
-130
lines changed

pandas/core/algorithms.py

+19-7
Original file line numberDiff line numberDiff line change
@@ -15,11 +15,12 @@
1515
is_unsigned_integer_dtype, is_signed_integer_dtype,
1616
is_integer_dtype, is_complex_dtype,
1717
is_object_dtype,
18+
is_extension_array_dtype,
1819
is_categorical_dtype, is_sparse,
1920
is_period_dtype,
2021
is_numeric_dtype, is_float_dtype,
2122
is_bool_dtype, needs_i8_conversion,
22-
is_categorical, is_datetimetz,
23+
is_datetimetz,
2324
is_datetime64_any_dtype, is_datetime64tz_dtype,
2425
is_timedelta64_dtype, is_interval_dtype,
2526
is_scalar, is_list_like,
@@ -547,7 +548,7 @@ def value_counts(values, sort=True, ascending=False, normalize=False,
547548
if is_categorical_dtype(values) or is_sparse(values):
548549

549550
# handle Categorical and sparse,
550-
result = Series(values).values.value_counts(dropna=dropna)
551+
result = Series(values)._values.value_counts(dropna=dropna)
551552
result.name = name
552553
counts = result.values
553554

@@ -1292,10 +1293,13 @@ def take_nd(arr, indexer, axis=0, out=None, fill_value=np.nan, mask_info=None,
12921293
"""
12931294
Specialized Cython take which sets NaN values in one pass
12941295
1296+
This dispatches to ``take`` defined on ExtensionArrays. It does not
1297+
currently dispatch to ``SparseArray.take`` for sparse ``arr``.
1298+
12951299
Parameters
12961300
----------
1297-
arr : ndarray
1298-
Input array
1301+
arr : array-like
1302+
Input array.
12991303
indexer : ndarray
13001304
1-D array of indices to take, subarrays corresponding to -1 value
13011305
indicies are filed with fill_value
@@ -1315,17 +1319,25 @@ def take_nd(arr, indexer, axis=0, out=None, fill_value=np.nan, mask_info=None,
13151319
If False, indexer is assumed to contain no -1 values so no filling
13161320
will be done. This short-circuits computation of a mask. Result is
13171321
undefined if allow_fill == False and -1 is present in indexer.
1322+
1323+
Returns
1324+
-------
1325+
subarray : array-like
1326+
May be the same type as the input, or cast to an ndarray.
13181327
"""
13191328

1329+
# TODO(EA): Remove these if / elifs as datetimeTZ, interval, become EAs
13201330
# dispatch to internal type takes
1321-
if is_categorical(arr):
1322-
return arr.take_nd(indexer, fill_value=fill_value,
1323-
allow_fill=allow_fill)
1331+
if is_extension_array_dtype(arr):
1332+
return arr.take(indexer, fill_value=fill_value, allow_fill=allow_fill)
13241333
elif is_datetimetz(arr):
13251334
return arr.take(indexer, fill_value=fill_value, allow_fill=allow_fill)
13261335
elif is_interval_dtype(arr):
13271336
return arr.take(indexer, fill_value=fill_value, allow_fill=allow_fill)
13281337

1338+
if is_sparse(arr):
1339+
arr = arr.get_values()
1340+
13291341
if indexer is None:
13301342
indexer = np.arange(arr.shape[axis], dtype=np.int64)
13311343
dtype, fill_value = arr.dtype, arr.dtype.type()

pandas/core/arrays/base.py

+69-21
Original file line numberDiff line numberDiff line change
@@ -25,14 +25,13 @@ class ExtensionArray(object):
2525
* isna
2626
* take
2727
* copy
28-
* _formatting_values
2928
* _concat_same_type
3029
31-
Some additional methods are required to satisfy pandas' internal, private
30+
Some additional methods are available to satisfy pandas' internal, private
3231
block API.
3332
34-
* _concat_same_type
3533
* _can_hold_na
34+
* _formatting_values
3635
3736
This class does not inherit from 'abc.ABCMeta' for performance reasons.
3837
Methods and properties required by the interface raise
@@ -53,13 +52,14 @@ class ExtensionArray(object):
5352
Extension arrays should be able to be constructed with instances of
5453
the class, i.e. ``ExtensionArray(extension_array)`` should return
5554
an instance, not error.
56-
57-
Additionally, certain methods and interfaces are required for proper
58-
this array to be properly stored inside a ``DataFrame`` or ``Series``.
5955
"""
56+
# '_typ' is for pandas.core.dtypes.generic.ABCExtensionArray.
57+
# Don't override this.
58+
_typ = 'extension'
6059
# ------------------------------------------------------------------------
6160
# Must be a Sequence
6261
# ------------------------------------------------------------------------
62+
6363
def __getitem__(self, item):
6464
# type (Any) -> Any
6565
"""Select a subset of self.
@@ -92,7 +92,46 @@ def __getitem__(self, item):
9292
raise AbstractMethodError(self)
9393

9494
def __setitem__(self, key, value):
95-
# type: (Any, Any) -> None
95+
# type: (Union[int, np.ndarray], Any) -> None
96+
"""Set one or more values inplace.
97+
98+
This method is not required to satisfy the pandas extension array
99+
interface.
100+
101+
Parameters
102+
----------
103+
key : int, ndarray, or slice
104+
When called from, e.g. ``Series.__setitem__``, ``key`` will be
105+
one of
106+
107+
* scalar int
108+
* ndarray of integers.
109+
* boolean ndarray
110+
* slice object
111+
112+
value : ExtensionDtype.type, Sequence[ExtensionDtype.type], or object
113+
value or values to be set of ``key``.
114+
115+
Returns
116+
-------
117+
None
118+
"""
119+
# Some notes to the ExtensionArray implementor who may have ended up
120+
# here. While this method is not required for the interface, if you
121+
# *do* choose to implement __setitem__, then some semantics should be
122+
# observed:
123+
#
124+
# * Setting multiple values : ExtensionArrays should support setting
125+
# multiple values at once, 'key' will be a sequence of integers and
126+
# 'value' will be a same-length sequence.
127+
#
128+
# * Broadcasting : For a sequence 'key' and a scalar 'value',
129+
# each position in 'key' should be set to 'value'.
130+
#
131+
# * Coercion : Most users will expect basic coercion to work. For
132+
# example, a string like '2018-01-01' is coerced to a datetime
133+
# when setting on a datetime64ns array. In general, if the
134+
# __init__ method coerces that value, then so should __setitem__
96135
raise NotImplementedError(_not_implemented_message.format(
97136
type(self), '__setitem__')
98137
)
@@ -107,6 +146,16 @@ def __len__(self):
107146
# type: () -> int
108147
raise AbstractMethodError(self)
109148

149+
def __iter__(self):
150+
"""Iterate over elements of the array.
151+
152+
"""
153+
# This needs to be implemented so that pandas recognizes extension
154+
# arrays as list-like. The default implementation makes successive
155+
# calls to ``__getitem__``, which may be slower than necessary.
156+
for i in range(len(self)):
157+
yield self[i]
158+
110159
# ------------------------------------------------------------------------
111160
# Required attributes
112161
# ------------------------------------------------------------------------
@@ -132,9 +181,9 @@ def nbytes(self):
132181
# type: () -> int
133182
"""The number of bytes needed to store this object in memory.
134183
135-
If this is expensive to compute, return an approximate lower bound
136-
on the number of bytes needed.
137184
"""
185+
# If this is expensive to compute, return an approximate lower bound
186+
# on the number of bytes needed.
138187
raise AbstractMethodError(self)
139188

140189
# ------------------------------------------------------------------------
@@ -184,8 +233,8 @@ def take(self, indexer, allow_fill=True, fill_value=None):
184233
will be done. This short-circuits computation of a mask. Result is
185234
undefined if allow_fill == False and -1 is present in indexer.
186235
fill_value : any, default None
187-
Fill value to replace -1 values with. By default, this uses
188-
the missing value sentinel for this type, ``self._fill_value``.
236+
Fill value to replace -1 values with. If applicable, this should
237+
use the sentinel missing value for this type.
189238
190239
Notes
191240
-----
@@ -198,17 +247,20 @@ def take(self, indexer, allow_fill=True, fill_value=None):
198247
199248
Examples
200249
--------
201-
Suppose the extension array somehow backed by a NumPy structured array
202-
and that the underlying structured array is stored as ``self.data``.
203-
Then ``take`` may be written as
250+
Suppose the extension array is backed by a NumPy array stored as
251+
``self.data``. Then ``take`` may be written as
204252
205253
.. code-block:: python
206254
207255
def take(self, indexer, allow_fill=True, fill_value=None):
208256
mask = indexer == -1
209257
result = self.data.take(indexer)
210-
result[mask] = self._fill_value
258+
result[mask] = np.nan # NA for this type
211259
return type(self)(result)
260+
261+
See Also
262+
--------
263+
numpy.take
212264
"""
213265
raise AbstractMethodError(self)
214266

@@ -230,17 +282,12 @@ def copy(self, deep=False):
230282
# ------------------------------------------------------------------------
231283
# Block-related methods
232284
# ------------------------------------------------------------------------
233-
@property
234-
def _fill_value(self):
235-
# type: () -> Any
236-
"""The missing value for this type, e.g. np.nan"""
237-
return None
238285

239286
def _formatting_values(self):
240287
# type: () -> np.ndarray
241288
# At the moment, this has to be an array since we use result.dtype
242289
"""An array of values to be printed in, e.g. the Series repr"""
243-
raise AbstractMethodError(self)
290+
return np.array(self)
244291

245292
@classmethod
246293
def _concat_same_type(cls, to_concat):
@@ -257,6 +304,7 @@ def _concat_same_type(cls, to_concat):
257304
"""
258305
raise AbstractMethodError(cls)
259306

307+
@property
260308
def _can_hold_na(self):
261309
# type: () -> bool
262310
"""Whether your array can hold missing values. True by default.

pandas/core/dtypes/base.py

+47-10
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,7 @@
11
"""Extend pandas with custom array types"""
2+
import numpy as np
3+
4+
from pandas import compat
25
from pandas.errors import AbstractMethodError
36

47

@@ -23,6 +26,32 @@ class ExtensionDtype(object):
2326
def __str__(self):
2427
return self.name
2528

29+
def __eq__(self, other):
30+
"""Check whether 'other' is equal to self.
31+
32+
By default, 'other' is considered equal if
33+
34+
* it's a string matching 'self.name'.
35+
* it's an instance of this type.
36+
37+
Parameters
38+
----------
39+
other : Any
40+
41+
Returns
42+
-------
43+
bool
44+
"""
45+
if isinstance(other, compat.string_types):
46+
return other == self.name
47+
elif isinstance(other, type(self)):
48+
return True
49+
else:
50+
return False
51+
52+
def __ne__(self, other):
53+
return not self.__eq__(other)
54+
2655
@property
2756
def type(self):
2857
# type: () -> type
@@ -102,11 +131,12 @@ def construct_from_string(cls, string):
102131

103132
@classmethod
104133
def is_dtype(cls, dtype):
105-
"""Check if we match 'dtype'
134+
"""Check if we match 'dtype'.
106135
107136
Parameters
108137
----------
109-
dtype : str or dtype
138+
dtype : object
139+
The object to check.
110140
111141
Returns
112142
-------
@@ -118,12 +148,19 @@ def is_dtype(cls, dtype):
118148
119149
1. ``cls.construct_from_string(dtype)`` is an instance
120150
of ``cls``.
121-
2. 'dtype' is ``cls`` or a subclass of ``cls``.
151+
2. ``dtype`` is an object and is an instance of ``cls``
152+
3. ``dtype`` has a ``dtype`` attribute, and any of the above
153+
conditions is true for ``dtype.dtype``.
122154
"""
123-
if isinstance(dtype, str):
124-
try:
125-
return isinstance(cls.construct_from_string(dtype), cls)
126-
except TypeError:
127-
return False
128-
else:
129-
return issubclass(dtype, cls)
155+
dtype = getattr(dtype, 'dtype', dtype)
156+
157+
if isinstance(dtype, np.dtype):
158+
return False
159+
elif dtype is None:
160+
return False
161+
elif isinstance(dtype, cls):
162+
return True
163+
try:
164+
return cls.construct_from_string(dtype) is not None
165+
except TypeError:
166+
return False

pandas/core/dtypes/common.py

+1-1
Original file line numberDiff line numberDiff line change
@@ -1708,9 +1708,9 @@ def is_extension_array_dtype(arr_or_dtype):
17081708
"""
17091709
from pandas.core.arrays import ExtensionArray
17101710

1711-
# we want to unpack series, anything else?
17121711
if isinstance(arr_or_dtype, (ABCIndexClass, ABCSeries)):
17131712
arr_or_dtype = arr_or_dtype._values
1713+
17141714
return isinstance(arr_or_dtype, (ExtensionDtype, ExtensionArray))
17151715

17161716

pandas/core/dtypes/dtypes.py

-25
Original file line numberDiff line numberDiff line change
@@ -66,13 +66,6 @@ def __hash__(self):
6666
raise NotImplementedError("sub-classes should implement an __hash__ "
6767
"method")
6868

69-
def __eq__(self, other):
70-
raise NotImplementedError("sub-classes should implement an __eq__ "
71-
"method")
72-
73-
def __ne__(self, other):
74-
return not self.__eq__(other)
75-
7669
def __getstate__(self):
7770
# pickle support; we don't want to pickle the cache
7871
return {k: getattr(self, k, None) for k in self._metadata}
@@ -82,24 +75,6 @@ def reset_cache(cls):
8275
""" clear the cache """
8376
cls._cache = {}
8477

85-
@classmethod
86-
def is_dtype(cls, dtype):
87-
""" Return a boolean if the passed type is an actual dtype that
88-
we can match (via string or type)
89-
"""
90-
if hasattr(dtype, 'dtype'):
91-
dtype = dtype.dtype
92-
if isinstance(dtype, np.dtype):
93-
return False
94-
elif dtype is None:
95-
return False
96-
elif isinstance(dtype, cls):
97-
return True
98-
try:
99-
return cls.construct_from_string(dtype) is not None
100-
except:
101-
return False
102-
10378

10479
class CategoricalDtypeType(type):
10580
"""

pandas/core/dtypes/generic.py

+2
Original file line numberDiff line numberDiff line change
@@ -57,6 +57,8 @@ def _check(cls, inst):
5757
ABCDateOffset = create_pandas_abc_type("ABCDateOffset", "_typ",
5858
("dateoffset",))
5959
ABCInterval = create_pandas_abc_type("ABCInterval", "_typ", ("interval", ))
60+
ABCExtensionArray = create_pandas_abc_type("ABCExtensionArray", "_typ",
61+
("extension", "categorical",))
6062

6163

6264
class _ABCGeneric(type):

0 commit comments

Comments
 (0)