Skip to content

Commit 54d621a

Browse files
TomAugspurgertm9k1
authored andcommitted
[API/REF]: SparseArray is an ExtensionArray (pandas-dev#22325)
Makes SparseArray an ExtensionArray. * Fixed DataFrame.__setitem__ for updating to sparse. Closes pandas-dev#22367 * Fixed Series[sparse].to_sparse Closes pandas-dev#22389 Closes pandas-dev#21978 Closes pandas-dev#19506 Closes pandas-dev#22835
1 parent 3feaa79 commit 54d621a

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

50 files changed

+3346
-1421
lines changed

doc/source/whatsnew/v0.24.0.txt

+46-7
Original file line numberDiff line numberDiff line change
@@ -381,6 +381,37 @@ is the case with :attr:`Period.end_time`, for example
381381

382382
p.end_time
383383

384+
.. _whatsnew_0240.api_breaking.sparse_values:
385+
386+
Sparse Data Structure Refactor
387+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
388+
389+
``SparseArray``, the array backing ``SparseSeries`` and the columns in a ``SparseDataFrame``,
390+
is now an extension array (:issue:`21978`, :issue:`19056`, :issue:`22835`).
391+
To conform to this interface and for consistency with the rest of pandas, some API breaking
392+
changes were made:
393+
394+
- ``SparseArray`` is no longer a subclass of :class:`numpy.ndarray`. To convert a SparseArray to a NumPy array, use :meth:`numpy.asarray`.
395+
- ``SparseArray.dtype`` and ``SparseSeries.dtype`` are now instances of :class:`SparseDtype`, rather than ``np.dtype``. Access the underlying dtype with ``SparseDtype.subtype``.
396+
- :meth:`numpy.asarray(sparse_array)` now returns a dense array with all the values, not just the non-fill-value values (:issue:`14167`)
397+
- ``SparseArray.take`` now matches the API of :meth:`pandas.api.extensions.ExtensionArray.take` (:issue:`19506`):
398+
399+
* The default value of ``allow_fill`` has changed from ``False`` to ``True``.
400+
* The ``out`` and ``mode`` parameters are now longer accepted (previously, this raised if they were specified).
401+
* Passing a scalar for ``indices`` is no longer allowed.
402+
403+
- The result of concatenating a mix of sparse and dense Series is a Series with sparse values, rather than a ``SparseSeries``.
404+
- ``SparseDataFrame.combine`` and ``DataFrame.combine_first`` no longer supports combining a sparse column with a dense column while preserving the sparse subtype. The result will be an object-dtype SparseArray.
405+
- Setting :attr:`SparseArray.fill_value` to a fill value with a different dtype is now allowed.
406+
407+
408+
Some new warnings are issued for operations that require or are likely to materialize a large dense array:
409+
410+
- A :class:`errors.PerformanceWarning` is issued when using fillna with a ``method``, as a dense array is constructed to create the filled array. Filling with a ``value`` is the efficient way to fill a sparse array.
411+
- A :class:`errors.PerformanceWarning` is now issued when concatenating sparse Series with differing fill values. The fill value from the first sparse array continues to be used.
412+
413+
In addition to these API breaking changes, many :ref:`performance improvements and bug fixes have been made <whatsnew_0240.bug_fixes.sparse>`.
414+
384415
.. _whatsnew_0240.api_breaking.frame_to_dict_index_orient:
385416

386417
Raise ValueError in ``DataFrame.to_dict(orient='index')``
@@ -574,6 +605,7 @@ update the ``ExtensionDtype._metadata`` tuple to match the signature of your
574605
- Added :meth:`pandas.api.types.register_extension_dtype` to register an extension type with pandas (:issue:`22664`)
575606
- Series backed by an ``ExtensionArray`` now work with :func:`util.hash_pandas_object` (:issue:`23066`)
576607
- Updated the ``.type`` attribute for ``PeriodDtype``, ``DatetimeTZDtype``, and ``IntervalDtype`` to be instances of the dtype (``Period``, ``Timestamp``, and ``Interval`` respectively) (:issue:`22938`)
608+
- :func:`ExtensionArray.isna` is allowed to return an ``ExtensionArray`` (:issue:`22325`).
577609
- Support for reduction operations such as ``sum``, ``mean`` via opt-in base class method override (:issue:`22762`)
578610

579611
.. _whatsnew_0240.api.incompatibilities:
@@ -656,6 +688,7 @@ Other API Changes
656688
- :class:`pandas.io.formats.style.Styler` supports a ``number-format`` property when using :meth:`~pandas.io.formats.style.Styler.to_excel` (:issue:`22015`)
657689
- :meth:`DataFrame.corr` and :meth:`Series.corr` now raise a ``ValueError`` along with a helpful error message instead of a ``KeyError`` when supplied with an invalid method (:issue:`22298`)
658690
- :meth:`shift` will now always return a copy, instead of the previous behaviour of returning self when shifting by 0 (:issue:`22397`)
691+
- Slicing a single row of a DataFrame with multiple ExtensionArrays of the same type now preserves the dtype, rather than coercing to object (:issue:`22784`)
659692

660693
.. _whatsnew_0240.deprecations:
661694

@@ -897,13 +930,6 @@ Groupby/Resample/Rolling
897930
- :func:`RollingGroupby.agg` and :func:`ExpandingGroupby.agg` now support multiple aggregation functions as parameters (:issue:`15072`)
898931
- Bug in :meth:`DataFrame.resample` and :meth:`Series.resample` when resampling by a weekly offset (``'W'``) across a DST transition (:issue:`9119`, :issue:`21459`)
899932

900-
Sparse
901-
^^^^^^
902-
903-
-
904-
-
905-
-
906-
907933
Reshaping
908934
^^^^^^^^^
909935

@@ -922,6 +948,19 @@ Reshaping
922948
- Bug in :func:`merge_asof` when merging on float values within defined tolerance (:issue:`22981`)
923949
- Bug in :func:`pandas.concat` when concatenating a multicolumn DataFrame with tz-aware data against a DataFrame with a different number of columns (:issue`22796`)
924950

951+
.. _whatsnew_0240.bug_fixes.sparse:
952+
953+
Sparse
954+
^^^^^^
955+
956+
- Updating a boolean, datetime, or timedelta column to be Sparse now works (:issue:`22367`)
957+
- Bug in :meth:`Series.to_sparse` with Series already holding sparse data not constructing properly (:issue:`22389`)
958+
- Providing a ``sparse_index`` to the SparseArray constructor no longer defaults the na-value to ``np.nan`` for all dtypes. The correct na_value for ``data.dtype`` is now used.
959+
- Bug in ``SparseArray.nbytes`` under-reporting its memory usage by not including the size of its sparse index.
960+
- Improved performance of :meth:`Series.shift` for non-NA ``fill_value``, as values are no longer converted to a dense array.
961+
- Bug in ``DataFrame.groupby`` not including ``fill_value`` in the groups for non-NA ``fill_value`` when grouping by a sparse column (:issue:`5078`)
962+
- Bug in unary inversion operator (``~``) on a ``SparseSeries`` with boolean values. The performance of this has also been improved (:issue:`22835`)
963+
925964
Build Changes
926965
^^^^^^^^^^^^^
927966

pandas/_libs/sparse.pyx

+8
Original file line numberDiff line numberDiff line change
@@ -68,6 +68,10 @@ cdef class IntIndex(SparseIndex):
6868
output += 'Indices: %s\n' % repr(self.indices)
6969
return output
7070

71+
@property
72+
def nbytes(self):
73+
return self.indices.nbytes
74+
7175
def check_integrity(self):
7276
"""
7377
Checks the following:
@@ -359,6 +363,10 @@ cdef class BlockIndex(SparseIndex):
359363

360364
return output
361365

366+
@property
367+
def nbytes(self):
368+
return self.blocs.nbytes + self.blengths.nbytes
369+
362370
@property
363371
def ngaps(self):
364372
return self.length - self.npoints

pandas/core/arrays/base.py

+18-3
Original file line numberDiff line numberDiff line change
@@ -287,10 +287,25 @@ def astype(self, dtype, copy=True):
287287
return np.array(self, dtype=dtype, copy=copy)
288288

289289
def isna(self):
290-
# type: () -> np.ndarray
291-
"""Boolean NumPy array indicating if each value is missing.
290+
# type: () -> Union[ExtensionArray, np.ndarray]
291+
"""
292+
A 1-D array indicating if each value is missing.
293+
294+
Returns
295+
-------
296+
na_values : Union[np.ndarray, ExtensionArray]
297+
In most cases, this should return a NumPy ndarray. For
298+
exceptional cases like ``SparseArray``, where returning
299+
an ndarray would be expensive, an ExtensionArray may be
300+
returned.
301+
302+
Notes
303+
-----
304+
If returning an ExtensionArray, then
292305
293-
This should return a 1-D array the same length as 'self'.
306+
* ``na_values._is_boolean`` should be True
307+
* `na_values` should implement :func:`ExtensionArray._reduce`
308+
* ``na_values.any`` and ``na_values.all`` should be implemented
294309
"""
295310
raise AbstractMethodError(self)
296311

pandas/core/common.py

+3-1
Original file line numberDiff line numberDiff line change
@@ -14,7 +14,9 @@
1414

1515
from pandas import compat
1616
from pandas.compat import iteritems, PY36, OrderedDict
17-
from pandas.core.dtypes.generic import ABCSeries, ABCIndex, ABCIndexClass
17+
from pandas.core.dtypes.generic import (
18+
ABCSeries, ABCIndex, ABCIndexClass
19+
)
1820
from pandas.core.dtypes.common import (
1921
is_integer, is_bool_dtype, is_extension_array_dtype, is_array_like
2022
)

pandas/core/dtypes/common.py

+14-3
Original file line numberDiff line numberDiff line change
@@ -12,6 +12,7 @@
1212
PeriodDtype, IntervalDtype,
1313
PandasExtensionDtype, ExtensionDtype,
1414
_pandas_registry)
15+
from pandas.core.sparse.dtype import SparseDtype
1516
from pandas.core.dtypes.generic import (
1617
ABCCategorical, ABCPeriodIndex, ABCDatetimeIndex, ABCSeries,
1718
ABCSparseArray, ABCSparseSeries, ABCCategoricalIndex, ABCIndexClass,
@@ -180,8 +181,10 @@ def is_sparse(arr):
180181
>>> is_sparse(bsr_matrix([1, 2, 3]))
181182
False
182183
"""
184+
from pandas.core.sparse.dtype import SparseDtype
183185

184-
return isinstance(arr, (ABCSparseArray, ABCSparseSeries))
186+
dtype = getattr(arr, 'dtype', arr)
187+
return isinstance(dtype, SparseDtype)
185188

186189

187190
def is_scipy_sparse(arr):
@@ -1643,8 +1646,9 @@ def is_bool_dtype(arr_or_dtype):
16431646
True
16441647
>>> is_bool_dtype(pd.Categorical([True, False]))
16451648
True
1649+
>>> is_bool_dtype(pd.SparseArray([True, False]))
1650+
True
16461651
"""
1647-
16481652
if arr_or_dtype is None:
16491653
return False
16501654
try:
@@ -1751,6 +1755,8 @@ def is_extension_array_dtype(arr_or_dtype):
17511755
array interface. In pandas, this includes:
17521756
17531757
* Categorical
1758+
* Sparse
1759+
* Interval
17541760
17551761
Third-party libraries may implement arrays or types satisfying
17561762
this interface as well.
@@ -1873,7 +1879,8 @@ def _get_dtype(arr_or_dtype):
18731879
return PeriodDtype.construct_from_string(arr_or_dtype)
18741880
elif is_interval_dtype(arr_or_dtype):
18751881
return IntervalDtype.construct_from_string(arr_or_dtype)
1876-
elif isinstance(arr_or_dtype, (ABCCategorical, ABCCategoricalIndex)):
1882+
elif isinstance(arr_or_dtype, (ABCCategorical, ABCCategoricalIndex,
1883+
ABCSparseArray, ABCSparseSeries)):
18771884
return arr_or_dtype.dtype
18781885

18791886
if hasattr(arr_or_dtype, 'dtype'):
@@ -1921,6 +1928,10 @@ def _get_dtype_type(arr_or_dtype):
19211928
elif is_interval_dtype(arr_or_dtype):
19221929
return Interval
19231930
return _get_dtype_type(np.dtype(arr_or_dtype))
1931+
elif isinstance(arr_or_dtype, (ABCSparseSeries, ABCSparseArray,
1932+
SparseDtype)):
1933+
dtype = getattr(arr_or_dtype, 'dtype', arr_or_dtype)
1934+
return dtype.type
19241935
try:
19251936
return arr_or_dtype.dtype.type
19261937
except AttributeError:

pandas/core/dtypes/concat.py

+18-54
Original file line numberDiff line numberDiff line change
@@ -93,11 +93,13 @@ def _get_series_result_type(result, objs=None):
9393
def _get_frame_result_type(result, objs):
9494
"""
9595
return appropriate class of DataFrame-like concat
96-
if all blocks are SparseBlock, return SparseDataFrame
96+
if all blocks are sparse, return SparseDataFrame
9797
otherwise, return 1st obj
9898
"""
9999

100-
if result.blocks and all(b.is_sparse for b in result.blocks):
100+
if (result.blocks and (
101+
all(is_sparse(b) for b in result.blocks) or
102+
all(isinstance(obj, ABCSparseDataFrame) for obj in objs))):
101103
from pandas.core.sparse.api import SparseDataFrame
102104
return SparseDataFrame
103105
else:
@@ -554,61 +556,23 @@ def _concat_sparse(to_concat, axis=0, typs=None):
554556
a single array, preserving the combined dtypes
555557
"""
556558

557-
from pandas.core.sparse.array import SparseArray, _make_index
559+
from pandas.core.sparse.array import SparseArray
558560

559-
def convert_sparse(x, axis):
560-
# coerce to native type
561-
if isinstance(x, SparseArray):
562-
x = x.get_values()
563-
else:
564-
x = np.asarray(x)
565-
x = x.ravel()
566-
if axis > 0:
567-
x = np.atleast_2d(x)
568-
return x
561+
fill_values = [x.fill_value for x in to_concat
562+
if isinstance(x, SparseArray)]
569563

570-
if typs is None:
571-
typs = get_dtype_kinds(to_concat)
564+
if len(set(fill_values)) > 1:
565+
raise ValueError("Cannot concatenate SparseArrays with different "
566+
"fill values")
572567

573-
if len(typs) == 1:
574-
# concat input as it is if all inputs are sparse
575-
# and have the same fill_value
576-
fill_values = {c.fill_value for c in to_concat}
577-
if len(fill_values) == 1:
578-
sp_values = [c.sp_values for c in to_concat]
579-
indexes = [c.sp_index.to_int_index() for c in to_concat]
580-
581-
indices = []
582-
loc = 0
583-
for idx in indexes:
584-
indices.append(idx.indices + loc)
585-
loc += idx.length
586-
sp_values = np.concatenate(sp_values)
587-
indices = np.concatenate(indices)
588-
sp_index = _make_index(loc, indices, kind=to_concat[0].sp_index)
589-
590-
return SparseArray(sp_values, sparse_index=sp_index,
591-
fill_value=to_concat[0].fill_value)
592-
593-
# input may be sparse / dense mixed and may have different fill_value
594-
# input must contain sparse at least 1
595-
sparses = [c for c in to_concat if is_sparse(c)]
596-
fill_values = [c.fill_value for c in sparses]
597-
sp_indexes = [c.sp_index for c in sparses]
598-
599-
# densify and regular concat
600-
to_concat = [convert_sparse(x, axis) for x in to_concat]
601-
result = np.concatenate(to_concat, axis=axis)
602-
603-
if not len(typs - {'sparse', 'f', 'i'}):
604-
# sparsify if inputs are sparse and dense numerics
605-
# first sparse input's fill_value and SparseIndex is used
606-
result = SparseArray(result.ravel(), fill_value=fill_values[0],
607-
kind=sp_indexes[0])
608-
else:
609-
# coerce to object if needed
610-
result = result.astype('object')
611-
return result
568+
fill_value = fill_values[0]
569+
570+
# TODO: Fix join unit generation so we aren't passed this.
571+
to_concat = [x if isinstance(x, SparseArray)
572+
else SparseArray(x.squeeze(), fill_value=fill_value)
573+
for x in to_concat]
574+
575+
return SparseArray._concat_same_type(to_concat)
612576

613577

614578
def _concat_rangeindex_same_dtype(indexes):

pandas/core/dtypes/missing.py

+13
Original file line numberDiff line numberDiff line change
@@ -499,6 +499,19 @@ def na_value_for_dtype(dtype, compat=True):
499499
Returns
500500
-------
501501
np.dtype or a pandas dtype
502+
503+
Examples
504+
--------
505+
>>> na_value_for_dtype(np.dtype('int64'))
506+
0
507+
>>> na_value_for_dtype(np.dtype('int64'), compat=False)
508+
nan
509+
>>> na_value_for_dtype(np.dtype('float64'))
510+
nan
511+
>>> na_value_for_dtype(np.dtype('bool'))
512+
False
513+
>>> na_value_for_dtype(np.dtype('datetime64[ns]'))
514+
NaT
502515
"""
503516
dtype = pandas_dtype(dtype)
504517

pandas/core/internals/__init__.py

+1-1
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@
55
make_block, # io.pytables, io.packers
66
FloatBlock, IntBlock, ComplexBlock, BoolBlock, ObjectBlock,
77
TimeDeltaBlock, DatetimeBlock, DatetimeTZBlock,
8-
CategoricalBlock, ExtensionBlock, SparseBlock, ScalarBlock,
8+
CategoricalBlock, ExtensionBlock, ScalarBlock,
99
Block)
1010
from .managers import ( # noqa:F401
1111
BlockManager, SingleBlockManager,

0 commit comments

Comments
 (0)