Skip to content

Commit 341585a

Browse files
committed
ENH: Sparse dtypes
1 parent 10bf721 commit 341585a

18 files changed

+696
-176
lines changed

doc/source/sparse.rst

+55
Original file line numberDiff line numberDiff line change
@@ -132,6 +132,61 @@ keeps an arrays of all of the locations where the data are not equal to the
132132
fill value. The ``block`` format tracks only the locations and sizes of blocks
133133
of data.
134134

135+
.. _sparse.dtype:
136+
137+
Sparse Dtypes
138+
-------------
139+
140+
Sparse data should have the same dtype as its dense representation. Currently,
141+
``float64``, ``int64`` and ``bool`` dtypes are supported. Depending on the original
142+
dtype, ``fill_value`` default changes:
143+
144+
- ``float64``: ``np.nan``
145+
- ``int64``: ``0``
146+
- ``bool``: ``False``
147+
148+
.. ipython:: python
149+
150+
s = pd.Series([1, np.nan, np.nan])
151+
s
152+
s.to_sparse()
153+
154+
s = pd.Series([1, 0, 0])
155+
s
156+
s.to_sparse()
157+
158+
s = pd.Series([True, False, True])
159+
s
160+
s.to_sparse()
161+
162+
You can change the dtype using ``.astype()``, the result is also sparse. Note that
163+
``.astype()`` also affects to the ``fill_value`` to keep its dense represantation.
164+
165+
166+
.. ipython:: python
167+
168+
s = pd.Series([1, 0, 0, 0, 0])
169+
s
170+
ss = s.to_sparse()
171+
ss
172+
ss.astype(np.float64)
173+
174+
It raises if any value cannot be coerced to specified dtype.
175+
176+
.. code-block:: ipython
177+
178+
In [1]: ss = pd.Series([1, np.nan, np.nan]).to_sparse()
179+
0 1.0
180+
1 NaN
181+
2 NaN
182+
dtype: float64
183+
BlockIndex
184+
Block locations: array([0], dtype=int32)
185+
Block lengths: array([1], dtype=int32)
186+
187+
In [2]: ss.astype(np.int64)
188+
ValueError: unable to coerce current fill_value nan to int64 dtype
189+
135190
.. _sparse.calculation:
136191

137192
Sparse Calculation

doc/source/whatsnew/v0.19.0.txt

+54-1
Original file line numberDiff line numberDiff line change
@@ -17,6 +17,7 @@ Highlights include:
1717
- ``.rolling()`` are now time-series aware, see :ref:`here <whatsnew_0190.enhancements.rolling_ts>`
1818
- pandas development api, see :ref:`here <whatsnew_0190.dev_api>`
1919
- ``PeriodIndex`` now has its own ``period`` dtype, and changed to be more consistent with other ``Index`` classes. See ref:`here <whatsnew_0190.api.period>`
20+
- Sparse data structures now gained enhanced support of ``int`` and ``bool`` dtypes, see :ref:`here <whatsnew_0190.sparse>`
2021

2122
.. contents:: What's new in v0.19.0
2223
:local:
@@ -975,6 +976,51 @@ Sparse Changes
975976

976977
These changes allow pandas to handle sparse data with more dtypes, and for work to make a smoother experience with data handling.
977978

979+
980+
``int64`` and ``bool`` support enhancements
981+
"""""""""""""""""""""""""""""""""""""""""""
982+
983+
Sparse data structures now gained enhanced support of ``int64`` and ``bool`` ``dtype`` (:issue:`667`, :issue:`13849`)
984+
985+
Previously, sparse data were ``float64`` dtype by default, even if all inputs were ``int`` or ``bool`` dtype. You had to specify ``dtype`` explicitly to create sparse data with ``int64`` dtype. Also, ``fill_value`` had to be specified explicitly becuase it's default was ``np.nan`` which doesn't appear in ``int64`` or ``bool`` data.
986+
987+
.. code-block:: ipython
988+
989+
In [1]: pd.SparseArray([1, 2, 0, 0])
990+
Out[1]:
991+
[1.0, 2.0, 0.0, 0.0]
992+
Fill: nan
993+
IntIndex
994+
Indices: array([0, 1, 2, 3], dtype=int32)
995+
996+
# specifying int64 dtype, but all values are stored in sp_values because
997+
# fill_value default is np.nan
998+
In [2]: pd.SparseArray([1, 2, 0, 0], dtype=np.int64)
999+
Out[2]:
1000+
[1, 2, 0, 0]
1001+
Fill: nan
1002+
IntIndex
1003+
Indices: array([0, 1, 2, 3], dtype=int32)
1004+
1005+
In [3]: pd.SparseArray([1, 2, 0, 0], dtype=np.int64, fill_value=0)
1006+
Out[3]:
1007+
[1, 2, 0, 0]
1008+
Fill: 0
1009+
IntIndex
1010+
Indices: array([0, 1], dtype=int32)
1011+
1012+
As of v0.19.0, sparse data keeps the input dtype, and assign more appropriate ``fill_value`` default (``0`` for ``int64`` dtype, ``False`` for ``bool`` dtype).
1013+
1014+
.. ipython :: python
1015+
1016+
pd.SparseArray([1, 2, 0, 0], dtype=np.int64)
1017+
pd.SparseArray([True, False, False, False])
1018+
1019+
See the :ref:`docs <sparse.dtype>` for more details.
1020+
1021+
Operators now preserve dtypes
1022+
"""""""""""""""""""""""""""""
1023+
9781024
- Sparse data structure now can preserve ``dtype`` after arithmetic ops (:issue:`13848`)
9791025

9801026
.. ipython:: python
@@ -1001,6 +1047,9 @@ Note that the limitation is applied to ``fill_value`` which default is ``np.nan`
10011047
Out[7]:
10021048
ValueError: unable to coerce current fill_value nan to int64 dtype
10031049

1050+
Other sparse fixes
1051+
""""""""""""""""""
1052+
10041053
- Subclassed ``SparseDataFrame`` and ``SparseSeries`` now preserve class types when slicing or transposing. (:issue:`13787`)
10051054
- ``SparseArray`` with ``bool`` dtype now supports logical (bool) operators (:issue:`14000`)
10061055
- Bug in ``SparseSeries`` with ``MultiIndex`` ``[]`` indexing may raise ``IndexError`` (:issue:`13144`)
@@ -1011,6 +1060,11 @@ Note that the limitation is applied to ``fill_value`` which default is ``np.nan`
10111060
- Bug in ``SparseArray`` and ``SparseSeries`` don't apply ufunc to ``fill_value`` (:issue:`13853`)
10121061
- Bug in ``SparseSeries.abs`` incorrectly keeps negative ``fill_value`` (:issue:`13853`)
10131062
- Bug in single row slicing on multi-type ``SparseDataFrame``s, types were previously forced to float (:issue:`13917`)
1063+
- Bug in ``SparseSeries`` slicing changes integer dtype to float (:issue:`8292`)
1064+
- Bug in ``SparseDataFarme`` comparison ops may raise ``TypeError`` (:issue:`13001`)
1065+
- Bug in ``SparseDataFarme.isnull`` raises ``ValueError`` (:issue:`8276`)
1066+
- Bug in ``SparseSeries`` representation with ``bool`` dtype may raise ``IndexError`` (:issue:`13110`)
1067+
- Bug in ``SparseSeries`` and ``SparseDataFrame`` of ``bool`` or ``int64`` dtype may display its values like ``float64`` dtype (:issue:`13110`)
10141068
- Bug in sparse indexing using ``SparseArray`` with ``bool`` dtype may return incorrect result (:issue:`13985`)
10151069
- Bug in ``SparseArray`` created from ``SparseSeries`` may lose ``dtype`` (:issue:`13999`)
10161070
- Bug in ``SparseSeries`` comparison with dense returns normal ``Series`` rather than ``SparseSeries`` (:issue:`13999`)
@@ -1053,7 +1107,6 @@ New behaviour:
10531107
In [2]: i.get_indexer(['b', 'b', 'c']).dtype
10541108
Out[2]: dtype('int64')
10551109

1056-
10571110
.. _whatsnew_0190.deprecations:
10581111

10591112
Deprecations

pandas/core/generic.py

+9-4
Original file line numberDiff line numberDiff line change
@@ -3779,24 +3779,29 @@ def asof(self, where, subset=None):
37793779
# ----------------------------------------------------------------------
37803780
# Action Methods
37813781

3782-
def isnull(self):
3783-
"""
3782+
_shared_docs['isnull'] = """
37843783
Return a boolean same-sized object indicating if the values are null.
37853784
37863785
See Also
37873786
--------
37883787
notnull : boolean inverse of isnull
37893788
"""
3789+
3790+
@Appender(_shared_docs['isnull'])
3791+
def isnull(self):
37903792
return isnull(self).__finalize__(self)
37913793

3792-
def notnull(self):
3793-
"""Return a boolean same-sized object indicating if the values are
3794+
_shared_docs['isnotnull'] = """
3795+
Return a boolean same-sized object indicating if the values are
37943796
not null.
37953797
37963798
See Also
37973799
--------
37983800
isnull : boolean inverse of notnull
37993801
"""
3802+
3803+
@Appender(_shared_docs['isnotnull'])
3804+
def notnull(self):
38003805
return notnull(self).__finalize__(self)
38013806

38023807
def clip(self, lower=None, upper=None, axis=None, *args, **kwargs):

pandas/core/internals.py

-3
Original file line numberDiff line numberDiff line change
@@ -2478,9 +2478,6 @@ def fill_value(self):
24782478

24792479
@fill_value.setter
24802480
def fill_value(self, v):
2481-
# we may need to upcast our fill to match our dtype
2482-
if issubclass(self.dtype.type, np.floating):
2483-
v = float(v)
24842481
self.values.fill_value = v
24852482

24862483
def to_dense(self):

pandas/formats/format.py

+3
Original file line numberDiff line numberDiff line change
@@ -21,6 +21,7 @@
2121
is_numeric_dtype,
2222
is_datetime64_dtype,
2323
is_timedelta64_dtype)
24+
from pandas.types.generic import ABCSparseArray
2425

2526
from pandas.core.base import PandasObject
2627
from pandas.core.index import Index, MultiIndex, _ensure_index
@@ -1966,6 +1967,8 @@ def _format(x):
19661967
vals = self.values
19671968
if isinstance(vals, Index):
19681969
vals = vals._values
1970+
elif isinstance(vals, ABCSparseArray):
1971+
vals = vals.values
19691972

19701973
is_float_type = lib.map_infer(vals, is_float) & notnull(vals)
19711974
leading_space = is_float_type.any()

pandas/io/tests/test_pickle.py

+7
Original file line numberDiff line numberDiff line change
@@ -163,6 +163,13 @@ def compare_index_period(self, result, expected, typ, version):
163163
tm.assert_equal(result.freqstr, 'M')
164164
tm.assert_index_equal(result.shift(2), expected.shift(2))
165165

166+
def compare_sp_frame_float(self, result, expected, typ, version):
167+
if LooseVersion(version) <= '0.18.1':
168+
tm.assert_sp_frame_equal(result, expected, exact_indices=False,
169+
check_dtype=False)
170+
else:
171+
tm.assert_sp_frame_equal(result, expected)
172+
166173
def read_pickles(self, version):
167174
if not is_platform_little_endian():
168175
raise nose.SkipTest("known failure on non-little endian")

0 commit comments

Comments
 (0)