Behaviour of Categorical inputs to sparse data structures #19278

jnothman · 2018-01-17T07:32:46Z

Code Sample, a copy-pastable example if possible

>>> import pandas as pd
>>> c = pd.Categorical(list('abcabc'))
>>> c
[a, b, c, a, b, c]
Categories (3, object): [a, b, c]
>>> pd.Series(c).dtype
CategoricalDtype(categories=['a', 'b', 'c'], ordered=False)
>>> pd.Series(c).to_sparse().dtype
dtype('O')
>>> pd.SparseArray(c)
[a, b, c, a, b, c]
Fill: nan
IntIndex
Indices: array([0, 1, 2, 3, 4, 5], dtype=int32)

>>> pd.SparseArray(c).dtype
dtype('O')
>>> pd.SparseSeries(c)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/joel/anaconda3/envs/pandas-dev/lib/python3.6/site-packages/pandas/core/sparse/series.py", line 175, in __init__
    length = len(index)
TypeError: object of type 'NoneType' has no len()
>>> pd.DataFrame({'a': c})['a']
0    a
1    b
2    c
3    a
4    b
5    c
Name: a, dtype: category
Categories (3, object): [a, b, c]
>>> pd.SparseDataFrame({'a': c})['a']
0    a
1    b
2    c
3    a
4    b
5    c
Name: a, dtype: object
BlockIndex
Block locations: array([0], dtype=int32)
Block lengths: array([6], dtype=int32)

Problem description

Categoricals are upcast to object dtype when put into SparseArray and SparseDataFrame (or when calling Series.to_sparse()). This is inconsistent with the categorical dtype retained by dense Series and DataFrame.
SparseSeries raises an error when constructed with a categorical argument. This is inconsistent with the SparseArray and SparseDataFrame behaviour.

Expected Output

SparseDataFrame({'a': c})['a'].dtype == SparseSeries(c).dtype == SparseArray(c).dtype == Series(c).dtype

or at a minimum:

SparseSeries(c) raises no error, and produces object dtype.

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit: None
python: 3.6.4.final.0
python-bits: 64
OS: Darwin
OS-release: 17.3.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_AU.UTF-8
LOCALE: en_AU.UTF-8

pandas: 0+unknown
pytest: None
pip: 9.0.1
setuptools: 38.4.0
Cython: 0.27.3
numpy: 1.14.0
scipy: None
pyarrow: None
xarray: None
IPython: None
sphinx: None
patsy: None
dateutil: 2.6.1
pytz: 2017.3
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: None
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: None
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

The text was updated successfully, but these errors were encountered:

jreback · 2018-01-17T12:40:04Z

why would you actually want to do this? using object dtypes in sparse has very little utility and is barely supported.

jnothman · 2018-01-17T12:48:00Z

okay. so make it invalid. I'll admit i don't have a use case for it. it just came up when looking into the implementation of unstack and how to make that sparse. I still think the SparseSeries error is inappropriate.

…

On 17 Jan 2018 11:40 pm, "Jeff Reback" ***@***.***> wrote: why would you actually want to do this? using object dtypes in sparse has very little utility and is barely supported. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#19278 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAEz6y6ZQqux6x9BvzU5oaVt1Vkiijd0ks5tLeo4gaJpZM4Rg6sJ> .

jreback · 2018-01-19T22:12:19Z

would appreciate a PR

cr458 · 2018-01-20T13:01:51Z

@jnothman am happy to have a look at this if it is needed. Am having a little trouble understanding what you mean 'make it invalid' though?

LEO-E-100 · 2018-01-23T13:51:06Z

I am currently working on this issue and am thinking that the solution is to add a check to the SparseSeries which checks if the data is a CategoricalDtype:

elif isinstance(data, CategoricalDtype):
                if dtype is not None:
                    data = data.astype(dtype)
                if index is None:
                    index = data.index.view()
                else:
                    data=data.reindex(index, copy=False)

However the boolean isinstance(c, CategoricalDtype) returns false even if c.dtype returns CategoricalDtype. I suspect I am missing something important here but I cannot find how to make this boolean true on a Categorical datatype.

For reference this elif block would be added at ~ line 174 of pandas.core.sparse.series.py.

TomAugspurger · 2018-01-23T16:06:52Z

However the boolean isinstance(c, CategoricalDtype) returns false even if c.dtype returns CategoricalDtype.

CategoricalDtype is the class of array.dtype for a categorical array. You could use

if is_categorical_dtype(data):
   ...

OmerJog · 2019-01-23T12:37:23Z

@LEO-E-100 are you still working on this issue?

TomAugspurger · 2019-09-17T21:02:45Z

We can re-purpose this issue to be for allowing SparsArray[ExtensionDtype, fill_value]. It's not exactly straightforward though.

jreback added Error Reporting Incorrect or improved errors from pandas Sparse Sparse Data Type labels Jan 19, 2018

jreback added this to the Next Major Release milestone Jan 19, 2018

jreback added Effort Low good first issue labels Jan 19, 2018

jnothman mentioned this issue Feb 14, 2018

ENH: Add sparse option in pivot_table / pivot / unstack #14493

Open

xchoudhury mentioned this issue Apr 10, 2018

added check if categoricalDtype for issue #19278 #20644

Closed

4 tasks

jbrockmendel added the Constructors Series/DataFrame/Index/pd.array Constructors label Jul 23, 2019

TomAugspurger mentioned this issue Sep 16, 2019

Remove SparseSeries and SparseDataFrame #28425

Merged

jbrockmendel removed the Effort Low label Oct 21, 2019

mroeschke added Enhancement and removed good first issue Error Reporting Incorrect or improved errors from pandas labels Apr 20, 2020

mroeschke removed this from the Contributions Welcome milestone Oct 13, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Behaviour of Categorical inputs to sparse data structures #19278

Behaviour of Categorical inputs to sparse data structures #19278

jnothman commented Jan 17, 2018

INSTALLED VERSIONS

jreback commented Jan 17, 2018

jnothman commented Jan 17, 2018 via email

jreback commented Jan 19, 2018

cr458 commented Jan 20, 2018 •

edited

Loading

LEO-E-100 commented Jan 23, 2018

TomAugspurger commented Jan 23, 2018

OmerJog commented Jan 23, 2019

TomAugspurger commented Sep 17, 2019

Behaviour of Categorical inputs to sparse data structures #19278

Behaviour of Categorical inputs to sparse data structures #19278

Comments

jnothman commented Jan 17, 2018

Code Sample, a copy-pastable example if possible

Problem description

Expected Output

Output of pd.show_versions()

INSTALLED VERSIONS

jreback commented Jan 17, 2018

jnothman commented Jan 17, 2018 via email

jreback commented Jan 19, 2018

cr458 commented Jan 20, 2018 • edited Loading

LEO-E-100 commented Jan 23, 2018

TomAugspurger commented Jan 23, 2018

OmerJog commented Jan 23, 2019

TomAugspurger commented Sep 17, 2019

Output of `pd.show_versions()`

cr458 commented Jan 20, 2018 •

edited

Loading