Skip to content

Behaviour of Categorical inputs to sparse data structures #19278

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
jnothman opened this issue Jan 17, 2018 · 8 comments
Open

Behaviour of Categorical inputs to sparse data structures #19278

jnothman opened this issue Jan 17, 2018 · 8 comments
Labels
Constructors Series/DataFrame/Index/pd.array Constructors Enhancement Sparse Sparse Data Type

Comments

@jnothman
Copy link
Contributor

Code Sample, a copy-pastable example if possible

>>> import pandas as pd
>>> c = pd.Categorical(list('abcabc'))
>>> c
[a, b, c, a, b, c]
Categories (3, object): [a, b, c]
>>> pd.Series(c).dtype
CategoricalDtype(categories=['a', 'b', 'c'], ordered=False)
>>> pd.Series(c).to_sparse().dtype
dtype('O')
>>> pd.SparseArray(c)
[a, b, c, a, b, c]
Fill: nan
IntIndex
Indices: array([0, 1, 2, 3, 4, 5], dtype=int32)

>>> pd.SparseArray(c).dtype
dtype('O')
>>> pd.SparseSeries(c)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/joel/anaconda3/envs/pandas-dev/lib/python3.6/site-packages/pandas/core/sparse/series.py", line 175, in __init__
    length = len(index)
TypeError: object of type 'NoneType' has no len()
>>> pd.DataFrame({'a': c})['a']
0    a
1    b
2    c
3    a
4    b
5    c
Name: a, dtype: category
Categories (3, object): [a, b, c]
>>> pd.SparseDataFrame({'a': c})['a']
0    a
1    b
2    c
3    a
4    b
5    c
Name: a, dtype: object
BlockIndex
Block locations: array([0], dtype=int32)
Block lengths: array([6], dtype=int32)

Problem description

  • Categoricals are upcast to object dtype when put into SparseArray and SparseDataFrame (or when calling Series.to_sparse()). This is inconsistent with the categorical dtype retained by dense Series and DataFrame.
  • SparseSeries raises an error when constructed with a categorical argument. This is inconsistent with the SparseArray and SparseDataFrame behaviour.

Expected Output

SparseDataFrame({'a': c})['a'].dtype == SparseSeries(c).dtype == SparseArray(c).dtype == Series(c).dtype

or at a minimum:

SparseSeries(c) raises no error, and produces object dtype.

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.6.4.final.0
python-bits: 64
OS: Darwin
OS-release: 17.3.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_AU.UTF-8
LOCALE: en_AU.UTF-8

pandas: 0+unknown
pytest: None
pip: 9.0.1
setuptools: 38.4.0
Cython: 0.27.3
numpy: 1.14.0
scipy: None
pyarrow: None
xarray: None
IPython: None
sphinx: None
patsy: None
dateutil: 2.6.1
pytz: 2017.3
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: None
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: None
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

@jreback
Copy link
Contributor

jreback commented Jan 17, 2018

why would you actually want to do this? using object dtypes in sparse has very little utility and is barely supported.

@jnothman
Copy link
Contributor Author

jnothman commented Jan 17, 2018 via email

@jreback jreback added Error Reporting Incorrect or improved errors from pandas Sparse Sparse Data Type labels Jan 19, 2018
@jreback jreback added this to the Next Major Release milestone Jan 19, 2018
@jreback
Copy link
Contributor

jreback commented Jan 19, 2018

would appreciate a PR

@cr458
Copy link

cr458 commented Jan 20, 2018

@jnothman am happy to have a look at this if it is needed. Am having a little trouble understanding what you mean 'make it invalid' though?

@LEO-E-100
Copy link

I am currently working on this issue and am thinking that the solution is to add a check to the SparseSeries which checks if the data is a CategoricalDtype:

elif isinstance(data, CategoricalDtype):
                if dtype is not None:
                    data = data.astype(dtype)
                if index is None:
                    index = data.index.view()
                else:
                    data=data.reindex(index, copy=False)

However the boolean isinstance(c, CategoricalDtype) returns false even if c.dtype returns CategoricalDtype. I suspect I am missing something important here but I cannot find how to make this boolean true on a Categorical datatype.

For reference this elif block would be added at ~ line 174 of pandas.core.sparse.series.py.

@TomAugspurger
Copy link
Contributor

However the boolean isinstance(c, CategoricalDtype) returns false even if c.dtype returns CategoricalDtype.

CategoricalDtype is the class of array.dtype for a categorical array. You could use

if is_categorical_dtype(data):
   ...

@OmerJog
Copy link

OmerJog commented Jan 23, 2019

@LEO-E-100 are you still working on this issue?

@jbrockmendel jbrockmendel added the Constructors Series/DataFrame/Index/pd.array Constructors label Jul 23, 2019
@TomAugspurger
Copy link
Contributor

We can re-purpose this issue to be for allowing SparsArray[ExtensionDtype, fill_value]. It's not exactly straightforward though.

@mroeschke mroeschke added Enhancement and removed good first issue Error Reporting Incorrect or improved errors from pandas labels Apr 20, 2020
@mroeschke mroeschke removed this from the Contributions Welcome milestone Oct 13, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Constructors Series/DataFrame/Index/pd.array Constructors Enhancement Sparse Sparse Data Type
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants