Skip to content

Index astype('category') does not return a CategoricalIndex #18630

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
nmusolino opened this issue Dec 4, 2017 · 5 comments · Fixed by #18677
Closed

Index astype('category') does not return a CategoricalIndex #18630

nmusolino opened this issue Dec 4, 2017 · 5 comments · Fixed by #18677
Labels
API Design Categorical Categorical Data Type
Milestone

Comments

@nmusolino
Copy link
Contributor

nmusolino commented Dec 4, 2017

Code Sample, a copy-pastable example if possible

In [1]: import pandas

In [2]: idx = pandas.Index(['a', 'b', 'c'])

In [3]: idx
Out[3]: Index(['a', 'b', 'c'], dtype='object')

In [4]: idx.astype('category')
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-4-b8d97d97d03f> in <module>()
----> 1 idx.astype('category')

C:\...\pandas\indexes\base.py in astype(self, dtype, copy)
    889     @Appender(_index_shared_docs['astype'])
    890     def astype(self, dtype, copy=True):
--> 891         return Index(self.values.astype(dtype, copy=copy), name=self.name,
    892                      dtype=dtype)
    893

TypeError: data type "category" not understood

Problem description

The documentation for this method reads:

Create an Index with values cast to dtypes. The class of a new Index is determined by dtype.

Since there is a CategoricalIndex type, it is reasonable for a user to expect that .astype('category') would return a CategoricalIndex object.

As a workaround for the issue, users can construct a CategoricalIndex directly:

In [7]: pandas.CategoricalIndex(idx)
Out[7]: CategoricalIndex(['a', 'b', 'c'], categories=['a', 'b', 'c'], ordered=False, dtype='category')

Expected Output

The method should return a CategoricalIndex equal to the following:

In [5]: pandas.CategoricalIndex(['a', 'b', 'c'])
Out[5]: CategoricalIndex(['a', 'b', 'c'], categories=['a', 'b', 'c'], ordered=False, dtype='category')

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.4.5.final.0
python-bits: 64
OS: Windows
OS-release: 7
machine: AMD64
processor: Intel64 Family 6 Model 79 Stepping 1, GenuineIntel
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: None.None

pandas: 0.19.1
nose: 1.3.7
pip: 9.0.1
setuptools: 27.2.0
Cython: 0.24.1
numpy: 1.11.2
scipy: 0.18.1
statsmodels: 0.6.1
xarray: 0.8.2
IPython: 5.1.0
sphinx: 1.4.8
patsy: 0.4.1
dateutil: 2.6.0
pytz: 2016.7
blosc: 1.5.0
bottleneck: 1.2.0
tables: 3.2.2
numexpr: 2.6.1
matplotlib: 2.0.0
openpyxl: 2.4.0
xlrd: 1.0.0
xlwt: 1.1.2
xlsxwriter: 0.9.3
lxml: 3.6.4
bs4: 4.5.3
html5lib: 0.999
httplib2: 0.9.2
apiclient: None
sqlalchemy: 1.1.3
pymysql: None
psycopg2: 2.6.2 (dt dec pq3 ext lo64)
jinja2: 2.8
boto: 2.43.0
pandas_datareader: None

@TomAugspurger
Copy link
Contributor

That seems reasonable. We would also want to accept CategoricalDtype there.

Are you able to submit a pull request?

@TomAugspurger TomAugspurger added API Design Categorical Categorical Data Type labels Dec 4, 2017
@TomAugspurger TomAugspurger added this to the Next Major Release milestone Dec 4, 2017
@jreback
Copy link
Contributor

jreback commented Dec 5, 2017

note that this should test all versions of indexes .astype('category')

@jschendel
Copy link
Member

jschendel commented Dec 7, 2017

A couple of questions:

  1. It looks like IntervalIndex.astype('category') already has some logic intentionally written to return a Categorical, not a CategoricalIndex. Should this be changed for consistency with the other types of index? Or was there a specific reason it was implemented this way? I don't immediately see a reason why we shouldn't return a CategoricalIndex. (see here for code)

  2. Should MultiIndex.astype('category') return categories consisting of tuples? Or should this not be supported for MultiIndex?

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Dec 7, 2017 via email

@jreback
Copy link
Contributor

jreback commented Dec 7, 2017

  1. this could prob be changed. I wrote it like this as we needed to convert II to categorical for indexing, but I don't fully remember if I then discarded that need. This should return a CI instead.

@jreback jreback modified the milestones: Next Major Release, 0.22.0 Dec 7, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design Categorical Categorical Data Type
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants