Skip to content

Commit 1ee7a2a

Browse files
committed
support CategoricalIndex
raise KeyError when accessing invalid elements setting elements not in the categories is equiv of .append() (which coerces to an Index)
1 parent 8d2818e commit 1ee7a2a

14 files changed

+1094
-210
lines changed

.gitignore

+1-1
Original file line numberDiff line numberDiff line change
@@ -55,7 +55,7 @@ dist
5555
######################
5656
.directory
5757
.gdb_history
58-
.DS_Store?
58+
.DS_Store
5959
ehthumbs.db
6060
Icon?
6161
Thumbs.db

doc/source/advanced.rst

+70-2
Original file line numberDiff line numberDiff line change
@@ -594,7 +594,76 @@ faster than fancy indexing.
594594
timeit ser.ix[indexer]
595595
timeit ser.take(indexer)
596596

597-
.. _indexing.float64index:
597+
.. _indexing.categoricalindex:
598+
599+
CategoricalIndex
600+
----------------
601+
602+
.. versionadded:: 0.16.1
603+
604+
We introduce a ``CategoricalIndex``, a new type of index object that is useful for supporting
605+
indexing with duplicates. This is a container around a ``Categorical`` (introduced in v0.15.0)
606+
and allows efficient indexing and storage of an index with a large number of duplicated elements. Prior to 0.16.1,
607+
setting the index of a ``DataFrame/Series`` with a ``category`` dtype would convert this to regular object-based ``Index``.
608+
609+
.. ipython:: python
610+
611+
df = DataFrame({'A' : np.arange(6),
612+
'B' : Series(list('aabbca')).astype('category',
613+
categories=list('cab'))
614+
})
615+
df
616+
df.dtypes
617+
df.B.cat.categories
618+
619+
Setting the index, will create create a ``CategoricalIndex``
620+
621+
.. ipython:: python
622+
623+
df2 = df.set_index('B')
624+
df2.index
625+
df2.index.categories
626+
627+
Indexing works similarly to an ``Index`` with duplicates
628+
629+
.. ipython:: python
630+
631+
df2.loc['a']
632+
633+
# and preserves the CategoricalIndex
634+
df2.loc['a'].index
635+
df2.loc['a'].index.categories
636+
637+
Sorting will order by the order of the categories
638+
639+
.. ipython:: python
640+
641+
df2.sort_index()
642+
643+
Groupby operations on the index will preserve the index nature as well
644+
645+
.. ipython:: python
646+
647+
df2.groupby(level=0).sum()
648+
df2.groupby(level=0).sum().index
649+
650+
.. warning::
651+
652+
Reshaping and Comparision operations on a ``CategoricalIndex`` must have the same categories
653+
or a ``TypeError`` will be raised.
654+
655+
.. code-block:: python
656+
657+
In [10]: df3 = DataFrame({'A' : np.arange(6),
658+
'B' : Series(list('aabbca')).astype('category',
659+
categories=list('abc'))
660+
}).set_index('B')
661+
662+
In [11]: df3.index.categories
663+
Out[11]: Index([u'a', u'b', u'c'], dtype='object')
664+
665+
In [12]: pd.concat([df2,df3]
666+
TypeError: categories must match existing categories when appending
598667
599668
Float64Index
600669
------------
@@ -706,4 +775,3 @@ Of course if you need integer based selection, then use ``iloc``
706775
.. ipython:: python
707776
708777
dfir.iloc[0:5]
709-

doc/source/api.rst

+20
Original file line numberDiff line numberDiff line change
@@ -1289,6 +1289,26 @@ Selecting
12891289
Index.slice_indexer
12901290
Index.slice_locs
12911291

1292+
.. _api.categoricalindex:
1293+
1294+
CategoricalIndex
1295+
----------------
1296+
1297+
.. autosummary::
1298+
:toctree: generated/
1299+
1300+
CategoricalIndex
1301+
1302+
Categorical Components
1303+
~~~~~~~~~~~~~~~~~~~~~~
1304+
1305+
.. autosummary::
1306+
:toctree: generated/
1307+
1308+
CategoricalIndex.codes
1309+
CategoricalIndex.categories
1310+
CategoricalIndex.ordered
1311+
12921312
.. _api.datetimeindex:
12931313

12941314
DatetimeIndex

doc/source/whatsnew/v0.16.1.txt

+40
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,10 @@ This is a minor bug-fix release from 0.16.0 and includes a a large number of
77
bug fixes along several new features, enhancements, and performance improvements.
88
We recommend that all users upgrade to this version.
99

10+
Highlights include:
11+
12+
- Support for a ``CategoricalIndex``, a category based index, see :ref:`here <whatsnew_0161`.enhancements.categoricalindex>`
13+
1014
.. contents:: What's new in v0.16.1
1115
:local:
1216
:backlinks: none
@@ -17,10 +21,46 @@ We recommend that all users upgrade to this version.
1721
Enhancements
1822
~~~~~~~~~~~~
1923

24+
.. _whatsnew_0161.enhancements.categoricalindex:
25+
26+
CategoricalIndex
27+
^^^^^^^^^^^^^^^^
28+
29+
We introduce a ``CategoricalIndex``, a new type of index object that is useful for supporting
30+
indexing with duplicates. This is a container around a ``Categorical`` (introduced in v0.15.0)
31+
and allows efficient indexing and storage of an index with a large number of duplicated elements. Prior to 0.16.1,
32+
setting the index of a ``DataFrame/Series`` with a ``category`` dtype would convert this to regular object-based ``Index``.
33+
34+
.. ipython :: python
35+
36+
df = DataFrame({'A' : np.arange(6),
37+
'B' : Series(list('aabbca')).astype('category',
38+
categories=list('cab'))
39+
})
40+
df
41+
df.dtypes
42+
df.B.cat.categories
43+
44+
# setting the index, will create create a CategoricalIndex
45+
df2 = df.set_index('B')
46+
df2.index
47+
df2.index.categories
48+
49+
# indexing works similarly to an Index with duplicates
50+
df2.loc['a']
2051

52+
# and preserves the CategoricalIndex
53+
df2.loc['a'].index
54+
df2.loc['a'].index.categories
2155

56+
# sorting will order by the order of the categories
57+
df2.sort_index()
2258

59+
# groupby operations on the index will preserve the index nature as well
60+
df2.groupby(level=0).sum()
61+
df2.groupby(level=0).sum().index
2362

63+
See the :ref:`documentation <advanced.categoricalindex>` for more. (:issue:`7629`)
2464

2565
.. _whatsnew_0161.api:
2666

pandas/core/api.py

+1-1
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,7 @@
88
from pandas.core.categorical import Categorical
99
from pandas.core.groupby import Grouper
1010
from pandas.core.format import set_eng_float_format
11-
from pandas.core.index import Index, Int64Index, Float64Index, MultiIndex
11+
from pandas.core.index import Index, CategoricalIndex, Int64Index, Float64Index, MultiIndex
1212

1313
from pandas.core.series import Series, TimeSeries
1414
from pandas.core.frame import DataFrame

pandas/core/categorical.py

+33-5
Original file line numberDiff line numberDiff line change
@@ -14,10 +14,11 @@
1414
import pandas.core.common as com
1515
from pandas.util.decorators import cache_readonly
1616

17-
from pandas.core.common import (CategoricalDtype, ABCSeries, isnull, notnull,
17+
from pandas.core.common import (CategoricalDtype, ABCSeries, ABCCategoricalIndex, isnull, notnull,
1818
is_categorical_dtype, is_integer_dtype, is_object_dtype,
1919
_possibly_infer_to_datetimelike, get_dtype_kinds,
2020
is_list_like, is_sequence, is_null_slice, is_bool,
21+
is_dtypes_equal,
2122
_ensure_platform_int, _ensure_object, _ensure_int64,
2223
_coerce_indexer_dtype, _values_from_object, take_1d)
2324
from pandas.util.terminal import get_terminal_size
@@ -79,7 +80,7 @@ def f(self, other):
7980

8081
def maybe_to_categorical(array):
8182
""" coerce to a categorical if a series is given """
82-
if isinstance(array, ABCSeries):
83+
if isinstance(array, (ABCSeries, ABCCategoricalIndex)):
8384
return array.values
8485
return array
8586

@@ -233,12 +234,17 @@ def __init__(self, values, categories=None, ordered=False, name=None, fastpath=F
233234
cat = values
234235
if isinstance(values, ABCSeries):
235236
cat = values.values
237+
if isinstance(values, ABCCategoricalIndex):
238+
ordered = values.ordered
239+
cat = values.values
240+
236241
if categories is None:
237242
categories = cat.categories
238243
values = values.__array__()
239244

240245
elif isinstance(values, Index):
241-
pass
246+
#values = np.array(values)
247+
ordered = True
242248

243249
else:
244250

@@ -302,11 +308,27 @@ def copy(self):
302308
return Categorical(values=self._codes.copy(),categories=self.categories,
303309
name=self.name, ordered=self.ordered, fastpath=True)
304310

311+
def astype(self, dtype):
312+
""" coerce this type to another dtype """
313+
if is_categorical_dtype(dtype):
314+
return self
315+
return np.array(self, dtype=dtype)
316+
305317
@cache_readonly
306318
def ndim(self):
307319
"""Number of dimensions of the Categorical """
308320
return self._codes.ndim
309321

322+
@cache_readonly
323+
def size(self):
324+
""" return the len of myself """
325+
return len(self)
326+
327+
@cache_readonly
328+
def itemsize(self):
329+
""" return the size of a single category """
330+
return self.categories.itemsize
331+
310332
def reshape(self, new_shape, **kwargs):
311333
""" compat with .reshape """
312334
return self
@@ -1596,14 +1618,20 @@ def _delegate_method(self, name, *args, **kwargs):
15961618
##### utility routines #####
15971619

15981620
def _get_codes_for_values(values, categories):
1599-
""""
1621+
"""
16001622
utility routine to turn values into codes given the specified categories
16011623
"""
16021624

16031625
from pandas.core.algorithms import _get_data_algo, _hashtables
1604-
if values.dtype != categories.dtype:
1626+
if not is_dtypes_equal(values.dtype,categories.dtype):
1627+
values = _ensure_object(values)
1628+
categories = _ensure_object(categories)
1629+
1630+
if is_object_dtype(values):
16051631
values = _ensure_object(values)
1632+
if is_object_dtype(categories):
16061633
categories = _ensure_object(categories)
1634+
16071635
(hash_klass, vec_klass), vals = _get_data_algo(values, _hashtables)
16081636
t = hash_klass(len(categories))
16091637
t.map_locations(_values_from_object(categories))

pandas/core/common.py

+18
Original file line numberDiff line numberDiff line change
@@ -72,6 +72,7 @@ def _check(cls, inst):
7272
ABCDatetimeIndex = create_pandas_abc_type("ABCDatetimeIndex", "_typ", ("datetimeindex",))
7373
ABCTimedeltaIndex = create_pandas_abc_type("ABCTimedeltaIndex", "_typ", ("timedeltaindex",))
7474
ABCPeriodIndex = create_pandas_abc_type("ABCPeriodIndex", "_typ", ("periodindex",))
75+
ABCCategoricalIndex = create_pandas_abc_type("ABCCategoricalIndex", "_typ", ("categoricalindex",))
7576
ABCSeries = create_pandas_abc_type("ABCSeries", "_typ", ("series",))
7677
ABCDataFrame = create_pandas_abc_type("ABCDataFrame", "_typ", ("dataframe",))
7778
ABCPanel = create_pandas_abc_type("ABCPanel", "_typ", ("panel",))
@@ -2438,9 +2439,26 @@ def _get_dtype_type(arr_or_dtype):
24382439
return np.dtype(arr_or_dtype).type
24392440
elif isinstance(arr_or_dtype, CategoricalDtype):
24402441
return CategoricalDtypeType
2442+
elif isinstance(arr_or_dtype, compat.string_types):
2443+
if is_categorical_dtype(arr_or_dtype):
2444+
return CategoricalDtypeType
2445+
return _get_dtype_type(np.dtype(arr_or_dtype))
24412446
return arr_or_dtype.dtype.type
24422447

24432448

2449+
def is_dtypes_equal(source, target):
2450+
""" return a boolean if the dtypes are equal """
2451+
source = _get_dtype_type(source)
2452+
target = _get_dtype_type(target)
2453+
2454+
try:
2455+
return source == target
2456+
except TypeError:
2457+
2458+
# invalid comparison
2459+
# object == category will hit this
2460+
return False
2461+
24442462
def is_any_int_dtype(arr_or_dtype):
24452463
tipo = _get_dtype_type(arr_or_dtype)
24462464
return issubclass(tipo, np.integer)

0 commit comments

Comments
 (0)