Skip to content

Commit 61bd9ca

Browse files
committed
support CategoricalIndex
raise KeyError when accessing invalid elements setting elements not in the categories is equiv of .append() (which coerces to an Index)
1 parent 161f38d commit 61bd9ca

18 files changed

+1950
-375
lines changed

doc/source/advanced.rst

+89-1
Original file line numberDiff line numberDiff line change
@@ -594,6 +594,95 @@ faster than fancy indexing.
594594
timeit ser.ix[indexer]
595595
timeit ser.take(indexer)
596596

597+
.. _indexing.categoricalindex:
598+
599+
CategoricalIndex
600+
----------------
601+
602+
.. versionadded:: 0.16.1
603+
604+
We introduce a ``CategoricalIndex``, a new type of index object that is useful for supporting
605+
indexing with duplicates. This is a container around a ``Categorical`` (introduced in v0.15.0)
606+
and allows efficient indexing and storage of an index with a large number of duplicated elements. Prior to 0.16.1,
607+
setting the index of a ``DataFrame/Series`` with a ``category`` dtype would convert this to regular object-based ``Index``.
608+
609+
.. ipython:: python
610+
611+
df = DataFrame({'A' : np.arange(6),
612+
'B' : Series(list('aabbca')).astype('category',
613+
categories=list('cab'))
614+
})
615+
df
616+
df.dtypes
617+
df.B.cat.categories
618+
619+
Setting the index, will create create a ``CategoricalIndex``
620+
621+
.. ipython:: python
622+
623+
df2 = df.set_index('B')
624+
df2.index
625+
626+
Indexing with ``__getitem__/.iloc/.loc/.ix`` works similarly to an ``Index`` with duplicates.
627+
The indexers MUST be in the category or the operation will raise.
628+
629+
.. ipython:: python
630+
631+
df2.loc['a']
632+
633+
These PRESERVE the ``CategoricalIndex``
634+
635+
.. ipython:: python
636+
637+
df2.loc['a'].index
638+
639+
Sorting will order by the order of the categories
640+
641+
.. ipython:: python
642+
643+
df2.sort_index()
644+
645+
Groupby operations on the index will preserve the index nature as well
646+
647+
.. ipython:: python
648+
649+
df2.groupby(level=0).sum()
650+
df2.groupby(level=0).sum().index
651+
652+
Reindexing operations, will return a resulting index based on the type of the passed
653+
indexer, meaning that passing a list will return a plain-old-``Index``; indexing with
654+
a ``Categorical`` will return a ``CategoricalIndex``, indexed according to the categories
655+
of the PASSED ``Categorical`` dtype. This allows one to arbitrarly index these even with
656+
values NOT in the categories, similarly to how you can reindex ANY pandas index.
657+
658+
.. ipython :: python
659+
660+
df2.reindex(['a','e'])
661+
df2.reindex(['a','e']).index
662+
df2.reindex(pd.Categorical(['a','e'],categories=list('abcde')))
663+
df2.reindex(pd.Categorical(['a','e'],categories=list('abcde'))).index
664+
665+
.. warning::
666+
667+
Reshaping and Comparision operations on a ``CategoricalIndex`` must have the same categories
668+
or a ``TypeError`` will be raised.
669+
670+
.. code-block:: python
671+
672+
In [10]: df3 = DataFrame({'A' : np.arange(6),
673+
'B' : Series(list('aabbca')).astype('category',
674+
categories=list('abc'))
675+
}).set_index('B')
676+
677+
In [11]: df3.index
678+
Out[11]:
679+
CategoricalIndex([u'a', u'a', u'b', u'b', u'c', u'a'],
680+
categories=[u'a', u'b', u'c'],
681+
ordered=False)
682+
683+
In [12]: pd.concat([df2,df3]
684+
TypeError: categories must match existing categories when appending
685+
597686
.. _indexing.float64index:
598687
599688
Float64Index
@@ -706,4 +795,3 @@ Of course if you need integer based selection, then use ``iloc``
706795
.. ipython:: python
707796
708797
dfir.iloc[0:5]
709-

doc/source/api.rst

+28
Original file line numberDiff line numberDiff line change
@@ -1291,6 +1291,34 @@ Selecting
12911291
Index.slice_indexer
12921292
Index.slice_locs
12931293

1294+
.. _api.categoricalindex:
1295+
1296+
CategoricalIndex
1297+
----------------
1298+
1299+
.. autosummary::
1300+
:toctree: generated/
1301+
1302+
CategoricalIndex
1303+
1304+
Categorical Components
1305+
~~~~~~~~~~~~~~~~~~~~~~
1306+
1307+
.. autosummary::
1308+
:toctree: generated/
1309+
1310+
CategoricalIndex.codes
1311+
CategoricalIndex.categories
1312+
CategoricalIndex.ordered
1313+
CategoricalIndex.rename_categories
1314+
CategoricalIndex.reorder_categories
1315+
CategoricalIndex.add_categories
1316+
CategoricalIndex.remove_categories
1317+
CategoricalIndex.remove_unused_categories
1318+
CategoricalIndex.set_categories
1319+
CategoricalIndex.as_ordered
1320+
CategoricalIndex.as_unordered
1321+
12941322
.. _api.datetimeindex:
12951323

12961324
DatetimeIndex

doc/source/whatsnew/v0.16.1.txt

+75
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,10 @@ This is a minor bug-fix release from 0.16.0 and includes a a large number of
77
bug fixes along several new features, enhancements, and performance improvements.
88
We recommend that all users upgrade to this version.
99

10+
Highlights include:
11+
12+
- Support for a ``CategoricalIndex``, a category based index, see :ref:`here <whatsnew_0161`.enhancements.categoricalindex>`
13+
1014
.. contents:: What's new in v0.16.1
1115
:local:
1216
:backlinks: none
@@ -31,6 +35,7 @@ Enhancements
3135
will return a `np.array` instead of a boolean `Index` (:issue:`8875`). This enables the following expression
3236
to work naturally:
3337

38+
3439
.. ipython:: python
3540

3641
idx = Index(['a1', 'a2', 'b1', 'b2'])
@@ -40,6 +45,7 @@ Enhancements
4045
s[s.index.str.startswith('a')]
4146

4247
- ``DataFrame.mask()`` and ``Series.mask()`` now support same keywords as ``where`` (:issue:`8801`)
48+
4349
- ``drop`` function can now accept ``errors`` keyword to suppress ValueError raised when any of label does not exist in the target data. (:issue:`6736`)
4450

4551
.. ipython:: python
@@ -54,6 +60,75 @@ Enhancements
5460
- Allow timedelta string conversion when leading zero is missing from time definition, ie `0:00:00` vs `00:00:00`. (:issue:`9570`)
5561
- Allow Panel.shift with ``axis='items'`` (:issue:`9890`)
5662

63+
64+
.. _whatsnew_0161.enhancements.categoricalindex:
65+
66+
CategoricalIndex
67+
^^^^^^^^^^^^^^^^
68+
69+
We introduce a ``CategoricalIndex``, a new type of index object that is useful for supporting
70+
indexing with duplicates. This is a container around a ``Categorical`` (introduced in v0.15.0)
71+
and allows efficient indexing and storage of an index with a large number of duplicated elements. Prior to 0.16.1,
72+
setting the index of a ``DataFrame/Series`` with a ``category`` dtype would convert this to regular object-based ``Index``.
73+
74+
.. ipython :: python
75+
76+
df = DataFrame({'A' : np.arange(6),
77+
'B' : Series(list('aabbca')).astype('category',
78+
categories=list('cab'))
79+
})
80+
df
81+
df.dtypes
82+
df.B.cat.categories
83+
84+
setting the index, will create create a CategoricalIndex
85+
86+
.. ipython :: python
87+
88+
df2 = df.set_index('B')
89+
df2.index
90+
91+
indexing with ``__getitem__/.iloc/.loc/.ix`` works similarly to an Index with duplicates.
92+
The indexers MUST be in the category or the operation will raise.
93+
94+
.. ipython :: python
95+
96+
df2.loc['a']
97+
98+
and preserves the ``CategoricalIndex``
99+
100+
.. ipython :: python
101+
102+
df2.loc['a'].index
103+
104+
sorting will order by the order of the categories
105+
106+
.. ipython :: python
107+
108+
df2.sort_index()
109+
110+
groupby operations on the index will preserve the index nature as well
111+
112+
.. ipython :: python
113+
114+
df2.groupby(level=0).sum()
115+
df2.groupby(level=0).sum().index
116+
117+
reindexing operations, will return a resulting index based on the type of the passed
118+
indexer, meaning that passing a list will return a plain-old-``Index``; indexing with
119+
a ``Categorical`` will return a ``CategoricalIndex``, indexed according to the categories
120+
of the PASSED ``Categorical`` dtype. This allows one to arbitrarly index these even with
121+
values NOT in the categories, similarly to how you can reindex ANY pandas index.
122+
123+
.. ipython :: python
124+
125+
df2.reindex(['a','e'])
126+
df2.reindex(['a','e']).index
127+
df2.reindex(pd.Categorical(['a','e'],categories=list('abcde')))
128+
df2.reindex(pd.Categorical(['a','e'],categories=list('abcde'))).index
129+
130+
See the :ref:`documentation <advanced.categoricalindex>` for more. (:issue:`7629`)
131+
57132
.. _whatsnew_0161.api:
58133

59134
API changes

pandas/core/api.py

+1-1
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,7 @@
88
from pandas.core.categorical import Categorical
99
from pandas.core.groupby import Grouper
1010
from pandas.core.format import set_eng_float_format
11-
from pandas.core.index import Index, Int64Index, Float64Index, MultiIndex
11+
from pandas.core.index import Index, CategoricalIndex, Int64Index, Float64Index, MultiIndex
1212

1313
from pandas.core.series import Series, TimeSeries
1414
from pandas.core.frame import DataFrame

pandas/core/base.py

+4-2
Original file line numberDiff line numberDiff line change
@@ -121,7 +121,7 @@ def _delegate_method(self, name, *args, **kwargs):
121121
raise TypeError("You cannot call method {name}".format(name=name))
122122

123123
@classmethod
124-
def _add_delegate_accessors(cls, delegate, accessors, typ):
124+
def _add_delegate_accessors(cls, delegate, accessors, typ, overwrite=False):
125125
"""
126126
add accessors to cls from the delegate class
127127
@@ -131,6 +131,8 @@ def _add_delegate_accessors(cls, delegate, accessors, typ):
131131
delegate : the class to get methods/properties & doc-strings
132132
acccessors : string list of accessors to add
133133
typ : 'property' or 'method'
134+
overwrite : boolean, default False
135+
overwrite the method/property in the target class if it exists
134136
135137
"""
136138

@@ -164,7 +166,7 @@ def f(self, *args, **kwargs):
164166
f = _create_delegator_method(name)
165167

166168
# don't overwrite existing methods/properties
167-
if not hasattr(cls, name):
169+
if overwrite or not hasattr(cls, name):
168170
setattr(cls,name,f)
169171

170172

0 commit comments

Comments
 (0)