Skip to content

Commit ecf8514

Browse files
committed
support CategoricalIndex
raise KeyError when accessing invalid elements setting elements not in the categories is equiv of .append() (which coerces to an Index)
1 parent 85059a4 commit ecf8514

18 files changed

+1950
-375
lines changed

doc/source/advanced.rst

+89-1
Original file line numberDiff line numberDiff line change
@@ -594,6 +594,95 @@ faster than fancy indexing.
594594
timeit ser.ix[indexer]
595595
timeit ser.take(indexer)
596596

597+
.. _indexing.categoricalindex:
598+
599+
CategoricalIndex
600+
----------------
601+
602+
.. versionadded:: 0.16.1
603+
604+
We introduce a ``CategoricalIndex``, a new type of index object that is useful for supporting
605+
indexing with duplicates. This is a container around a ``Categorical`` (introduced in v0.15.0)
606+
and allows efficient indexing and storage of an index with a large number of duplicated elements. Prior to 0.16.1,
607+
setting the index of a ``DataFrame/Series`` with a ``category`` dtype would convert this to regular object-based ``Index``.
608+
609+
.. ipython:: python
610+
611+
df = DataFrame({'A' : np.arange(6),
612+
'B' : Series(list('aabbca')).astype('category',
613+
categories=list('cab'))
614+
})
615+
df
616+
df.dtypes
617+
df.B.cat.categories
618+
619+
Setting the index, will create create a ``CategoricalIndex``
620+
621+
.. ipython:: python
622+
623+
df2 = df.set_index('B')
624+
df2.index
625+
626+
Indexing with ``__getitem__/.iloc/.loc/.ix`` works similarly to an ``Index`` with duplicates.
627+
The indexers MUST be in the category or the operation will raise.
628+
629+
.. ipython:: python
630+
631+
df2.loc['a']
632+
633+
These PRESERVE the ``CategoricalIndex``
634+
635+
.. ipython:: python
636+
637+
df2.loc['a'].index
638+
639+
Sorting will order by the order of the categories
640+
641+
.. ipython:: python
642+
643+
df2.sort_index()
644+
645+
Groupby operations on the index will preserve the index nature as well
646+
647+
.. ipython:: python
648+
649+
df2.groupby(level=0).sum()
650+
df2.groupby(level=0).sum().index
651+
652+
Reindexing operations, will return a resulting index based on the type of the passed
653+
indexer, meaning that passing a list will return a plain-old-``Index``; indexing with
654+
a ``Categorical`` will return a ``CategoricalIndex``, indexed according to the categories
655+
of the PASSED ``Categorical`` dtype. This allows one to arbitrarly index these even with
656+
values NOT in the categories, similarly to how you can reindex ANY pandas index.
657+
658+
.. ipython :: python
659+
660+
df2.reindex(['a','e'])
661+
df2.reindex(['a','e']).index
662+
df2.reindex(pd.Categorical(['a','e'],categories=list('abcde')))
663+
df2.reindex(pd.Categorical(['a','e'],categories=list('abcde'))).index
664+
665+
.. warning::
666+
667+
Reshaping and Comparision operations on a ``CategoricalIndex`` must have the same categories
668+
or a ``TypeError`` will be raised.
669+
670+
.. code-block:: python
671+
672+
In [10]: df3 = DataFrame({'A' : np.arange(6),
673+
'B' : Series(list('aabbca')).astype('category',
674+
categories=list('abc'))
675+
}).set_index('B')
676+
677+
In [11]: df3.index
678+
Out[11]:
679+
CategoricalIndex([u'a', u'a', u'b', u'b', u'c', u'a'],
680+
categories=[u'a', u'b', u'c'],
681+
ordered=False)
682+
683+
In [12]: pd.concat([df2,df3]
684+
TypeError: categories must match existing categories when appending
685+
597686
.. _indexing.float64index:
598687
599688
Float64Index
@@ -706,4 +795,3 @@ Of course if you need integer based selection, then use ``iloc``
706795
.. ipython:: python
707796
708797
dfir.iloc[0:5]
709-

doc/source/api.rst

+28
Original file line numberDiff line numberDiff line change
@@ -1291,6 +1291,34 @@ Selecting
12911291
Index.slice_indexer
12921292
Index.slice_locs
12931293

1294+
.. _api.categoricalindex:
1295+
1296+
CategoricalIndex
1297+
----------------
1298+
1299+
.. autosummary::
1300+
:toctree: generated/
1301+
1302+
CategoricalIndex
1303+
1304+
Categorical Components
1305+
~~~~~~~~~~~~~~~~~~~~~~
1306+
1307+
.. autosummary::
1308+
:toctree: generated/
1309+
1310+
CategoricalIndex.codes
1311+
CategoricalIndex.categories
1312+
CategoricalIndex.ordered
1313+
CategoricalIndex.rename_categories
1314+
CategoricalIndex.reorder_categories
1315+
CategoricalIndex.add_categories
1316+
CategoricalIndex.remove_categories
1317+
CategoricalIndex.remove_unused_categories
1318+
CategoricalIndex.set_categories
1319+
CategoricalIndex.as_ordered
1320+
CategoricalIndex.as_unordered
1321+
12941322
.. _api.datetimeindex:
12951323

12961324
DatetimeIndex

doc/source/whatsnew/v0.16.1.txt

+75
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,10 @@ This is a minor bug-fix release from 0.16.0 and includes a a large number of
77
bug fixes along several new features, enhancements, and performance improvements.
88
We recommend that all users upgrade to this version.
99

10+
Highlights include:
11+
12+
- Support for a ``CategoricalIndex``, a category based index, see :ref:`here <whatsnew_0161`.enhancements.categoricalindex>`
13+
1014
.. contents:: What's new in v0.16.1
1115
:local:
1216
:backlinks: none
@@ -31,6 +35,7 @@ Enhancements
3135
will return a `np.array` instead of a boolean `Index` (:issue:`8875`). This enables the following expression
3236
to work naturally:
3337

38+
3439
.. ipython:: python
3540

3641
idx = Index(['a1', 'a2', 'b1', 'b2'])
@@ -40,6 +45,7 @@ Enhancements
4045
s[s.index.str.startswith('a')]
4146

4247
- ``DataFrame.mask()`` and ``Series.mask()`` now support same keywords as ``where`` (:issue:`8801`)
48+
4349
- ``drop`` function can now accept ``errors`` keyword to suppress ValueError raised when any of label does not exist in the target data. (:issue:`6736`)
4450

4551
.. ipython:: python
@@ -58,6 +64,75 @@ Enhancements
5864

5965
- ``DataFrame`` and ``Series`` now have ``_constructor_expanddim`` property as overridable constructor for one higher dimensionality data. This should be used only when it is really needed, see :ref:`here <ref-subclassing-pandas>`
6066

67+
.. _whatsnew_0161.enhancements.categoricalindex:
68+
69+
CategoricalIndex
70+
^^^^^^^^^^^^^^^^
71+
72+
We introduce a ``CategoricalIndex``, a new type of index object that is useful for supporting
73+
indexing with duplicates. This is a container around a ``Categorical`` (introduced in v0.15.0)
74+
and allows efficient indexing and storage of an index with a large number of duplicated elements. Prior to 0.16.1,
75+
setting the index of a ``DataFrame/Series`` with a ``category`` dtype would convert this to regular object-based ``Index``.
76+
77+
.. ipython :: python
78+
79+
df = DataFrame({'A' : np.arange(6),
80+
'B' : Series(list('aabbca')).astype('category',
81+
categories=list('cab'))
82+
})
83+
df
84+
df.dtypes
85+
df.B.cat.categories
86+
87+
setting the index, will create create a CategoricalIndex
88+
89+
.. ipython :: python
90+
91+
df2 = df.set_index('B')
92+
df2.index
93+
94+
indexing with ``__getitem__/.iloc/.loc/.ix`` works similarly to an Index with duplicates.
95+
The indexers MUST be in the category or the operation will raise.
96+
97+
.. ipython :: python
98+
99+
df2.loc['a']
100+
101+
and preserves the ``CategoricalIndex``
102+
103+
.. ipython :: python
104+
105+
df2.loc['a'].index
106+
107+
sorting will order by the order of the categories
108+
109+
.. ipython :: python
110+
111+
df2.sort_index()
112+
113+
groupby operations on the index will preserve the index nature as well
114+
115+
.. ipython :: python
116+
117+
df2.groupby(level=0).sum()
118+
df2.groupby(level=0).sum().index
119+
120+
reindexing operations, will return a resulting index based on the type of the passed
121+
indexer, meaning that passing a list will return a plain-old-``Index``; indexing with
122+
a ``Categorical`` will return a ``CategoricalIndex``, indexed according to the categories
123+
of the PASSED ``Categorical`` dtype. This allows one to arbitrarly index these even with
124+
values NOT in the categories, similarly to how you can reindex ANY pandas index.
125+
126+
.. ipython :: python
127+
128+
df2.reindex(['a','e'])
129+
df2.reindex(['a','e']).index
130+
df2.reindex(pd.Categorical(['a','e'],categories=list('abcde')))
131+
df2.reindex(pd.Categorical(['a','e'],categories=list('abcde'))).index
132+
133+
See the :ref:`documentation <advanced.categoricalindex>` for more. (:issue:`7629`)
134+
>>>>>>> support CategoricalIndex
135+
61136
.. _whatsnew_0161.api:
62137

63138
API changes

pandas/core/api.py

+1-1
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,7 @@
88
from pandas.core.categorical import Categorical
99
from pandas.core.groupby import Grouper
1010
from pandas.core.format import set_eng_float_format
11-
from pandas.core.index import Index, Int64Index, Float64Index, MultiIndex
11+
from pandas.core.index import Index, CategoricalIndex, Int64Index, Float64Index, MultiIndex
1212

1313
from pandas.core.series import Series, TimeSeries
1414
from pandas.core.frame import DataFrame

pandas/core/base.py

+4-2
Original file line numberDiff line numberDiff line change
@@ -121,7 +121,7 @@ def _delegate_method(self, name, *args, **kwargs):
121121
raise TypeError("You cannot call method {name}".format(name=name))
122122

123123
@classmethod
124-
def _add_delegate_accessors(cls, delegate, accessors, typ):
124+
def _add_delegate_accessors(cls, delegate, accessors, typ, overwrite=False):
125125
"""
126126
add accessors to cls from the delegate class
127127
@@ -131,6 +131,8 @@ def _add_delegate_accessors(cls, delegate, accessors, typ):
131131
delegate : the class to get methods/properties & doc-strings
132132
acccessors : string list of accessors to add
133133
typ : 'property' or 'method'
134+
overwrite : boolean, default False
135+
overwrite the method/property in the target class if it exists
134136
135137
"""
136138

@@ -164,7 +166,7 @@ def f(self, *args, **kwargs):
164166
f = _create_delegator_method(name)
165167

166168
# don't overwrite existing methods/properties
167-
if not hasattr(cls, name):
169+
if overwrite or not hasattr(cls, name):
168170
setattr(cls,name,f)
169171

170172

0 commit comments

Comments
 (0)