Skip to content

Commit beac7d3

Browse files
committed
support CategoricalIndex
raise KeyError when accessing invalid elements setting elements not in the categories is equiv of .append() (which coerces to an Index)
1 parent cef3c85 commit beac7d3

18 files changed

+1935
-369
lines changed

doc/source/advanced.rst

+88-2
Original file line numberDiff line numberDiff line change
@@ -594,7 +594,94 @@ faster than fancy indexing.
594594
timeit ser.ix[indexer]
595595
timeit ser.take(indexer)
596596

597-
.. _indexing.float64index:
597+
.. _indexing.categoricalindex:
598+
599+
CategoricalIndex
600+
----------------
601+
602+
.. versionadded:: 0.16.1
603+
604+
We introduce a ``CategoricalIndex``, a new type of index object that is useful for supporting
605+
indexing with duplicates. This is a container around a ``Categorical`` (introduced in v0.15.0)
606+
and allows efficient indexing and storage of an index with a large number of duplicated elements. Prior to 0.16.1,
607+
setting the index of a ``DataFrame/Series`` with a ``category`` dtype would convert this to regular object-based ``Index``.
608+
609+
.. ipython:: python
610+
611+
df = DataFrame({'A' : np.arange(6),
612+
'B' : Series(list('aabbca')).astype('category',
613+
categories=list('cab'))
614+
})
615+
df
616+
df.dtypes
617+
df.B.cat.categories
618+
619+
Setting the index, will create create a ``CategoricalIndex``
620+
621+
.. ipython:: python
622+
623+
df2 = df.set_index('B')
624+
df2.index
625+
626+
Indexing with ``__getitem__/.iloc/.loc/.ix`` works similarly to an ``Index`` with duplicates.
627+
The indexers MUST be in the category or the operation will raise.
628+
629+
.. ipython:: python
630+
631+
df2.loc['a']
632+
633+
These PRESERVE the ``CategoricalIndex``
634+
635+
.. ipython:: python
636+
637+
df2.loc['a'].index
638+
639+
Sorting will order by the order of the categories
640+
641+
.. ipython:: python
642+
643+
df2.sort_index()
644+
645+
Groupby operations on the index will preserve the index nature as well
646+
647+
.. ipython:: python
648+
649+
df2.groupby(level=0).sum()
650+
df2.groupby(level=0).sum().index
651+
652+
Reindexing operations, will return a resulting index based on the type of the passed
653+
indexer, meaning that passing a list will return a plain-old-``Index``; indexing with
654+
a ``Categorical`` will return a ``CategoricalIndex``, indexed according to the categories
655+
of the PASSED ``Categorical`` dtype. This allows one to arbitrarly index these even with
656+
values NOT in the categories, similarly to how you can reindex ANY pandas index.
657+
658+
.. ipython :: python
659+
660+
df2.reindex(['a','e'])
661+
df2.reindex(['a','e']).index
662+
df2.reindex(pd.Categorical(['a','e'],categories=list('abcde')))
663+
df2.reindex(pd.Categorical(['a','e'],categories=list('abcde'))).index
664+
665+
.. warning::
666+
667+
Reshaping and Comparision operations on a ``CategoricalIndex`` must have the same categories
668+
or a ``TypeError`` will be raised.
669+
670+
.. code-block:: python
671+
672+
In [10]: df3 = DataFrame({'A' : np.arange(6),
673+
'B' : Series(list('aabbca')).astype('category',
674+
categories=list('abc'))
675+
}).set_index('B')
676+
677+
In [11]: df3.index
678+
Out[11]:
679+
CategoricalIndex([u'a', u'a', u'b', u'b', u'c', u'a'],
680+
categories=[u'a', u'b', u'c'],
681+
ordered=False)
682+
683+
In [12]: pd.concat([df2,df3]
684+
TypeError: categories must match existing categories when appending
598685
599686
Float64Index
600687
------------
@@ -706,4 +793,3 @@ Of course if you need integer based selection, then use ``iloc``
706793
.. ipython:: python
707794
708795
dfir.iloc[0:5]
709-

doc/source/api.rst

+29
Original file line numberDiff line numberDiff line change
@@ -1233,6 +1233,7 @@ Modifying and Computations
12331233
Index.max
12341234
Index.order
12351235
Index.reindex
1236+
Index.reindex_non_unique
12361237
Index.repeat
12371238
Index.take
12381239
Index.putmask
@@ -1291,6 +1292,34 @@ Selecting
12911292
Index.slice_indexer
12921293
Index.slice_locs
12931294

1295+
.. _api.categoricalindex:
1296+
1297+
CategoricalIndex
1298+
----------------
1299+
1300+
.. autosummary::
1301+
:toctree: generated/
1302+
1303+
CategoricalIndex
1304+
1305+
Categorical Components
1306+
~~~~~~~~~~~~~~~~~~~~~~
1307+
1308+
.. autosummary::
1309+
:toctree: generated/
1310+
1311+
CategoricalIndex.codes
1312+
CategoricalIndex.categories
1313+
CategoricalIndex.ordered
1314+
CategoricalIndex.rename_categories
1315+
CategoricalIndex.reorder_categories
1316+
CategoricalIndex.add_categories
1317+
CategoricalIndex.remove_categories
1318+
CategoricalIndex.remove_unused_categories
1319+
CategoricalIndex.set_categories
1320+
CategoricalIndex.as_ordered
1321+
CategoricalIndex.as_unordered
1322+
12941323
.. _api.datetimeindex:
12951324

12961325
DatetimeIndex

doc/source/whatsnew/v0.16.1.txt

+75
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,10 @@ This is a minor bug-fix release from 0.16.0 and includes a a large number of
77
bug fixes along several new features, enhancements, and performance improvements.
88
We recommend that all users upgrade to this version.
99

10+
Highlights include:
11+
12+
- Support for a ``CategoricalIndex``, a category based index, see :ref:`here <whatsnew_0161`.enhancements.categoricalindex>`
13+
1014
.. contents:: What's new in v0.16.1
1115
:local:
1216
:backlinks: none
@@ -31,6 +35,7 @@ Enhancements
3135
will return a `np.array` instead of a boolean `Index` (:issue:`8875`). This enables the following expression
3236
to work naturally:
3337

38+
3439
.. ipython:: python
3540

3641
idx = Index(['a1', 'a2', 'b1', 'b2'])
@@ -40,6 +45,7 @@ Enhancements
4045
s[s.index.str.startswith('a')]
4146

4247
- ``DataFrame.mask()`` and ``Series.mask()`` now support same keywords as ``where`` (:issue:`8801`)
48+
4349
- ``drop`` function can now accept ``errors`` keyword to suppress ValueError raised when any of label does not exist in the target data. (:issue:`6736`)
4450

4551
.. ipython:: python
@@ -53,6 +59,75 @@ Enhancements
5359

5460
- Allow timedelta string conversion when leading zero is missing from time definition, ie `0:00:00` vs `00:00:00`. (:issue:`9570`)
5561

62+
63+
.. _whatsnew_0161.enhancements.categoricalindex:
64+
65+
CategoricalIndex
66+
^^^^^^^^^^^^^^^^
67+
68+
We introduce a ``CategoricalIndex``, a new type of index object that is useful for supporting
69+
indexing with duplicates. This is a container around a ``Categorical`` (introduced in v0.15.0)
70+
and allows efficient indexing and storage of an index with a large number of duplicated elements. Prior to 0.16.1,
71+
setting the index of a ``DataFrame/Series`` with a ``category`` dtype would convert this to regular object-based ``Index``.
72+
73+
.. ipython :: python
74+
75+
df = DataFrame({'A' : np.arange(6),
76+
'B' : Series(list('aabbca')).astype('category',
77+
categories=list('cab'))
78+
})
79+
df
80+
df.dtypes
81+
df.B.cat.categories
82+
83+
setting the index, will create create a CategoricalIndex
84+
85+
.. ipython :: python
86+
87+
df2 = df.set_index('B')
88+
df2.index
89+
90+
indexing with ``__getitem__/.iloc/.loc/.ix`` works similarly to an Index with duplicates.
91+
The indexers MUST be in the category or the operation will raise.
92+
93+
.. ipython :: python
94+
95+
df2.loc['a']
96+
97+
and preserves the ``CategoricalIndex``
98+
99+
.. ipython :: python
100+
101+
df2.loc['a'].index
102+
103+
sorting will order by the order of the categories
104+
105+
.. ipython :: python
106+
107+
df2.sort_index()
108+
109+
groupby operations on the index will preserve the index nature as well
110+
111+
.. ipython :: python
112+
113+
df2.groupby(level=0).sum()
114+
df2.groupby(level=0).sum().index
115+
116+
reindexing operations, will return a resulting index based on the type of the passed
117+
indexer, meaning that passing a list will return a plain-old-``Index``; indexing with
118+
a ``Categorical`` will return a ``CategoricalIndex``, indexed according to the categories
119+
of the PASSED ``Categorical`` dtype. This allows one to arbitrarly index these even with
120+
values NOT in the categories, similarly to how you can reindex ANY pandas index.
121+
122+
.. ipython :: python
123+
124+
df2.reindex(['a','e'])
125+
df2.reindex(['a','e']).index
126+
df2.reindex(pd.Categorical(['a','e'],categories=list('abcde')))
127+
df2.reindex(pd.Categorical(['a','e'],categories=list('abcde'))).index
128+
129+
See the :ref:`documentation <advanced.categoricalindex>` for more. (:issue:`7629`)
130+
56131
.. _whatsnew_0161.api:
57132

58133
API changes

pandas/core/api.py

+1-1
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,7 @@
88
from pandas.core.categorical import Categorical
99
from pandas.core.groupby import Grouper
1010
from pandas.core.format import set_eng_float_format
11-
from pandas.core.index import Index, Int64Index, Float64Index, MultiIndex
11+
from pandas.core.index import Index, CategoricalIndex, Int64Index, Float64Index, MultiIndex
1212

1313
from pandas.core.series import Series, TimeSeries
1414
from pandas.core.frame import DataFrame

pandas/core/base.py

+4-2
Original file line numberDiff line numberDiff line change
@@ -121,7 +121,7 @@ def _delegate_method(self, name, *args, **kwargs):
121121
raise TypeError("You cannot call method {name}".format(name=name))
122122

123123
@classmethod
124-
def _add_delegate_accessors(cls, delegate, accessors, typ):
124+
def _add_delegate_accessors(cls, delegate, accessors, typ, overwrite=False):
125125
"""
126126
add accessors to cls from the delegate class
127127
@@ -131,6 +131,8 @@ def _add_delegate_accessors(cls, delegate, accessors, typ):
131131
delegate : the class to get methods/properties & doc-strings
132132
acccessors : string list of accessors to add
133133
typ : 'property' or 'method'
134+
overwrite : boolean, default False
135+
overwrite the method/property in the target class if it exists
134136
135137
"""
136138

@@ -164,7 +166,7 @@ def f(self, *args, **kwargs):
164166
f = _create_delegator_method(name)
165167

166168
# don't overwrite existing methods/properties
167-
if not hasattr(cls, name):
169+
if overwrite or not hasattr(cls, name):
168170
setattr(cls,name,f)
169171

170172

0 commit comments

Comments
 (0)