Skip to content

Commit fa7c29e

Browse files
committed
Merge pull request pandas-dev#9741 from jreback/ci
ENH: support CategoricalIndex (GH7629)
2 parents 85059a4 + ecf8514 commit fa7c29e

18 files changed

+1950
-375
lines changed

doc/source/advanced.rst

+89-1
Original file line numberDiff line numberDiff line change
@@ -594,6 +594,95 @@ faster than fancy indexing.
594594
timeit ser.ix[indexer]
595595
timeit ser.take(indexer)
596596

597+
.. _indexing.categoricalindex:
598+
599+
CategoricalIndex
600+
----------------
601+
602+
.. versionadded:: 0.16.1
603+
604+
We introduce a ``CategoricalIndex``, a new type of index object that is useful for supporting
605+
indexing with duplicates. This is a container around a ``Categorical`` (introduced in v0.15.0)
606+
and allows efficient indexing and storage of an index with a large number of duplicated elements. Prior to 0.16.1,
607+
setting the index of a ``DataFrame/Series`` with a ``category`` dtype would convert this to regular object-based ``Index``.
608+
609+
.. ipython:: python
610+
611+
df = DataFrame({'A' : np.arange(6),
612+
'B' : Series(list('aabbca')).astype('category',
613+
categories=list('cab'))
614+
})
615+
df
616+
df.dtypes
617+
df.B.cat.categories
618+
619+
Setting the index, will create create a ``CategoricalIndex``
620+
621+
.. ipython:: python
622+
623+
df2 = df.set_index('B')
624+
df2.index
625+
626+
Indexing with ``__getitem__/.iloc/.loc/.ix`` works similarly to an ``Index`` with duplicates.
627+
The indexers MUST be in the category or the operation will raise.
628+
629+
.. ipython:: python
630+
631+
df2.loc['a']
632+
633+
These PRESERVE the ``CategoricalIndex``
634+
635+
.. ipython:: python
636+
637+
df2.loc['a'].index
638+
639+
Sorting will order by the order of the categories
640+
641+
.. ipython:: python
642+
643+
df2.sort_index()
644+
645+
Groupby operations on the index will preserve the index nature as well
646+
647+
.. ipython:: python
648+
649+
df2.groupby(level=0).sum()
650+
df2.groupby(level=0).sum().index
651+
652+
Reindexing operations, will return a resulting index based on the type of the passed
653+
indexer, meaning that passing a list will return a plain-old-``Index``; indexing with
654+
a ``Categorical`` will return a ``CategoricalIndex``, indexed according to the categories
655+
of the PASSED ``Categorical`` dtype. This allows one to arbitrarly index these even with
656+
values NOT in the categories, similarly to how you can reindex ANY pandas index.
657+
658+
.. ipython :: python
659+
660+
df2.reindex(['a','e'])
661+
df2.reindex(['a','e']).index
662+
df2.reindex(pd.Categorical(['a','e'],categories=list('abcde')))
663+
df2.reindex(pd.Categorical(['a','e'],categories=list('abcde'))).index
664+
665+
.. warning::
666+
667+
Reshaping and Comparision operations on a ``CategoricalIndex`` must have the same categories
668+
or a ``TypeError`` will be raised.
669+
670+
.. code-block:: python
671+
672+
In [10]: df3 = DataFrame({'A' : np.arange(6),
673+
'B' : Series(list('aabbca')).astype('category',
674+
categories=list('abc'))
675+
}).set_index('B')
676+
677+
In [11]: df3.index
678+
Out[11]:
679+
CategoricalIndex([u'a', u'a', u'b', u'b', u'c', u'a'],
680+
categories=[u'a', u'b', u'c'],
681+
ordered=False)
682+
683+
In [12]: pd.concat([df2,df3]
684+
TypeError: categories must match existing categories when appending
685+
597686
.. _indexing.float64index:
598687
599688
Float64Index
@@ -706,4 +795,3 @@ Of course if you need integer based selection, then use ``iloc``
706795
.. ipython:: python
707796
708797
dfir.iloc[0:5]
709-

doc/source/api.rst

+28
Original file line numberDiff line numberDiff line change
@@ -1291,6 +1291,34 @@ Selecting
12911291
Index.slice_indexer
12921292
Index.slice_locs
12931293

1294+
.. _api.categoricalindex:
1295+
1296+
CategoricalIndex
1297+
----------------
1298+
1299+
.. autosummary::
1300+
:toctree: generated/
1301+
1302+
CategoricalIndex
1303+
1304+
Categorical Components
1305+
~~~~~~~~~~~~~~~~~~~~~~
1306+
1307+
.. autosummary::
1308+
:toctree: generated/
1309+
1310+
CategoricalIndex.codes
1311+
CategoricalIndex.categories
1312+
CategoricalIndex.ordered
1313+
CategoricalIndex.rename_categories
1314+
CategoricalIndex.reorder_categories
1315+
CategoricalIndex.add_categories
1316+
CategoricalIndex.remove_categories
1317+
CategoricalIndex.remove_unused_categories
1318+
CategoricalIndex.set_categories
1319+
CategoricalIndex.as_ordered
1320+
CategoricalIndex.as_unordered
1321+
12941322
.. _api.datetimeindex:
12951323

12961324
DatetimeIndex

doc/source/whatsnew/v0.16.1.txt

+75
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,10 @@ This is a minor bug-fix release from 0.16.0 and includes a a large number of
77
bug fixes along several new features, enhancements, and performance improvements.
88
We recommend that all users upgrade to this version.
99

10+
Highlights include:
11+
12+
- Support for a ``CategoricalIndex``, a category based index, see :ref:`here <whatsnew_0161`.enhancements.categoricalindex>`
13+
1014
.. contents:: What's new in v0.16.1
1115
:local:
1216
:backlinks: none
@@ -31,6 +35,7 @@ Enhancements
3135
will return a `np.array` instead of a boolean `Index` (:issue:`8875`). This enables the following expression
3236
to work naturally:
3337

38+
3439
.. ipython:: python
3540

3641
idx = Index(['a1', 'a2', 'b1', 'b2'])
@@ -40,6 +45,7 @@ Enhancements
4045
s[s.index.str.startswith('a')]
4146

4247
- ``DataFrame.mask()`` and ``Series.mask()`` now support same keywords as ``where`` (:issue:`8801`)
48+
4349
- ``drop`` function can now accept ``errors`` keyword to suppress ValueError raised when any of label does not exist in the target data. (:issue:`6736`)
4450

4551
.. ipython:: python
@@ -58,6 +64,75 @@ Enhancements
5864

5965
- ``DataFrame`` and ``Series`` now have ``_constructor_expanddim`` property as overridable constructor for one higher dimensionality data. This should be used only when it is really needed, see :ref:`here <ref-subclassing-pandas>`
6066

67+
.. _whatsnew_0161.enhancements.categoricalindex:
68+
69+
CategoricalIndex
70+
^^^^^^^^^^^^^^^^
71+
72+
We introduce a ``CategoricalIndex``, a new type of index object that is useful for supporting
73+
indexing with duplicates. This is a container around a ``Categorical`` (introduced in v0.15.0)
74+
and allows efficient indexing and storage of an index with a large number of duplicated elements. Prior to 0.16.1,
75+
setting the index of a ``DataFrame/Series`` with a ``category`` dtype would convert this to regular object-based ``Index``.
76+
77+
.. ipython :: python
78+
79+
df = DataFrame({'A' : np.arange(6),
80+
'B' : Series(list('aabbca')).astype('category',
81+
categories=list('cab'))
82+
})
83+
df
84+
df.dtypes
85+
df.B.cat.categories
86+
87+
setting the index, will create create a CategoricalIndex
88+
89+
.. ipython :: python
90+
91+
df2 = df.set_index('B')
92+
df2.index
93+
94+
indexing with ``__getitem__/.iloc/.loc/.ix`` works similarly to an Index with duplicates.
95+
The indexers MUST be in the category or the operation will raise.
96+
97+
.. ipython :: python
98+
99+
df2.loc['a']
100+
101+
and preserves the ``CategoricalIndex``
102+
103+
.. ipython :: python
104+
105+
df2.loc['a'].index
106+
107+
sorting will order by the order of the categories
108+
109+
.. ipython :: python
110+
111+
df2.sort_index()
112+
113+
groupby operations on the index will preserve the index nature as well
114+
115+
.. ipython :: python
116+
117+
df2.groupby(level=0).sum()
118+
df2.groupby(level=0).sum().index
119+
120+
reindexing operations, will return a resulting index based on the type of the passed
121+
indexer, meaning that passing a list will return a plain-old-``Index``; indexing with
122+
a ``Categorical`` will return a ``CategoricalIndex``, indexed according to the categories
123+
of the PASSED ``Categorical`` dtype. This allows one to arbitrarly index these even with
124+
values NOT in the categories, similarly to how you can reindex ANY pandas index.
125+
126+
.. ipython :: python
127+
128+
df2.reindex(['a','e'])
129+
df2.reindex(['a','e']).index
130+
df2.reindex(pd.Categorical(['a','e'],categories=list('abcde')))
131+
df2.reindex(pd.Categorical(['a','e'],categories=list('abcde'))).index
132+
133+
See the :ref:`documentation <advanced.categoricalindex>` for more. (:issue:`7629`)
134+
>>>>>>> support CategoricalIndex
135+
61136
.. _whatsnew_0161.api:
62137

63138
API changes

pandas/core/api.py

+1-1
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,7 @@
88
from pandas.core.categorical import Categorical
99
from pandas.core.groupby import Grouper
1010
from pandas.core.format import set_eng_float_format
11-
from pandas.core.index import Index, Int64Index, Float64Index, MultiIndex
11+
from pandas.core.index import Index, CategoricalIndex, Int64Index, Float64Index, MultiIndex
1212

1313
from pandas.core.series import Series, TimeSeries
1414
from pandas.core.frame import DataFrame

pandas/core/base.py

+4-2
Original file line numberDiff line numberDiff line change
@@ -121,7 +121,7 @@ def _delegate_method(self, name, *args, **kwargs):
121121
raise TypeError("You cannot call method {name}".format(name=name))
122122

123123
@classmethod
124-
def _add_delegate_accessors(cls, delegate, accessors, typ):
124+
def _add_delegate_accessors(cls, delegate, accessors, typ, overwrite=False):
125125
"""
126126
add accessors to cls from the delegate class
127127
@@ -131,6 +131,8 @@ def _add_delegate_accessors(cls, delegate, accessors, typ):
131131
delegate : the class to get methods/properties & doc-strings
132132
acccessors : string list of accessors to add
133133
typ : 'property' or 'method'
134+
overwrite : boolean, default False
135+
overwrite the method/property in the target class if it exists
134136
135137
"""
136138

@@ -164,7 +166,7 @@ def f(self, *args, **kwargs):
164166
f = _create_delegator_method(name)
165167

166168
# don't overwrite existing methods/properties
167-
if not hasattr(cls, name):
169+
if overwrite or not hasattr(cls, name):
168170
setattr(cls,name,f)
169171

170172

0 commit comments

Comments
 (0)