ENH: support CategoricalIndex (GH7629) #9741

jreback · 2015-03-27T22:01:57Z

docs / whatsnew
~~auto-create a CategoricalIndex when grouping by a Categorical (this doesn't ATM)~~
adding a value not in the Index, e.g. df2.loc['d'] = 5 should do what? (currently will coerce to an Index)
pd.concat([df2,df]) should STILL have a CategoricalIndex (yep)?
implement min/max
fix groupby on cat column
add Categorical wrapper methods
make repr evalable / fix
contains should be on values not categories

A CategoricalIndex is essentially a drop-in replacement for Index, that works nicely for non-unique values. It uses a Categorical to represent itself. The behavior is very similar to using a duplicated Index (for say indexing).

Groupby works naturally (and returns another CategoricalIndex). The only real departure is that .sort_index() works like you would expected (which is a good thing:). Clearly this will provide idempotency for set/reset index w.r.t. Categoricals, and thus memory savings by its representation.

This doesn't change the API at all. IOW, this is not turned on by default, you have to either use set/reset, assign an index, or pass a Categorical to Index.

In [1]: df = DataFrame({'A' : np.arange(6,dtype='int64'),
   ...:                         'B' : Series(list('aabbca')).astype('category',categories=list('cab')) })

In [2]: df
Out[2]: 
   A  B
0  0  a
1  1  a
2  2  b
3  3  b
4  4  c
5  5  a

In [3]: df.dtypes
Out[3]: 
A       int64
B    category
dtype: object

In [5]: df.B.cat.categories
Out[5]: Index([u'c', u'a', u'b'], dtype='object')

In [6]: df2 = df.set_index('B')
In [7]: df2
Out[7]: 
   A
B   
a  0
a  1
b  2
b  3
c  4
a  5

In [8]: df2.index
Out[8]: CategoricalIndex([u'a', u'a', u'b', u'b', u'c', u'a'], dtype='category')

In [9]: df2.index.categories
Out[9]: Index([u'c', u'a', u'b'], dtype='object')

In [10]: df2.index.codes     
Out[10]: array([1, 1, 2, 2, 0, 1], dtype=int8)

In [11]: df2.loc['a']
Out[11]: 
   A
B   
a  0
a  1
a  5

In [12]: df2.loc['a'].index 
Out[12]: CategoricalIndex([u'a', u'a', u'a'], dtype='category')

In [13]: df2.loc['a'].index.categories
Out[13]: Index([u'c', u'a', u'b'], dtype='object')

In [14]: df2.sort_index() 
Out[14]: 
   A
B   
c  4
a  0
a  1
a  5
b  2
b  3

In [15]: df2.groupby(level=0).sum()
Out[15]: 
   A
B   
a  6
b  5
c  4

In [16]: df2.groupby(level=0).sum().index
Out[16]: CategoricalIndex([u'a', u'b', u'c'], dtype='category')

jreback · 2015-03-27T22:04:28Z

cc @TomAugspurger
cc @jorisvandenbossche
cc @JanSchulz
cc @shoyer
cc @mrocklin

So the main point to note here is that we don't have the concept of ordered in a CategoricalIndex. (if you pass it its a ValueError), because by-definition an Index IS ordered.

This is actually a good thing, it makes the entire discussion we had w.r.t. to the groupby sort issue moot (well if you are grouping by a CategoricalIndex anyhow)

jreback · 2015-03-27T22:11:04Z

are there operations that currently return say an Index that really should return a CategoricalIndex. e.g. similary to how say pd.cut should return a IntervalIndex (when @shoyer finishes :)

shoyer · 2015-03-27T22:17:06Z

@jreback What happens if I try to insert new values into a categorical index that aren't already in the categories?

jreback · 2015-03-27T22:24:43Z

@shoyer

hmm, I think

df2.loc['d'] = 6 should raise (broken now)

jreback · 2015-03-27T22:26:37Z

In [24]: pd.concat([df2,df2])
Out[24]: 
   A
B   
a  0
a  1
b  2
b  3
c  4
a  6
a  0
a  1
b  2
b  3
c  4
a  6

In [25]: pd.concat([df2,df2]).index
Out[25]: Index([u'a', u'a', u'b', u'b', u'c', u'a', u'a', u'a', u'b', u'b', u'c', u'a'], dtype='object')

This I could make work, but IIRC we decided to have this merge the categories (though in this case they are the same)....hmmm

jreback · 2015-03-27T22:57:48Z

In [8]: df2 = DataFrame({'A' : np.arange(6,dtype='int64'),
   ...:                         'B' : Series(list('aabbca')).astype('category',categories=list('cab')) }).set_index('B')

In [9]: df2
Out[9]: 
   A
B   
a  0
a  1
b  2
b  3
c  4
a  5

In [10]: df2.index        
Out[10]: CategoricalIndex([u'a', u'a', u'b', u'b', u'c', u'a'], dtype='category')

In [11]: df2.loc['d'] = 10

In [12]: df2
Out[12]: 
    A
a   0
a   1
b   2
b   3
c   4
a   5
d  10

In [13]: df2.index
Out[13]: Index([u'a', u'a', u'b', u'b', u'c', u'a', u'd'], dtype='object')

In [14]: df2.index = pd.CategoricalIndex(df2.index)

In [15]: df2.index
Out[15]: CategoricalIndex([u'a', u'a', u'b', u'b', u'c', u'a', u'd'], dtype='category')

In [16]: df2.index.categories
Out[16]: Index([u'a', u'b', u'c', u'd'], dtype='object')

so the appending operation works, but converts you to a Index (and you lose the ordering by-definition)

TomAugspurger · 2015-03-28T16:42:46Z

pandas/core/index.py

+        result._reset_identity()
+        return result
+
+    def equals(self, other):


Does CategoricalIndex(['a', 'b'], categories=['a', 'b']) == CategoricalIndex(['a', 'b'], categories=['a', 'b', 'c']) return True? i.e. the values are the same but the categories (possible values) differ.

I just checked on Categoricals and we raise a TypeError if the categories aren't identical.

In [1]: c1 = pd.Categorical(['a', 'b'], categories=['a', 'b']) In [2]: c2 = pd.Categorical(['a', 'b'], categories=['a', 'b', 'c']) In [5]: c1 == c2 --------------------------------------------------------------------------- TypeError Traceback (most recent call last) <ipython-input-5-d8d43a43a02a> in <module>() ----> 1 c1 == c2 /Users/tom.augspurger/Envs/py3/lib/python3.4/site-packages/pandas-0.16.0_19_g8d2818e-py3.4-macosx-10.10-x86_64.egg/pandas/core/categorical.py in f(self, other) 38 if (len(self.categories) != len(other.categories)) or \ 39 not ((self.categories == other.categories).all()): ---> 40 raise TypeError("Categoricals can only be compared if 'categories' are the same") 41 if not (self.ordered == other.ordered): 42 raise TypeError("Categoricals can only be compared if 'ordered' is the same") TypeError: Categoricals can only be compared if 'categories' are the same

We should probably raise here too. Ohh, and maybe that's handled in self._data == other._data?

jreback · 2015-03-28T17:46:53Z

@TomAugspurger

I now dispatch to Categorical for comparisons, providing conversions for Index and Categorical, but they must match in categories/ordered

In [4]: ci1 = CategoricalIndex(['a', 'b'], categories=['a', 'b'])

In [5]: ci2 = CategoricalIndex(['a', 'b'], categories=['a', 'b', 'c'])

In [6]: ci1.equals(ci1)
Out[6]: True

# this is always safe (e.g. it returns boolean)
In [7]: ci1.equals(ci2)
Out[7]: False

In [8]: ci1==ci1
Out[8]: array([ True,  True], dtype=bool)

# not the same categories
In [9]: ci1==ci2
TypeError: categorical index comparisions must have the same categories

# scalar comparison
In [10]: ci1=='a'
Out[10]: array([ True, False], dtype=bool)

# ok as all values are in the categories
In [11]: ci1==Index(['a','b'])
Out[11]: array([ True,  True], dtype=bool)

# cannot compare vs unordered
In [13]: ci1 == pd.Categorical(ci1.values, ordered=False)
TypeError: categorical index comparisions must be ordered

# this is not allowed because 'c' is not a category
In [14]: ci1==Index(['a','b','c'])
TypeError: cannot compare versus non-convertible Index type

jreback · 2015-03-28T17:52:54Z

I fixed the concat issues to be the same as concating columns (e.g. non-matching categories make this raise).

In [15]:  a = Series(np.arange(6,dtype='int64'))

In [16]:  b = Series(list('aabbca'))

In [17]: df2 = DataFrame({'A' : a, 'B' : b.astype('category',categories=list('cab')) }).set_index('B')

# concat of same-categories is good
In [22]: df2.index
Out[22]: CategoricalIndex([u'a', u'a', u'b', u'b', u'c', u'a'], dtype='category')

In [24]: pd.concat([df2,df2]).index
Out[24]: CategoricalIndex([u'a', u'a', u'b', u'b', u'c', u'a', u'a', u'a', u'b', u'b', u'c', u'a'], dtype='category')

# concat of not-same categories is an error
In [25]: df3 = DataFrame({'A' : a, 'B' : b.astype('category',categories=list('abc')) }).set_index('B')

In [26]: df3.index.categories
Out[26]: Index([u'a', u'b', u'c'], dtype='object')

In [27]: df2.index.categories
Out[27]: Index([u'c', u'a', u'b'], dtype='object')

In [28]: pd.concat([df2,df3])
TypeError: categories must match existing categories when appending

shoyer · 2015-03-28T21:51:01Z

pandas/core/categorical.py

+        if is_categorical_dtype(dtype):
+            return self
+        elif is_object_dtype(dtype):
+            return np.array(self)


I don't like this, particularly because the dtype of the returned array will usually not be object.

Why don't you simply do return np.array(self, dtype=dtype) and allow any valid numpy dtype?

that will cause numpy to error if dtype=='category';

or are you just talking about where I have is_object_dtype (and just make the return np.array(self, dtype=dtype). That would seem to be ok

Yes, just for the cases when we already know it's not dtype='category'

shoyer · 2015-03-28T22:53:46Z

pandas/core/index.py

+        hash(key)
+        return key in self.categories
+
+    def __array__(self, result=None):


__array__ actually is supposed to take an optional dtype argument, not result, which should be passed on to the np.array call below and eventually on to Categorical, which should return an array of the appropriate type.

hmm, ok, we don't do this for Series, i'll fix for Categorical/CategoricalIndex here, maybe make a separate issue for this

shoyer · 2015-03-28T23:03:30Z

What about Categorical levels in a MultiIndex?

shoyer · 2015-04-11T02:32:34Z

pandas/tests/test_indexing.py

+        # not all labels in the categories
+        self.assertRaises(KeyError, lambda : self.df2.loc[['a','d']])
+
+    def test_reindexing(self):


@jreback you are my hero :)

shoyer · 2015-04-11T02:34:32Z

did another read through -- looking pretty good to me!

jorisvandenbossche · 2015-04-11T11:10:19Z

pandas/tests/test_index.py

+        self.assertTrue(ci1.equals(ci1))
+        self.assertFalse(ci1.equals(ci2))
+        self.assertTrue(ci1.equals(ci1.astype(object)))
+        self.assertTrue(ci1.astype(object).equals(ci1))


Should there also be a test for identical? (or is it already somewhere else?)

jreback · 2015-04-12T14:06:11Z

@shoyer @jorisvandenbossche @JanSchulz any other comments....going to merge

shoyer · 2015-04-14T04:32:02Z

Looks good to me!

On Sun, Apr 12, 2015 at 10:06 AM, jreback [email protected]
wrote:

@shoyer @jorisvandenbossche @JanSchulz any other comments....going to merge

Reply to this email directly or view it on GitHub:
#9741 (comment)

jorisvandenbossche · 2015-04-14T08:33:28Z

doc/source/advanced.rst

@@ -594,7 +594,94 @@ faster than fancy indexing.
   timeit ser.ix[indexer]
   timeit ser.take(indexer)

-.. _indexing.float64index:


small issue: you removed the label of the "Float64Index" section below this

jorisvandenbossche · 2015-04-14T15:36:17Z

Another thing: you added the reindex_non_unique and can_reindex methods to the Index API. I would vote for OR keeping them private OR if we want them public it should be documented and added as an enhancement.

can_reindex does not seem like something that needs to be public I think?
reindex_non_unique seems to be used within reindex, so maybe also not necessary to have it public?

jorisvandenbossche · 2015-04-14T15:38:28Z

Ah, but I see that you added reindex_non_unique explicitely to the api.rst page. What is the use case of this method above the use of plain reindex?

jreback · 2015-04-16T13:45:25Z

ok, marked reindex_non_unique/can_index as internal methods (and removed from API.rst).
I just prefer to have a method that is called internally in pandas not have a leading underscore, but that is not for public-consumption. These are only called in core/indexing.py and core/internals.py.

jorisvandenbossche · 2015-04-16T19:37:53Z

@jreback I know we already had some similar discussion before, but even if the docstring says that it is an "internal, non-public method", nevertheless, it will appear in tab completion, and there will be an api page for those methods in the documentation (as this is done automatically), making them de facto public.
Beside that, we already have rather too many methods on out objects exposed to the user, that I really don't like adding more if they are not supposed to be used.

But I know it is a difficult discussion, and the line between a "public for other parts of pandas" and "really an internal helper function" is not always clear and easy to draw.
But as it is here not used that much within pandas itself, I would opt for real internal methods.

jreback · 2015-04-16T22:46:26Z

ok I made _can_index/_reindex_non_unique.

I left join (which is I think public, even though it says internal).

any more comments.......?

raise KeyError when accessing invalid elements setting elements not in the categories is equiv of .append() (which coerces to an Index)

ENH: support CategoricalIndex (GH7629)

jreback · 2015-04-20T11:20:04Z

bombs away!

jorisvandenbossche · 2015-04-20T12:32:09Z

doc/source/whatsnew/v0.16.1.txt

+   df2.reindex(pd.Categorical(['a','e'],categories=list('abcde'))).index
+
+See the :ref:`documentation <advanced.categoricalindex>` for more. (:issue:`7629`)
+>>>>>>> support CategoricalIndex


small leftover from rebasing

jorisvandenbossche · 2015-04-20T12:33:15Z

just a small issue in the whatsnew

But thanks a lot! It was an extensive, but a good discussion!

mrocklin · 2015-04-20T14:35:36Z

Woot! Thanks @jreback

jreback added Enhancement Indexing Related to indexing on series/frames, not to indexes themselves Categorical Categorical Data Type labels Mar 27, 2015

jreback added this to the 0.16.1 milestone Mar 27, 2015

TomAugspurger reviewed Mar 28, 2015
View reviewed changes

jreback force-pushed the ci branch 2 times, most recently from 9c22b53 to c1730ef Compare March 28, 2015 17:42

jreback force-pushed the ci branch 3 times, most recently from 379eda0 to 918c01a Compare March 28, 2015 19:29

shoyer reviewed Mar 28, 2015
View reviewed changes

jreback force-pushed the ci branch from 71a5737 to 1ee7a2a Compare March 28, 2015 22:44

shoyer reviewed Mar 28, 2015
View reviewed changes

shoyer reviewed Apr 11, 2015
View reviewed changes

jorisvandenbossche reviewed Apr 11, 2015
View reviewed changes

jreback mentioned this pull request Apr 11, 2015

Add key to sorting functions #3942

Closed

jreback force-pushed the ci branch 2 times, most recently from 162dee9 to d44e812 Compare April 11, 2015 20:14

jreback force-pushed the ci branch from d44e812 to d6c5d04 Compare April 13, 2015 13:43

jorisvandenbossche reviewed Apr 14, 2015
View reviewed changes

jreback force-pushed the ci branch 2 times, most recently from beac7d3 to 2f0953e Compare April 16, 2015 13:43

jreback force-pushed the ci branch from 2f0953e to 501cd93 Compare April 16, 2015 14:05

mrocklin mentioned this pull request Apr 16, 2015

pframe: allow string column index dask/dask#156

Closed

jreback force-pushed the ci branch from 501cd93 to 61bd9ca Compare April 16, 2015 22:40

support CategoricalIndex

ecf8514

raise KeyError when accessing invalid elements setting elements not in the categories is equiv of .append() (which coerces to an Index)

jreback force-pushed the ci branch from 61bd9ca to ecf8514 Compare April 20, 2015 11:19

jreback added a commit that referenced this pull request Apr 20, 2015

Merge pull request #9741 from jreback/ci

fa7c29e

ENH: support CategoricalIndex (GH7629)

jreback merged commit fa7c29e into pandas-dev:master Apr 20, 2015

jorisvandenbossche reviewed Apr 20, 2015
View reviewed changes

shoyer mentioned this pull request Apr 21, 2015

Add a more memory-efficient RangeIndex-sort of thing to avoid large arange(N) indexes in some cases #939

Closed

stanwest mentioned this pull request Oct 25, 2021

Group by a categorical Series of unequal length #44180

Merged

4 tasks

Uh oh!

ENH: support CategoricalIndex (GH7629) #9741

ENH: support CategoricalIndex (GH7629) #9741

Uh oh!

Conversation

jreback commented Mar 27, 2015

Uh oh!

jreback commented Mar 27, 2015

Uh oh!

jreback commented Mar 27, 2015

Uh oh!

shoyer commented Mar 27, 2015

Uh oh!

jreback commented Mar 27, 2015

Uh oh!

jreback commented Mar 27, 2015

Uh oh!

jreback commented Mar 27, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jreback commented Mar 28, 2015

Uh oh!

jreback commented Mar 28, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

shoyer commented Mar 28, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

shoyer commented Apr 11, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jreback commented Apr 12, 2015

Uh oh!

shoyer commented Apr 14, 2015

@shoyer @jorisvandenbossche @JanSchulz any other comments....going to merge

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jorisvandenbossche commented Apr 14, 2015

Uh oh!

jorisvandenbossche commented Apr 14, 2015

Uh oh!

jreback commented Apr 16, 2015

Uh oh!

jorisvandenbossche commented Apr 16, 2015

Uh oh!

jreback commented Apr 16, 2015

Uh oh!

jreback commented Apr 20, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jorisvandenbossche commented Apr 20, 2015

Uh oh!

mrocklin commented Apr 20, 2015

Uh oh!

Uh oh!