Skip to content

ENH: support CategoricalIndex (GH7629) #9741

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Apr 20, 2015
Merged

Conversation

jreback
Copy link
Contributor

@jreback jreback commented Mar 27, 2015

closes #7629
xref #8613
xref #8074

  • docs / whatsnew
  • auto-create a CategoricalIndex when grouping by a Categorical (this doesn't ATM)
  • adding a value not in the Index, e.g. df2.loc['d'] = 5 should do what? (currently will coerce to an Index)
  • pd.concat([df2,df]) should STILL have a CategoricalIndex (yep)?
  • implement min/max
  • fix groupby on cat column
  • add Categorical wrapper methods
  • make repr evalable / fix
  • contains should be on values not categories

A CategoricalIndex is essentially a drop-in replacement for Index, that works nicely for non-unique values. It uses a Categorical to represent itself. The behavior is very similar to using a duplicated Index (for say indexing).

Groupby works naturally (and returns another CategoricalIndex). The only real departure is that .sort_index() works like you would expected (which is a good thing:). Clearly this will provide idempotency for set/reset index w.r.t. Categoricals, and thus memory savings by its representation.

This doesn't change the API at all. IOW, this is not turned on by default, you have to either use set/reset, assign an index, or pass a Categorical to Index.

In [1]: df = DataFrame({'A' : np.arange(6,dtype='int64'),
   ...:                         'B' : Series(list('aabbca')).astype('category',categories=list('cab')) })

In [2]: df
Out[2]: 
   A  B
0  0  a
1  1  a
2  2  b
3  3  b
4  4  c
5  5  a

In [3]: df.dtypes
Out[3]: 
A       int64
B    category
dtype: object

In [5]: df.B.cat.categories
Out[5]: Index([u'c', u'a', u'b'], dtype='object')

In [6]: df2 = df.set_index('B')
In [7]: df2
Out[7]: 
   A
B   
a  0
a  1
b  2
b  3
c  4
a  5

In [8]: df2.index
Out[8]: CategoricalIndex([u'a', u'a', u'b', u'b', u'c', u'a'], dtype='category')

In [9]: df2.index.categories
Out[9]: Index([u'c', u'a', u'b'], dtype='object')

In [10]: df2.index.codes     
Out[10]: array([1, 1, 2, 2, 0, 1], dtype=int8)

In [11]: df2.loc['a']
Out[11]: 
   A
B   
a  0
a  1
a  5

In [12]: df2.loc['a'].index 
Out[12]: CategoricalIndex([u'a', u'a', u'a'], dtype='category')

In [13]: df2.loc['a'].index.categories
Out[13]: Index([u'c', u'a', u'b'], dtype='object')

In [14]: df2.sort_index() 
Out[14]: 
   A
B   
c  4
a  0
a  1
a  5
b  2
b  3

In [15]: df2.groupby(level=0).sum()
Out[15]: 
   A
B   
a  6
b  5
c  4

In [16]: df2.groupby(level=0).sum().index
Out[16]: CategoricalIndex([u'a', u'b', u'c'], dtype='category')

@jreback jreback added Enhancement Indexing Related to indexing on series/frames, not to indexes themselves Categorical Categorical Data Type labels Mar 27, 2015
@jreback jreback added this to the 0.16.1 milestone Mar 27, 2015
@jreback
Copy link
Contributor Author

jreback commented Mar 27, 2015

cc @TomAugspurger
cc @jorisvandenbossche
cc @JanSchulz
cc @shoyer
cc @mrocklin

So the main point to note here is that we don't have the concept of ordered in a CategoricalIndex. (if you pass it its a ValueError), because by-definition an Index IS ordered.

This is actually a good thing, it makes the entire discussion we had w.r.t. to the groupby sort issue moot (well if you are grouping by a CategoricalIndex anyhow)

@jreback
Copy link
Contributor Author

jreback commented Mar 27, 2015

are there operations that currently return say an Index that really should return a CategoricalIndex. e.g. similary to how say pd.cut should return a IntervalIndex (when @shoyer finishes :)

@shoyer
Copy link
Member

shoyer commented Mar 27, 2015

@jreback What happens if I try to insert new values into a categorical index that aren't already in the categories?

@jreback
Copy link
Contributor Author

jreback commented Mar 27, 2015

@shoyer

hmm, I think

df2.loc['d'] = 6 should raise (broken now)

@jreback
Copy link
Contributor Author

jreback commented Mar 27, 2015

In [24]: pd.concat([df2,df2])
Out[24]: 
   A
B   
a  0
a  1
b  2
b  3
c  4
a  6
a  0
a  1
b  2
b  3
c  4
a  6

In [25]: pd.concat([df2,df2]).index
Out[25]: Index([u'a', u'a', u'b', u'b', u'c', u'a', u'a', u'a', u'b', u'b', u'c', u'a'], dtype='object')

This I could make work, but IIRC we decided to have this merge the categories (though in this case they are the same)....hmmm

@jreback
Copy link
Contributor Author

jreback commented Mar 27, 2015

In [8]: df2 = DataFrame({'A' : np.arange(6,dtype='int64'),
   ...:                         'B' : Series(list('aabbca')).astype('category',categories=list('cab')) }).set_index('B')

In [9]: df2
Out[9]: 
   A
B   
a  0
a  1
b  2
b  3
c  4
a  5

In [10]: df2.index        
Out[10]: CategoricalIndex([u'a', u'a', u'b', u'b', u'c', u'a'], dtype='category')

In [11]: df2.loc['d'] = 10

In [12]: df2
Out[12]: 
    A
a   0
a   1
b   2
b   3
c   4
a   5
d  10

In [13]: df2.index
Out[13]: Index([u'a', u'a', u'b', u'b', u'c', u'a', u'd'], dtype='object')

In [14]: df2.index = pd.CategoricalIndex(df2.index)

In [15]: df2.index
Out[15]: CategoricalIndex([u'a', u'a', u'b', u'b', u'c', u'a', u'd'], dtype='category')

In [16]: df2.index.categories
Out[16]: Index([u'a', u'b', u'c', u'd'], dtype='object')

so the appending operation works, but converts you to a Index (and you lose the ordering by-definition)

result._reset_identity()
return result

def equals(self, other):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does CategoricalIndex(['a', 'b'], categories=['a', 'b']) == CategoricalIndex(['a', 'b'], categories=['a', 'b', 'c']) return True? i.e. the values are the same but the categories (possible values) differ.

I just checked on Categoricals and we raise a TypeError if the categories aren't identical.

In [1]: c1 = pd.Categorical(['a', 'b'], categories=['a', 'b'])

In [2]: c2 = pd.Categorical(['a', 'b'], categories=['a', 'b', 'c'])

In [5]: c1 == c2
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-5-d8d43a43a02a> in <module>()
----> 1 c1 == c2

/Users/tom.augspurger/Envs/py3/lib/python3.4/site-packages/pandas-0.16.0_19_g8d2818e-py3.4-macosx-10.10-x86_64.egg/pandas/core/categorical.py in f(self, other)
     38             if (len(self.categories) != len(other.categories)) or \
     39                     not ((self.categories == other.categories).all()):
---> 40                 raise TypeError("Categoricals can only be compared if 'categories' are the same")
     41             if not (self.ordered == other.ordered):
     42                 raise TypeError("Categoricals can only be compared if 'ordered' is the same")

TypeError: Categoricals can only be compared if 'categories' are the same

We should probably raise here too. Ohh, and maybe that's handled in self._data == other._data?

@jreback jreback force-pushed the ci branch 2 times, most recently from 9c22b53 to c1730ef Compare March 28, 2015 17:42
@jreback
Copy link
Contributor Author

jreback commented Mar 28, 2015

@TomAugspurger

I now dispatch to Categorical for comparisons, providing conversions for Index and Categorical, but they must match in categories/ordered

In [4]: ci1 = CategoricalIndex(['a', 'b'], categories=['a', 'b'])

In [5]: ci2 = CategoricalIndex(['a', 'b'], categories=['a', 'b', 'c'])

In [6]: ci1.equals(ci1)
Out[6]: True

# this is always safe (e.g. it returns boolean)
In [7]: ci1.equals(ci2)
Out[7]: False

In [8]: ci1==ci1
Out[8]: array([ True,  True], dtype=bool)

# not the same categories
In [9]: ci1==ci2
TypeError: categorical index comparisions must have the same categories

# scalar comparison
In [10]: ci1=='a'
Out[10]: array([ True, False], dtype=bool)

# ok as all values are in the categories
In [11]: ci1==Index(['a','b'])
Out[11]: array([ True,  True], dtype=bool)

# cannot compare vs unordered
In [13]: ci1 == pd.Categorical(ci1.values, ordered=False)
TypeError: categorical index comparisions must be ordered

# this is not allowed because 'c' is not a category
In [14]: ci1==Index(['a','b','c'])
TypeError: cannot compare versus non-convertible Index type

@jreback
Copy link
Contributor Author

jreback commented Mar 28, 2015

I fixed the concat issues to be the same as concating columns (e.g. non-matching categories make this raise).

In [15]:  a = Series(np.arange(6,dtype='int64'))

In [16]:  b = Series(list('aabbca'))

In [17]: df2 = DataFrame({'A' : a, 'B' : b.astype('category',categories=list('cab')) }).set_index('B')

# concat of same-categories is good
In [22]: df2.index
Out[22]: CategoricalIndex([u'a', u'a', u'b', u'b', u'c', u'a'], dtype='category')

In [24]: pd.concat([df2,df2]).index
Out[24]: CategoricalIndex([u'a', u'a', u'b', u'b', u'c', u'a', u'a', u'a', u'b', u'b', u'c', u'a'], dtype='category')

# concat of not-same categories is an error
In [25]: df3 = DataFrame({'A' : a, 'B' : b.astype('category',categories=list('abc')) }).set_index('B')

In [26]: df3.index.categories
Out[26]: Index([u'a', u'b', u'c'], dtype='object')

In [27]: df2.index.categories
Out[27]: Index([u'c', u'a', u'b'], dtype='object')

In [28]: pd.concat([df2,df3])
TypeError: categories must match existing categories when appending

@jreback jreback force-pushed the ci branch 3 times, most recently from 379eda0 to 918c01a Compare March 28, 2015 19:29
if is_categorical_dtype(dtype):
return self
elif is_object_dtype(dtype):
return np.array(self)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like this, particularly because the dtype of the returned array will usually not be object.

Why don't you simply do return np.array(self, dtype=dtype) and allow any valid numpy dtype?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that will cause numpy to error if dtype=='category';

or are you just talking about where I have is_object_dtype (and just make the return np.array(self, dtype=dtype). That would seem to be ok

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, just for the cases when we already know it's not dtype='category'

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

hash(key)
return key in self.categories

def __array__(self, result=None):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

__array__ actually is supposed to take an optional dtype argument, not result, which should be passed on to the np.array call below and eventually on to Categorical, which should return an array of the appropriate type.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, ok, we don't do this for Series, i'll fix for Categorical/CategoricalIndex here, maybe make a separate issue for this

@shoyer
Copy link
Member

shoyer commented Mar 28, 2015

What about Categorical levels in a MultiIndex?

# not all labels in the categories
self.assertRaises(KeyError, lambda : self.df2.loc[['a','d']])

def test_reindexing(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback you are my hero :)

@shoyer
Copy link
Member

shoyer commented Apr 11, 2015

did another read through -- looking pretty good to me!

self.assertTrue(ci1.equals(ci1))
self.assertFalse(ci1.equals(ci2))
self.assertTrue(ci1.equals(ci1.astype(object)))
self.assertTrue(ci1.astype(object).equals(ci1))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should there also be a test for identical? (or is it already somewhere else?)

@jreback jreback force-pushed the ci branch 2 times, most recently from 162dee9 to d44e812 Compare April 11, 2015 20:14
@jreback
Copy link
Contributor Author

jreback commented Apr 12, 2015

@shoyer @jorisvandenbossche @JanSchulz any other comments....going to merge

@shoyer
Copy link
Member

shoyer commented Apr 14, 2015

Looks good to me!

On Sun, Apr 12, 2015 at 10:06 AM, jreback [email protected]
wrote:

@shoyer @jorisvandenbossche @JanSchulz any other comments....going to merge

Reply to this email directly or view it on GitHub:
#9741 (comment)

@@ -594,7 +594,94 @@ faster than fancy indexing.
timeit ser.ix[indexer]
timeit ser.take(indexer)

.. _indexing.float64index:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

small issue: you removed the label of the "Float64Index" section below this

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@jorisvandenbossche
Copy link
Member

Another thing: you added the reindex_non_unique and can_reindex methods to the Index API. I would vote for OR keeping them private OR if we want them public it should be documented and added as an enhancement.

can_reindex does not seem like something that needs to be public I think?
reindex_non_unique seems to be used within reindex, so maybe also not necessary to have it public?

@jorisvandenbossche
Copy link
Member

Ah, but I see that you added reindex_non_unique explicitely to the api.rst page. What is the use case of this method above the use of plain reindex?

@jreback jreback force-pushed the ci branch 2 times, most recently from beac7d3 to 2f0953e Compare April 16, 2015 13:43
@jreback
Copy link
Contributor Author

jreback commented Apr 16, 2015

ok, marked reindex_non_unique/can_index as internal methods (and removed from API.rst).
I just prefer to have a method that is called internally in pandas not have a leading underscore, but that is not for public-consumption. These are only called in core/indexing.py and core/internals.py.

@jorisvandenbossche
Copy link
Member

@jreback I know we already had some similar discussion before, but even if the docstring says that it is an "internal, non-public method", nevertheless, it will appear in tab completion, and there will be an api page for those methods in the documentation (as this is done automatically), making them de facto public.
Beside that, we already have rather too many methods on out objects exposed to the user, that I really don't like adding more if they are not supposed to be used.

But I know it is a difficult discussion, and the line between a "public for other parts of pandas" and "really an internal helper function" is not always clear and easy to draw.
But as it is here not used that much within pandas itself, I would opt for real internal methods.

@jreback
Copy link
Contributor Author

jreback commented Apr 16, 2015

ok I made _can_index/_reindex_non_unique.

I left join (which is I think public, even though it says internal).

any more comments.......?

raise KeyError when accessing invalid elements
setting elements not in the categories is equiv of .append() (which coerces to an Index)
jreback added a commit that referenced this pull request Apr 20, 2015
ENH: support CategoricalIndex (GH7629)
@jreback jreback merged commit fa7c29e into pandas-dev:master Apr 20, 2015
@jreback
Copy link
Contributor Author

jreback commented Apr 20, 2015

bombs away!

df2.reindex(pd.Categorical(['a','e'],categories=list('abcde'))).index

See the :ref:`documentation <advanced.categoricalindex>` for more. (:issue:`7629`)
>>>>>>> support CategoricalIndex
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

small leftover from rebasing

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@jorisvandenbossche
Copy link
Member

just a small issue in the whatsnew

But thanks a lot! It was an extensive, but a good discussion!

@mrocklin
Copy link
Contributor

Woot! Thanks @jreback

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Categorical Categorical Data Type Enhancement Indexing Related to indexing on series/frames, not to indexes themselves
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Implement CategoricalIndex
5 participants