-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
ENH: support CategoricalIndex (GH7629) #9741
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
cc @TomAugspurger So the main point to note here is that we don't have the concept of This is actually a good thing, it makes the entire discussion we had w.r.t. to the groupby sort issue moot (well if you are grouping by a |
are there operations that currently return say an |
@jreback What happens if I try to insert new values into a categorical index that aren't already in the categories? |
hmm, I think
|
This I could make work, but IIRC we decided to have this merge the categories (though in this case they are the same)....hmmm |
so the appending operation works, but converts you to a |
result._reset_identity() | ||
return result | ||
|
||
def equals(self, other): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does CategoricalIndex(['a', 'b'], categories=['a', 'b']) == CategoricalIndex(['a', 'b'], categories=['a', 'b', 'c'])
return True? i.e. the values are the same but the categories (possible values) differ.
I just checked on Categoricals and we raise a TypeError
if the categories aren't identical.
In [1]: c1 = pd.Categorical(['a', 'b'], categories=['a', 'b'])
In [2]: c2 = pd.Categorical(['a', 'b'], categories=['a', 'b', 'c'])
In [5]: c1 == c2
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-5-d8d43a43a02a> in <module>()
----> 1 c1 == c2
/Users/tom.augspurger/Envs/py3/lib/python3.4/site-packages/pandas-0.16.0_19_g8d2818e-py3.4-macosx-10.10-x86_64.egg/pandas/core/categorical.py in f(self, other)
38 if (len(self.categories) != len(other.categories)) or \
39 not ((self.categories == other.categories).all()):
---> 40 raise TypeError("Categoricals can only be compared if 'categories' are the same")
41 if not (self.ordered == other.ordered):
42 raise TypeError("Categoricals can only be compared if 'ordered' is the same")
TypeError: Categoricals can only be compared if 'categories' are the same
We should probably raise here too. Ohh, and maybe that's handled in self._data == other._data
?
9c22b53
to
c1730ef
Compare
I now dispatch to Categorical for comparisons, providing conversions for Index and Categorical, but they must match in categories/ordered
|
I fixed the concat issues to be the same as concating columns (e.g. non-matching categories make this raise).
|
379eda0
to
918c01a
Compare
if is_categorical_dtype(dtype): | ||
return self | ||
elif is_object_dtype(dtype): | ||
return np.array(self) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't like this, particularly because the dtype of the returned array will usually not be object.
Why don't you simply do return np.array(self, dtype=dtype)
and allow any valid numpy dtype?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that will cause numpy to error if dtype=='category'
;
or are you just talking about where I have is_object_dtype
(and just make the return np.array(self, dtype=dtype)
. That would seem to be ok
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, just for the cases when we already know it's not dtype='category'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
hash(key) | ||
return key in self.categories | ||
|
||
def __array__(self, result=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
__array__
actually is supposed to take an optional dtype
argument, not result
, which should be passed on to the np.array
call below and eventually on to Categorical, which should return an array of the appropriate type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, ok, we don't do this for Series, i'll fix for Categorical/CategoricalIndex here, maybe make a separate issue for this
What about Categorical levels in a MultiIndex? |
# not all labels in the categories | ||
self.assertRaises(KeyError, lambda : self.df2.loc[['a','d']]) | ||
|
||
def test_reindexing(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jreback you are my hero :)
did another read through -- looking pretty good to me! |
self.assertTrue(ci1.equals(ci1)) | ||
self.assertFalse(ci1.equals(ci2)) | ||
self.assertTrue(ci1.equals(ci1.astype(object))) | ||
self.assertTrue(ci1.astype(object).equals(ci1)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should there also be a test for identical
? (or is it already somewhere else?)
162dee9
to
d44e812
Compare
@shoyer @jorisvandenbossche @JanSchulz any other comments....going to merge |
Looks good to me! On Sun, Apr 12, 2015 at 10:06 AM, jreback [email protected]
|
@@ -594,7 +594,94 @@ faster than fancy indexing. | |||
timeit ser.ix[indexer] | |||
timeit ser.take(indexer) | |||
|
|||
.. _indexing.float64index: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
small issue: you removed the label of the "Float64Index" section below this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
Another thing: you added the
|
Ah, but I see that you added |
beac7d3
to
2f0953e
Compare
ok, marked |
@jreback I know we already had some similar discussion before, but even if the docstring says that it is an "internal, non-public method", nevertheless, it will appear in tab completion, and there will be an api page for those methods in the documentation (as this is done automatically), making them de facto public. But I know it is a difficult discussion, and the line between a "public for other parts of pandas" and "really an internal helper function" is not always clear and easy to draw. |
ok I made I left any more comments.......? |
raise KeyError when accessing invalid elements setting elements not in the categories is equiv of .append() (which coerces to an Index)
ENH: support CategoricalIndex (GH7629)
bombs away! |
df2.reindex(pd.Categorical(['a','e'],categories=list('abcde'))).index | ||
|
||
See the :ref:`documentation <advanced.categoricalindex>` for more. (:issue:`7629`) | ||
>>>>>>> support CategoricalIndex |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
small leftover from rebasing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
just a small issue in the whatsnew But thanks a lot! It was an extensive, but a good discussion! |
Woot! Thanks @jreback |
closes #7629
xref #8613
xref #8074
auto-create aCategoricalIndex
when grouping by aCategorical
(this doesn't ATM)df2.loc['d'] = 5
should do what? (currently will coerce to anIndex
)pd.concat([df2,df])
should STILL have aCategoricalIndex
(yep)?min/max
Categorical
wrapper methodsA
CategoricalIndex
is essentially a drop-in replacement forIndex
, that works nicely for non-unique values. It uses aCategorical
to represent itself. The behavior is very similar to using a duplicated Index (for say indexing).Groupby works naturally (and returns another
CategoricalIndex
). The only real departure is that.sort_index()
works like you would expected (which is a good thing:). Clearly this will provide idempotency forset/reset
index w.r.t. Categoricals, and thus memory savings by its representation.This doesn't change the API at all. IOW, this is not turned on by default, you have to either use
set/reset
, assign an index, or pass aCategorical
toIndex
.