-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG: Ensure Index.astype('category') returns a CategoricalIndex #18677
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@@ -1934,7 +1934,7 @@ def pandas_dtype(dtype): | |||
except TypeError: | |||
pass | |||
|
|||
elif dtype.startswith('interval[') or dtype.startswith('Interval['): | |||
elif dtype.startswith('interval') or dtype.startswith('Interval'): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed this because switching to pandas_dtype
caused a test to break since it was passing 'interval'
as the dtype, which appears to be valid:
In [1]: from pandas.core.dtypes.common import is_interval_dtype
In [2]: is_interval_dtype('interval')
Out[2]: True
pandas/core/indexes/multi.py
Outdated
@@ -2715,7 +2716,7 @@ def difference(self, other): | |||
|
|||
@Appender(_index_shared_docs['astype']) | |||
def astype(self, dtype, copy=True): | |||
if not is_object_dtype(np.dtype(dtype)): | |||
if not is_object_dtype(pandas_dtype(dtype)): | |||
raise TypeError('Setting %s dtype to anything other than object ' | |||
'is not supported' % self.__class__) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Made this change since np.dtype
doesn't recognize 'category'
, and would raise a different error message than the one specified here; pandas_dtype
ensures that this error message will be raised.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ahh yes, we should have very very limited np.dtype
calls anywhere; pandas_dtype
is the general version
I think it should be CategoricalIndex |
pandas/core/indexes/category.py
Outdated
if is_interval_dtype(dtype): | ||
from pandas import IntervalIndex | ||
return IntervalIndex.from_intervals(np.array(self)) | ||
elif is_categorical_dtype(dtype) and (dtype == self.dtype): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you should use dtype.equals(self.dtype)
here I think
pandas/core/indexes/multi.py
Outdated
@@ -2715,7 +2716,7 @@ def difference(self, other): | |||
|
|||
@Appender(_index_shared_docs['astype']) | |||
def astype(self, dtype, copy=True): | |||
if not is_object_dtype(np.dtype(dtype)): | |||
if not is_object_dtype(pandas_dtype(dtype)): | |||
raise TypeError('Setting %s dtype to anything other than object ' | |||
'is not supported' % self.__class__) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ahh yes, we should have very very limited np.dtype
calls anywhere; pandas_dtype
is the general version
pandas/tests/indexes/common.py
Outdated
@@ -1058,3 +1059,20 @@ def test_putmask_with_wrong_mask(self): | |||
|
|||
with pytest.raises(ValueError): | |||
index.putmask('foo', 1) | |||
|
|||
def test_astype_category(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you pass a CategoricalDtype
here as well (with and w/o ordered) and make this parameterized
@@ -362,8 +363,15 @@ def test_astype(self, closed): | |||
tm.assert_index_equal(result, idx) | |||
assert result.equals(idx) | |||
|
|||
result = idx.astype('category') | |||
def test_astype_category(self, closed): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same
pandas/tests/indexes/test_multi.py
Outdated
@@ -554,6 +555,15 @@ def test_astype(self): | |||
with tm.assert_raises_regex(TypeError, "^Setting.*dtype.*object"): | |||
self.index.astype(np.dtype(int)) | |||
|
|||
def test_astype_category(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same
Codecov Report
@@ Coverage Diff @@
## master #18677 +/- ##
==========================================
- Coverage 91.6% 91.59% -0.02%
==========================================
Files 153 153
Lines 51306 51339 +33
==========================================
+ Hits 46998 47022 +24
- Misses 4308 4317 +9
Continue to review full report at Codecov.
|
Question regarding how this should behave with a Setup: In [3]: ci = pd.CategoricalIndex(list('abca'))
In [4]: new_dtype = CategoricalDtype(ordered=True)
In [5]: new_dtype
Out[5]: CategoricalDtype(categories=None, ordered=True)
In [6]: ci.dtype
Out[6]: CategoricalDtype(categories=['a', 'b', 'c'], ordered=False) Then In [7]: ci.dtype == new_dtype
Out[7]: True Which appears to be intentional, per the comment prior to the implementation: pandas/pandas/core/dtypes/dtypes.py Lines 216 to 222 in 9629fef
In this case, should doing
|
373f9d4
to
b4f354f
Compare
Made review related updates and a few additional fixes/changes:
Also modified the The only slightly counter-intuitive thing is that |
pandas/core/indexes/category.py
Outdated
dtype = CategoricalDtype(new_categories, new_ordered) | ||
|
||
# fastpath if dtypes are equal | ||
if dtype == self.dtype: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that dtype.equals(self.dtype)
raises AttributeError: 'CategoricalDtype' object has no attribute 'equals'
.
The only other thing I found for dtype comparison is is_dtype_equal
, but under the hood that basically just does ==
within a try
/except
to catch if the dtypes aren't comparable. We're within elif is_categorical_dtype(dtype)
here though, so is_dtype_equal
seem superfluous, but could still use it if preferred.
I agree that from a user point of view this should change the ordered attribute.
I am wondering if it would not be possible to change this. It is clear that we need to keep
That doesn't sound fully as the desired behaviour .. although I am not fully sure :-) |
Yes, agreed that it's unclear. In that case, I think it's less surprising for
That was for backwards comparability. I wouldn't use it as an argument of consistency, since it's a bad thing to be consistent about :) |
b4f354f
to
8066780
Compare
Updates:
|
8066780
to
3697d6e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm. small comments.
pandas/core/indexes/category.py
Outdated
@@ -341,9 +341,26 @@ def __array__(self, dtype=None): | |||
|
|||
@Appender(_index_shared_docs['astype']) | |||
def astype(self, dtype, copy=True): | |||
if isinstance(dtype, compat.string_types) and dtype == 'category': |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think you actually need this check as the dtype == self.dtype
line below should pick this up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this, or some variation of it, is necessary in order to guarantee that CI.astype('category')
doesn't change anything. The issue being that pandas_dtype('category')
returns CDT(None, False)
, which would change a CI
with ordered=True
to False
.
pandas/core/indexes/datetimes.py
Outdated
@@ -916,6 +917,10 @@ def astype(self, dtype, copy=True): | |||
elif copy is True: | |||
return self.copy() | |||
return self | |||
elif is_categorical_dtype(dtype): | |||
from pandas.core.indexes.category import CategoricalIndex |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pattern is to import CI
at the top, but ok here too
e.g. from pandas.core.indexes.category import CategoricalIndex
should work
pandas/core/indexes/interval.py
Outdated
@@ -632,8 +632,9 @@ def astype(self, dtype, copy=True): | |||
elif is_object_dtype(dtype): | |||
return Index(self.values, dtype=object) | |||
elif is_categorical_dtype(dtype): | |||
from pandas import Categorical | |||
return Categorical(self, ordered=True) | |||
from pandas.core.indexes.category import CategoricalIndex |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same
'is not supported' % self.__class__) | ||
dtype = pandas_dtype(dtype) | ||
if is_categorical_dtype(dtype): | ||
msg = '> 1 ndim Categorical are not supported at this time' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
test fo this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wrote a test for it in test_multi.py
, which overrides the test in common.py
:
pandas/core/indexes/numeric.py
Outdated
@@ -313,10 +314,14 @@ def astype(self, dtype, copy=True): | |||
values = self._values.astype(dtype, copy=copy) | |||
elif is_object_dtype(dtype): | |||
values = self._values.astype('object', copy=copy) | |||
elif is_categorical_dtype(dtype): | |||
from pandas.core.indexes.category import CategoricalIndex |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same
@@ -517,6 +518,10 @@ def astype(self, dtype, copy=True, how='start'): | |||
return self.to_timestamp(how=how).tz_localize(dtype.tz) | |||
elif is_period_dtype(dtype): | |||
return self.asfreq(freq=dtype.freq) | |||
elif is_categorical_dtype(dtype): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same
side thing, I think that we could make a more generic astype in indexes.base
and remove some boiler plate maybe (of course separate PR), you can make an issue if you want (or just PR!)
3697d6e
to
31d4d62
Compare
Updated to move the |
Looks good to me |
31d4d62
to
6042131
Compare
Moved the categorical dtype update code to a new Not sure if the name/location is appropriate, but can rename/move if need be. For the time being, I've maintained the existing logic of |
@@ -1053,6 +1053,10 @@ def _to_embed(self, keep_tz=False, dtype=None): | |||
|
|||
@Appender(_index_shared_docs['astype']) | |||
def astype(self, dtype, copy=True): | |||
if is_categorical_dtype(dtype): | |||
from .category import CategoricalIndex |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess we have an import issue if we import this at the top (with the fully qualified path)?
thanks @jschendel nice patch! keep em coming! |
git diff upstream/master -u -- "*.py" | flake8 --diff
Notes:
MultiIndex.astype('category')
raises per @TomAugspurger's comment in the issue.IntervalIndex.astype('category')
return aCategorical
withordered=True
instead ofCategoricalIndex
, since it looks like someone previously intentionally implemented it this way. I don't immediately see a reason why, but left it as is. Would be straightforward to make this consistent and return aCategoricalIndex
.CategoricalIndex
.