Skip to content

BUG: Ensure Index.astype('category') returns a CategoricalIndex #18677

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
Dec 11, 2017

Conversation

jschendel
Copy link
Member

Notes:

  • MultiIndex.astype('category') raises per @TomAugspurger's comment in the issue.
  • IntervalIndex.astype('category') return a Categorical with ordered=True instead of CategoricalIndex, since it looks like someone previously intentionally implemented it this way. I don't immediately see a reason why, but left it as is. Would be straightforward to make this consistent and return a CategoricalIndex.
  • All other types of index should return a CategoricalIndex.

@@ -1934,7 +1934,7 @@ def pandas_dtype(dtype):
except TypeError:
pass

elif dtype.startswith('interval[') or dtype.startswith('Interval['):
elif dtype.startswith('interval') or dtype.startswith('Interval'):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed this because switching to pandas_dtype caused a test to break since it was passing 'interval' as the dtype, which appears to be valid:

In [1]: from pandas.core.dtypes.common import is_interval_dtype

In [2]: is_interval_dtype('interval')
Out[2]: True

@@ -2715,7 +2716,7 @@ def difference(self, other):

@Appender(_index_shared_docs['astype'])
def astype(self, dtype, copy=True):
if not is_object_dtype(np.dtype(dtype)):
if not is_object_dtype(pandas_dtype(dtype)):
raise TypeError('Setting %s dtype to anything other than object '
'is not supported' % self.__class__)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made this change since np.dtype doesn't recognize 'category', and would raise a different error message than the one specified here; pandas_dtype ensures that this error message will be raised.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ahh yes, we should have very very limited np.dtype calls anywhere; pandas_dtype is the general version

@jorisvandenbossche jorisvandenbossche added Categorical Categorical Data Type Dtype Conversions Unexpected or buggy dtype conversions labels Dec 7, 2017
@jorisvandenbossche
Copy link
Member

IntervalIndex.astype('category') return a Categorical with ordered=True instead of CategoricalIndex, since it looks like someone previously intentionally implemented it this way.

I think it should be CategoricalIndex

if is_interval_dtype(dtype):
from pandas import IntervalIndex
return IntervalIndex.from_intervals(np.array(self))
elif is_categorical_dtype(dtype) and (dtype == self.dtype):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you should use dtype.equals(self.dtype) here I think

@@ -2715,7 +2716,7 @@ def difference(self, other):

@Appender(_index_shared_docs['astype'])
def astype(self, dtype, copy=True):
if not is_object_dtype(np.dtype(dtype)):
if not is_object_dtype(pandas_dtype(dtype)):
raise TypeError('Setting %s dtype to anything other than object '
'is not supported' % self.__class__)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ahh yes, we should have very very limited np.dtype calls anywhere; pandas_dtype is the general version

@@ -1058,3 +1059,20 @@ def test_putmask_with_wrong_mask(self):

with pytest.raises(ValueError):
index.putmask('foo', 1)

def test_astype_category(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you pass a CategoricalDtype here as well (with and w/o ordered) and make this parameterized

@@ -362,8 +363,15 @@ def test_astype(self, closed):
tm.assert_index_equal(result, idx)
assert result.equals(idx)

result = idx.astype('category')
def test_astype_category(self, closed):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same

@@ -554,6 +555,15 @@ def test_astype(self):
with tm.assert_raises_regex(TypeError, "^Setting.*dtype.*object"):
self.index.astype(np.dtype(int))

def test_astype_category(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same

@codecov
Copy link

codecov bot commented Dec 7, 2017

Codecov Report

Merging #18677 into master will decrease coverage by 0.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #18677      +/-   ##
==========================================
- Coverage    91.6%   91.59%   -0.02%     
==========================================
  Files         153      153              
  Lines       51306    51339      +33     
==========================================
+ Hits        46998    47022      +24     
- Misses       4308     4317       +9
Flag Coverage Δ
#multiple 89.45% <100%> (ø) ⬆️
#single 40.74% <31.7%> (-0.11%) ⬇️
Impacted Files Coverage Δ
pandas/core/indexes/multi.py 96.29% <100%> (+0.01%) ⬆️
pandas/core/dtypes/dtypes.py 95.27% <100%> (+0.13%) ⬆️
pandas/core/indexes/numeric.py 97.36% <100%> (+0.03%) ⬆️
pandas/core/indexes/interval.py 93.8% <100%> (ø) ⬆️
pandas/core/indexes/datetimes.py 95.7% <100%> (+0.01%) ⬆️
pandas/core/indexes/base.py 96.43% <100%> (ø) ⬆️
pandas/core/indexes/category.py 97.23% <100%> (+0.03%) ⬆️
pandas/core/indexes/period.py 92.93% <100%> (+0.03%) ⬆️
pandas/core/dtypes/common.py 94.45% <100%> (ø) ⬆️
pandas/core/indexes/timedeltas.py 91.26% <100%> (+0.05%) ⬆️
... and 2 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ae74c2b...6042131. Read the comment docs.

@jschendel
Copy link
Member Author

Question regarding how this should behave with a CategoricalIndex.

Setup:

In [3]: ci = pd.CategoricalIndex(list('abca'))

In [4]: new_dtype = CategoricalDtype(ordered=True)

In [5]: new_dtype
Out[5]: CategoricalDtype(categories=None, ordered=True)

In [6]: ci.dtype
Out[6]: CategoricalDtype(categories=['a', 'b', 'c'], ordered=False)

Then ci.dtype and new_dtype are considered equal despite ordered being different:

In [7]: ci.dtype == new_dtype
Out[7]: True

Which appears to be intentional, per the comment prior to the implementation:

elif self.categories is None or other.categories is None:
# We're forced into a suboptimal corner thanks to math and
# backwards compatibility. We require that `CDT(...) == 'category'`
# for all CDTs **including** `CDT(None, ...)`. Therefore, *all*
# CDT(., .) = CDT(None, False) and *all*
# CDT(., .) = CDT(None, True).
return True

In this case, should doing ci.astype(new_dtype) change ordered to True?

  • From a user standpoint I'd expect this to change ordered to True, as this would logically be the user's intention.
  • From a codebase consistency standpoint, it seems like since the two dtypes are equal, doing an astype shouldn't change anything

@jschendel jschendel force-pushed the idx-astype-category branch from 373f9d4 to b4f354f Compare December 8, 2017 08:09
@jschendel
Copy link
Member Author

Made review related updates and a few additional fixes/changes:

  • IntervalIndex.astype('category') has been changed to return a CategoricalIndex
  • Modified MultiIndex.astype('category') to raise a NotImplementedError stating that >1 ndim categoricals aren't currently supported
    • Mirrors a similar error message raised by DataFrame.astype('category').

Also modified the CategoricalIndex.astype behavior to address my question above. I chose the first bulleted option. The logic I implemented is that if a CategoricalDtype with categories or ordered being None is passed, None is replaced by the corresponding value from the CategoricalIndex's dtype. Basically None corresponds to "don't change this attribute".

The only slightly counter-intuitive thing is that pandas_dtype('category') returns a CategoricalDtype with ordered=False, meaning that if you have an ordered CategoricalIndex and do .astype('category') it would keep the same categories but switch ordered to False. Should be straightforward to make this not change anything if that'd be preferred.

dtype = CategoricalDtype(new_categories, new_ordered)

# fastpath if dtypes are equal
if dtype == self.dtype:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that dtype.equals(self.dtype) raises AttributeError: 'CategoricalDtype' object has no attribute 'equals'.

The only other thing I found for dtype comparison is is_dtype_equal, but under the hood that basically just does == within a try/except to catch if the dtypes aren't comparable. We're within elif is_categorical_dtype(dtype) here though, so is_dtype_equal seem superfluous, but could still use it if preferred.

@jorisvandenbossche
Copy link
Member

should doing ci.astype(new_dtype) change ordered to True?

I agree that from a user point of view this should change the ordered attribute.

From a codebase consistency standpoint, it seems like since the two dtypes are equal, doing an astype shouldn't change anything

I am wondering if it would not be possible to change this. It is clear that we need to keep cat_dtype == 'category' to be True. But that is already handled a few lines above the snippet you showed.
So I don't fully understand the comment why it is needed that all CDT(None, ordered=True/False) needs to be equal regardless of the ordered attribute.

cc @TomAugspurger

if you have an ordered CategoricalIndex and do .astype('category') it would keep the same categories but switch ordered to False. Should be straightforward to make this not change anything if that'd be preferred.

That doesn't sound fully as the desired behaviour .. although I am not fully sure :-)

@TomAugspurger
Copy link
Contributor

if you have an ordered CategoricalIndex and do .astype('category') it would keep the same categories but switch ordered to False. Should be straightforward to make this not change anything if that'd be preferred.

That doesn't sound fully as the desired behaviour .. although I am not fully sure :-)

Yes, agreed that it's unclear. In that case, I think it's less surprising for 'category' to behave as CDT(None, None), i.e. don't change anything.

So I don't fully understand the comment why it is needed that all CDT(None, ordered=True/False) needs to be equal regardless of the ordered attribute.

That was for backwards comparability. I wouldn't use it as an argument of consistency, since it's a bad thing to be consistent about :)

@jschendel jschendel force-pushed the idx-astype-category branch from b4f354f to 8066780 Compare December 9, 2017 00:08
@jschendel
Copy link
Member Author

Updates:

  • Modified CategoricalIndex.astype('category') to not change anything
  • Fixed some tests that previously relied on IntervalIndex.astype('category') returning a Categorical with ordered=True instead of a CategoricalIndex
    • This only impacted the creation of expected in the tests; didn't change result.

@jschendel jschendel force-pushed the idx-astype-category branch from 8066780 to 3697d6e Compare December 9, 2017 01:38
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm. small comments.

@@ -341,9 +341,26 @@ def __array__(self, dtype=None):

@Appender(_index_shared_docs['astype'])
def astype(self, dtype, copy=True):
if isinstance(dtype, compat.string_types) and dtype == 'category':
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think you actually need this check as the dtype == self.dtype line below should pick this up.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this, or some variation of it, is necessary in order to guarantee that CI.astype('category') doesn't change anything. The issue being that pandas_dtype('category') returns CDT(None, False), which would change a CI with ordered=True to False.

@@ -916,6 +917,10 @@ def astype(self, dtype, copy=True):
elif copy is True:
return self.copy()
return self
elif is_categorical_dtype(dtype):
from pandas.core.indexes.category import CategoricalIndex
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pattern is to import CI at the top, but ok here too

e.g. from pandas.core.indexes.category import CategoricalIndex should work

@@ -632,8 +632,9 @@ def astype(self, dtype, copy=True):
elif is_object_dtype(dtype):
return Index(self.values, dtype=object)
elif is_categorical_dtype(dtype):
from pandas import Categorical
return Categorical(self, ordered=True)
from pandas.core.indexes.category import CategoricalIndex
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same

'is not supported' % self.__class__)
dtype = pandas_dtype(dtype)
if is_categorical_dtype(dtype):
msg = '> 1 ndim Categorical are not supported at this time'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

test fo this?

Copy link
Member Author

@jschendel jschendel Dec 9, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -313,10 +314,14 @@ def astype(self, dtype, copy=True):
values = self._values.astype(dtype, copy=copy)
elif is_object_dtype(dtype):
values = self._values.astype('object', copy=copy)
elif is_categorical_dtype(dtype):
from pandas.core.indexes.category import CategoricalIndex
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same

@@ -517,6 +518,10 @@ def astype(self, dtype, copy=True, how='start'):
return self.to_timestamp(how=how).tz_localize(dtype.tz)
elif is_period_dtype(dtype):
return self.asfreq(freq=dtype.freq)
elif is_categorical_dtype(dtype):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same

side thing, I think that we could make a more generic astype in indexes.base and remove some boiler plate maybe (of course separate PR), you can make an issue if you want (or just PR!)

@jschendel
Copy link
Member Author

Updated to move the CI imports, and replied to the other comments in the latest review.

@jorisvandenbossche
Copy link
Member

Looks good to me

@jschendel
Copy link
Member Author

Moved the categorical dtype update code to a new _update_dtype method of CategoricalDtype, per #18710 (comment)

Not sure if the name/location is appropriate, but can rename/move if need be. For the time being, I've maintained the existing logic of .astype('category') not updating anything.

@jreback jreback added this to the 0.22.0 milestone Dec 11, 2017
@@ -1053,6 +1053,10 @@ def _to_embed(self, keep_tz=False, dtype=None):

@Appender(_index_shared_docs['astype'])
def astype(self, dtype, copy=True):
if is_categorical_dtype(dtype):
from .category import CategoricalIndex
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess we have an import issue if we import this at the top (with the fully qualified path)?

@jreback jreback merged commit 3821040 into pandas-dev:master Dec 11, 2017
@jreback
Copy link
Contributor

jreback commented Dec 11, 2017

thanks @jschendel nice patch! keep em coming!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Categorical Categorical Data Type Dtype Conversions Unexpected or buggy dtype conversions
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Index astype('category') does not return a CategoricalIndex
4 participants