Skip to content

BUG: Allow IntervalIndex to be constructed from categorical data with appropriate dtype #21254

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jun 4, 2018
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions doc/source/whatsnew/v0.23.1.txt
Original file line number Diff line number Diff line change
Expand Up @@ -81,6 +81,7 @@ Indexing
- Bug in :meth:`Series.reset_index` where appropriate error was not raised with an invalid level name (:issue:`20925`)
- Bug in :func:`interval_range` when ``start``/``periods`` or ``end``/``periods`` are specified with float ``start`` or ``end`` (:issue:`21161`)
- Bug in :meth:`MultiIndex.set_names` where error raised for a ``MultiIndex`` with ``nlevels == 1`` (:issue:`21149`)
- Bug in :class:`IntervalIndex` constructors where creating an ``IntervalIndex`` from categorical data was not fully supported (:issue:`21243`, issue:`21253`)
-

I/O
Expand Down
4 changes: 4 additions & 0 deletions pandas/core/indexes/interval.py
Original file line number Diff line number Diff line change
Expand Up @@ -112,6 +112,10 @@ def maybe_convert_platform_interval(values):
-------
array
"""
if is_categorical_dtype(values):
# GH 21243/21253
values = np.array(values)

if isinstance(values, (list, tuple)) and len(values) == 0:
# GH 19016
# empty lists/tuples get object dtype by default, but this is not
Expand Down
23 changes: 22 additions & 1 deletion pandas/tests/indexes/interval/test_construction.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,8 +6,9 @@

from pandas import (
Interval, IntervalIndex, Index, Int64Index, Float64Index, Categorical,
date_range, timedelta_range, period_range, notna)
CategoricalIndex, date_range, timedelta_range, period_range, notna)
from pandas.compat import lzip
from pandas.core.dtypes.common import is_categorical_dtype
from pandas.core.dtypes.dtypes import IntervalDtype
import pandas.core.common as com
import pandas.util.testing as tm
Expand Down Expand Up @@ -111,6 +112,22 @@ def test_constructor_string(self, constructor, breaks):
with tm.assert_raises_regex(TypeError, msg):
constructor(**self.get_kwargs_from_breaks(breaks))

@pytest.mark.parametrize('cat_constructor', [
Categorical, CategoricalIndex])
def test_constructor_categorical_valid(self, constructor, cat_constructor):
# GH 21243/21253
if isinstance(constructor, partial) and constructor.func is Index:
# Index is defined to create CategoricalIndex from categorical data
pytest.skip()
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is being skipped due to the following behavior:

In [2]: cat = pd.Categorical([pd.Interval(0, 1), pd.Interval(1, 2), pd.Interval(0, 1)])

In [3]: pd.Index(cat, dtype='interval')
Out[3]: CategoricalIndex([(0, 1], (1, 2], (0, 1]], categories=[(0, 1], (1, 2]], ordered=False, dtype='category')

This happens because the Index code is structured so that categorical takes precedence over interval:

# categorical
if is_categorical_dtype(data) or is_categorical_dtype(dtype):
from .category import CategoricalIndex
return CategoricalIndex(data, dtype=dtype, copy=copy, name=name,
**kwargs)
# interval
if is_interval_dtype(data) or is_interval_dtype(dtype):
from .interval import IntervalIndex
closed = kwargs.get('closed', None)
return IntervalIndex(data, dtype=dtype, name=name, copy=copy,
closed=closed)

The code above could be restructured so that the dtype argument, if present, takes precedence over the type of data. Seems like that would be more sensible than the current approach for this corner case, but on the fence about it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I c. ok can open an issue about this, but yes I would agree should infer with a passed dtype first before switching on the type of the data.


breaks = np.arange(10, dtype='int64')
expected = IntervalIndex.from_breaks(breaks)

cat_breaks = cat_constructor(breaks)
result_kwargs = self.get_kwargs_from_breaks(cat_breaks)
result = constructor(**result_kwargs)
tm.assert_index_equal(result, expected)

def test_generic_errors(self, constructor):
# filler input data to be used when supplying invalid kwargs
filler = self.get_kwargs_from_breaks(range(10))
Expand Down Expand Up @@ -238,6 +255,8 @@ def get_kwargs_from_breaks(self, breaks, closed='right'):
tuples = lzip(breaks[:-1], breaks[1:])
if isinstance(breaks, (list, tuple)):
return {'data': tuples}
elif is_categorical_dtype(breaks):
return {'data': breaks._constructor(tuples)}
return {'data': com._asarray_tuplesafe(tuples)}

def test_constructor_errors(self):
Expand Down Expand Up @@ -286,6 +305,8 @@ def get_kwargs_from_breaks(self, breaks, closed='right'):

if isinstance(breaks, list):
return {'data': ivs}
elif is_categorical_dtype(breaks):
return {'data': breaks._constructor(ivs)}
return {'data': np.array(ivs, dtype=object)}

def test_generic_errors(self, constructor):
Expand Down