Skip to content

Commit d26363b

Browse files
pijuchajreback
authored andcommitted
BUG/DEPR: Categorical: keep dtype in MultiIndex (pandas-dev#13743), deprecate .from_array
Now, categorical dtype is preserved also in `groupby`, `set_index`, `stack`, `get_dummies`, and `make_axis_dummies`. closes pandas-dev#13743 closes pandas-dev#13854
1 parent ccec504 commit d26363b

File tree

14 files changed

+370
-91
lines changed

14 files changed

+370
-91
lines changed

doc/source/whatsnew/v0.19.0.txt

+75-1
Original file line numberDiff line numberDiff line change
@@ -977,6 +977,79 @@ New Behavior:
977977
pd.Index([1, 2, 3]).unique()
978978
pd.DatetimeIndex(['2011-01-01', '2011-01-02', '2011-01-03'], tz='Asia/Tokyo').unique()
979979

980+
.. _whatsnew_0190.api.multiindex:
981+
982+
``MultiIndex`` constructors preserve categorical dtypes
983+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
984+
985+
``MultiIndex.from_arrays`` and ``MultiIndex.from_product`` will now preserve categorical dtype
986+
in ``MultiIndex`` levels. (:issue:`13743`, :issue:`13854`)
987+
988+
.. ipython:: python
989+
990+
cat = pd.Categorical(['a', 'b'], categories=list("bac"))
991+
lvl1 = ['foo', 'bar']
992+
midx = pd.MultiIndex.from_arrays([cat, lvl1])
993+
midx
994+
995+
Previous Behavior:
996+
997+
.. code-block:: ipython
998+
999+
In [4]: midx.levels[0]
1000+
Out[4]: Index(['b', 'a', 'c'], dtype='object')
1001+
1002+
In [5]: midx.get_level_values[0]
1003+
Out[5]: Index(['a', 'b'], dtype='object')
1004+
1005+
New Behavior:
1006+
1007+
.. ipython:: python
1008+
1009+
midx.levels[0]
1010+
midx.get_level_values(0)
1011+
1012+
An analogous change has been made to ``MultiIndex.from_product``.
1013+
As a consequence, ``groupby`` and ``set_index`` also preserve categorical dtypes in indexes
1014+
1015+
.. ipython:: python
1016+
1017+
df = pd.DataFrame({'A': [0, 1], 'B': [10, 11], 'C': cat})
1018+
df_grouped = df.groupby(by=['A', 'C']).first()
1019+
df_set_idx = df.set_index(['A', 'C'])
1020+
1021+
Previous Behavior:
1022+
1023+
.. code-block:: ipython
1024+
1025+
In [11]: df_grouped.index.levels[1]
1026+
Out[11]: Index(['b', 'a', 'c'], dtype='object', name='C')
1027+
In [12]: df_grouped.reset_index().dtypes
1028+
Out[12]:
1029+
A int64
1030+
C object
1031+
B float64
1032+
dtype: object
1033+
1034+
In [13]: df_set_idx.index.levels[1]
1035+
Out[13]: Index(['b', 'a', 'c'], dtype='object', name='C')
1036+
In [14]: df_set_idx.reset_index().dtypes
1037+
Out[14]:
1038+
A int64
1039+
C object
1040+
B int64
1041+
dtype: object
1042+
1043+
New Behavior:
1044+
1045+
.. ipython:: python
1046+
1047+
df_grouped.index.levels[1]
1048+
df_grouped.reset_index().dtypes
1049+
1050+
df_set_idx.index.levels[1]
1051+
df_set_idx.reset_index().dtypes
1052+
9801053
.. _whatsnew_0190.api.autogenerated_chunksize_index:
9811054

9821055
``read_csv`` will progressively enumerate chunks
@@ -1173,7 +1246,7 @@ Deprecations
11731246
- ``Panel4D`` and ``PanelND`` constructors are deprecated and will be removed in a future version. The recommended way to represent these types of n-dimensional data are with the `xarray package <http://xarray.pydata.org/en/stable/>`__. Pandas provides a :meth:`~Panel4D.to_xarray` method to automate this conversion. (:issue:`13564`)
11741247
- ``pandas.tseries.frequencies.get_standard_freq`` is deprecated. Use ``pandas.tseries.frequencies.to_offset(freq).rule_code`` instead. (:issue:`13874`)
11751248
- ``pandas.tseries.frequencies.to_offset``'s ``freqstr`` keyword is deprecated in favor of ``freq``. (:issue:`13874`)
1176-
1249+
- ``Categorical.from_array`` has been deprecated and will be removed in a future version (:issue:`13854`)
11771250

11781251
.. _whatsnew_0190.prior_deprecations:
11791252

@@ -1388,3 +1461,4 @@ Bug Fixes
13881461
- Bug in ``.to_string()`` when called with an integer ``line_width`` and ``index=False`` raises an UnboundLocalError exception because ``idx`` referenced before assignment.
13891462

13901463
- Bug in ``eval()`` where the ``resolvers`` argument would not accept a list (:issue`14095`)
1464+
- Bugs in ``stack``, ``get_dummies``, ``make_axis_dummies`` which don't preserve categorical dtypes in (multi)indexes (:issue:`13854`)

pandas/core/categorical.py

+50-1
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@
55
import types
66

77
from pandas import compat, lib
8-
from pandas.compat import u
8+
from pandas.compat import u, lzip
99

1010
from pandas.types.generic import ABCSeries, ABCIndexClass, ABCCategoricalIndex
1111
from pandas.types.missing import isnull, notnull
@@ -17,6 +17,7 @@
1717
_ensure_platform_int,
1818
is_dtype_equal,
1919
is_datetimelike,
20+
is_categorical,
2021
is_categorical_dtype,
2122
is_integer_dtype, is_bool,
2223
is_list_like, is_sequence,
@@ -411,6 +412,8 @@ def base(self):
411412
@classmethod
412413
def from_array(cls, data, **kwargs):
413414
"""
415+
DEPRECATED: Use ``Categorical`` instead.
416+
414417
Make a Categorical type from a single array-like object.
415418
416419
For internal compatibility with numpy arrays.
@@ -421,6 +424,8 @@ def from_array(cls, data, **kwargs):
421424
Can be an Index or array-like. The categories are assumed to be
422425
the unique values of `data`.
423426
"""
427+
warn("Categorical.from_array is deprecated, use Categorical instead",
428+
FutureWarning, stacklevel=2)
424429
return cls(data, **kwargs)
425430

426431
@classmethod
@@ -1959,3 +1964,47 @@ def _convert_to_list_like(list_like):
19591964
else:
19601965
# is this reached?
19611966
return [list_like]
1967+
1968+
1969+
def _factorize_from_iterable(values):
1970+
"""
1971+
Factorize an input `values` into `categories` and `codes`. Preserves
1972+
categorical dtype in `categories`.
1973+
1974+
*This is an internal function*
1975+
1976+
Parameters
1977+
----------
1978+
values : list-like
1979+
1980+
Returns
1981+
-------
1982+
codes : np.array
1983+
categories : Index
1984+
If `values` has a categorical dtype, then `categories` is
1985+
a CategoricalIndex keeping the categories and order of `values`.
1986+
"""
1987+
from pandas.indexes.category import CategoricalIndex
1988+
1989+
if is_categorical(values):
1990+
if isinstance(values, (ABCCategoricalIndex, ABCSeries)):
1991+
values = values._values
1992+
categories = CategoricalIndex(values.categories,
1993+
categories=values.categories,
1994+
ordered=values.ordered)
1995+
codes = values.codes
1996+
else:
1997+
cat = Categorical(values, ordered=True)
1998+
categories = cat.categories
1999+
codes = cat.codes
2000+
return codes, categories
2001+
2002+
2003+
def _factorize_from_iterables(iterables):
2004+
"""
2005+
A higher-level wrapper over `_factorize_from_iterable`.
2006+
See `_factorize_from_iterable` for more info.
2007+
2008+
*This is an internal function*
2009+
"""
2010+
return lzip(*[_factorize_from_iterable(it) for it in iterables])

pandas/core/panel.py

+1-8
Original file line numberDiff line numberDiff line change
@@ -21,7 +21,6 @@
2121
from pandas import compat
2222
from pandas.compat import (map, zip, range, u, OrderedDict, OrderedDefaultdict)
2323
from pandas.compat.numpy import function as nv
24-
from pandas.core.categorical import Categorical
2524
from pandas.core.common import PandasError, _try_sort, _default_index
2625
from pandas.core.frame import DataFrame
2726
from pandas.core.generic import NDFrame, _shared_docs
@@ -103,13 +102,7 @@ def panel_index(time, panels, names=None):
103102
if names is None:
104103
names = ['time', 'panel']
105104
time, panels = _ensure_like_indices(time, panels)
106-
time_factor = Categorical.from_array(time, ordered=True)
107-
panel_factor = Categorical.from_array(panels, ordered=True)
108-
109-
labels = [time_factor.codes, panel_factor.codes]
110-
levels = [time_factor.categories, panel_factor.categories]
111-
return MultiIndex(levels, labels, sortorder=None, names=names,
112-
verify_integrity=False)
105+
return MultiIndex.from_arrays([time, panels], sortorder=None, names=names)
113106

114107

115108
class Panel(NDFrame):

pandas/core/reshape.py

+10-14
Original file line numberDiff line numberDiff line change
@@ -18,7 +18,7 @@
1818
from pandas.sparse.array import SparseArray
1919
from pandas._sparse import IntIndex
2020

21-
from pandas.core.categorical import Categorical
21+
from pandas.core.categorical import Categorical, _factorize_from_iterable
2222
from pandas.core.groupby import get_group_index, _compress_group_index
2323

2424
import pandas.core.algorithms as algos
@@ -166,9 +166,8 @@ def get_result(self):
166166
if self.is_categorical is not None:
167167
categories = self.is_categorical.categories
168168
ordered = self.is_categorical.ordered
169-
values = [Categorical.from_array(values[:, i],
170-
categories=categories,
171-
ordered=ordered)
169+
values = [Categorical(values[:, i], categories=categories,
170+
ordered=ordered)
172171
for i in range(values.shape[-1])]
173172

174173
return DataFrame(values, index=index, columns=columns)
@@ -471,8 +470,8 @@ def stack(frame, level=-1, dropna=True):
471470
def factorize(index):
472471
if index.is_unique:
473472
return index, np.arange(len(index))
474-
cat = Categorical(index, ordered=True)
475-
return cat.categories, cat.codes
473+
codes, categories = _factorize_from_iterable(index)
474+
return categories, codes
476475

477476
N, K = frame.shape
478477
if isinstance(frame.columns, MultiIndex):
@@ -1107,8 +1106,7 @@ def check_len(item, name):
11071106
def _get_dummies_1d(data, prefix, prefix_sep='_', dummy_na=False,
11081107
sparse=False, drop_first=False):
11091108
# Series avoids inconsistent NaN handling
1110-
cat = Categorical.from_array(Series(data), ordered=True)
1111-
levels = cat.categories
1109+
codes, levels = _factorize_from_iterable(Series(data))
11121110

11131111
def get_empty_Frame(data, sparse):
11141112
if isinstance(data, Series):
@@ -1124,10 +1122,10 @@ def get_empty_Frame(data, sparse):
11241122
if not dummy_na and len(levels) == 0:
11251123
return get_empty_Frame(data, sparse)
11261124

1127-
codes = cat.codes.copy()
1125+
codes = codes.copy()
11281126
if dummy_na:
1129-
codes[codes == -1] = len(cat.categories)
1130-
levels = np.append(cat.categories, np.nan)
1127+
codes[codes == -1] = len(levels)
1128+
levels = np.append(levels, np.nan)
11311129

11321130
# if dummy_na, we just fake a nan level. drop_first will drop it again
11331131
if drop_first and len(levels) == 1:
@@ -1212,9 +1210,7 @@ def make_axis_dummies(frame, axis='minor', transform=None):
12121210
labels = frame.index.labels[num]
12131211
if transform is not None:
12141212
mapped_items = items.map(transform)
1215-
cat = Categorical.from_array(mapped_items.take(labels), ordered=True)
1216-
labels = cat.codes
1217-
items = cat.categories
1213+
labels, items = _factorize_from_iterable(mapped_items.take(labels))
12181214

12191215
values = np.eye(len(items), dtype=float)
12201216
values = values.take(labels, axis=0)

pandas/indexes/multi.py

+8-11
Original file line numberDiff line numberDiff line change
@@ -852,8 +852,6 @@ def from_arrays(cls, arrays, sortorder=None, names=None):
852852
MultiIndex.from_product : Make a MultiIndex from cartesian product
853853
of iterables
854854
"""
855-
from pandas.core.categorical import Categorical
856-
857855
if len(arrays) == 1:
858856
name = None if names is None else names[0]
859857
return Index(arrays[0], name=name)
@@ -864,9 +862,9 @@ def from_arrays(cls, arrays, sortorder=None, names=None):
864862
if len(arrays[i]) != len(arrays[i - 1]):
865863
raise ValueError('all arrays must be same length')
866864

867-
cats = [Categorical.from_array(arr, ordered=True) for arr in arrays]
868-
levels = [c.categories for c in cats]
869-
labels = [c.codes for c in cats]
865+
from pandas.core.categorical import _factorize_from_iterables
866+
867+
labels, levels = _factorize_from_iterables(arrays)
870868
if names is None:
871869
names = [getattr(arr, "name", None) for arr in arrays]
872870

@@ -952,15 +950,14 @@ def from_product(cls, iterables, sortorder=None, names=None):
952950
MultiIndex.from_arrays : Convert list of arrays to MultiIndex
953951
MultiIndex.from_tuples : Convert list of tuples to MultiIndex
954952
"""
955-
from pandas.core.categorical import Categorical
953+
from pandas.core.categorical import _factorize_from_iterables
956954
from pandas.tools.util import cartesian_product
957955

958-
categoricals = [Categorical.from_array(it, ordered=True)
959-
for it in iterables]
960-
labels = cartesian_product([c.codes for c in categoricals])
956+
labels, levels = _factorize_from_iterables(iterables)
957+
labels = cartesian_product(labels)
961958

962-
return MultiIndex(levels=[c.categories for c in categoricals],
963-
labels=labels, sortorder=sortorder, names=names)
959+
return MultiIndex(levels=levels, labels=labels, sortorder=sortorder,
960+
names=names)
964961

965962
@property
966963
def nlevels(self):

pandas/io/pytables.py

+7-6
Original file line numberDiff line numberDiff line change
@@ -37,7 +37,7 @@
3737
from pandas.formats.printing import adjoin, pprint_thing
3838
from pandas.core.common import _asarray_tuplesafe, PerformanceWarning
3939
from pandas.core.algorithms import match, unique
40-
from pandas.core.categorical import Categorical
40+
from pandas.core.categorical import Categorical, _factorize_from_iterables
4141
from pandas.core.internals import (BlockManager, make_block,
4242
_block2d_to_blocknd,
4343
_factor_indexer, _block_shape)
@@ -3736,11 +3736,12 @@ def read(self, where=None, columns=None, **kwargs):
37363736
if not self.read_axes(where=where, **kwargs):
37373737
return None
37383738

3739-
factors = [Categorical.from_array(
3740-
a.values, ordered=True) for a in self.index_axes]
3741-
levels = [f.categories for f in factors]
3742-
N = [len(f.categories) for f in factors]
3743-
labels = [f.codes for f in factors]
3739+
lst_vals = [a.values for a in self.index_axes]
3740+
labels, levels = _factorize_from_iterables(lst_vals)
3741+
# labels and levels are tuples but lists are expected
3742+
labels = list(labels)
3743+
levels = list(levels)
3744+
N = [len(lvl) for lvl in levels]
37443745

37453746
# compute the key
37463747
key = _factor_indexer(N[1:], labels)

pandas/tests/frame/test_alter_axes.py

+15
Original file line numberDiff line numberDiff line change
@@ -664,3 +664,18 @@ def test_assign_columns(self):
664664
frame.columns = ['foo', 'bar', 'baz', 'quux', 'foo2']
665665
assert_series_equal(self.frame['C'], frame['baz'], check_names=False)
666666
assert_series_equal(self.frame['hi'], frame['foo2'], check_names=False)
667+
668+
def test_set_index_preserve_categorical_dtype(self):
669+
# GH13743, GH13854
670+
df = DataFrame({'A': [1, 2, 1, 1, 2],
671+
'B': [10, 16, 22, 28, 34],
672+
'C1': pd.Categorical(list("abaab"),
673+
categories=list("bac"),
674+
ordered=False),
675+
'C2': pd.Categorical(list("abaab"),
676+
categories=list("bac"),
677+
ordered=True)})
678+
for cols in ['C1', 'C2', ['A', 'C1'], ['A', 'C2'], ['C1', 'C2']]:
679+
result = df.set_index(cols).reset_index()
680+
result = result.reindex(columns=df.columns)
681+
tm.assert_frame_equal(result, df)

pandas/tests/frame/test_reshape.py

+16
Original file line numberDiff line numberDiff line change
@@ -707,3 +707,19 @@ def _test_stack_with_multiindex(multiindex):
707707
columns=Index(['B', 'C'], name='Upper'),
708708
dtype=df.dtypes[0])
709709
assert_frame_equal(result, expected)
710+
711+
def test_stack_preserve_categorical_dtype(self):
712+
# GH13854
713+
for ordered in [False, True]:
714+
for labels in [list("yxz"), list("yxy")]:
715+
cidx = pd.CategoricalIndex(labels, categories=list("xyz"),
716+
ordered=ordered)
717+
df = DataFrame([[10, 11, 12]], columns=cidx)
718+
result = df.stack()
719+
720+
# `MutliIndex.from_product` preserves categorical dtype -
721+
# it's tested elsewhere.
722+
midx = pd.MultiIndex.from_product([df.index, cidx])
723+
expected = Series([10, 11, 12], index=midx)
724+
725+
tm.assert_series_equal(result, expected)

0 commit comments

Comments
 (0)