Skip to content

Commit 96b8bb1

Browse files
jschendeljreback
authored andcommitted
ENH: Implement DataFrame.astype('category') (#18099)
1 parent 6ef4be3 commit 96b8bb1

File tree

4 files changed

+139
-24
lines changed

4 files changed

+139
-24
lines changed

doc/source/categorical.rst

+84-14
Original file line numberDiff line numberDiff line change
@@ -46,9 +46,14 @@ The categorical data type is useful in the following cases:
4646

4747
See also the :ref:`API docs on categoricals<api.categorical>`.
4848

49+
.. _categorical.objectcreation:
50+
4951
Object Creation
5052
---------------
5153

54+
Series Creation
55+
~~~~~~~~~~~~~~~
56+
5257
Categorical ``Series`` or columns in a ``DataFrame`` can be created in several ways:
5358

5459
By specifying ``dtype="category"`` when constructing a ``Series``:
@@ -77,7 +82,7 @@ discrete bins. See the :ref:`example on tiling <reshaping.tile.cut>` in the docs
7782
df['group'] = pd.cut(df.value, range(0, 105, 10), right=False, labels=labels)
7883
df.head(10)
7984
80-
By passing a :class:`pandas.Categorical` object to a `Series` or assigning it to a `DataFrame`.
85+
By passing a :class:`pandas.Categorical` object to a ``Series`` or assigning it to a ``DataFrame``.
8186

8287
.. ipython:: python
8388
@@ -89,6 +94,55 @@ By passing a :class:`pandas.Categorical` object to a `Series` or assigning it to
8994
df["B"] = raw_cat
9095
df
9196
97+
Categorical data has a specific ``category`` :ref:`dtype <basics.dtypes>`:
98+
99+
.. ipython:: python
100+
101+
df.dtypes
102+
103+
DataFrame Creation
104+
~~~~~~~~~~~~~~~~~~
105+
106+
Similar to the previous section where a single column was converted to categorical, all columns in a
107+
``DataFrame`` can be batch converted to categorical either during or after construction.
108+
109+
This can be done during construction by specifying ``dtype="category"`` in the ``DataFrame`` constructor:
110+
111+
.. ipython:: python
112+
113+
df = pd.DataFrame({'A': list('abca'), 'B': list('bccd')}, dtype="category")
114+
df.dtypes
115+
116+
Note that the categories present in each column differ; the conversion is done column by column, so
117+
only labels present in a given column are categories:
118+
119+
.. ipython:: python
120+
121+
df['A']
122+
df['B']
123+
124+
125+
.. versionadded:: 0.23.0
126+
127+
Analogously, all columns in an existing ``DataFrame`` can be batch converted using :meth:`DataFrame.astype`:
128+
129+
.. ipython:: python
130+
131+
df = pd.DataFrame({'A': list('abca'), 'B': list('bccd')})
132+
df_cat = df.astype('category')
133+
df_cat.dtypes
134+
135+
This conversion is likewise done column by column:
136+
137+
.. ipython:: python
138+
139+
df_cat['A']
140+
df_cat['B']
141+
142+
143+
Controlling Behavior
144+
~~~~~~~~~~~~~~~~~~~~
145+
92146
In the examples above where we passed ``dtype='category'``, we used the default
93147
behavior:
94148

@@ -108,21 +162,36 @@ of :class:`~pandas.api.types.CategoricalDtype`.
108162
s_cat = s.astype(cat_type)
109163
s_cat
110164
111-
Categorical data has a specific ``category`` :ref:`dtype <basics.dtypes>`:
165+
Similarly, a ``CategoricalDtype`` can be used with a ``DataFrame`` to ensure that categories
166+
are consistent among all columns.
112167

113168
.. ipython:: python
114169
115-
df.dtypes
170+
df = pd.DataFrame({'A': list('abca'), 'B': list('bccd')})
171+
cat_type = CategoricalDtype(categories=list('abcd'),
172+
ordered=True)
173+
df_cat = df.astype(cat_type)
174+
df_cat['A']
175+
df_cat['B']
116176
117177
.. note::
118178

119-
In contrast to R's `factor` function, categorical data is not converting input values to
120-
strings and categories will end up the same data type as the original values.
179+
To perform table-wise conversion, where all labels in the entire ``DataFrame`` are used as
180+
categories for each column, the ``categories`` parameter can be determined programatically by
181+
``categories = pd.unique(df.values.ravel())``.
121182

122-
.. note::
183+
If you already have ``codes`` and ``categories``, you can use the
184+
:func:`~pandas.Categorical.from_codes` constructor to save the factorize step
185+
during normal constructor mode:
123186

124-
In contrast to R's `factor` function, there is currently no way to assign/change labels at
125-
creation time. Use `categories` to change the categories after creation time.
187+
.. ipython:: python
188+
189+
splitter = np.random.choice([0,1], 5, p=[0.5,0.5])
190+
s = pd.Series(pd.Categorical.from_codes(splitter, categories=["train", "test"]))
191+
192+
193+
Regaining Original Data
194+
~~~~~~~~~~~~~~~~~~~~~~~
126195

127196
To get back to the original ``Series`` or NumPy array, use
128197
``Series.astype(original_dtype)`` or ``np.asarray(categorical)``:
@@ -136,14 +205,15 @@ To get back to the original ``Series`` or NumPy array, use
136205
s2.astype(str)
137206
np.asarray(s2)
138207
139-
If you already have `codes` and `categories`, you can use the
140-
:func:`~pandas.Categorical.from_codes` constructor to save the factorize step
141-
during normal constructor mode:
208+
.. note::
142209

143-
.. ipython:: python
210+
In contrast to R's `factor` function, categorical data is not converting input values to
211+
strings; categories will end up the same data type as the original values.
144212

145-
splitter = np.random.choice([0,1], 5, p=[0.5,0.5])
146-
s = pd.Series(pd.Categorical.from_codes(splitter, categories=["train", "test"]))
213+
.. note::
214+
215+
In contrast to R's `factor` function, there is currently no way to assign/change labels at
216+
creation time. Use `categories` to change the categories after creation time.
147217

148218
.. _categorical.categoricaldtype:
149219

doc/source/whatsnew/v0.23.0.txt

+31
Original file line numberDiff line numberDiff line change
@@ -268,6 +268,37 @@ The :func:`DataFrame.assign` now accepts dependent keyword arguments for python
268268

269269
df.assign(A=df.A+1, C= lambda df: df.A* -1)
270270

271+
272+
.. _whatsnew_0230.enhancements.astype_category:
273+
274+
``DataFrame.astype`` performs column-wise conversion to ``Categorical``
275+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
276+
277+
:meth:`DataFrame.astype` can now perform column-wise conversion to ``Categorical`` by supplying the string ``'category'`` or
278+
a :class:`~pandas.api.types.CategoricalDtype`. Previously, attempting this would raise a ``NotImplementedError``. See the
279+
:ref:`categorical.objectcreation` section of the documentation for more details and examples. (:issue:`12860`, :issue:`18099`)
280+
281+
Supplying the string ``'category'`` performs column-wise conversion, with only labels appearing in a given column set as categories:
282+
283+
.. ipython:: python
284+
285+
df = pd.DataFrame({'A': list('abca'), 'B': list('bccd')})
286+
df = df.astype('category')
287+
df['A'].dtype
288+
df['B'].dtype
289+
290+
291+
Supplying a ``CategoricalDtype`` will make the categories in each column consistent with the supplied dtype:
292+
293+
.. ipython:: python
294+
295+
from pandas.api.types import CategoricalDtype
296+
df = pd.DataFrame({'A': list('abca'), 'B': list('bccd')})
297+
cdt = CategoricalDtype(categories=list('abcd'), ordered=True)
298+
df = df.astype(cdt)
299+
df['A'].dtype
300+
df['B'].dtype
301+
271302
.. _whatsnew_0230.enhancements.other:
272303

273304
Other Enhancements

pandas/core/generic.py

+7-2
Original file line numberDiff line numberDiff line change
@@ -18,6 +18,7 @@
1818
is_number,
1919
is_integer, is_bool,
2020
is_bool_dtype,
21+
is_categorical_dtype,
2122
is_numeric_dtype,
2223
is_datetime64_dtype,
2324
is_timedelta64_dtype,
@@ -4429,14 +4430,18 @@ def astype(self, dtype, copy=True, errors='raise', **kwargs):
44294430
if col_name not in self:
44304431
raise KeyError('Only a column name can be used for the '
44314432
'key in a dtype mappings argument.')
4432-
from pandas import concat
44334433
results = []
44344434
for col_name, col in self.iteritems():
44354435
if col_name in dtype:
44364436
results.append(col.astype(dtype[col_name], copy=copy))
44374437
else:
44384438
results.append(results.append(col.copy() if copy else col))
4439-
return concat(results, axis=1, copy=False)
4439+
return pd.concat(results, axis=1, copy=False)
4440+
4441+
elif is_categorical_dtype(dtype) and self.ndim > 1:
4442+
# GH 18099: columnwise conversion to categorical
4443+
results = (self[col].astype(dtype, copy=copy) for col in self)
4444+
return pd.concat(results, axis=1, copy=False)
44404445

44414446
# else, only a single dtype is given
44424447
new_data = self._data.astype(dtype=dtype, copy=copy, errors=errors,

pandas/tests/frame/test_dtypes.py

+17-8
Original file line numberDiff line numberDiff line change
@@ -8,11 +8,11 @@
88

99
import numpy as np
1010
from pandas import (DataFrame, Series, date_range, Timedelta, Timestamp,
11-
compat, concat, option_context)
11+
Categorical, compat, concat, option_context)
1212
from pandas.compat import u
1313
from pandas import _np_version_under1p14
1414

15-
from pandas.core.dtypes.dtypes import DatetimeTZDtype
15+
from pandas.core.dtypes.dtypes import DatetimeTZDtype, CategoricalDtype
1616
from pandas.tests.frame.common import TestData
1717
from pandas.util.testing import (assert_series_equal,
1818
assert_frame_equal,
@@ -619,12 +619,21 @@ def test_astype_duplicate_col(self):
619619
expected = concat([a1_str, b, a2_str], axis=1)
620620
assert_frame_equal(result, expected)
621621

622-
@pytest.mark.parametrize('columns', [['x'], ['x', 'y'], ['x', 'y', 'z']])
623-
def test_categorical_astype_ndim_raises(self, columns):
624-
# GH 18004
625-
msg = '> 1 ndim Categorical are not supported at this time'
626-
with tm.assert_raises_regex(NotImplementedError, msg):
627-
DataFrame(columns=columns).astype('category')
622+
@pytest.mark.parametrize('dtype', [
623+
'category',
624+
CategoricalDtype(),
625+
CategoricalDtype(ordered=True),
626+
CategoricalDtype(ordered=False),
627+
CategoricalDtype(categories=list('abcdef')),
628+
CategoricalDtype(categories=list('edba'), ordered=False),
629+
CategoricalDtype(categories=list('edcb'), ordered=True)], ids=repr)
630+
def test_astype_categorical(self, dtype):
631+
# GH 18099
632+
d = {'A': list('abbc'), 'B': list('bccd'), 'C': list('cdde')}
633+
df = DataFrame(d)
634+
result = df.astype(dtype)
635+
expected = DataFrame({k: Categorical(d[k], dtype=dtype) for k in d})
636+
tm.assert_frame_equal(result, expected)
628637

629638
@pytest.mark.parametrize("cls", [
630639
pd.api.types.CategoricalDtype,

0 commit comments

Comments
 (0)