Skip to content

Commit fc68669

Browse files
committed
ENH: Implement DataFrame.astype('category')
1 parent ce77b79 commit fc68669

File tree

4 files changed

+139
-28
lines changed

4 files changed

+139
-28
lines changed

doc/source/categorical.rst

+83-18
Original file line numberDiff line numberDiff line change
@@ -44,11 +44,26 @@ The categorical data type is useful in the following cases:
4444
* As a signal to other Python libraries that this column should be treated as a categorical
4545
variable (e.g. to use suitable statistical methods or plot types).
4646

47+
.. note::
48+
49+
In contrast to R's `factor` function, categorical data is not converting input values to
50+
strings and categories will end up the same data type as the original values.
51+
52+
.. note::
53+
54+
In contrast to R's `factor` function, there is currently no way to assign/change labels at
55+
creation time. Use `categories` to change the categories after creation time.
56+
4757
See also the :ref:`API docs on categoricals<api.categorical>`.
4858

59+
.. _categorical.objectcreation:
60+
4961
Object Creation
5062
---------------
5163

64+
Series Creation
65+
~~~~~~~~~~~~~~~
66+
5267
Categorical ``Series`` or columns in a ``DataFrame`` can be created in several ways:
5368

5469
By specifying ``dtype="category"`` when constructing a ``Series``:
@@ -77,7 +92,7 @@ discrete bins. See the :ref:`example on tiling <reshaping.tile.cut>` in the docs
7792
df['group'] = pd.cut(df.value, range(0, 105, 10), right=False, labels=labels)
7893
df.head(10)
7994
80-
By passing a :class:`pandas.Categorical` object to a `Series` or assigning it to a `DataFrame`.
95+
By passing a :class:`pandas.Categorical` object to a ``Series`` or assigning it to a ``DataFrame``.
8196

8297
.. ipython:: python
8398
@@ -89,6 +104,56 @@ By passing a :class:`pandas.Categorical` object to a `Series` or assigning it to
89104
df["B"] = raw_cat
90105
df
91106
107+
Categorical data has a specific ``category`` :ref:`dtype <basics.dtypes>`:
108+
109+
.. ipython:: python
110+
111+
df.dtypes
112+
113+
DataFrame Creation
114+
~~~~~~~~~~~~~~~~~~
115+
116+
Columns in a ``DataFrame`` can be batch converted to categorical, either at the time of construction
117+
or after construction. The conversion to categorical is done on a column by column basis; labels present
118+
in a one column will not be carried over and used as categories in another column.
119+
120+
Columns can be batch converted by specifying ``dtype="category"`` when constructing a ``DataFrame``:
121+
122+
.. ipython:: python
123+
124+
df = pd.DataFrame({'A': list('abca'), 'B': list('bccd')}, dtype="category")
125+
df.dtypes
126+
127+
Note that the categories present in each column differ; since the conversion is done on a column by column
128+
basis, only labels present in a given column are categories:
129+
130+
.. ipython:: python
131+
132+
df['A']
133+
df['B']
134+
135+
136+
.. versionadded:: 0.23.0
137+
138+
Similarly, columns in an existing ``DataFrame`` can be batch converted using :meth:`DataFrame.astype`:
139+
140+
.. ipython:: python
141+
142+
df = pd.DataFrame({'A': list('abca'), 'B': list('bccd')})
143+
df_cat = df.astype('category')
144+
df_cat.dtypes
145+
146+
This conversion is likewise done on a column by column basis:
147+
148+
.. ipython:: python
149+
150+
df_cat['A']
151+
df_cat['B']
152+
153+
154+
Controlling Behavior
155+
~~~~~~~~~~~~~~~~~~~~
156+
92157
In the examples above where we passed ``dtype='category'``, we used the default
93158
behavior:
94159

@@ -108,21 +173,30 @@ of :class:`~pandas.api.types.CategoricalDtype`.
108173
s_cat = s.astype(cat_type)
109174
s_cat
110175
111-
Categorical data has a specific ``category`` :ref:`dtype <basics.dtypes>`:
176+
Similarly, a ``CategoricalDtype`` can be used with a ``DataFrame`` to ensure that categories
177+
are consistent among all columns.
112178

113179
.. ipython:: python
114180
115-
df.dtypes
181+
df = pd.DataFrame({'A': list('abca'), 'B': list('bccd')})
182+
cat_type = CategoricalDtype(categories=list('abcd'),
183+
ordered=True)
184+
df_cat = df.astype(cat_type)
185+
df_cat['A']
186+
df_cat['B']
116187
117-
.. note::
188+
If you already have `codes` and `categories`, you can use the
189+
:func:`~pandas.Categorical.from_codes` constructor to save the factorize step
190+
during normal constructor mode:
118191

119-
In contrast to R's `factor` function, categorical data is not converting input values to
120-
strings and categories will end up the same data type as the original values.
192+
.. ipython:: python
121193
122-
.. note::
194+
splitter = np.random.choice([0,1], 5, p=[0.5,0.5])
195+
s = pd.Series(pd.Categorical.from_codes(splitter, categories=["train", "test"]))
123196
124-
In contrast to R's `factor` function, there is currently no way to assign/change labels at
125-
creation time. Use `categories` to change the categories after creation time.
197+
198+
Regaining Original Data
199+
~~~~~~~~~~~~~~~~~~~~~~~
126200

127201
To get back to the original ``Series`` or NumPy array, use
128202
``Series.astype(original_dtype)`` or ``np.asarray(categorical)``:
@@ -136,15 +210,6 @@ To get back to the original ``Series`` or NumPy array, use
136210
s2.astype(str)
137211
np.asarray(s2)
138212
139-
If you already have `codes` and `categories`, you can use the
140-
:func:`~pandas.Categorical.from_codes` constructor to save the factorize step
141-
during normal constructor mode:
142-
143-
.. ipython:: python
144-
145-
splitter = np.random.choice([0,1], 5, p=[0.5,0.5])
146-
s = pd.Series(pd.Categorical.from_codes(splitter, categories=["train", "test"]))
147-
148213
.. _categorical.categoricaldtype:
149214

150215
CategoricalDtype

doc/source/whatsnew/v0.23.0.txt

+32
Original file line numberDiff line numberDiff line change
@@ -259,6 +259,38 @@ The :func:`DataFrame.assign` now accepts dependent keyword arguments for python
259259

260260
df.assign(A=df.A+1, C= lambda df: df.A* -1)
261261

262+
263+
.. _whatsnew_0230.enhancements.astype_category:
264+
265+
``DataFrame.astype`` performs columnwise conversion to ``Categorical``
266+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
267+
268+
:meth:`DataFrame.astype` can now perform columnwise conversion to ``Categorical`` by supplying the string ``'category'`` or a :class:`~pandas.api.types.CategoricalDtype`.
269+
Previously, attempting this would raise a ``NotImplementedError``. (:issue:`18099`)
270+
271+
Supplying the string ``'category'`` performs columnwise conversion, with only labels appearing in a given column set as categories:
272+
273+
.. ipython:: python
274+
275+
df = pd.DataFrame({'A': list('abca'), 'B': list('bccd')})
276+
df = df.astype('category')
277+
df['A'].dtype
278+
df['B'].dtype
279+
280+
281+
Supplying a ``CategoricalDtype`` will make the categories in each column consistent with the supplied dtype:
282+
283+
.. ipython:: python
284+
285+
from pandas.api.types import CategoricalDtype
286+
df = pd.DataFrame({'A': list('abca'), 'B': list('bccd')})
287+
cdt = CategoricalDtype(categories=list('abcd'), ordered=True)
288+
df = df.astype(cdt)
289+
df['A'].dtype
290+
df['B'].dtype
291+
292+
See the :ref:`categorical.objectcreation` section of the documentation for more details and examples.
293+
262294
.. _whatsnew_0230.enhancements.other:
263295

264296
Other Enhancements

pandas/core/generic.py

+7-2
Original file line numberDiff line numberDiff line change
@@ -18,6 +18,7 @@
1818
is_number,
1919
is_integer, is_bool,
2020
is_bool_dtype,
21+
is_categorical_dtype,
2122
is_numeric_dtype,
2223
is_datetime64_dtype,
2324
is_timedelta64_dtype,
@@ -4429,14 +4430,18 @@ def astype(self, dtype, copy=True, errors='raise', **kwargs):
44294430
if col_name not in self:
44304431
raise KeyError('Only a column name can be used for the '
44314432
'key in a dtype mappings argument.')
4432-
from pandas import concat
44334433
results = []
44344434
for col_name, col in self.iteritems():
44354435
if col_name in dtype:
44364436
results.append(col.astype(dtype[col_name], copy=copy))
44374437
else:
44384438
results.append(results.append(col.copy() if copy else col))
4439-
return concat(results, axis=1, copy=False)
4439+
return pd.concat(results, axis=1, copy=False)
4440+
4441+
elif is_categorical_dtype(dtype) and self.ndim > 1:
4442+
# GH 18099: columnwise conversion to categorical
4443+
results = (self[col].astype(dtype, copy=copy) for col in self)
4444+
return pd.concat(results, axis=1, copy=False)
44404445

44414446
# else, only a single dtype is given
44424447
new_data = self._data.astype(dtype=dtype, copy=copy, errors=errors,

pandas/tests/frame/test_dtypes.py

+17-8
Original file line numberDiff line numberDiff line change
@@ -8,11 +8,11 @@
88

99
import numpy as np
1010
from pandas import (DataFrame, Series, date_range, Timedelta, Timestamp,
11-
compat, concat, option_context)
11+
Categorical, compat, concat, option_context)
1212
from pandas.compat import u
1313
from pandas import _np_version_under1p14
1414

15-
from pandas.core.dtypes.dtypes import DatetimeTZDtype
15+
from pandas.core.dtypes.dtypes import DatetimeTZDtype, CategoricalDtype
1616
from pandas.tests.frame.common import TestData
1717
from pandas.util.testing import (assert_series_equal,
1818
assert_frame_equal,
@@ -619,12 +619,21 @@ def test_astype_duplicate_col(self):
619619
expected = concat([a1_str, b, a2_str], axis=1)
620620
assert_frame_equal(result, expected)
621621

622-
@pytest.mark.parametrize('columns', [['x'], ['x', 'y'], ['x', 'y', 'z']])
623-
def test_categorical_astype_ndim_raises(self, columns):
624-
# GH 18004
625-
msg = '> 1 ndim Categorical are not supported at this time'
626-
with tm.assert_raises_regex(NotImplementedError, msg):
627-
DataFrame(columns=columns).astype('category')
622+
@pytest.mark.parametrize('dtype', [
623+
'category',
624+
CategoricalDtype(),
625+
CategoricalDtype(ordered=True),
626+
CategoricalDtype(ordered=False),
627+
CategoricalDtype(categories=list('abcdef')),
628+
CategoricalDtype(categories=list('edba'), ordered=False),
629+
CategoricalDtype(categories=list('edcb'), ordered=True)], ids=repr)
630+
def test_astype_categorical(self, dtype):
631+
# GH 18099
632+
d = {'A': list('abbc'), 'B': list('bccd'), 'C': list('cdde')}
633+
df = DataFrame(d)
634+
result = df.astype(dtype)
635+
expected = DataFrame({k: Categorical(d[k], dtype=dtype) for k in d})
636+
tm.assert_frame_equal(result, expected)
628637

629638
@pytest.mark.parametrize("cls", [
630639
pd.api.types.CategoricalDtype,

0 commit comments

Comments
 (0)