Skip to content

Commit e57f189

Browse files
TomAugspurgerjreback
authored andcommitted
Categorical type (#16015)
Closes #14711 Closes #15078 Closes #14676
1 parent ecd2ad9 commit e57f189

31 files changed

+1092
-288
lines changed

doc/source/advanced.rst

+3-1
Original file line numberOriginal file lineDiff line numberDiff line change
@@ -638,9 +638,11 @@ and allows efficient indexing and storage of an index with a large number of dup
638

638

639
.. ipython:: python
639
.. ipython:: python
640
640
641+
from pandas.api.types import CategoricalDtype
642+
641
df = pd.DataFrame({'A': np.arange(6),
643
df = pd.DataFrame({'A': np.arange(6),
642
'B': list('aabbca')})
644
'B': list('aabbca')})
643-
df['B'] = df['B'].astype('category', categories=list('cab'))
645+
df['B'] = df['B'].astype(CategoricalDtype(list('cab')))
644
df
646
df
645
df.dtypes
647
df.dtypes
646
df.B.cat.categories
648
df.B.cat.categories

doc/source/api.rst

+4-1
Original file line numberOriginal file lineDiff line numberDiff line change
@@ -646,7 +646,10 @@ strings and apply several methods to it. These can be accessed like
646
Categorical
646
Categorical
647
~~~~~~~~~~~
647
~~~~~~~~~~~
648

648

649-
If the Series is of dtype ``category``, ``Series.cat`` can be used to change the the categorical
649+
.. autoclass:: api.types.CategoricalDtype
650+
:members: categories, ordered
651+
652+
If the Series is of dtype ``CategoricalDtype``, ``Series.cat`` can be used to change the categorical
650
data. This accessor is similar to the ``Series.dt`` or ``Series.str`` and has the
653
data. This accessor is similar to the ``Series.dt`` or ``Series.str`` and has the
651
following usable methods and properties:
654
following usable methods and properties:
652

655

doc/source/categorical.rst

+95-8
Original file line numberOriginal file lineDiff line numberDiff line change
@@ -89,12 +89,22 @@ By passing a :class:`pandas.Categorical` object to a `Series` or assigning it to
89
df["B"] = raw_cat
89
df["B"] = raw_cat
90
df
90
df
91
91
92-
You can also specify differently ordered categories or make the resulting data ordered, by passing these arguments to ``astype()``:
92+
Anywhere above we passed a keyword ``dtype='category'``, we used the default behavior of
93+
94+
1. categories are inferred from the data
95+
2. categories are unordered.
96+
97+
To control those behaviors, instead of passing ``'category'``, use an instance
98+
of :class:`~pandas.api.types.CategoricalDtype`.
93

99

94
.. ipython:: python
100
.. ipython:: python
95
101
96-
s = pd.Series(["a","b","c","a"])
102+
from pandas.api.types import CategoricalDtype
97-
s_cat = s.astype("category", categories=["b","c","d"], ordered=False)
103+
104+
s = pd.Series(["a", "b", "c", "a"])
105+
cat_type = CategoricalDtype(categories=["b", "c", "d"],
106+
ordered=True)
107+
s_cat = s.astype(cat_type)
98
s_cat
108
s_cat
99
109
100
Categorical data has a specific ``category`` :ref:`dtype <basics.dtypes>`:
110
Categorical data has a specific ``category`` :ref:`dtype <basics.dtypes>`:
@@ -133,6 +143,75 @@ constructor to save the factorize step during normal constructor mode:
133
splitter = np.random.choice([0,1], 5, p=[0.5,0.5])
143
splitter = np.random.choice([0,1], 5, p=[0.5,0.5])
134
s = pd.Series(pd.Categorical.from_codes(splitter, categories=["train", "test"]))
144
s = pd.Series(pd.Categorical.from_codes(splitter, categories=["train", "test"]))
135
145
146+
.. _categorical.categoricaldtype:
147+
148+
CategoricalDtype
149+
----------------
150+
151+
.. versionchanged:: 0.21.0
152+
153+
A categorical's type is fully described by
154+
155+
1. ``categories``: a sequence of unique values and no missing values
156+
2. ``ordered``: a boolean
157+
158+
This information can be stored in a :class:`~pandas.api.types.CategoricalDtype`.
159+
The ``categories`` argument is optional, which implies that the actual categories
160+
should be inferred from whatever is present in the data when the
161+
:class:`pandas.Categorical` is created. The categories are assumed to be unordered
162+
by default.
163+
164+
.. ipython:: python
165+
166+
from pandas.api.types import CategoricalDtype
167+
168+
CategoricalDtype(['a', 'b', 'c'])
169+
CategoricalDtype(['a', 'b', 'c'], ordered=True)
170+
CategoricalDtype()
171+
172+
A :class:`~pandas.api.types.CategoricalDtype` can be used in any place pandas
173+
expects a `dtype`. For example :func:`pandas.read_csv`,
174+
:func:`pandas.DataFrame.astype`, or in the Series constructor.
175+
176+
.. note::
177+
178+
As a convenience, you can use the string ``'category'`` in place of a
179+
:class:`~pandas.api.types.CategoricalDtype` when you want the default behavior of
180+
the categories being unordered, and equal to the set values present in the
181+
array. In other words, ``dtype='category'`` is equivalent to
182+
``dtype=CategoricalDtype()``.
183+
184+
Equality Semantics
185+
~~~~~~~~~~~~~~~~~~
186+
187+
Two instances of :class:`~pandas.api.types.CategoricalDtype` compare equal
188+
whenever they have the same categories and orderedness. When comparing two
189+
unordered categoricals, the order of the ``categories`` is not considered
190+
191+
.. ipython:: python
192+
193+
c1 = CategoricalDtype(['a', 'b', 'c'], ordered=False)
194+
195+
# Equal, since order is not considered when ordered=False
196+
c1 == CategoricalDtype(['b', 'c', 'a'], ordered=False)
197+
198+
# Unequal, since the second CategoricalDtype is ordered
199+
c1 == CategoricalDtype(['a', 'b', 'c'], ordered=True)
200+
201+
All instances of ``CategoricalDtype`` compare equal to the string ``'category'``
202+
203+
.. ipython:: python
204+
205+
c1 == 'category'
206+
207+
.. warning::
208+
209+
Since ``dtype='category'`` is essentially ``CategoricalDtype(None, False)``,
210+
and since all instances ``CategoricalDtype`` compare equal to ``'category'``,
211+
all instances of ``CategoricalDtype`` compare equal to a
212+
``CategoricalDtype(None, False)``, regardless of ``categories`` or
213+
``ordered``.
214+
136
Description
215
Description
137
-----------
216
-----------
138

217

@@ -184,7 +263,7 @@ It's also possible to pass in the categories in a specific order:
184

263

185
.. ipython:: python
264
.. ipython:: python
186
265
187-
s = pd.Series(list('babc')).astype('category', categories=list('abcd'))
266+
s = pd.Series(list('babc')).astype(CategoricalDtype(list('abcd')))
188
s
267
s
189
268
190
# categories
269
# categories
@@ -301,7 +380,9 @@ meaning and certain operations are possible. If the categorical is unordered, ``
301
380
302
s = pd.Series(pd.Categorical(["a","b","c","a"], ordered=False))
381
s = pd.Series(pd.Categorical(["a","b","c","a"], ordered=False))
303
s.sort_values(inplace=True)
382
s.sort_values(inplace=True)
304-
s = pd.Series(["a","b","c","a"]).astype('category', ordered=True)
383+
s = pd.Series(["a","b","c","a"]).astype(
384+
CategoricalDtype(ordered=True)
385+
)
305
s.sort_values(inplace=True)
386
s.sort_values(inplace=True)
306
s
387
s
307
s.min(), s.max()
388
s.min(), s.max()
@@ -401,9 +482,15 @@ categories or a categorical with any list-like object, will raise a TypeError.
401

482

402
.. ipython:: python
483
.. ipython:: python
403
484
404-
cat = pd.Series([1,2,3]).astype("category", categories=[3,2,1], ordered=True)
485+
cat = pd.Series([1,2,3]).astype(
405-
cat_base = pd.Series([2,2,2]).astype("category", categories=[3,2,1], ordered=True)
486+
CategoricalDtype([3, 2, 1], ordered=True)
406-
cat_base2 = pd.Series([2,2,2]).astype("category", ordered=True)
487+
)
488+
cat_base = pd.Series([2,2,2]).astype(
489+
CategoricalDtype([3, 2, 1], ordered=True)
490+
)
491+
cat_base2 = pd.Series([2,2,2]).astype(
492+
CategoricalDtype(ordered=True)
493+
)
407
494
408
cat
495
cat
409
cat_base
496
cat_base

doc/source/merging.rst

+8-3
Original file line numberOriginal file lineDiff line numberDiff line change
@@ -830,8 +830,10 @@ The left frame.
830

830

831
.. ipython:: python
831
.. ipython:: python
832
832
833+
from pandas.api.types import CategoricalDtype
834+
833
X = pd.Series(np.random.choice(['foo', 'bar'], size=(10,)))
835
X = pd.Series(np.random.choice(['foo', 'bar'], size=(10,)))
834-
X = X.astype('category', categories=['foo', 'bar'])
836+
X = X.astype(CategoricalDtype(categories=['foo', 'bar']))
835
837
836
left = pd.DataFrame({'X': X,
838
left = pd.DataFrame({'X': X,
837
'Y': np.random.choice(['one', 'two', 'three'], size=(10,))})
839
'Y': np.random.choice(['one', 'two', 'three'], size=(10,))})
@@ -842,8 +844,11 @@ The right frame.
842

844

843
.. ipython:: python
845
.. ipython:: python
844
846
845-
right = pd.DataFrame({'X': pd.Series(['foo', 'bar']).astype('category', categories=['foo', 'bar']),
847+
right = pd.DataFrame({
846-
'Z': [1, 2]})
848+
'X': pd.Series(['foo', 'bar'],
849+
dtype=CategoricalDtype(['foo', 'bar'])),
850+
'Z': [1, 2]
851+
})
847
right
852
right
848
right.dtypes
853
right.dtypes
849
854

doc/source/whatsnew/v0.21.0.txt

+27
Original file line numberOriginal file lineDiff line numberDiff line change
@@ -10,6 +10,8 @@ users upgrade to this version.
10
Highlights include:
10
Highlights include:
11

11

12
- Integration with `Apache Parquet <https://parquet.apache.org/>`__, including a new top-level :func:`read_parquet` and :func:`DataFrame.to_parquet` method, see :ref:`here <io.parquet>`.
12
- Integration with `Apache Parquet <https://parquet.apache.org/>`__, including a new top-level :func:`read_parquet` and :func:`DataFrame.to_parquet` method, see :ref:`here <io.parquet>`.
13+
- New user-facing :class:`pandas.api.types.CategoricalDtype` for specifying
14+
categoricals independent of the data, see :ref:`here <whatsnew_0210.enhancements.categorical_dtype>`.
13

15

14
Check the :ref:`API Changes <whatsnew_0210.api_breaking>` and :ref:`deprecations <whatsnew_0210.deprecations>` before updating.
16
Check the :ref:`API Changes <whatsnew_0210.api_breaking>` and :ref:`deprecations <whatsnew_0210.deprecations>` before updating.
15

17

@@ -89,6 +91,31 @@ This does not raise any obvious exceptions, but also does not create a new colum
89

91

90
Setting a list-like data structure into a new attribute now raise a ``UserWarning`` about the potential for unexpected behavior. See :ref:`Attribute Access <indexing.attribute_access>`.
92
Setting a list-like data structure into a new attribute now raise a ``UserWarning`` about the potential for unexpected behavior. See :ref:`Attribute Access <indexing.attribute_access>`.
91

93

94+
.. _whatsnew_0210.enhancements.categorical_dtype:
95+
96+
``CategoricalDtype`` for specifying categoricals
97+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
98+
99+
:class:`pandas.api.types.CategoricalDtype` has been added to the public API and
100+
expanded to include the ``categories`` and ``ordered`` attributes. A
101+
``CategoricalDtype`` can be used to specify the set of categories and
102+
orderedness of an array, independent of the data themselves. This can be useful,
103+
e.g., when converting string data to a ``Categorical`` (:issue:`14711`,
104+
:issue:`15078`, :issue:`16015`):
105+
106+
.. ipython:: python
107+
108+
from pandas.api.types import CategoricalDtype
109+
110+
s = pd.Series(['a', 'b', 'c', 'a']) # strings
111+
dtype = CategoricalDtype(categories=['a', 'b', 'c', 'd'], ordered=True)
112+
s.astype(dtype)
113+
114+
The ``.dtype`` property of a ``Categorical``, ``CategoricalIndex`` or a
115+
``Series`` with categorical type will now return an instance of ``CategoricalDtype``.
116+
117+
See the :ref:`CategoricalDtype docs <categorical.categoricaldtype>` for more.
118+
92
.. _whatsnew_0210.enhancements.other:
119
.. _whatsnew_0210.enhancements.other:
93

120

94
Other Enhancements
121
Other Enhancements

0 commit comments

Comments
 (0)