Skip to content

Commit 507467f

Browse files
committed
ENH: Parametrized CategoricalDtype
We extended the CategoricalDtype to accept optional categories and ordered argument. ```python pd.CategoricalDtype(categories=['a', 'b'], ordered=True ``` CategoricalDtype is now part of the public API. This allows users to specify the desired categories and orderedness of an operation ahead of time. The current behavior, which is still possible with categories=None, the default, is to infer the categories from whatever is present. This change will make it easy to implement support for specifying categories that are know ahead of time in other places e.g. .astype, .read_csv, and the Series constructor. Closes pandas-dev#14711 Closes pandas-dev#15078 Closes pandas-dev#14676
1 parent 4004367 commit 507467f

31 files changed

+1079
-288
lines changed

doc/source/advanced.rst

+3-1
Original file line numberDiff line numberDiff line change
@@ -638,9 +638,11 @@ and allows efficient indexing and storage of an index with a large number of dup
638638

639639
.. ipython:: python
640640
641+
from pandas.api.types import CategoricalDtype
642+
641643
df = pd.DataFrame({'A': np.arange(6),
642644
'B': list('aabbca')})
643-
df['B'] = df['B'].astype('category', categories=list('cab'))
645+
df['B'] = df['B'].astype(CategoricalDtype(list('cab')))
644646
df
645647
df.dtypes
646648
df.B.cat.categories

doc/source/api.rst

+4-1
Original file line numberDiff line numberDiff line change
@@ -646,7 +646,10 @@ strings and apply several methods to it. These can be accessed like
646646
Categorical
647647
~~~~~~~~~~~
648648

649-
If the Series is of dtype ``category``, ``Series.cat`` can be used to change the the categorical
649+
.. autoclass:: api.types.CategoricalDtype
650+
:members: categories, ordered
651+
652+
If the Series is of dtype ``CategoricalDtype``, ``Series.cat`` can be used to change the categorical
650653
data. This accessor is similar to the ``Series.dt`` or ``Series.str`` and has the
651654
following usable methods and properties:
652655

doc/source/categorical.rst

+95-8
Original file line numberDiff line numberDiff line change
@@ -89,12 +89,22 @@ By passing a :class:`pandas.Categorical` object to a `Series` or assigning it to
8989
df["B"] = raw_cat
9090
df
9191
92-
You can also specify differently ordered categories or make the resulting data ordered, by passing these arguments to ``astype()``:
92+
Anywhere above we passed a keyword ``dtype='category'``, we used the default behavior of
93+
94+
1. categories are inferred from the data
95+
2. categories are unordered.
96+
97+
To control those behaviors, instead of passing ``'category'``, use an instance
98+
of :class:`~pandas.api.types.CategoricalDtype`.
9399

94100
.. ipython:: python
95101
96-
s = pd.Series(["a","b","c","a"])
97-
s_cat = s.astype("category", categories=["b","c","d"], ordered=False)
102+
from pandas.api.types import CategoricalDtype
103+
104+
s = pd.Series(["a", "b", "c", "a"])
105+
cat_type = CategoricalDtype(categories=["b", "c", "d"],
106+
ordered=True)
107+
s_cat = s.astype(cat_type)
98108
s_cat
99109
100110
Categorical data has a specific ``category`` :ref:`dtype <basics.dtypes>`:
@@ -133,6 +143,75 @@ constructor to save the factorize step during normal constructor mode:
133143
splitter = np.random.choice([0,1], 5, p=[0.5,0.5])
134144
s = pd.Series(pd.Categorical.from_codes(splitter, categories=["train", "test"]))
135145
146+
.. _categorical.categoricaldtype:
147+
148+
CategoricalDtype
149+
----------------
150+
151+
.. versionchanged:: 0.21.0
152+
153+
A categorical's type is fully described by
154+
155+
1. ``categories``: a sequence of unique values and no missing values
156+
2. ``ordered``: a boolean
157+
158+
This information can be stored in a :class:`~pandas.api.types.CategoricalDtype`.
159+
The ``categories`` argument is optional, which implies that the actual categories
160+
should be inferred from whatever is present in the data when the
161+
:class:`pandas.Categorical` is created. The categories are assumed to be unordered
162+
by default.
163+
164+
.. ipython:: python
165+
166+
from pandas.api.types import CategoricalDtype
167+
168+
CategoricalDtype(['a', 'b', 'c'])
169+
CategoricalDtype(['a', 'b', 'c'], ordered=True)
170+
CategoricalDtype()
171+
172+
A :class:`~pandas.api.types.CategoricalDtype` can be used in any place pandas
173+
expects a `dtype`. For example :func:`pandas.read_csv`,
174+
:func:`pandas.DataFrame.astype`, or in the Series constructor.
175+
176+
.. note::
177+
178+
As a convenience, you can use the string ``'category'`` in place of a
179+
:class:`~pandas.api.types.CategoricalDtype` when you want the default behavior of
180+
the categories being unordered, and equal to the set values present in the
181+
array. In other words, ``dtype='category'`` is equivalent to
182+
``dtype=CategoricalDtype()``.
183+
184+
Equality Semantics
185+
~~~~~~~~~~~~~~~~~~
186+
187+
Two instances of :class:`~pandas.api.types.CategoricalDtype` compare equal
188+
whenever they have the same categories and orderedness. When comparing two
189+
unordered categoricals, the order of the ``categories`` is not considered
190+
191+
.. ipython:: python
192+
193+
c1 = CategoricalDtype(['a', 'b', 'c'], ordered=False)
194+
195+
# Equal, since order is not considered when ordered=False
196+
c1 == CategoricalDtype(['b', 'c', 'a'], ordered=False)
197+
198+
# Unequal, since the second CategoricalDtype is ordered
199+
c1 == CategoricalDtype(['a', 'b', 'c'], ordered=True)
200+
201+
All instances of ``CategoricalDtype`` compare equal to the string ``'category'``
202+
203+
.. ipython:: python
204+
205+
c1 == 'category'
206+
207+
.. warning::
208+
209+
Since ``dtype='category'`` is essentially ``CategoricalDtype(None, False)``,
210+
and since all instances ``CategoricalDtype`` compare equal to ``'category'``,
211+
all instances of ``CategoricalDtype`` compare equal to a
212+
``CategoricalDtype(None, False)``, regardless of ``categories`` or
213+
``ordered``.
214+
136215
Description
137216
-----------
138217

@@ -184,7 +263,7 @@ It's also possible to pass in the categories in a specific order:
184263

185264
.. ipython:: python
186265
187-
s = pd.Series(list('babc')).astype('category', categories=list('abcd'))
266+
s = pd.Series(list('babc')).astype(CategoricalDtype(list('abcd')))
188267
s
189268
190269
# categories
@@ -301,7 +380,9 @@ meaning and certain operations are possible. If the categorical is unordered, ``
301380
302381
s = pd.Series(pd.Categorical(["a","b","c","a"], ordered=False))
303382
s.sort_values(inplace=True)
304-
s = pd.Series(["a","b","c","a"]).astype('category', ordered=True)
383+
s = pd.Series(["a","b","c","a"]).astype(
384+
CategoricalDtype(ordered=True)
385+
)
305386
s.sort_values(inplace=True)
306387
s
307388
s.min(), s.max()
@@ -401,9 +482,15 @@ categories or a categorical with any list-like object, will raise a TypeError.
401482

402483
.. ipython:: python
403484
404-
cat = pd.Series([1,2,3]).astype("category", categories=[3,2,1], ordered=True)
405-
cat_base = pd.Series([2,2,2]).astype("category", categories=[3,2,1], ordered=True)
406-
cat_base2 = pd.Series([2,2,2]).astype("category", ordered=True)
485+
cat = pd.Series([1,2,3]).astype(
486+
CategoricalDtype([3, 2, 1], ordered=True)
487+
)
488+
cat_base = pd.Series([2,2,2]).astype(
489+
CategoricalDtype([3, 2, 1], ordered=True)
490+
)
491+
cat_base2 = pd.Series([2,2,2]).astype(
492+
CategoricalDtype(ordered=True)
493+
)
407494
408495
cat
409496
cat_base

doc/source/merging.rst

+8-3
Original file line numberDiff line numberDiff line change
@@ -830,8 +830,10 @@ The left frame.
830830

831831
.. ipython:: python
832832
833+
from pandas.api.types import CategoricalDtype
834+
833835
X = pd.Series(np.random.choice(['foo', 'bar'], size=(10,)))
834-
X = X.astype('category', categories=['foo', 'bar'])
836+
X = X.astype(CategoricalDtype(categories=['foo', 'bar']))
835837
836838
left = pd.DataFrame({'X': X,
837839
'Y': np.random.choice(['one', 'two', 'three'], size=(10,))})
@@ -842,8 +844,11 @@ The right frame.
842844

843845
.. ipython:: python
844846
845-
right = pd.DataFrame({'X': pd.Series(['foo', 'bar']).astype('category', categories=['foo', 'bar']),
846-
'Z': [1, 2]})
847+
right = pd.DataFrame({
848+
'X': pd.Series(['foo', 'bar'],
849+
dtype=CategoricalDtype(['foo', 'bar'])),
850+
'Z': [1, 2]
851+
})
847852
right
848853
right.dtypes
849854

doc/source/whatsnew/v0.21.0.txt

+27
Original file line numberDiff line numberDiff line change
@@ -10,6 +10,8 @@ users upgrade to this version.
1010
Highlights include:
1111

1212
- Integration with `Apache Parquet <https://parquet.apache.org/>`__, including a new top-level :func:`read_parquet` and :func:`DataFrame.to_parquet` method, see :ref:`here <io.parquet>`.
13+
- New user-facing :class:`pandas.api.types.CategoricalDtype` for specifying
14+
categoricals independent of the data, see :ref:`here <whatsnew_0210.enhancements.categorical_dtype>`.
1315

1416
Check the :ref:`API Changes <whatsnew_0210.api_breaking>` and :ref:`deprecations <whatsnew_0210.deprecations>` before updating.
1517

@@ -89,6 +91,31 @@ This does not raise any obvious exceptions, but also does not create a new colum
8991

9092
Setting a list-like data structure into a new attribute now raise a ``UserWarning`` about the potential for unexpected behavior. See :ref:`Attribute Access <indexing.attribute_access>`.
9193

94+
.. _whatsnew_0210.enhancements.categorical_dtype:
95+
96+
``CategoricalDtype`` for specifying categoricals
97+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
98+
99+
:class:`pandas.api.types.CategoricalDtype` has been added to the public API and
100+
expanded to include the ``categories`` and ``ordered`` attributes. A
101+
``CategoricalDtype`` can be used to specify the set of categories and
102+
orderedness of an array, independent of the data themselves. This can be useful,
103+
e.g., when converting string data to a ``Categorical`` (:issue:`14711`,
104+
:issue:`15078`, :issue:`16015`):
105+
106+
.. ipython:: python
107+
108+
from pandas.api.types import CategoricalDtype
109+
110+
s = pd.Series(['a', 'b', 'c', 'a']) # strings
111+
dtype = CategoricalDtype(categories=['a', 'b', 'c', 'd'], ordered=True)
112+
s.astype(dtype)
113+
114+
The ``.dtype`` property of a ``Categorical``, ``CategoricalIndex`` or a
115+
``Series`` with categorical type will now return an instance of ``CategoricalDtype``.
116+
117+
See the :ref:`CategoricalDtype docs <categorical.categoricaldtype>` for more.
118+
92119
.. _whatsnew_0210.enhancements.other:
93120

94121
Other Enhancements

0 commit comments

Comments
 (0)