Skip to content

Commit 7314570

Browse files
committed
ENH: Parametrized CategoricalDtype
We extended the CategoricalDtype to accept optional categories and ordered argument. ```python pd.CategoricalDtype(categories=['a', 'b'], ordered=True ``` CategoricalDtype is now part of the public API. This allows users to specify the desired categories and orderedness of an operation ahead of time. The current behavior, which is still possible with categories=None, the default, is to infer the categories from whatever is present. This change will make it easy to implement support for specifying categories that are know ahead of time in other places e.g. .astype, .read_csv, and the Series constructor. Closes #14711 Closes #15078 Closes #14676
1 parent 94266d4 commit 7314570

31 files changed

+887
-250
lines changed

doc/source/advanced.rst

+3-1
Original file line numberDiff line numberDiff line change
@@ -638,9 +638,11 @@ and allows efficient indexing and storage of an index with a large number of dup
638638

639639
.. ipython:: python
640640
641+
from pandas.api.types import CategoricalDtype
642+
641643
df = pd.DataFrame({'A': np.arange(6),
642644
'B': list('aabbca')})
643-
df['B'] = df['B'].astype('category', categories=list('cab'))
645+
df['B'] = df['B'].astype(CategoricalDtype(list('cab')))
644646
df
645647
df.dtypes
646648
df.B.cat.categories

doc/source/api.rst

+4-1
Original file line numberDiff line numberDiff line change
@@ -646,7 +646,10 @@ strings and apply several methods to it. These can be accessed like
646646
Categorical
647647
~~~~~~~~~~~
648648

649-
If the Series is of dtype ``category``, ``Series.cat`` can be used to change the the categorical
649+
.. autoclass:: api.types.CategoricalDtype
650+
:members: categories, ordered
651+
652+
If the Series is of dtype ``CategoricalDtype``, ``Series.cat`` can be used to change the categorical
650653
data. This accessor is similar to the ``Series.dt`` or ``Series.str`` and has the
651654
following usable methods and properties:
652655

doc/source/categorical.rst

+90-8
Original file line numberDiff line numberDiff line change
@@ -89,12 +89,22 @@ By passing a :class:`pandas.Categorical` object to a `Series` or assigning it to
8989
df["B"] = raw_cat
9090
df
9191
92-
You can also specify differently ordered categories or make the resulting data ordered, by passing these arguments to ``astype()``:
92+
Anywhere above we passed a keyword ``dtype='category'``, we used the default behavior of
93+
94+
1. categories are inferred from the data
95+
2. categories are unordered.
96+
97+
To control those behaviors, instead of passing ``'category'``, use an instance
98+
of :class:`~pd.api.types.CategoricalDtype`.
9399

94100
.. ipython:: python
95101
96-
s = pd.Series(["a","b","c","a"])
97-
s_cat = s.astype("category", categories=["b","c","d"], ordered=False)
102+
from pandas.api.types import CategoricalDtype
103+
104+
s = pd.Series(["a", "b", "c", "a"])
105+
cat_type = CategoricalDtype(categories=["b", "c", "d"],
106+
ordered=False)
107+
s_cat = s.astype(cat_type)
98108
s_cat
99109
100110
Categorical data has a specific ``category`` :ref:`dtype <basics.dtypes>`:
@@ -133,6 +143,70 @@ constructor to save the factorize step during normal constructor mode:
133143
splitter = np.random.choice([0,1], 5, p=[0.5,0.5])
134144
s = pd.Series(pd.Categorical.from_codes(splitter, categories=["train", "test"]))
135145
146+
.. _categorical.categoricaldtype:
147+
148+
CategoricalDtype
149+
----------------
150+
151+
.. versionchanged:: 0.21.0
152+
153+
A categorical's type is fully described by
154+
155+
1. its categories: a sequence of unique values and no missing values
156+
2. its orderedness: a boolean
157+
158+
This information can be stored in a :class:`~pandas.api.types.CategoricalDtype`.
159+
The ``categories`` argument is optional, which implies that the actual categories
160+
should be inferred from whatever is present in the data when the
161+
:class:`pandas.Categorical` is created.
162+
163+
.. ipython:: python
164+
165+
from pandas.api.types import CategoricalDtype
166+
167+
CategoricalDtype(['a', 'b', 'c'])
168+
CategoricalDtype(['a', 'b', 'c'], ordered=True)
169+
CategoricalDtype()
170+
171+
A :class:`~pandas.api.types.CategoricalDtype` can be used in any place pandas
172+
expects a `dtype`. For example :func:`pandas.read_csv`,
173+
:func:`pandas.DataFrame.astype`, or in the Series constructor.
174+
175+
As a convenience, you can use the string ``'category'`` in place of a
176+
:class:`~pandas.api.types.CategoricalDtype` when you want the default behavior of
177+
the categories being unordered, and equal to the set values present in the
178+
array. In other words, ``dtype='category'`` is equivalent to
179+
``dtype=CategoricalDtype()``.
180+
181+
Equality Semantics
182+
~~~~~~~~~~~~~~~~~~
183+
184+
Two instances of :class:`~pandas.api.types.CategoricalDtype` compare equal whenever the have
185+
the same categories and orderedness. When comparing two unordered categoricals, the
186+
order of the ``categories`` is not considered
187+
188+
.. ipython:: python
189+
190+
c1 = CategoricalDtype(['a', 'b', 'c'], ordered=False)
191+
192+
# Equal, since order is not considered when ordered=False
193+
c1 == CategoricalDtype(['b', 'c', 'a'], ordered=False)
194+
195+
# Unequal, since the second CategoricalDtype is ordered
196+
c1 == CategoricalDtype(['a', 'b', 'c'], ordered=True)
197+
198+
All instances of ``CategoricalDtype`` compare equal to the string ``'category'``
199+
200+
.. ipython:: python
201+
202+
c1 == 'category'
203+
204+
.. warning::
205+
206+
Since ``dtype='category'`` is essentially ``CategoricalDtype(None, False)``,
207+
and since all instances ``CategoricalDtype`` compare equal to ``'`category'``,
208+
all instances of ``CategoricalDtype`` compare equal to a ``CategoricalDtype(None)``
209+
136210
Description
137211
-----------
138212

@@ -182,7 +256,7 @@ It's also possible to pass in the categories in a specific order:
182256

183257
.. ipython:: python
184258
185-
s = pd.Series(list('babc')).astype('category', categories=list('abcd'))
259+
s = pd.Series(list('babc')).astype(CategoricalDtype(list('abcd')))
186260
s
187261
188262
# categories
@@ -295,7 +369,9 @@ meaning and certain operations are possible. If the categorical is unordered, ``
295369
296370
s = pd.Series(pd.Categorical(["a","b","c","a"], ordered=False))
297371
s.sort_values(inplace=True)
298-
s = pd.Series(["a","b","c","a"]).astype('category', ordered=True)
372+
s = pd.Series(["a","b","c","a"]).astype(
373+
CategoricalDtype(ordered=True)
374+
)
299375
s.sort_values(inplace=True)
300376
s
301377
s.min(), s.max()
@@ -395,9 +471,15 @@ categories or a categorical with any list-like object, will raise a TypeError.
395471

396472
.. ipython:: python
397473
398-
cat = pd.Series([1,2,3]).astype("category", categories=[3,2,1], ordered=True)
399-
cat_base = pd.Series([2,2,2]).astype("category", categories=[3,2,1], ordered=True)
400-
cat_base2 = pd.Series([2,2,2]).astype("category", ordered=True)
474+
cat = pd.Series([1,2,3]).astype(
475+
CategoricalDtype([3, 2, 1], ordered=True)
476+
)
477+
cat_base = pd.Series([2,2,2]).astype(
478+
CategoricalDtype([3, 2, 1], ordered=True)
479+
)
480+
cat_base2 = pd.Series([2,2,2]).astype(
481+
CategoricalDtype(ordered=True)
482+
)
401483
402484
cat
403485
cat_base

doc/source/merging.rst

+8-3
Original file line numberDiff line numberDiff line change
@@ -830,8 +830,10 @@ The left frame.
830830

831831
.. ipython:: python
832832
833+
from pandas.api.types import CategoricalDtype
834+
833835
X = pd.Series(np.random.choice(['foo', 'bar'], size=(10,)))
834-
X = X.astype('category', categories=['foo', 'bar'])
836+
X = X.astype(CategoricalDtype(categories=['foo', 'bar']))
835837
836838
left = pd.DataFrame({'X': X,
837839
'Y': np.random.choice(['one', 'two', 'three'], size=(10,))})
@@ -842,8 +844,11 @@ The right frame.
842844

843845
.. ipython:: python
844846
845-
right = pd.DataFrame({'X': pd.Series(['foo', 'bar']).astype('category', categories=['foo', 'bar']),
846-
'Z': [1, 2]})
847+
right = pd.DataFrame({
848+
'X': pd.Series(['foo', 'bar'],
849+
dtype=CategoricalDtype(['foo', 'bar'])),
850+
'Z': [1, 2]
851+
})
847852
right
848853
right.dtypes
849854

doc/source/whatsnew/v0.21.0.txt

+26
Original file line numberDiff line numberDiff line change
@@ -10,6 +10,8 @@ users upgrade to this version.
1010
Highlights include:
1111

1212
- Integration with `Apache Parquet <https://parquet.apache.org/>`__, including a new top-level :func:`read_parquet` and :func:`DataFrame.to_parquet` method, see :ref:`here <io.parquet>`.
13+
- New user-facing :class:`pandas.api.types.CategoricalDtype` for specifying
14+
categoricals independent of the data (:issue:`14711`, :issue:`15078`)
1315

1416
Check the :ref:`API Changes <whatsnew_0210.api_breaking>` and :ref:`deprecations <whatsnew_0210.deprecations>` before updating.
1517

@@ -88,6 +90,30 @@ This does not raise any obvious exceptions, but also does not create a new colum
8890

8991
Setting a list-like data structure into a new attribute now raise a ``UserWarning`` about the potential for unexpected behavior. See :ref:`Attribute Access <indexing.attribute_access>`.
9092

93+
.. _whatsnew_0210.enhancements.categorical_dtype:
94+
95+
``CategoricalDtype`` for specifying categoricals
96+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
97+
98+
:class:`pandas.api.types.CategoricalDtype` has been added to the public API and
99+
expanded to include the ``categories`` and ``ordered`` attributes. A
100+
``CategoricalDtype`` can be used to specify the set of categories and
101+
orderedness of an array, independent of the data themselves. This can be useful,
102+
e.g., when converting string data to a ``Categorical``:
103+
104+
.. ipython:: python
105+
106+
from pandas.api.types import CategoricalDtype
107+
108+
s = pd.Series(['a', 'b', 'c', 'a']) # strings
109+
dtype = CategoricalDtype(categories=['a', 'b', 'c', 'd'], ordered=True)
110+
s.astype(dtype)
111+
112+
The ``.dtype`` property of a ``Categorical``, ``CategoricalIndex`` or a
113+
``Series`` with categorical type will now return an instance of ``CategoricalDtype``.
114+
115+
See :ref:`CategoricalDtype <categorical.categoricaldtype>` for more.
116+
91117
.. _whatsnew_0210.enhancements.other:
92118

93119
Other Enhancements

0 commit comments

Comments
 (0)