Skip to content

Commit 7e816ed

Browse files
committed
ENH: Parametrized CategoricalDtype
We extended the CategoricalDtype to accept optional categories and ordered argument. ```python pd.CategoricalDtype(categories=['a', 'b'], ordered=True ``` CategoricalDtype is now part of the public API. This allows users to specify the desired categories and orderedness of an operation ahead of time. The current behavior, which is still possible with categories=None, the default, is to infer the categories from whatever is present. This change will make it easy to implement support for specifying categories that are know ahead of time in other places e.g. .astype, .read_csv, and the Series constructor. Closes pandas-dev#14711 Closes pandas-dev#15078 Closes pandas-dev#14676
1 parent 64c8a8d commit 7e816ed

22 files changed

+629
-171
lines changed

doc/source/advanced.rst

+1-1
Original file line numberDiff line numberDiff line change
@@ -654,7 +654,7 @@ setting the index of a ``DataFrame/Series`` with a ``category`` dtype would conv
654654
655655
df = pd.DataFrame({'A': np.arange(6),
656656
'B': list('aabbca')})
657-
df['B'] = df['B'].astype('category', categories=list('cab'))
657+
df['B'] = df['B'].astype(pd.api.types.CategoricalDtype(list('cab')))
658658
df
659659
df.dtypes
660660
df.B.cat.categories

doc/source/categorical.rst

+82-8
Original file line numberDiff line numberDiff line change
@@ -96,12 +96,20 @@ By passing a :class:`pandas.Categorical` object to a `Series` or assigning it to
9696
df["B"] = raw_cat
9797
df
9898
99-
You can also specify differently ordered categories or make the resulting data ordered, by passing these arguments to ``astype()``:
99+
Anywhere above we passed a keyword ``dtype='category'``, we used the default behavior of
100+
101+
1. categories are inferred from the data
102+
2. categories are unordered.
103+
104+
To control those behaviors, instead of passing ``'category'``, use an instance
105+
of :class:`~pd.api.types.CategoricalDtype`.
100106

101107
.. ipython:: python
102108
103-
s = pd.Series(["a","b","c","a"])
104-
s_cat = s.astype("category", categories=["b","c","d"], ordered=False)
109+
s = pd.Series(["a", "b", "c", "a"])
110+
cat_type = pd.api.types.CategoricalDtype(categories=["b", "c", "d"],
111+
ordered=False)
112+
s_cat = s.astype(cat_type)
105113
s_cat
106114
107115
Categorical data has a specific ``category`` :ref:`dtype <basics.dtypes>`:
@@ -140,6 +148,62 @@ constructor to save the factorize step during normal constructor mode:
140148
splitter = np.random.choice([0,1], 5, p=[0.5,0.5])
141149
s = pd.Series(pd.Categorical.from_codes(splitter, categories=["train", "test"]))
142150
151+
CategoricalDtype
152+
----------------
153+
154+
.. versionchanged:: 0.21.0
155+
156+
A categorical's type is fully described by 1.) its categories (an iterable with
157+
unique values and no missing values), and 2.) its orderedness (a boolean).
158+
This information can be stored in a :class:`~pandas.api.types.CategoricalDtype`.
159+
The ``categories`` argument is optional, which implies that the actual categories
160+
should be inferred from whatever is present in the data when the
161+
:class:`pandas.Categorical` is created.
162+
163+
.. ipython:: python
164+
165+
pd.api.types.CategoricalDtype(['a', 'b', 'c'])
166+
pd.api.types.CategoricalDtype(['a', 'b', 'c'], ordered=True)
167+
pd.api.types.CategoricalDtype()
168+
169+
A :class:`~pandas.api.types.CategoricalDtype` can be used in any place pandas
170+
expects a `dtype`. For example :func:`pandas.read_csv`,
171+
:func:`pandas.DataFrame.astype`, or the Series constructor.
172+
173+
As a convenience, you can use the string `'category'` in place of a
174+
:class:`~pandas.api.types.CategoricalDtype` when you want the default behavior of
175+
the categories being unordered, and equal to the set values present in the
176+
array. On other words, ``dtype='category'`` is equivalent to
177+
``dtype=pd.api.types.CategoricalDtype()``.
178+
179+
Equality Semantics
180+
~~~~~~~~~~~~~~~~~~
181+
182+
Two instances of :class:`~pandas.api.types.CategoricalDtype` compare equal whenever the have
183+
the same categories and orderedness. When comparing two unordered categoricals, the
184+
order of the ``categories`` is not considered
185+
186+
.. ipython:: python
187+
188+
c1 = pd.api.types.CategoricalDtype(['a', 'b', 'c'], ordered=False)
189+
# Equal, since order is not considered when ordered=False
190+
c1 == pd.api.types.CategoricalDtype(['b', 'c', 'a'], ordered=False)
191+
# Unequal, since the second CategoricalDtype is ordered
192+
c1 == pd.api.types.CategoricalDtype(['a', 'b', 'c'], ordered=True)
193+
194+
All instances of ``CategoricalDtype`` compare equal to the string ``'category'``
195+
196+
.. ipython:: python
197+
198+
c1 == 'category'
199+
200+
201+
.. warning::
202+
203+
Since ``dtype='category'`` is essentially ``CategoricalDtype(None, False)``,
204+
and since all instances ``CategoricalDtype`` compare equal to ``'`category'``,
205+
all instances of ``CategoricalDtype`` compare equal to a ``CategoricalDtype(None)``
206+
143207
Description
144208
-----------
145209

@@ -189,7 +253,9 @@ It's also possible to pass in the categories in a specific order:
189253

190254
.. ipython:: python
191255
192-
s = pd.Series(list('babc')).astype('category', categories=list('abcd'))
256+
s = pd.Series(list('babc')).astype(
257+
pd.api.types.CategoricalDtype(list('abcd'))
258+
)
193259
s
194260
195261
# categories
@@ -306,7 +372,9 @@ meaning and certain operations are possible. If the categorical is unordered, ``
306372
307373
s = pd.Series(pd.Categorical(["a","b","c","a"], ordered=False))
308374
s.sort_values(inplace=True)
309-
s = pd.Series(["a","b","c","a"]).astype('category', ordered=True)
375+
s = pd.Series(["a","b","c","a"]).astype(
376+
pd.api.types.CategoricalDtype(ordered=True)
377+
)
310378
s.sort_values(inplace=True)
311379
s
312380
s.min(), s.max()
@@ -406,9 +474,15 @@ categories or a categorical with any list-like object, will raise a TypeError.
406474

407475
.. ipython:: python
408476
409-
cat = pd.Series([1,2,3]).astype("category", categories=[3,2,1], ordered=True)
410-
cat_base = pd.Series([2,2,2]).astype("category", categories=[3,2,1], ordered=True)
411-
cat_base2 = pd.Series([2,2,2]).astype("category", ordered=True)
477+
cat = pd.Series([1,2,3]).astype(
478+
pd.api.types.CategoricalDtype([3, 2, 1], ordered=True)
479+
)
480+
cat_base = pd.Series([2,2,2]).astype(
481+
pd.api.types.CategoricalDtype([3, 2, 1], ordered=True)
482+
)
483+
cat_base2 = pd.Series([2,2,2]).astype(
484+
pd.api.types.CategoricalDtype(ordered=True)
485+
)
412486
413487
cat
414488
cat_base

doc/source/merging.rst

+6-3
Original file line numberDiff line numberDiff line change
@@ -831,7 +831,7 @@ The left frame.
831831
.. ipython:: python
832832
833833
X = pd.Series(np.random.choice(['foo', 'bar'], size=(10,)))
834-
X = X.astype('category', categories=['foo', 'bar'])
834+
X = X.astype(pd.api.types.CategoricalDtype(categories=['foo', 'bar']))
835835
836836
left = pd.DataFrame({'X': X,
837837
'Y': np.random.choice(['one', 'two', 'three'], size=(10,))})
@@ -842,8 +842,11 @@ The right frame.
842842

843843
.. ipython:: python
844844
845-
right = pd.DataFrame({'X': pd.Series(['foo', 'bar']).astype('category', categories=['foo', 'bar']),
846-
'Z': [1, 2]})
845+
right = pd.DataFrame({
846+
'X': pd.Series(['foo', 'bar'],
847+
dtype=pd.api.types.CategoricalDtype(['foo', 'bar'])),
848+
'Z': [1, 2]
849+
})
847850
right
848851
right.dtypes
849852

doc/source/whatsnew/v0.21.0.txt

+26
Original file line numberDiff line numberDiff line change
@@ -22,6 +22,8 @@ Check the :ref:`API Changes <whatsnew_0210.api_breaking>` and :ref:`deprecations
2222
New features
2323
~~~~~~~~~~~~
2424

25+
- New user-facing :class:`pandas.api.types.CategoricalDtype` for specifying
26+
categoricals independent of the data (:issue:`14711`, :issue:`15078`)
2527
- Support for `PEP 519 -- Adding a file system path protocol
2628
<https://www.python.org/dev/peps/pep-0519/>`_ on most readers and writers (:issue:`13823`)
2729
- Added ``__fspath__`` method to :class:`~pandas.HDFStore`, :class:`~pandas.ExcelFile`,
@@ -106,6 +108,30 @@ This does not permit that column to be accessed as an attribute:
106108

107109
Both of these now raise a ``UserWarning`` about the potential for unexpected behavior. See :ref:`Attribute Access <indexing.attribute_access>`.
108110

111+
.. _whatsnew_0210.enhancements.categorical_dtype:
112+
113+
``CategoricalDtype`` for specifying categoricals
114+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
115+
116+
:class:`pandas.api.types.CategoricalDtype` has been added to the public API and
117+
expanded to include the ``categories`` and ``ordered`` attributes. A
118+
``CategoricalDtype`` can be used to specify the set of categories and
119+
orderedness of an array, independent of the data themselves. This can be useful,
120+
e.g., when converting string data to a ``Categorical``:
121+
122+
.. ipython:: python
123+
124+
from pandas.api.types import CategoricalDtype
125+
126+
s = pd.Series(['a', 'b', 'c', 'a']) # strings
127+
dtype = CategoricalDtype(categories=['a', 'b', 'c', 'd'], ordered=True)
128+
s.astype(dtype)
129+
130+
The ``.dtype`` property of a ``Categorical``, ``CategoricalIndex`` or a
131+
``Series`` with categorical type will now return an instance of ``CategoricalDtype``.
132+
133+
See :ref:`CategoricalDtype <categorical.categoricaldtype>` for more.
134+
109135
.. _whatsnew_0210.enhancements.other:
110136

111137
Other Enhancements

0 commit comments

Comments
 (0)