Skip to content

Commit 80ae7a1

Browse files
committed
ENH: Parametrized CategoricalDtype
We extended the CategoricalDtype to accept optional categories and ordered argument. ```python pd.CategoricalDtype(categories=['a', 'b'], ordered=True ``` CategoricalDtype is now part of the public API. This allows users to specify the desired categories and orderedness of an operation ahead of time. The current behavior, which is still possible with categories=None, the default, is to infer the categories from whatever is present. This change will make it easy to implement support for specifying categories that are know ahead of time in other places e.g. .astype, .read_csv, and the Series constructor. Closes pandas-dev#14711 Closes pandas-dev#15078 Closes pandas-dev#14676
1 parent e6aed2e commit 80ae7a1

22 files changed

+629
-171
lines changed

doc/source/advanced.rst

+1-1
Original file line numberDiff line numberDiff line change
@@ -640,7 +640,7 @@ and allows efficient indexing and storage of an index with a large number of dup
640640
641641
df = pd.DataFrame({'A': np.arange(6),
642642
'B': list('aabbca')})
643-
df['B'] = df['B'].astype('category', categories=list('cab'))
643+
df['B'] = df['B'].astype(pd.api.types.CategoricalDtype(list('cab')))
644644
df
645645
df.dtypes
646646
df.B.cat.categories

doc/source/categorical.rst

+82-8
Original file line numberDiff line numberDiff line change
@@ -89,12 +89,20 @@ By passing a :class:`pandas.Categorical` object to a `Series` or assigning it to
8989
df["B"] = raw_cat
9090
df
9191
92-
You can also specify differently ordered categories or make the resulting data ordered, by passing these arguments to ``astype()``:
92+
Anywhere above we passed a keyword ``dtype='category'``, we used the default behavior of
93+
94+
1. categories are inferred from the data
95+
2. categories are unordered.
96+
97+
To control those behaviors, instead of passing ``'category'``, use an instance
98+
of :class:`~pd.api.types.CategoricalDtype`.
9399

94100
.. ipython:: python
95101
96-
s = pd.Series(["a","b","c","a"])
97-
s_cat = s.astype("category", categories=["b","c","d"], ordered=False)
102+
s = pd.Series(["a", "b", "c", "a"])
103+
cat_type = pd.api.types.CategoricalDtype(categories=["b", "c", "d"],
104+
ordered=False)
105+
s_cat = s.astype(cat_type)
98106
s_cat
99107
100108
Categorical data has a specific ``category`` :ref:`dtype <basics.dtypes>`:
@@ -133,6 +141,62 @@ constructor to save the factorize step during normal constructor mode:
133141
splitter = np.random.choice([0,1], 5, p=[0.5,0.5])
134142
s = pd.Series(pd.Categorical.from_codes(splitter, categories=["train", "test"]))
135143
144+
CategoricalDtype
145+
----------------
146+
147+
.. versionchanged:: 0.21.0
148+
149+
A categorical's type is fully described by 1.) its categories (an iterable with
150+
unique values and no missing values), and 2.) its orderedness (a boolean).
151+
This information can be stored in a :class:`~pandas.api.types.CategoricalDtype`.
152+
The ``categories`` argument is optional, which implies that the actual categories
153+
should be inferred from whatever is present in the data when the
154+
:class:`pandas.Categorical` is created.
155+
156+
.. ipython:: python
157+
158+
pd.api.types.CategoricalDtype(['a', 'b', 'c'])
159+
pd.api.types.CategoricalDtype(['a', 'b', 'c'], ordered=True)
160+
pd.api.types.CategoricalDtype()
161+
162+
A :class:`~pandas.api.types.CategoricalDtype` can be used in any place pandas
163+
expects a `dtype`. For example :func:`pandas.read_csv`,
164+
:func:`pandas.DataFrame.astype`, or the Series constructor.
165+
166+
As a convenience, you can use the string `'category'` in place of a
167+
:class:`~pandas.api.types.CategoricalDtype` when you want the default behavior of
168+
the categories being unordered, and equal to the set values present in the
169+
array. On other words, ``dtype='category'`` is equivalent to
170+
``dtype=pd.api.types.CategoricalDtype()``.
171+
172+
Equality Semantics
173+
~~~~~~~~~~~~~~~~~~
174+
175+
Two instances of :class:`~pandas.api.types.CategoricalDtype` compare equal whenever the have
176+
the same categories and orderedness. When comparing two unordered categoricals, the
177+
order of the ``categories`` is not considered
178+
179+
.. ipython:: python
180+
181+
c1 = pd.api.types.CategoricalDtype(['a', 'b', 'c'], ordered=False)
182+
# Equal, since order is not considered when ordered=False
183+
c1 == pd.api.types.CategoricalDtype(['b', 'c', 'a'], ordered=False)
184+
# Unequal, since the second CategoricalDtype is ordered
185+
c1 == pd.api.types.CategoricalDtype(['a', 'b', 'c'], ordered=True)
186+
187+
All instances of ``CategoricalDtype`` compare equal to the string ``'category'``
188+
189+
.. ipython:: python
190+
191+
c1 == 'category'
192+
193+
194+
.. warning::
195+
196+
Since ``dtype='category'`` is essentially ``CategoricalDtype(None, False)``,
197+
and since all instances ``CategoricalDtype`` compare equal to ``'`category'``,
198+
all instances of ``CategoricalDtype`` compare equal to a ``CategoricalDtype(None)``
199+
136200
Description
137201
-----------
138202

@@ -182,7 +246,9 @@ It's also possible to pass in the categories in a specific order:
182246

183247
.. ipython:: python
184248
185-
s = pd.Series(list('babc')).astype('category', categories=list('abcd'))
249+
s = pd.Series(list('babc')).astype(
250+
pd.api.types.CategoricalDtype(list('abcd'))
251+
)
186252
s
187253
188254
# categories
@@ -295,7 +361,9 @@ meaning and certain operations are possible. If the categorical is unordered, ``
295361
296362
s = pd.Series(pd.Categorical(["a","b","c","a"], ordered=False))
297363
s.sort_values(inplace=True)
298-
s = pd.Series(["a","b","c","a"]).astype('category', ordered=True)
364+
s = pd.Series(["a","b","c","a"]).astype(
365+
pd.api.types.CategoricalDtype(ordered=True)
366+
)
299367
s.sort_values(inplace=True)
300368
s
301369
s.min(), s.max()
@@ -395,9 +463,15 @@ categories or a categorical with any list-like object, will raise a TypeError.
395463

396464
.. ipython:: python
397465
398-
cat = pd.Series([1,2,3]).astype("category", categories=[3,2,1], ordered=True)
399-
cat_base = pd.Series([2,2,2]).astype("category", categories=[3,2,1], ordered=True)
400-
cat_base2 = pd.Series([2,2,2]).astype("category", ordered=True)
466+
cat = pd.Series([1,2,3]).astype(
467+
pd.api.types.CategoricalDtype([3, 2, 1], ordered=True)
468+
)
469+
cat_base = pd.Series([2,2,2]).astype(
470+
pd.api.types.CategoricalDtype([3, 2, 1], ordered=True)
471+
)
472+
cat_base2 = pd.Series([2,2,2]).astype(
473+
pd.api.types.CategoricalDtype(ordered=True)
474+
)
401475
402476
cat
403477
cat_base

doc/source/merging.rst

+6-3
Original file line numberDiff line numberDiff line change
@@ -831,7 +831,7 @@ The left frame.
831831
.. ipython:: python
832832
833833
X = pd.Series(np.random.choice(['foo', 'bar'], size=(10,)))
834-
X = X.astype('category', categories=['foo', 'bar'])
834+
X = X.astype(pd.api.types.CategoricalDtype(categories=['foo', 'bar']))
835835
836836
left = pd.DataFrame({'X': X,
837837
'Y': np.random.choice(['one', 'two', 'three'], size=(10,))})
@@ -842,8 +842,11 @@ The right frame.
842842

843843
.. ipython:: python
844844
845-
right = pd.DataFrame({'X': pd.Series(['foo', 'bar']).astype('category', categories=['foo', 'bar']),
846-
'Z': [1, 2]})
845+
right = pd.DataFrame({
846+
'X': pd.Series(['foo', 'bar'],
847+
dtype=pd.api.types.CategoricalDtype(['foo', 'bar'])),
848+
'Z': [1, 2]
849+
})
847850
right
848851
right.dtypes
849852

doc/source/whatsnew/v0.21.0.txt

+26
Original file line numberDiff line numberDiff line change
@@ -22,6 +22,8 @@ Check the :ref:`API Changes <whatsnew_0210.api_breaking>` and :ref:`deprecations
2222
New features
2323
~~~~~~~~~~~~
2424

25+
- New user-facing :class:`pandas.api.types.CategoricalDtype` for specifying
26+
categoricals independent of the data (:issue:`14711`, :issue:`15078`)
2527
- Support for `PEP 519 -- Adding a file system path protocol
2628
<https://www.python.org/dev/peps/pep-0519/>`_ on most readers and writers (:issue:`13823`)
2729
- Added ``__fspath__`` method to :class:`~pandas.HDFStore`, :class:`~pandas.ExcelFile`,
@@ -88,6 +90,30 @@ This does not raise any obvious exceptions, but also does not create a new colum
8890

8991
Setting a list-like data structure into a new attribute now raise a ``UserWarning`` about the potential for unexpected behavior. See :ref:`Attribute Access <indexing.attribute_access>`.
9092

93+
.. _whatsnew_0210.enhancements.categorical_dtype:
94+
95+
``CategoricalDtype`` for specifying categoricals
96+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
97+
98+
:class:`pandas.api.types.CategoricalDtype` has been added to the public API and
99+
expanded to include the ``categories`` and ``ordered`` attributes. A
100+
``CategoricalDtype`` can be used to specify the set of categories and
101+
orderedness of an array, independent of the data themselves. This can be useful,
102+
e.g., when converting string data to a ``Categorical``:
103+
104+
.. ipython:: python
105+
106+
from pandas.api.types import CategoricalDtype
107+
108+
s = pd.Series(['a', 'b', 'c', 'a']) # strings
109+
dtype = CategoricalDtype(categories=['a', 'b', 'c', 'd'], ordered=True)
110+
s.astype(dtype)
111+
112+
The ``.dtype`` property of a ``Categorical``, ``CategoricalIndex`` or a
113+
``Series`` with categorical type will now return an instance of ``CategoricalDtype``.
114+
115+
See :ref:`CategoricalDtype <categorical.categoricaldtype>` for more.
116+
91117
.. _whatsnew_0210.enhancements.other:
92118

93119
Other Enhancements

0 commit comments

Comments
 (0)