Skip to content

Commit a7eb835

Browse files
committed
ENH: Parametrized CategoricalDtype
We extended the CategoricalDtype to accept optional categories and ordered argument. ```python pd.CategoricalDtype(categories=['a', 'b'], ordered=True ``` CategoricalDtype is now part of the public API. This allows users to specify the desired categories and orderedness of an operation ahead of time. The current behavior, which is still possible with categories=None, the default, is to infer the categories from whatever is present. This change will make it easy to implement support for specifying categories that are know ahead of time in other places e.g. .astype, .read_csv, and the Series constructor. Closes pandas-dev#14711 Closes pandas-dev#15078 Closes pandas-dev#14676
1 parent 1abaecb commit a7eb835

21 files changed

+510
-170
lines changed

doc/source/advanced.rst

+1-1
Original file line numberDiff line numberDiff line change
@@ -654,7 +654,7 @@ setting the index of a ``DataFrame/Series`` with a ``category`` dtype would conv
654654
655655
df = pd.DataFrame({'A': np.arange(6),
656656
'B': list('aabbca')})
657-
df['B'] = df['B'].astype('category', categories=list('cab'))
657+
df['B'] = df['B'].astype(pd.CategoricalDtype(list('cab')))
658658
df
659659
df.dtypes
660660
df.B.cat.categories

doc/source/categorical.rst

+70-8
Original file line numberDiff line numberDiff line change
@@ -96,12 +96,19 @@ By passing a :class:`pandas.Categorical` object to a `Series` or assigning it to
9696
df["B"] = raw_cat
9797
df
9898
99-
You can also specify differently ordered categories or make the resulting data ordered, by passing these arguments to ``astype()``:
99+
Anywhere above we passed a keyword ``dtype='category'``, we used the default behavior of
100+
101+
1. categories are inferred from the data
102+
2. categories are unordered.
103+
104+
To control those behaviors, instead of passing ``'category'``, use an instance
105+
of :class:`CategoricalDtype`.
100106

101107
.. ipython:: python
102108
103-
s = pd.Series(["a","b","c","a"])
104-
s_cat = s.astype("category", categories=["b","c","d"], ordered=False)
109+
s = pd.Series(["a", "b", "c", "a"])
110+
cat_type = pd.CategoricalDtype(categories=["b", "c", "d"], ordered=False)
111+
s_cat = s.astype(cat_type)
105112
s_cat
106113
107114
Categorical data has a specific ``category`` :ref:`dtype <basics.dtypes>`:
@@ -140,6 +147,61 @@ constructor to save the factorize step during normal constructor mode:
140147
splitter = np.random.choice([0,1], 5, p=[0.5,0.5])
141148
s = pd.Series(pd.Categorical.from_codes(splitter, categories=["train", "test"]))
142149
150+
CategoricalDtype
151+
----------------
152+
153+
.. versionchanged:: 0.21.0
154+
155+
A categorical's type is fully described by 1.) its categories (an iterable with
156+
unique values and no missing values), and 2.) its orderedness (a boolean).
157+
This information can be stored in a :class:`~pandas.CategoricalDtype`.
158+
The ``categories`` argument is optional, which implies that the actual categories
159+
should be inferred from whatever is present in the data when the
160+
:class:`pandas.Categorical` is created.
161+
162+
.. ipython:: python
163+
164+
pd.CategoricalDtype(['a', 'b', 'c'])
165+
pd.CategoricalDtype(['a', 'b', 'c'], ordered=True)
166+
pd.CategoricalDtype()
167+
168+
A :class:`~pandas.CategoricalDtype` can be used in any place pandas expects a
169+
`dtype`. For example :func:`pandas.read_csv`, :func:`pandas.DataFrame.astype`,
170+
or the Series constructor.
171+
172+
As a convenience, you can use the string `'category'` in place of a
173+
:class:`pandas.CategoricalDtype` when you want the default behavior of
174+
the categories being unordered, and equal to the set values present in the array.
175+
On other words, ``dtype='category'`` is equivalent to ``dtype=pd.CategoricalDtype()``.
176+
177+
Equality Semantics
178+
~~~~~~~~~~~~~~~~~~
179+
180+
Two instances of :class:`pandas.CategoricalDtype` compare equal whenever the have
181+
the same categories and orderedness. When comparing two unordered categoricals, the
182+
order of the ``categories`` is not considered
183+
184+
.. ipython:: python
185+
186+
c1 = pd.CategoricalDtype(['a', 'b', 'c'], ordered=False)
187+
# Equal, since order is not considered when ordered=False
188+
c1 == pd.CategoricalDtype(['b', 'c', 'a'], ordered=False)
189+
# Unequal, since the second CategoricalDtype is ordered
190+
c1 == pd.CategoricalDtype(['a', 'b', 'c'], ordered=True)
191+
192+
All instances of ``CategoricalDtype`` compare equal to the string ``'category'``
193+
194+
.. ipython:: python
195+
196+
c1 == 'category'
197+
198+
199+
.. warning::
200+
201+
Since ``dtype='category'`` is essentially ``CategoricalDtype(None, False)``,
202+
and since all instances ``CategoricalDtype`` compare equal to ``'`category'``,
203+
all instances of ``CategoricalDtype`` compare equal to a ``CategoricalDtype(None)``
204+
143205
Description
144206
-----------
145207

@@ -189,7 +251,7 @@ It's also possible to pass in the categories in a specific order:
189251

190252
.. ipython:: python
191253
192-
s = pd.Series(list('babc')).astype('category', categories=list('abcd'))
254+
s = pd.Series(list('babc')).astype(pd.CategoricalDtype(list('abcd')))
193255
s
194256
195257
# categories
@@ -306,7 +368,7 @@ meaning and certain operations are possible. If the categorical is unordered, ``
306368
307369
s = pd.Series(pd.Categorical(["a","b","c","a"], ordered=False))
308370
s.sort_values(inplace=True)
309-
s = pd.Series(["a","b","c","a"]).astype('category', ordered=True)
371+
s = pd.Series(["a","b","c","a"]).astype(pd.CategoricalDtype(ordered=True))
310372
s.sort_values(inplace=True)
311373
s
312374
s.min(), s.max()
@@ -406,9 +468,9 @@ categories or a categorical with any list-like object, will raise a TypeError.
406468

407469
.. ipython:: python
408470
409-
cat = pd.Series([1,2,3]).astype("category", categories=[3,2,1], ordered=True)
410-
cat_base = pd.Series([2,2,2]).astype("category", categories=[3,2,1], ordered=True)
411-
cat_base2 = pd.Series([2,2,2]).astype("category", ordered=True)
471+
cat = pd.Series([1,2,3]).astype(pd.CategoricalDtype([3, 2, 1], ordered=True))
472+
cat_base = pd.Series([2,2,2]).astype(pd.CategoricalDtype([3, 2, 1], ordered=True))
473+
cat_base2 = pd.Series([2,2,2]).astype(pd.CategoricalDtype(ordered=True))
412474
413475
cat
414476
cat_base

doc/source/merging.rst

+5-3
Original file line numberDiff line numberDiff line change
@@ -831,7 +831,7 @@ The left frame.
831831
.. ipython:: python
832832
833833
X = pd.Series(np.random.choice(['foo', 'bar'], size=(10,)))
834-
X = X.astype('category', categories=['foo', 'bar'])
834+
X = X.astype(pd.CategoricalDtype(categories=['foo', 'bar']))
835835
836836
left = pd.DataFrame({'X': X,
837837
'Y': np.random.choice(['one', 'two', 'three'], size=(10,))})
@@ -842,8 +842,10 @@ The right frame.
842842

843843
.. ipython:: python
844844
845-
right = pd.DataFrame({'X': pd.Series(['foo', 'bar']).astype('category', categories=['foo', 'bar']),
846-
'Z': [1, 2]})
845+
right = pd.DataFrame({
846+
'X': pd.Series(['foo', 'bar'], dtype=pd.CategoricalDtype(['foo', 'bar'])),
847+
'Z': [1, 2]
848+
})
847849
right
848850
right.dtypes
849851

doc/source/whatsnew/v0.21.0.txt

+24
Original file line numberDiff line numberDiff line change
@@ -22,6 +22,8 @@ Check the :ref:`API Changes <whatsnew_0210.api_breaking>` and :ref:`deprecations
2222
New features
2323
~~~~~~~~~~~~
2424

25+
- New user-facing :class:`CategoricalDtype` for specifying categorical independent
26+
of the data (:issue:`14711`, :issue:`15078`)
2527
- Support for `PEP 519 -- Adding a file system path protocol
2628
<https://www.python.org/dev/peps/pep-0519/>`_ on most readers and writers (:issue:`13823`)
2729
- Added ``__fspath__`` method to :class:`~pandas.HDFStore`, :class:`~pandas.ExcelFile`,
@@ -106,6 +108,28 @@ This does not permit that column to be accessed as an attribute:
106108

107109
Both of these now raise a ``UserWarning`` about the potential for unexpected behavior. See :ref:`Attribute Access <indexing.attribute_access>`.
108110

111+
.. _whatsnew_0210.enhancements.categorical_dtype:
112+
113+
``CategoricalDtype`` for specifying categoricals
114+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
115+
116+
:class:`CategoricalDtype` has been added to the public API and expanded to
117+
include the ``categories`` and ``ordered`` attributes. A ``CategoricalDtype``
118+
can be used to specify the set of categories and orderedness of an array,
119+
independent of the data themselves. This can be useful, e.g., when converting
120+
string data to a ``Categorical``:
121+
122+
.. ipython:: python
123+
124+
s = pd.Series(['a', 'b', 'c', 'a']) # strings
125+
dtype = pd.CategoricalDtype(categories=['a', 'b', 'c', 'd'], ordered=True)
126+
s.astype(dtype)
127+
128+
The ``.dtype`` property of a ``Categorical``, ``CategoricalIndex`` or a
129+
``Series`` with categorical type will now return an instance of ``CategoricalDtype``.
130+
131+
See :ref:`CategoricalDtype <categorical.categoricaldtype>` for more.
132+
109133
.. _whatsnew_0210.enhancements.other:
110134

111135
Other Enhancements

pandas/core/api.py

+1
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,7 @@
66

77
from pandas.core.algorithms import factorize, unique, value_counts
88
from pandas.core.dtypes.missing import isna, isnull, notna, notnull
9+
from pandas.core.dtypes.dtypes import CategoricalDtype
910
from pandas.core.categorical import Categorical
1011
from pandas.core.groupby import Grouper
1112
from pandas.io.formats.format import set_eng_float_format

0 commit comments

Comments
 (0)