Skip to content

Commit 6f175a7

Browse files
committed
Doc and test unexpected values
1 parent 508dd1e commit 6f175a7

File tree

3 files changed

+43
-4
lines changed

3 files changed

+43
-4
lines changed

doc/source/io.rst

+9-1
Original file line numberDiff line numberDiff line change
@@ -482,14 +482,22 @@ that column's ``dtype``.
482482
dtype = CategoricalDtype(['d', 'c', 'b', 'a'], ordered=True)
483483
pd.read_csv(StringIO(data), dtype={'col1': dtype}).dtypes
484484
485+
When using ``dtype=CategoricalDtype``, "unexpected" values outside of
486+
``dtype.categories`` are treated as missing values.
487+
488+
dtype = CategoricalDtype(['a', 'b', 'd']) # No 'c'
489+
pd.read_csv(StringIO(data), dtype={'col1': dtype}).col1
490+
491+
This matches the behavior of :meth:`Categorical.set_categories`.
492+
485493
.. note::
486494

487495
With ``dtype='category'``, the resulting categories will always be parsed
488496
as strings (object dtype). If the categories are numeric they can be
489497
converted using the :func:`to_numeric` function, or as appropriate, another
490498
converter such as :func:`to_datetime`.
491499

492-
When ``dtype`` is a ``CategoricalDtype`` with homogenous ``categoriess`` (
500+
When ``dtype`` is a ``CategoricalDtype`` with homogenous ``categories`` (
493501
all numeric, all datetimes, etc.), the conversion is done automatically.
494502

495503
.. ipython:: python

doc/source/whatsnew/v0.21.0.txt

+26-3
Original file line numberDiff line numberDiff line change
@@ -119,7 +119,7 @@ expanded to include the ``categories`` and ``ordered`` attributes. A
119119
``CategoricalDtype`` can be used to specify the set of categories and
120120
orderedness of an array, independent of the data themselves. This can be useful,
121121
e.g., when converting string data to a ``Categorical`` (:issue:`14711`,
122-
:issue:`15078`, :issue:`16015`):
122+
:issue:`15078`, :issue:`16015`, :issue:`17643`):
123123

124124
.. ipython:: python
125125

@@ -129,8 +129,33 @@ e.g., when converting string data to a ``Categorical`` (:issue:`14711`,
129129
dtype = CategoricalDtype(categories=['a', 'b', 'c', 'd'], ordered=True)
130130
s.astype(dtype)
131131

132+
One place that deserves special mention is in :meth:`read_csv`. Previously, with
133+
``dtype={'col': 'category'}``, the returned values and categories would always
134+
be strings.
135+
136+
.. ipython:: python
137+
138+
from pandas.compat import StringIO
139+
140+
data = 'A,B\na,1\nb,2\nc,3'
141+
pd.read_csv(StringIO(data), dtype={'B': 'category'}).B.cat.categories
142+
143+
Notice the "object" dtype.
144+
145+
With a ``CategoricalDtype`` of all numerics, datetimes, or
146+
timedeltas, we can automatically convert to the correct type
147+
148+
dtype = {'B': CategoricalDtype([1, 2, 3])}
149+
pd.read_csv(StringIO(data), dtype=dtype).B.cat.categories
150+
151+
The values have been correctly interpreted as integers.
152+
132153
The ``.dtype`` property of a ``Categorical``, ``CategoricalIndex`` or a
133154
``Series`` with categorical type will now return an instance of ``CategoricalDtype``.
155+
For the most part, this is backwards compatible, though the string repr has changed.
156+
If you were previously using ``str(s.dtype == 'category')`` to detect categorical data,
157+
switch to :func:`api.types.is_categorical_dtype`, which is compatible with the old and
158+
new ``CategoricalDtype``.
134159

135160
See the :ref:`CategoricalDtype docs <categorical.categoricaldtype>` for more.
136161

@@ -163,8 +188,6 @@ Other Enhancements
163188
- :func:`Categorical.rename_categories` now accepts a dict-like argument as `new_categories` and only updates the categories found in that dict. (:issue:`17336`)
164189
- :func:`read_excel` raises ``ImportError`` with a better message if ``xlrd`` is not installed. (:issue:`17613`)
165190
- :meth:`DataFrame.assign` will preserve the original order of ``**kwargs`` for Python 3.6+ users instead of sorting the column names
166-
- Pass a :class:`~pandas.api.types.CategoricalDtype` to :meth:`read_csv` to parse categorical
167-
data as numeric, datetimes, or timedeltas, instead of strings. See :ref:`here <io.categorical>`. (:issue:`17643`)
168191

169192

170193
.. _whatsnew_0210.api_breaking:

pandas/tests/io/parser/dtypes.py

+8
Original file line numberDiff line numberDiff line change
@@ -210,6 +210,14 @@ def test_categoricaldtype_coerces_timedelta(self):
210210
result = self.read_csv(StringIO(data), dtype=dtype)
211211
tm.assert_frame_equal(result, expected)
212212

213+
def test_categoricaldtype_unexpected_categories(self):
214+
dtype = {'b': CategoricalDtype(['a', 'b', 'd', 'e'])}
215+
data = "b\nd\na\nc\nd" # Unexpected c
216+
expected = pd.DataFrame({"b": Categorical(list('dacd'),
217+
dtype=dtype['b'])})
218+
result = self.read_csv(StringIO(data), dtype=dtype)
219+
tm.assert_frame_equal(result, expected)
220+
213221
def test_categorical_categoricaldtype_chunksize(self):
214222
# GH 10153
215223
data = """a,b

0 commit comments

Comments
 (0)