Skip to content

Commit def3bce

Browse files
ENH: Accept CategoricalDtype in read_csv (pandas-dev#17643)
* ENH: Accept CategoricalDtype in CSV reader * rework * Fixed basic implementation * Added casting * Doc and cleanup * Fixed assignment of categoricals * Doc and test unexpected values * DOC: fixups * More coercion, use _recode_for_categories * Refactor with maybe_convert_for_categorical * PEP8 * Type for 32bit * REF: refactor to new method * py2 compat * Refactored * More in Categorical * fixup! More in Categorical
1 parent 2310faa commit def3bce

File tree

7 files changed

+278
-25
lines changed

7 files changed

+278
-25
lines changed

doc/source/io.rst

+34-5
Original file line numberDiff line numberDiff line change
@@ -452,7 +452,8 @@ Specifying Categorical dtype
452452

453453
.. versionadded:: 0.19.0
454454

455-
``Categorical`` columns can be parsed directly by specifying ``dtype='category'``
455+
``Categorical`` columns can be parsed directly by specifying ``dtype='category'`` or
456+
``dtype=CategoricalDtype(categories, ordered)``.
456457

457458
.. ipython:: python
458459
@@ -468,12 +469,40 @@ Individual columns can be parsed as a ``Categorical`` using a dict specification
468469
469470
pd.read_csv(StringIO(data), dtype={'col1': 'category'}).dtypes
470471
472+
.. versionadded:: 0.21.0
473+
474+
Specifying ``dtype='cateogry'`` will result in an unordered ``Categorical``
475+
whose ``categories`` are the unique values observed in the data. For more
476+
control on the categories and order, create a
477+
:class:`~pandas.api.types.CategoricalDtype` ahead of time, and pass that for
478+
that column's ``dtype``.
479+
480+
.. ipython:: python
481+
482+
from pandas.api.types import CategoricalDtype
483+
484+
dtype = CategoricalDtype(['d', 'c', 'b', 'a'], ordered=True)
485+
pd.read_csv(StringIO(data), dtype={'col1': dtype}).dtypes
486+
487+
When using ``dtype=CategoricalDtype``, "unexpected" values outside of
488+
``dtype.categories`` are treated as missing values.
489+
490+
.. ipython:: python
491+
492+
dtype = CategoricalDtype(['a', 'b', 'd']) # No 'c'
493+
pd.read_csv(StringIO(data), dtype={'col1': dtype}).col1
494+
495+
This matches the behavior of :meth:`Categorical.set_categories`.
496+
471497
.. note::
472498

473-
The resulting categories will always be parsed as strings (object dtype).
474-
If the categories are numeric they can be converted using the
475-
:func:`to_numeric` function, or as appropriate, another converter
476-
such as :func:`to_datetime`.
499+
With ``dtype='category'``, the resulting categories will always be parsed
500+
as strings (object dtype). If the categories are numeric they can be
501+
converted using the :func:`to_numeric` function, or as appropriate, another
502+
converter such as :func:`to_datetime`.
503+
504+
When ``dtype`` is a ``CategoricalDtype`` with homogenous ``categories`` (
505+
all numeric, all datetimes, etc.), the conversion is done automatically.
477506

478507
.. ipython:: python
479508

doc/source/whatsnew/v0.21.0.txt

+31-2
Original file line numberDiff line numberDiff line change
@@ -119,7 +119,7 @@ expanded to include the ``categories`` and ``ordered`` attributes. A
119119
``CategoricalDtype`` can be used to specify the set of categories and
120120
orderedness of an array, independent of the data themselves. This can be useful,
121121
e.g., when converting string data to a ``Categorical`` (:issue:`14711`,
122-
:issue:`15078`, :issue:`16015`):
122+
:issue:`15078`, :issue:`16015`, :issue:`17643`):
123123

124124
.. ipython:: python
125125

@@ -129,8 +129,37 @@ e.g., when converting string data to a ``Categorical`` (:issue:`14711`,
129129
dtype = CategoricalDtype(categories=['a', 'b', 'c', 'd'], ordered=True)
130130
s.astype(dtype)
131131

132+
One place that deserves special mention is in :meth:`read_csv`. Previously, with
133+
``dtype={'col': 'category'}``, the returned values and categories would always
134+
be strings.
135+
136+
.. ipython:: python
137+
:suppress:
138+
139+
from pandas.compat import StringIO
140+
141+
.. ipython:: python
142+
143+
data = 'A,B\na,1\nb,2\nc,3'
144+
pd.read_csv(StringIO(data), dtype={'B': 'category'}).B.cat.categories
145+
146+
Notice the "object" dtype.
147+
148+
With a ``CategoricalDtype`` of all numerics, datetimes, or
149+
timedeltas, we can automatically convert to the correct type
150+
151+
dtype = {'B': CategoricalDtype([1, 2, 3])}
152+
pd.read_csv(StringIO(data), dtype=dtype).B.cat.categories
153+
154+
The values have been correctly interpreted as integers.
155+
132156
The ``.dtype`` property of a ``Categorical``, ``CategoricalIndex`` or a
133-
``Series`` with categorical type will now return an instance of ``CategoricalDtype``.
157+
``Series`` with categorical type will now return an instance of
158+
``CategoricalDtype``. For the most part, this is backwards compatible, though
159+
the string repr has changed. If you were previously using ``str(s.dtype) ==
160+
'category'`` to detect categorical data, switch to
161+
:func:`pandas.api.types.is_categorical_dtype`, which is compatible with the old
162+
and new ``CategoricalDtype``.
134163

135164
See the :ref:`CategoricalDtype docs <categorical.categoricaldtype>` for more.
136165

pandas/_libs/parsers.pyx

+11-13
Original file line numberDiff line numberDiff line change
@@ -45,7 +45,7 @@ from pandas.core.dtypes.common import (
4545
is_bool_dtype, is_object_dtype,
4646
is_string_dtype, is_datetime64_dtype,
4747
pandas_dtype)
48-
from pandas.core.categorical import Categorical
48+
from pandas.core.categorical import Categorical, _recode_for_categories
4949
from pandas.core.algorithms import take_1d
5050
from pandas.core.dtypes.concat import union_categoricals
5151
from pandas import Index
@@ -1267,19 +1267,14 @@ cdef class TextReader:
12671267
return self._string_convert(i, start, end, na_filter,
12681268
na_hashset)
12691269
elif is_categorical_dtype(dtype):
1270+
# TODO: I suspect that _categorical_convert could be
1271+
# optimized when dtype is an instance of CategoricalDtype
12701272
codes, cats, na_count = _categorical_convert(
12711273
self.parser, i, start, end, na_filter,
12721274
na_hashset, self.c_encoding)
1273-
# sort categories and recode if necessary
1274-
cats = Index(cats)
1275-
if not cats.is_monotonic_increasing:
1276-
unsorted = cats.copy()
1277-
cats = cats.sort_values()
1278-
indexer = cats.get_indexer(unsorted)
1279-
codes = take_1d(indexer, codes, fill_value=-1)
1280-
1281-
return Categorical(codes, categories=cats, ordered=False,
1282-
fastpath=True), na_count
1275+
cat = Categorical._from_inferred_categories(cats, codes, dtype)
1276+
return cat, na_count
1277+
12831278
elif is_object_dtype(dtype):
12841279
return self._string_convert(i, start, end, na_filter,
12851280
na_hashset)
@@ -2230,8 +2225,11 @@ def _concatenate_chunks(list chunks):
22302225
if common_type == np.object:
22312226
warning_columns.append(str(name))
22322227

2233-
if is_categorical_dtype(dtypes.pop()):
2234-
result[name] = union_categoricals(arrs, sort_categories=True)
2228+
dtype = dtypes.pop()
2229+
if is_categorical_dtype(dtype):
2230+
sort_categories = isinstance(dtype, str)
2231+
result[name] = union_categoricals(arrs,
2232+
sort_categories=sort_categories)
22352233
else:
22362234
result[name] = np.concatenate(arrs)
22372235

pandas/core/categorical.py

+55
Original file line numberDiff line numberDiff line change
@@ -21,6 +21,8 @@
2121
_ensure_platform_int,
2222
is_dtype_equal,
2323
is_datetimelike,
24+
is_datetime64_dtype,
25+
is_timedelta64_dtype,
2426
is_categorical,
2527
is_categorical_dtype,
2628
is_integer_dtype,
@@ -510,6 +512,59 @@ def base(self):
510512
""" compat, we are always our own object """
511513
return None
512514

515+
@classmethod
516+
def _from_inferred_categories(cls, inferred_categories, inferred_codes,
517+
dtype):
518+
"""Construct a Categorical from inferred values
519+
520+
For inferred categories (`dtype` is None) the categories are sorted.
521+
For explicit `dtype`, the `inferred_categories` are cast to the
522+
appropriate type.
523+
524+
Parameters
525+
----------
526+
527+
inferred_categories : Index
528+
inferred_codes : Index
529+
dtype : CategoricalDtype or 'category'
530+
531+
Returns
532+
-------
533+
Categorical
534+
"""
535+
from pandas import Index, to_numeric, to_datetime, to_timedelta
536+
537+
cats = Index(inferred_categories)
538+
539+
known_categories = (isinstance(dtype, CategoricalDtype) and
540+
dtype.categories is not None)
541+
542+
if known_categories:
543+
# Convert to a specialzed type with `dtype` if specified
544+
if dtype.categories.is_numeric():
545+
cats = to_numeric(inferred_categories, errors='coerce')
546+
elif is_datetime64_dtype(dtype.categories):
547+
cats = to_datetime(inferred_categories, errors='coerce')
548+
elif is_timedelta64_dtype(dtype.categories):
549+
cats = to_timedelta(inferred_categories, errors='coerce')
550+
551+
if known_categories:
552+
# recode from observation oder to dtype.categories order
553+
categories = dtype.categories
554+
codes = _recode_for_categories(inferred_codes, cats, categories)
555+
elif not cats.is_monotonic_increasing:
556+
# sort categories and recode for unknown categories
557+
unsorted = cats.copy()
558+
categories = cats.sort_values()
559+
codes = _recode_for_categories(inferred_codes, unsorted,
560+
categories)
561+
dtype = CategoricalDtype(categories, ordered=False)
562+
else:
563+
dtype = CategoricalDtype(cats, ordered=False)
564+
codes = inferred_codes
565+
566+
return cls(codes, dtype=dtype, fastpath=True)
567+
513568
@classmethod
514569
def from_array(cls, data, **kwargs):
515570
"""

pandas/io/parsers.py

+14-5
Original file line numberDiff line numberDiff line change
@@ -21,6 +21,7 @@
2121
is_float, is_dtype_equal,
2222
is_object_dtype, is_string_dtype,
2323
is_scalar, is_categorical_dtype)
24+
from pandas.core.dtypes.dtypes import CategoricalDtype
2425
from pandas.core.dtypes.missing import isna
2526
from pandas.core.dtypes.cast import astype_nansafe
2627
from pandas.core.index import (Index, MultiIndex, RangeIndex,
@@ -1602,12 +1603,20 @@ def _cast_types(self, values, cast_type, column):
16021603
"""
16031604

16041605
if is_categorical_dtype(cast_type):
1605-
# XXX this is for consistency with
1606-
# c-parser which parses all categories
1607-
# as strings
1608-
if not is_object_dtype(values):
1606+
known_cats = (isinstance(cast_type, CategoricalDtype) and
1607+
cast_type.categories is not None)
1608+
1609+
if not is_object_dtype(values) and not known_cats:
1610+
# XXX this is for consistency with
1611+
# c-parser which parses all categories
1612+
# as strings
16091613
values = astype_nansafe(values, str)
1610-
values = Categorical(values)
1614+
1615+
cats = Index(values).unique().dropna()
1616+
values = Categorical._from_inferred_categories(
1617+
cats, cats.get_indexer(values), cast_type
1618+
)
1619+
16111620
else:
16121621
try:
16131622
values = astype_nansafe(values, cast_type, copy=True)

pandas/tests/io/parser/dtypes.py

+99
Original file line numberDiff line numberDiff line change
@@ -149,6 +149,105 @@ def test_categorical_dtype_chunksize(self):
149149
for actual, expected in zip(actuals, expecteds):
150150
tm.assert_frame_equal(actual, expected)
151151

152+
@pytest.mark.parametrize('ordered', [False, True])
153+
@pytest.mark.parametrize('categories', [
154+
['a', 'b', 'c'],
155+
['a', 'c', 'b'],
156+
['a', 'b', 'c', 'd'],
157+
['c', 'b', 'a'],
158+
])
159+
def test_categorical_categoricaldtype(self, categories, ordered):
160+
data = """a,b
161+
1,a
162+
1,b
163+
1,b
164+
2,c"""
165+
expected = pd.DataFrame({
166+
"a": [1, 1, 1, 2],
167+
"b": Categorical(['a', 'b', 'b', 'c'],
168+
categories=categories,
169+
ordered=ordered)
170+
})
171+
dtype = {"b": CategoricalDtype(categories=categories,
172+
ordered=ordered)}
173+
result = self.read_csv(StringIO(data), dtype=dtype)
174+
tm.assert_frame_equal(result, expected)
175+
176+
def test_categorical_categoricaldtype_unsorted(self):
177+
data = """a,b
178+
1,a
179+
1,b
180+
1,b
181+
2,c"""
182+
dtype = CategoricalDtype(['c', 'b', 'a'])
183+
expected = pd.DataFrame({
184+
'a': [1, 1, 1, 2],
185+
'b': Categorical(['a', 'b', 'b', 'c'], categories=['c', 'b', 'a'])
186+
})
187+
result = self.read_csv(StringIO(data), dtype={'b': dtype})
188+
tm.assert_frame_equal(result, expected)
189+
190+
def test_categoricaldtype_coerces_numeric(self):
191+
dtype = {'b': CategoricalDtype([1, 2, 3])}
192+
data = "b\n1\n1\n2\n3"
193+
expected = pd.DataFrame({'b': Categorical([1, 1, 2, 3])})
194+
result = self.read_csv(StringIO(data), dtype=dtype)
195+
tm.assert_frame_equal(result, expected)
196+
197+
def test_categoricaldtype_coerces_datetime(self):
198+
dtype = {
199+
'b': CategoricalDtype(pd.date_range('2017', '2019', freq='AS'))
200+
}
201+
data = "b\n2017-01-01\n2018-01-01\n2019-01-01"
202+
expected = pd.DataFrame({'b': Categorical(dtype['b'].categories)})
203+
result = self.read_csv(StringIO(data), dtype=dtype)
204+
tm.assert_frame_equal(result, expected)
205+
206+
dtype = {
207+
'b': CategoricalDtype([pd.Timestamp("2014")])
208+
}
209+
data = "b\n2014-01-01\n2014-01-01T00:00:00"
210+
expected = pd.DataFrame({'b': Categorical([pd.Timestamp('2014')] * 2)})
211+
result = self.read_csv(StringIO(data), dtype=dtype)
212+
tm.assert_frame_equal(result, expected)
213+
214+
def test_categoricaldtype_coerces_timedelta(self):
215+
dtype = {'b': CategoricalDtype(pd.to_timedelta(['1H', '2H', '3H']))}
216+
data = "b\n1H\n2H\n3H"
217+
expected = pd.DataFrame({'b': Categorical(dtype['b'].categories)})
218+
result = self.read_csv(StringIO(data), dtype=dtype)
219+
tm.assert_frame_equal(result, expected)
220+
221+
def test_categoricaldtype_unexpected_categories(self):
222+
dtype = {'b': CategoricalDtype(['a', 'b', 'd', 'e'])}
223+
data = "b\nd\na\nc\nd" # Unexpected c
224+
expected = pd.DataFrame({"b": Categorical(list('dacd'),
225+
dtype=dtype['b'])})
226+
result = self.read_csv(StringIO(data), dtype=dtype)
227+
tm.assert_frame_equal(result, expected)
228+
229+
def test_categorical_categoricaldtype_chunksize(self):
230+
# GH 10153
231+
data = """a,b
232+
1,a
233+
1,b
234+
1,b
235+
2,c"""
236+
cats = ['a', 'b', 'c']
237+
expecteds = [pd.DataFrame({'a': [1, 1],
238+
'b': Categorical(['a', 'b'],
239+
categories=cats)}),
240+
pd.DataFrame({'a': [1, 2],
241+
'b': Categorical(['b', 'c'],
242+
categories=cats)},
243+
index=[2, 3])]
244+
dtype = CategoricalDtype(cats)
245+
actuals = self.read_csv(StringIO(data), dtype={'b': dtype},
246+
chunksize=2)
247+
248+
for actual, expected in zip(actuals, expecteds):
249+
tm.assert_frame_equal(actual, expected)
250+
152251
def test_empty_pass_dtype(self):
153252
data = 'one,two'
154253
result = self.read_csv(StringIO(data), dtype={'one': 'u1'})

0 commit comments

Comments
 (0)