-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
Accept CategoricalDtype in read_csv #17643
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 13 commits
e83a0b8
388e8a9
c5f6e04
4b588cd
e32d5be
508dd1e
6f175a7
1545734
de9e3ee
b80cff8
b028827
fc34080
d100f0c
8600c50
8c4ab5b
96d5144
3de75cd
f03798d
9325a93
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -452,7 +452,8 @@ Specifying Categorical dtype | |
|
||
.. versionadded:: 0.19.0 | ||
|
||
``Categorical`` columns can be parsed directly by specifying ``dtype='category'`` | ||
``Categorical`` columns can be parsed directly by specifying ``dtype='category'`` or | ||
``dtype=CategoricalDtype(categories, ordered)``. | ||
|
||
.. ipython:: python | ||
|
||
|
@@ -468,12 +469,40 @@ Individual columns can be parsed as a ``Categorical`` using a dict specification | |
|
||
pd.read_csv(StringIO(data), dtype={'col1': 'category'}).dtypes | ||
|
||
.. versionadded:: 0.21.0 | ||
|
||
Specifying ``dtype='cateogry'`` will result in an unordered ``Categorical`` | ||
whose ``categories`` are the unique values observed in the data. For more | ||
control on the categories and order, create a | ||
:class:`~pandas.api.types.CategoricalDtype` ahead of time, and pass that for | ||
that column's ``dtype``. | ||
|
||
.. ipython:: python | ||
|
||
from pandas.api.types import CategoricalDtype | ||
|
||
dtype = CategoricalDtype(['d', 'c', 'b', 'a'], ordered=True) | ||
pd.read_csv(StringIO(data), dtype={'col1': dtype}).dtypes | ||
|
||
When using ``dtype=CategoricalDtype``, "unexpected" values outside of | ||
``dtype.categories`` are treated as missing values. | ||
|
||
.. ipython:: python | ||
|
||
dtype = CategoricalDtype(['a', 'b', 'd']) # No 'c' | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. missing |
||
pd.read_csv(StringIO(data), dtype={'col1': dtype}).col1 | ||
|
||
This matches the behavior of :meth:`Categorical.set_categories`. | ||
|
||
.. note:: | ||
|
||
The resulting categories will always be parsed as strings (object dtype). | ||
If the categories are numeric they can be converted using the | ||
:func:`to_numeric` function, or as appropriate, another converter | ||
such as :func:`to_datetime`. | ||
With ``dtype='category'``, the resulting categories will always be parsed | ||
as strings (object dtype). If the categories are numeric they can be | ||
converted using the :func:`to_numeric` function, or as appropriate, another | ||
converter such as :func:`to_datetime`. | ||
|
||
When ``dtype`` is a ``CategoricalDtype`` with homogenous ``categories`` ( | ||
all numeric, all datetimes, etc.), the conversion is done automatically. | ||
|
||
.. ipython:: python | ||
|
||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -119,7 +119,7 @@ expanded to include the ``categories`` and ``ordered`` attributes. A | |
``CategoricalDtype`` can be used to specify the set of categories and | ||
orderedness of an array, independent of the data themselves. This can be useful, | ||
e.g., when converting string data to a ``Categorical`` (:issue:`14711`, | ||
:issue:`15078`, :issue:`16015`): | ||
:issue:`15078`, :issue:`16015`, :issue:`17643`): | ||
|
||
.. ipython:: python | ||
|
||
|
@@ -129,8 +129,37 @@ e.g., when converting string data to a ``Categorical`` (:issue:`14711`, | |
dtype = CategoricalDtype(categories=['a', 'b', 'c', 'd'], ordered=True) | ||
s.astype(dtype) | ||
|
||
One place that deserves special mention is in :meth:`read_csv`. Previously, with | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. maybe a separate sub-section for this |
||
``dtype={'col': 'category'}``, the returned values and categories would always | ||
be strings. | ||
|
||
.. ipython:: python | ||
:suppress: | ||
|
||
from pandas.compat import StringIO | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. in general we put this in the hidden code block at the top of the file, as people shouldn't use this from pandas, but just import it themselves |
||
|
||
.. ipython:: python | ||
|
||
data = 'A,B\na,1\nb,2\nc,3' | ||
pd.read_csv(StringIO(data), dtype={'B': 'category'}).B.cat.categories | ||
|
||
Notice the "object" dtype. | ||
|
||
With a ``CategoricalDtype`` of all numerics, datetimes, or | ||
timedeltas, we can automatically convert to the correct type | ||
|
||
dtype = {'B': CategoricalDtype([1, 2, 3])} | ||
pd.read_csv(StringIO(data), dtype=dtype).B.cat.categories | ||
|
||
The values have been correctly interpreted as integers. | ||
|
||
The ``.dtype`` property of a ``Categorical``, ``CategoricalIndex`` or a | ||
``Series`` with categorical type will now return an instance of ``CategoricalDtype``. | ||
``Series`` with categorical type will now return an instance of | ||
``CategoricalDtype``. For the most part, this is backwards compatible, though | ||
the string repr has changed. If you were previously using ``str(s.dtype) == | ||
'category'`` to detect categorical data, switch to | ||
:func:`pandas.api.types.is_categorical_dtype`, which is compatible with the old | ||
and new ``CategoricalDtype``. | ||
|
||
See the :ref:`CategoricalDtype docs <categorical.categoricaldtype>` for more. | ||
|
||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -45,9 +45,10 @@ from pandas.core.dtypes.common import ( | |
is_bool_dtype, is_object_dtype, | ||
is_string_dtype, is_datetime64_dtype, | ||
pandas_dtype) | ||
from pandas.core.categorical import Categorical | ||
from pandas.core.categorical import Categorical, _recode_for_categories | ||
from pandas.core.algorithms import take_1d | ||
from pandas.core.dtypes.concat import union_categoricals | ||
from pandas.core.dtypes.cast import maybe_convert_for_categorical | ||
from pandas import Index | ||
|
||
import pandas.io.common as com | ||
|
@@ -1267,19 +1268,35 @@ cdef class TextReader: | |
return self._string_convert(i, start, end, na_filter, | ||
na_hashset) | ||
elif is_categorical_dtype(dtype): | ||
# TODO: I suspect that _categorical_convert could be | ||
# optimized when dtype is an instance of CategoricalDtype | ||
codes, cats, na_count = _categorical_convert( | ||
self.parser, i, start, end, na_filter, | ||
na_hashset, self.c_encoding) | ||
# sort categories and recode if necessary | ||
cats = Index(cats) | ||
if not cats.is_monotonic_increasing: | ||
|
||
cats = maybe_convert_for_categorical(cats, dtype) | ||
|
||
if (isinstance(dtype, CategoricalDtype) and | ||
dtype.categories is not None): | ||
# recode for dtype.categories | ||
categories = dtype.categories | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. use There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Fixed (will wait to push until I hear back about #17643 (comment)) |
||
codes = _recode_for_categories(codes, cats, categories) | ||
ordered = dtype.ordered | ||
elif not cats.is_monotonic_increasing: | ||
# sort categories and recode if necessary | ||
unsorted = cats.copy() | ||
cats = cats.sort_values() | ||
indexer = cats.get_indexer(unsorted) | ||
codes = take_1d(indexer, codes, fill_value=-1) | ||
categories = cats.sort_values() | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I would move ALL of this logic and simply create a new factory for |
||
codes = _recode_for_categories(codes, unsorted, categories) | ||
ordered = False | ||
else: | ||
categories = cats | ||
ordered = False | ||
|
||
cat = Categorical(codes, categories=categories, ordered=ordered, | ||
fastpath=True) | ||
|
||
return Categorical(codes, categories=cats, ordered=False, | ||
fastpath=True), na_count | ||
return cat, na_count | ||
elif is_object_dtype(dtype): | ||
return self._string_convert(i, start, end, na_filter, | ||
na_hashset) | ||
|
@@ -2230,8 +2247,11 @@ def _concatenate_chunks(list chunks): | |
if common_type == np.object: | ||
warning_columns.append(str(name)) | ||
|
||
if is_categorical_dtype(dtypes.pop()): | ||
result[name] = union_categoricals(arrs, sort_categories=True) | ||
dtype = dtypes.pop() | ||
if is_categorical_dtype(dtype): | ||
sort_categories = isinstance(dtype, str) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. str -> string_types |
||
result[name] = union_categoricals(arrs, | ||
sort_categories=sort_categories) | ||
else: | ||
result[name] = np.concatenate(arrs) | ||
|
||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -24,7 +24,8 @@ | |
_ensure_int32, _ensure_int64, | ||
_NS_DTYPE, _TD_DTYPE, _INT64_DTYPE, | ||
_POSSIBLY_CAST_DTYPES) | ||
from .dtypes import ExtensionDtype, DatetimeTZDtype, PeriodDtype | ||
from .dtypes import (ExtensionDtype, DatetimeTZDtype, PeriodDtype, | ||
CategoricalDtype) | ||
from .generic import (ABCDatetimeIndex, ABCPeriodIndex, | ||
ABCSeries) | ||
from .missing import isna, notna | ||
|
@@ -604,6 +605,41 @@ def conv(r, dtype): | |
return [conv(r, dtype) for r, dtype in zip(result, dtypes)] | ||
|
||
|
||
def maybe_convert_for_categorical(categories, dtype): | ||
"""Convert ``categories`` depending on ``dtype``. | ||
|
||
Converts to numeric, datetime, or timedelta types, when ``dtype`` is | ||
a CategoricalDtype with known, non-object categories. | ||
|
||
Parameters | ||
---------- | ||
categories : array-like | ||
type : CategoricalDtype | ||
|
||
Returns | ||
------- | ||
new_categories : array or Index | ||
|
||
Examples | ||
-------- | ||
>>> maybe_convert_for_categorical(['1', '2'], CategoricalDtype([1, 2])) | ||
array([ 1, 2]) | ||
>>> maybe_convert_for_categorical([1, 'a'], CategoricalDtype([1, 2])) | ||
array([ 1., nan]) | ||
""" | ||
if isinstance(dtype, CategoricalDtype) and dtype.categories is not None: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. in reaility this just an Index routine maybe
and if the ifisinstance(dtype, ....) logic can be in There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. see my comment below, you can simply fold this in to |
||
from pandas import to_numeric, to_datetime, to_timedelta | ||
|
||
if dtype.categories.is_numeric(): | ||
categories = to_numeric(categories, errors='coerce') | ||
elif is_datetime64_dtype(dtype.categories): | ||
categories = to_datetime(categories, errors='coerce') | ||
elif is_timedelta64_dtype(dtype.categories): | ||
categories = to_timedelta(categories, errors='coerce') | ||
|
||
return categories | ||
|
||
|
||
def astype_nansafe(arr, dtype, copy=True): | ||
""" return a view if copy is False, but | ||
need to be very careful as the result shape could change! """ | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -21,8 +21,10 @@ | |
is_float, is_dtype_equal, | ||
is_object_dtype, is_string_dtype, | ||
is_scalar, is_categorical_dtype) | ||
from pandas.core.dtypes.dtypes import CategoricalDtype | ||
from pandas.core.dtypes.missing import isna | ||
from pandas.core.dtypes.cast import astype_nansafe | ||
from pandas.core.dtypes.cast import (astype_nansafe, | ||
maybe_convert_for_categorical) | ||
from pandas.core.index import (Index, MultiIndex, RangeIndex, | ||
_ensure_index_from_sequences) | ||
from pandas.core.series import Series | ||
|
@@ -1605,9 +1607,18 @@ def _cast_types(self, values, cast_type, column): | |
# XXX this is for consistency with | ||
# c-parser which parses all categories | ||
# as strings | ||
if not is_object_dtype(values): | ||
known_cats = (isinstance(cast_type, CategoricalDtype) and | ||
cast_type.categories is not None) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. none of this logic should live here either. move to pandas.core.dtypes.cast.py (also ok with a new module pandas.core.dtypes.categorical.py if its simpler) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Refactored most of this to |
||
|
||
categories = ordered = None | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. why is this not using There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm not sure how much cleaner 3de75cd is. This really don't share much code, since the python parser has values, while the C parser has categories and codes. And the python parser has to maybe cast values to strings with |
||
if known_cats: | ||
values = maybe_convert_for_categorical(values, cast_type) | ||
categories = cast_type.categories | ||
ordered = cast_type.ordered | ||
elif not is_object_dtype(values): | ||
values = astype_nansafe(values, str) | ||
values = Categorical(values) | ||
values = Categorical(values, categories=categories, | ||
ordered=ordered) | ||
else: | ||
try: | ||
values = astype_nansafe(values, cast_type, copy=True) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
versionadded here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe a sub-section for this?