Skip to content

Accept CategoricalDtype in read_csv #17643

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 19 commits into from
Oct 2, 2017
Merged
Show file tree
Hide file tree
Changes from 6 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
27 changes: 22 additions & 5 deletions doc/source/io.rst
Original file line number Diff line number Diff line change
Expand Up @@ -452,7 +452,8 @@ Specifying Categorical dtype

.. versionadded:: 0.19.0

``Categorical`` columns can be parsed directly by specifying ``dtype='category'``
``Categorical`` columns can be parsed directly by specifying ``dtype='category'`` or
``dtype=CategoricalDtype(categories, ordered)``.

.. ipython:: python

Expand All @@ -468,12 +469,28 @@ Individual columns can be parsed as a ``Categorical`` using a dict specification

pd.read_csv(StringIO(data), dtype={'col1': 'category'}).dtypes

Specifying ``dtype='cateogry'`` will result in an unordered ``Categorical``
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

versionadded here

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe a sub-section for this?

whose ``categories`` are the unique values observed in the data. For more
control on the categories and order, create a
:class:`~pandas.api.types.CategoricalDtype` ahead of time, and pass that for
that column's ``dtype``.

.. ipython:: python

from pandas.api.types import CategoricalDtype

dtype = CategoricalDtype(['d', 'c', 'b', 'a'], ordered=True)
pd.read_csv(StringIO(data), dtype={'col1': dtype}).dtypes

.. note::

The resulting categories will always be parsed as strings (object dtype).
If the categories are numeric they can be converted using the
:func:`to_numeric` function, or as appropriate, another converter
such as :func:`to_datetime`.
With ``dtype='category'``, the resulting categories will always be parsed
as strings (object dtype). If the categories are numeric they can be
converted using the :func:`to_numeric` function, or as appropriate, another
converter such as :func:`to_datetime`.

When ``dtype`` is a ``CategoricalDtype`` with homogenous ``categoriess`` (
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

categoriess -> categories

all numeric, all datetimes, etc.), the conversion is done automatically.

.. ipython:: python

Expand Down
2 changes: 2 additions & 0 deletions doc/source/whatsnew/v0.21.0.txt
Original file line number Diff line number Diff line change
Expand Up @@ -163,6 +163,8 @@ Other Enhancements
- :func:`Categorical.rename_categories` now accepts a dict-like argument as `new_categories` and only updates the categories found in that dict. (:issue:`17336`)
- :func:`read_excel` raises ``ImportError`` with a better message if ``xlrd`` is not installed. (:issue:`17613`)
- :meth:`DataFrame.assign` will preserve the original order of ``**kwargs`` for Python 3.6+ users instead of sorting the column names
- Pass a :class:`~pandas.api.types.CategoricalDtype` to :meth:`read_csv` to parse categorical
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would clarify this should be passed to the dtype keyword?

Also, apart from the fact you can also have non-string categories, are there not more benefits (like being able to specify the categories yourself, specific order, ... performance?) ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps I'll merge this with the main section for CategoricalDtype. (no extra performance yet though)

data as numeric, datetimes, or timedeltas, instead of strings. See :ref:`here <io.categorical>`. (:issue:`17643`)


.. _whatsnew_0210.api_breaking:
Expand Down
51 changes: 42 additions & 9 deletions pandas/_libs/parsers.pyx
Original file line number Diff line number Diff line change
Expand Up @@ -48,7 +48,7 @@ from pandas.core.dtypes.common import (
from pandas.core.categorical import Categorical
from pandas.core.algorithms import take_1d
from pandas.core.dtypes.concat import union_categoricals
from pandas import Index
from pandas import Index, to_numeric, to_datetime, to_timedelta

import pandas.io.common as com

Expand Down Expand Up @@ -1267,19 +1267,49 @@ cdef class TextReader:
return self._string_convert(i, start, end, na_filter,
na_hashset)
elif is_categorical_dtype(dtype):
# TODO: I suspect that _categorical_convert could be
# optimized when dtype is an instance of CategoricalDtype
codes, cats, na_count = _categorical_convert(
self.parser, i, start, end, na_filter,
na_hashset, self.c_encoding)
# sort categories and recode if necessary
cats = Index(cats)
if not cats.is_monotonic_increasing:

# Determine if we should convert inferred string
# categories to a specialized type
if (isinstance(dtype, CategoricalDtype) and
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would rather move this entire section to a free function (except for the actual constructor)

maybe

cats, dtype = infer_categorical_dtype(cats) # put in pandas.core.dtypes.cast.py
cats = Categorical(cats, codes, dtype=dtype)

NONE of this logic should be here

dtype.categories is not None):
if dtype.categories.is_numeric():
# is ignore correct?
cats = to_numeric(cats, errors='ignore')
elif dtype.categories.is_all_dates:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this may leave open corner cases where strings don't map 1->1 with categories? For example:

cats:
# DatetimeIndex(['2014-01-01'], dtype='datetime64[ns]', freq=None)

data:
# ['2014-01-01', '2014-01-01T00:00:00', '2014-01-01']

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I don't follow. This passes:

        dtype = {
            'b': CategoricalDtype([pd.Timestamp("2014")])
        }
        # Two representations of the same value
        data = "b\n2014-01-01\n2014-01-01T00:00:00"
        expected = pd.DataFrame({'b': Categorical([pd.Timestamp('2014')] * 2)})
        result = self.read_csv(StringIO(data), dtype=dtype)
        tm.assert_frame_equal(result, expected)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does result['b'] not have duplicated categories? Sorry, don't have it checked out locally, only guessing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No problem. It has multiple values, but the categories are unique.

In [10]: pd.read_csv(StringIO(data), dtype=dtype).b.dtype
Out[10]: CategoricalDtype(categories=['2014-01-01'], ordered=False)

The categories passed to the Categorical constructor later on comes directly from dtype.categories, which is unique. The coercion is done on the values so it's OK if different string forms are coerced to the same value.

# is ignore correct?
if is_datetime64_dtype(dtype.categories):
cats = to_datetime(cats, errors='ignore')
else:
cats = to_timedelta(cats, errors='ignore')

if (isinstance(dtype, CategoricalDtype) and
dtype.categories is not None):
# recode for dtype.categories
categories = dtype.categories
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed (will wait to push until I hear back about #17643 (comment))

indexer = categories.get_indexer(cats)
codes = take_1d(indexer, codes, fill_value=-1)
ordered = dtype.ordered
elif not cats.is_monotonic_increasing:
# sort categories and recode if necessary
unsorted = cats.copy()
cats = cats.sort_values()
indexer = cats.get_indexer(unsorted)
categories = cats.sort_values()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would move ALL of this logic and simply create a new factory for Categorical.infer_from_categories(cats, codes, dtype=dtype) (and even fold in the maybe_convert_for_categorical). This just makes parsing code longer and longer; we want to push down logic to the dtypes.

indexer = categories.get_indexer(unsorted)
codes = take_1d(indexer, codes, fill_value=-1)
ordered = False
else:
categories = cats
ordered = False

cat = Categorical(codes, categories=categories, ordered=ordered,
fastpath=True)

return Categorical(codes, categories=cats, ordered=False,
fastpath=True), na_count
return cat, na_count
elif is_object_dtype(dtype):
return self._string_convert(i, start, end, na_filter,
na_hashset)
Expand Down Expand Up @@ -2230,8 +2260,11 @@ def _concatenate_chunks(list chunks):
if common_type == np.object:
warning_columns.append(str(name))

if is_categorical_dtype(dtypes.pop()):
result[name] = union_categoricals(arrs, sort_categories=True)
dtype = dtypes.pop()
if is_categorical_dtype(dtype):
sort_categories = isinstance(dtype, str)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

str -> string_types

result[name] = union_categoricals(arrs,
sort_categories=sort_categories)
else:
result[name] = np.concatenate(arrs)

Expand Down
24 changes: 20 additions & 4 deletions pandas/io/parsers.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,15 +12,17 @@

import numpy as np

from pandas import compat
from pandas import compat, to_numeric, to_timedelta
from pandas.compat import (range, lrange, PY3, StringIO, lzip,
zip, string_types, map, u)
from pandas.core.dtypes.common import (
is_integer, _ensure_object,
is_list_like, is_integer_dtype,
is_float, is_dtype_equal,
is_object_dtype, is_string_dtype,
is_scalar, is_categorical_dtype)
is_scalar, is_categorical_dtype,
is_datetime64_dtype, is_timedelta64_dtype)
from pandas.core.dtypes.dtypes import CategoricalDtype
from pandas.core.dtypes.missing import isna
from pandas.core.dtypes.cast import astype_nansafe
from pandas.core.index import (Index, MultiIndex, RangeIndex,
Expand Down Expand Up @@ -1605,9 +1607,23 @@ def _cast_types(self, values, cast_type, column):
# XXX this is for consistency with
# c-parser which parses all categories
# as strings
if not is_object_dtype(values):
known_cats = (isinstance(cast_type, CategoricalDtype) and
cast_type.categories is not None)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

none of this logic should live here either. move to pandas.core.dtypes.cast.py (also ok with a new module pandas.core.dtypes.categorical.py if its simpler)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Refactored most of this to pandas.core.dtypes.cast

str_values = is_object_dtype(values)

if known_cats and str_values:
if cast_type.categories.is_numeric():
values = to_numeric(values, errors='ignore')
elif is_datetime64_dtype(cast_type.categories):
values = tools.to_datetime(values, errors='ignore')
elif is_timedelta64_dtype(cast_type.categories):
values = to_timedelta(values, errors='ignore')
values = Categorical(values, categories=cast_type.categories,
ordered=cast_type.ordered)
elif not is_object_dtype(values):
values = astype_nansafe(values, str)
values = Categorical(values)
else:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any reason you are not handling this case as well? (I get that it conflates the purpose of _from_inferred_categories a bit), but in reality this is just like passing dtype=None.

I don't like to scatter casting/inferrence code around, very hard to figure out what's going on when when its not in 1 place.

values = Categorical(values)
else:
try:
values = astype_nansafe(values, cast_type, copy=True)
Expand Down
83 changes: 83 additions & 0 deletions pandas/tests/io/parser/dtypes.py
Original file line number Diff line number Diff line change
Expand Up @@ -149,6 +149,89 @@ def test_categorical_dtype_chunksize(self):
for actual, expected in zip(actuals, expecteds):
tm.assert_frame_equal(actual, expected)

@pytest.mark.parametrize('ordered', [False, True])
@pytest.mark.parametrize('categories', [
['a', 'b', 'c'],
['a', 'c', 'b'],
['a', 'b', 'c', 'd'],
['c', 'b', 'a'],
])
def test_categorical_categoricaldtype(self, categories, ordered):
data = """a,b
1,a
1,b
1,b
2,c"""
expected = pd.DataFrame({
"a": [1, 1, 1, 2],
"b": Categorical(['a', 'b', 'b', 'c'],
categories=categories,
ordered=ordered)
})
dtype = {"b": CategoricalDtype(categories=categories,
ordered=ordered)}
result = self.read_csv(StringIO(data), dtype=dtype)
tm.assert_frame_equal(result, expected)

def test_categorical_categoricaldtype_unsorted(self):
data = """a,b
1,a
1,b
1,b
2,c"""
dtype = CategoricalDtype(['c', 'b', 'a'])
expected = pd.DataFrame({
'a': [1, 1, 1, 2],
'b': Categorical(['a', 'b', 'b', 'c'], categories=['c', 'b', 'a'])
})
result = self.read_csv(StringIO(data), dtype={'b': dtype})
tm.assert_frame_equal(result, expected)

def test_categoricaldtype_coerces_numeric(self):
dtype = {'b': CategoricalDtype([1, 2, 3])}
data = "b\n1\n1\n2\n3"
expected = pd.DataFrame({'b': Categorical([1, 1, 2, 3])})
result = self.read_csv(StringIO(data), dtype=dtype)
tm.assert_frame_equal(result, expected)

def test_categoricaldtype_coerces_datetime(self):
dtype = {
'b': CategoricalDtype(pd.date_range('2017', '2019', freq='AS'))
}
data = "b\n2017-01-01\n2018-01-01\n2019-01-01"
expected = pd.DataFrame({'b': Categorical(dtype['b'].categories)})
result = self.read_csv(StringIO(data), dtype=dtype)
tm.assert_frame_equal(result, expected)

def test_categoricaldtype_coerces_timedelta(self):
dtype = {'b': CategoricalDtype(pd.to_timedelta(['1H', '2H', '3H']))}
data = "b\n1H\n2H\n3H"
expected = pd.DataFrame({'b': Categorical(dtype['b'].categories)})
result = self.read_csv(StringIO(data), dtype=dtype)
tm.assert_frame_equal(result, expected)

def test_categorical_categoricaldtype_chunksize(self):
# GH 10153
data = """a,b
1,a
1,b
1,b
2,c"""
cats = ['a', 'b', 'c']
expecteds = [pd.DataFrame({'a': [1, 1],
'b': Categorical(['a', 'b'],
categories=cats)}),
pd.DataFrame({'a': [1, 2],
'b': Categorical(['b', 'c'],
categories=cats)},
index=[2, 3])]
dtype = CategoricalDtype(cats)
actuals = self.read_csv(StringIO(data), dtype={'b': dtype},
chunksize=2)

for actual, expected in zip(actuals, expecteds):
tm.assert_frame_equal(actual, expected)

def test_empty_pass_dtype(self):
data = 'one,two'
result = self.read_csv(StringIO(data), dtype={'one': 'u1'})
Expand Down