-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
Accept CategoricalDtype in read_csv #17643
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Accept CategoricalDtype in read_csv #17643
Conversation
e8c4619
to
ccbaa04
Compare
pandas/_libs/parsers.pyx
Outdated
@@ -1267,6 +1267,8 @@ cdef class TextReader: | |||
return self._string_convert(i, start, end, na_filter, | |||
na_hashset) | |||
elif is_categorical_dtype(dtype): | |||
# TODO: I suspect that this could be optimized when dtype |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't spent any time optimizing this. It could presumably be made faster when we know the categories ahead of time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree. There should be a fastpath for this (or at least implement a different method to extract them).
doc/source/io.rst
Outdated
|
||
dtype = CategoricalDtype(['d', 'c', 'b', 'a'], ordered=True) | ||
pd.read_csv(StringIO(data), dtype={'col1': dtype}).dtypes | ||
|
||
.. note:: | ||
|
||
The resulting categories will always be parsed as strings (object dtype). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How clever do we want to be here? If we have CategoricalDtype([1, 2, 3])
and see a CSV with 1,2,3
, should we interpret those as integers? I'm not sure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think ideally we would - I actually had it working that way at one point the original PR, but the implementation was too complex / duplicative - so we decided not to. But I don't think it will be as bad in the categories known in advance case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK. I'm thinking we limit it to the case where all the categories are the same type. I'll see how difficult it is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes we would have to cast to the categories dtype.
doc/source/io.rst
Outdated
@@ -468,6 +469,18 @@ Individual columns can be parsed as a ``Categorical`` using a dict specification | |||
|
|||
pd.read_csv(StringIO(data), dtype={'col1': 'category'}).dtypes | |||
|
|||
Specifying ``dtype='cateogry'`` will result in a ``Categorical`` that is | |||
unordered, and whose ``categories`` are the unique values observed in the data. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: no comma after "unordered"
pandas/_libs/parsers.pyx
Outdated
cat = cat.set_ordered(ordered=dtype.ordered) | ||
else: | ||
cat = cat.set_categories(dtype.categories, | ||
ordered=dtype.ordered) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if you could refactor this a little and write it as such:
if isinstance(dtype, CategoricalDtype):
if dtype.categories is not None:
cat = cat.set_categories(dtype.categories)
cat = cat.set_ordered(ordered=dtype.ordered)
ccbaa04
to
4b588cd
Compare
This should be ready to go. My earlier implementation was buggy and only worked when the data were already sorted. Casting is now implemented by
One question I had is how to control options passed to that function. I've simply hardcoded |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What (should) happens when there are values in the csv file column that are not specified in the categories?(error or coerce to NaN)? (I would also mention this in the docs)
doc/source/io.rst
Outdated
converted using the :func:`to_numeric` function, or as appropriate, another | ||
converter such as :func:`to_datetime`. | ||
|
||
When ``dtype`` is a ``CategoricalDtype`` with homogenous ``categoriess`` ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
categoriess -> categories
doc/source/whatsnew/v0.21.0.txt
Outdated
@@ -163,6 +163,8 @@ Other Enhancements | |||
- :func:`Categorical.rename_categories` now accepts a dict-like argument as `new_categories` and only updates the categories found in that dict. (:issue:`17336`) | |||
- :func:`read_excel` raises ``ImportError`` with a better message if ``xlrd`` is not installed. (:issue:`17613`) | |||
- :meth:`DataFrame.assign` will preserve the original order of ``**kwargs`` for Python 3.6+ users instead of sorting the column names | |||
- Pass a :class:`~pandas.api.types.CategoricalDtype` to :meth:`read_csv` to parse categorical |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would clarify this should be passed to the dtype keyword?
Also, apart from the fact you can also have non-string categories, are there not more benefits (like being able to specify the categories yourself, specific order, ... performance?) ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps I'll merge this with the main section for CategoricalDtype. (no extra performance yet though)
Ah I forgot about this case. Yes, I think we will insert NaNs then. In my mind this should behave like a |
aa72ffe
to
6f175a7
Compare
Codecov Report
@@ Coverage Diff @@
## master #17643 +/- ##
=========================================
Coverage ? 91.24%
=========================================
Files ? 163
Lines ? 49819
Branches ? 0
=========================================
Hits ? 45456
Misses ? 4363
Partials ? 0
Continue to review full report at Codecov.
|
Codecov Report
@@ Coverage Diff @@
## master #17643 +/- ##
=========================================
Coverage ? 91.24%
=========================================
Files ? 163
Lines ? 49819
Branches ? 0
=========================================
Hits ? 45456
Misses ? 4363
Partials ? 0
Continue to review full report at Codecov.
|
Codecov Report
@@ Coverage Diff @@
## master #17643 +/- ##
==========================================
- Coverage 91.27% 91.23% -0.04%
==========================================
Files 163 163
Lines 49765 49848 +83
==========================================
+ Hits 45421 45480 +59
- Misses 4344 4368 +24
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
some minor doc comments
doc/source/io.rst
Outdated
When using ``dtype=CategoricalDtype``, "unexpected" values outside of | ||
``dtype.categories`` are treated as missing values. | ||
|
||
dtype = CategoricalDtype(['a', 'b', 'd']) # No 'c' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
missing .. ipython:: python
directive here
doc/source/whatsnew/v0.21.0.txt
Outdated
The ``.dtype`` property of a ``Categorical``, ``CategoricalIndex`` or a | ||
``Series`` with categorical type will now return an instance of ``CategoricalDtype``. | ||
For the most part, this is backwards compatible, though the string repr has changed. | ||
If you were previously using ``str(s.dtype == 'category')`` to detect categorical data, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
missing closing parenthesis around s.dtype (actually the closing one is in the wrong place)
doc/source/whatsnew/v0.21.0.txt
Outdated
The ``.dtype`` property of a ``Categorical``, ``CategoricalIndex`` or a | ||
``Series`` with categorical type will now return an instance of ``CategoricalDtype``. | ||
For the most part, this is backwards compatible, though the string repr has changed. | ||
If you were previously using ``str(s.dtype == 'category')`` to detect categorical data, | ||
switch to :func:`api.types.is_categorical_dtype`, which is compatible with the old and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would add pandas in the api.types.is_categorical_dtype
doc/source/whatsnew/v0.21.0.txt
Outdated
|
||
.. ipython:: python | ||
|
||
from pandas.compat import StringIO |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in general we put this in the hidden code block at the top of the file, as people shouldn't use this from pandas, but just import it themselves
pandas/_libs/parsers.pyx
Outdated
if (isinstance(dtype, CategoricalDtype) and | ||
dtype.categories is not None): | ||
# recode for dtype.categories | ||
categories = dtype.categories |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use _recode_for_categories
here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed (will wait to push until I hear back about #17643 (comment))
pandas/_libs/parsers.pyx
Outdated
if dtype.categories.is_numeric(): | ||
# is ignore correct? | ||
cats = to_numeric(cats, errors='ignore') | ||
elif dtype.categories.is_all_dates: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this may leave open corner cases where strings don't map 1->1 with categories? For example:
cats:
# DatetimeIndex(['2014-01-01'], dtype='datetime64[ns]', freq=None)
data:
# ['2014-01-01', '2014-01-01T00:00:00', '2014-01-01']
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I don't follow. This passes:
dtype = {
'b': CategoricalDtype([pd.Timestamp("2014")])
}
# Two representations of the same value
data = "b\n2014-01-01\n2014-01-01T00:00:00"
expected = pd.DataFrame({'b': Categorical([pd.Timestamp('2014')] * 2)})
result = self.read_csv(StringIO(data), dtype=dtype)
tm.assert_frame_equal(result, expected)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does result['b']
not have duplicated categories? Sorry, don't have it checked out locally, only guessing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No problem. It has multiple values, but the categories are unique.
In [10]: pd.read_csv(StringIO(data), dtype=dtype).b.dtype
Out[10]: CategoricalDtype(categories=['2014-01-01'], ordered=False)
The categories
passed to the Categorical
constructor later on comes directly from dtype.categories
, which is unique. The coercion is done on the values so it's OK if different string forms are coerced to the same value.
doc/source/io.rst
Outdated
@@ -468,12 +469,38 @@ Individual columns can be parsed as a ``Categorical`` using a dict specification | |||
|
|||
pd.read_csv(StringIO(data), dtype={'col1': 'category'}).dtypes | |||
|
|||
Specifying ``dtype='cateogry'`` will result in an unordered ``Categorical`` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
versionadded here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe a sub-section for this?
@@ -129,8 +129,37 @@ e.g., when converting string data to a ``Categorical`` (:issue:`14711`, | |||
dtype = CategoricalDtype(categories=['a', 'b', 'c', 'd'], ordered=True) | |||
s.astype(dtype) | |||
|
|||
One place that deserves special mention is in :meth:`read_csv`. Previously, with |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe a separate sub-section for this
pandas/_libs/parsers.pyx
Outdated
|
||
# Determine if we should convert inferred string | ||
# categories to a specialized type | ||
if (isinstance(dtype, CategoricalDtype) and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would rather move this entire section to a free function (except for the actual constructor)
maybe
cats, dtype = infer_categorical_dtype(cats) # put in pandas.core.dtypes.cast.py
cats = Categorical(cats, codes, dtype=dtype)
NONE of this logic should be here
result[name] = union_categoricals(arrs, sort_categories=True) | ||
dtype = dtypes.pop() | ||
if is_categorical_dtype(dtype): | ||
sort_categories = isinstance(dtype, str) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
str -> string_types
pandas/io/parsers.py
Outdated
@@ -1605,9 +1607,23 @@ def _cast_types(self, values, cast_type, column): | |||
# XXX this is for consistency with | |||
# c-parser which parses all categories | |||
# as strings | |||
if not is_object_dtype(values): | |||
known_cats = (isinstance(cast_type, CategoricalDtype) and | |||
cast_type.categories is not None) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
none of this logic should live here either. move to pandas.core.dtypes.cast.py (also ok with a new module pandas.core.dtypes.categorical.py if its simpler)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Refactored most of this to pandas.core.dtypes.cast
pandas/_libs/parsers.pyx
Outdated
cats = cats.sort_values() | ||
indexer = cats.get_indexer(unsorted) | ||
codes = take_1d(indexer, codes, fill_value=-1) | ||
categories = cats.sort_values() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would move ALL of this logic and simply create a new factory for Categorical.infer_from_categories(cats, codes, dtype=dtype)
(and even fold in the maybe_convert_for_categorical). This just makes parsing code longer and longer; we want to push down logic to the dtypes.
dtype = CategoricalDtype(cats, ordered=False) | ||
codes = inferred_codes | ||
|
||
return cls(codes, dtype=dtype, fastpath=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
much nicer
pandas/core/categorical.py
Outdated
------- | ||
Categorical | ||
""" | ||
from pandas.core.dtypes.cast import maybe_convert_for_categorical |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah not sure we need maybe_covnert_for_categorical now, maybe move it here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left it as it's own method since the python parser still needed it too. That one is different enough since it doesn't have codes, just values.
pandas/core/dtypes/cast.py
Outdated
>>> maybe_convert_for_categorical([1, 'a'], CategoricalDtype([1, 2])) | ||
array([ 1., nan]) | ||
""" | ||
if isinstance(dtype, CategoricalDtype) and dtype.categories is not None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in reaility this just an Index routine maybe
Index(categories).coerce_to_dtype(dtype.categories)
and if the ifisinstance(dtype, ....) logic can be in Categorical_from_inferred....
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see my comment below, you can simply fold this in to Categorical._from_infererd_categories
/ not averse to making the inside of this an Index routine though (as its just coercing on the index type).
Hmm, seems like the compiler error is back on circle CI. Looking into it. |
pandas/io/parsers.py
Outdated
known_cats = (isinstance(cast_type, CategoricalDtype) and | ||
cast_type.categories is not None) | ||
|
||
categories = ordered = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is this not using Categorical._inferred_from_categories
, this code duplication is just making technical debt.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure how much cleaner 3de75cd is. This really don't share much code, since the python parser has values, while the C parser has categories and codes. And the python parser has to maybe cast values to strings with cast_type='category'
.
pandas/core/dtypes/cast.py
Outdated
>>> maybe_convert_for_categorical([1, 'a'], CategoricalDtype([1, 2])) | ||
array([ 1., nan]) | ||
""" | ||
if isinstance(dtype, CategoricalDtype) and dtype.categories is not None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see my comment below, you can simply fold this in to Categorical._from_infererd_categories
/ not averse to making the inside of this an Index routine though (as its just coercing on the index type).
pandas/io/parsers.py
Outdated
values = Categorical._from_inferred_categories( | ||
cats, cats.get_indexer(values), cast_type | ||
) | ||
else: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
any reason you are not handling this case as well? (I get that it conflates the purpose of _from_inferred_categories
a bit), but in reality this is just like passing dtype=None
.
I don't like to scatter casting/inferrence code around, very hard to figure out what's going on when when its not in 1 place.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor comment lgtm otherwise
pandas/core/categorical.py
Outdated
Parameters | ||
---------- | ||
|
||
inferred_categories, inferred_codes : Index |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
separate lines for params
pandas/core/categorical.py
Outdated
cats = Index(inferred_categories) | ||
|
||
# Convert to a specialzed type with `dtype` is specified | ||
if (isinstance(dtype, CategoricalDtype) and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dtype by definition is already a CDT
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It could also be the string 'category'
. I've clarified the docstring.
All green. Merging. I opened up #17743 for optimizing |
Thanks! |
thanks @TomAugspurger this is great! |
* ENH: Accept CategoricalDtype in CSV reader * rework * Fixed basic implementation * Added casting * Doc and cleanup * Fixed assignment of categoricals * Doc and test unexpected values * DOC: fixups * More coercion, use _recode_for_categories * Refactor with maybe_convert_for_categorical * PEP8 * Type for 32bit * REF: refactor to new method * py2 compat * Refactored * More in Categorical * fixup! More in Categorical
* ENH: Accept CategoricalDtype in CSV reader * rework * Fixed basic implementation * Added casting * Doc and cleanup * Fixed assignment of categoricals * Doc and test unexpected values * DOC: fixups * More coercion, use _recode_for_categories * Refactor with maybe_convert_for_categorical * PEP8 * Type for 32bit * REF: refactor to new method * py2 compat * Refactored * More in Categorical * fixup! More in Categorical
* ENH: Accept CategoricalDtype in CSV reader * rework * Fixed basic implementation * Added casting * Doc and cleanup * Fixed assignment of categoricals * Doc and test unexpected values * DOC: fixups * More coercion, use _recode_for_categories * Refactor with maybe_convert_for_categorical * PEP8 * Type for 32bit * REF: refactor to new method * py2 compat * Refactored * More in Categorical * fixup! More in Categorical
This is for after #16015
cc @chris-b1