Skip to content

Accept CategoricalDtype in read_csv #17643

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 19 commits into from
Oct 2, 2017

Conversation

TomAugspurger
Copy link
Contributor

import pandas as pd
from io import StringIO
from pandas.api.types import CategoricalDtype

data = 'col1,col2,col3\na,b,1\na,b,2\nc,d,3'

dtype = CategoricalDtype(['d', 'c', 'b', 'a'], ordered=True)
pd.read_csv(StringIO(data), dtype={'col1': dtype}).dtypes

This is for after #16015

cc @chris-b1

@TomAugspurger
Copy link
Contributor Author

I squashed everything from #16015 down to a single commit, so the changes here are just ccbaa04

@@ -1267,6 +1267,8 @@ cdef class TextReader:
return self._string_convert(i, start, end, na_filter,
na_hashset)
elif is_categorical_dtype(dtype):
# TODO: I suspect that this could be optimized when dtype
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't spent any time optimizing this. It could presumably be made faster when we know the categories ahead of time.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. There should be a fastpath for this (or at least implement a different method to extract them).


dtype = CategoricalDtype(['d', 'c', 'b', 'a'], ordered=True)
pd.read_csv(StringIO(data), dtype={'col1': dtype}).dtypes

.. note::

The resulting categories will always be parsed as strings (object dtype).
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How clever do we want to be here? If we have CategoricalDtype([1, 2, 3]) and see a CSV with 1,2,3, should we interpret those as integers? I'm not sure.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think ideally we would - I actually had it working that way at one point the original PR, but the implementation was too complex / duplicative - so we decided not to. But I don't think it will be as bad in the categories known in advance case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. I'm thinking we limit it to the case where all the categories are the same type. I'll see how difficult it is.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes we would have to cast to the categories dtype.

@jreback jreback added Categorical Categorical Data Type IO CSV read_csv, to_csv labels Sep 23, 2017
@@ -468,6 +469,18 @@ Individual columns can be parsed as a ``Categorical`` using a dict specification

pd.read_csv(StringIO(data), dtype={'col1': 'category'}).dtypes

Specifying ``dtype='cateogry'`` will result in a ``Categorical`` that is
unordered, and whose ``categories`` are the unique values observed in the data.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: no comma after "unordered"

cat = cat.set_ordered(ordered=dtype.ordered)
else:
cat = cat.set_categories(dtype.categories,
ordered=dtype.ordered)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if you could refactor this a little and write it as such:

if isinstance(dtype, CategoricalDtype):
   if dtype.categories is not None:
      cat = cat.set_categories(dtype.categories)
   cat = cat.set_ordered(ordered=dtype.ordered)

@TomAugspurger
Copy link
Contributor Author

This should be ready to go. My earlier implementation was buggy and only worked when the data were already sorted.

Casting is now implemented by

  1. checking if dtype.categories is {numeric,datetime,timedelta} type
  2. calling the appropriate to_* function to cast the values / inferred categories

One question I had is how to control options passed to that function. I've simply hardcoded errors='ignore'. I'm leery about trying to be clever here.

@TomAugspurger TomAugspurger added this to the 0.21.0 milestone Sep 25, 2017
Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What (should) happens when there are values in the csv file column that are not specified in the categories?(error or coerce to NaN)? (I would also mention this in the docs)

converted using the :func:`to_numeric` function, or as appropriate, another
converter such as :func:`to_datetime`.

When ``dtype`` is a ``CategoricalDtype`` with homogenous ``categoriess`` (
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

categoriess -> categories

@@ -163,6 +163,8 @@ Other Enhancements
- :func:`Categorical.rename_categories` now accepts a dict-like argument as `new_categories` and only updates the categories found in that dict. (:issue:`17336`)
- :func:`read_excel` raises ``ImportError`` with a better message if ``xlrd`` is not installed. (:issue:`17613`)
- :meth:`DataFrame.assign` will preserve the original order of ``**kwargs`` for Python 3.6+ users instead of sorting the column names
- Pass a :class:`~pandas.api.types.CategoricalDtype` to :meth:`read_csv` to parse categorical
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would clarify this should be passed to the dtype keyword?

Also, apart from the fact you can also have non-string categories, are there not more benefits (like being able to specify the categories yourself, specific order, ... performance?) ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps I'll merge this with the main section for CategoricalDtype. (no extra performance yet though)

@TomAugspurger
Copy link
Contributor Author

What (should) happens when there are values in the csv file column that are not specified in the categories?

Ah I forgot about this case. Yes, I think we will insert NaNs then. In my mind this should behave like a .set_categories(dtype.categories) after the fact. I'll add tests and docs for this tomorrow.

@codecov
Copy link

codecov bot commented Sep 26, 2017

Codecov Report

❗ No coverage uploaded for pull request base (master@7e87385). Click here to learn what that means.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff            @@
##             master   #17643   +/-   ##
=========================================
  Coverage          ?   91.24%           
=========================================
  Files             ?      163           
  Lines             ?    49819           
  Branches          ?        0           
=========================================
  Hits              ?    45456           
  Misses            ?     4363           
  Partials          ?        0
Flag Coverage Δ
#multiple 89.04% <100%> (?)
#single 40.31% <14.28%> (?)
Impacted Files Coverage Δ
pandas/io/parsers.py 95.51% <100%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 7e87385...6f175a7. Read the comment docs.

@codecov
Copy link

codecov bot commented Sep 26, 2017

Codecov Report

❗ No coverage uploaded for pull request base (master@7e87385). Click here to learn what that means.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff            @@
##             master   #17643   +/-   ##
=========================================
  Coverage          ?   91.24%           
=========================================
  Files             ?      163           
  Lines             ?    49819           
  Branches          ?        0           
=========================================
  Hits              ?    45456           
  Misses            ?     4363           
  Partials          ?        0
Flag Coverage Δ
#multiple 89.04% <100%> (?)
#single 40.31% <14.28%> (?)
Impacted Files Coverage Δ
pandas/io/parsers.py 95.51% <100%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 7e87385...6f175a7. Read the comment docs.

@codecov
Copy link

codecov bot commented Sep 26, 2017

Codecov Report

Merging #17643 into master will decrease coverage by 0.03%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #17643      +/-   ##
==========================================
- Coverage   91.27%   91.23%   -0.04%     
==========================================
  Files         163      163              
  Lines       49765    49848      +83     
==========================================
+ Hits        45421    45480      +59     
- Misses       4344     4368      +24
Flag Coverage Δ
#multiple 89.03% <100%> (-0.02%) ⬇️
#single 40.32% <7.4%> (-0.09%) ⬇️
Impacted Files Coverage Δ
pandas/core/categorical.py 95.73% <100%> (+0.02%) ⬆️
pandas/io/parsers.py 95.49% <100%> (ø) ⬆️
pandas/io/gbq.py 25% <0%> (-58.34%) ⬇️
pandas/core/tools/datetimes.py 82.97% <0%> (-0.83%) ⬇️
pandas/core/common.py 91.42% <0%> (-0.56%) ⬇️
pandas/core/indexes/multi.py 96.39% <0%> (-0.51%) ⬇️
pandas/core/config.py 87.7% <0%> (-0.39%) ⬇️
pandas/core/indexes/category.py 97.46% <0%> (-0.29%) ⬇️
pandas/core/groupby.py 92.04% <0%> (-0.2%) ⬇️
... and 15 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update db1206a...9325a93. Read the comment docs.

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some minor doc comments

When using ``dtype=CategoricalDtype``, "unexpected" values outside of
``dtype.categories`` are treated as missing values.

dtype = CategoricalDtype(['a', 'b', 'd']) # No 'c'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missing .. ipython:: python directive here

The ``.dtype`` property of a ``Categorical``, ``CategoricalIndex`` or a
``Series`` with categorical type will now return an instance of ``CategoricalDtype``.
For the most part, this is backwards compatible, though the string repr has changed.
If you were previously using ``str(s.dtype == 'category')`` to detect categorical data,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missing closing parenthesis around s.dtype (actually the closing one is in the wrong place)

The ``.dtype`` property of a ``Categorical``, ``CategoricalIndex`` or a
``Series`` with categorical type will now return an instance of ``CategoricalDtype``.
For the most part, this is backwards compatible, though the string repr has changed.
If you were previously using ``str(s.dtype == 'category')`` to detect categorical data,
switch to :func:`api.types.is_categorical_dtype`, which is compatible with the old and
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would add pandas in the api.types.is_categorical_dtype


.. ipython:: python

from pandas.compat import StringIO
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in general we put this in the hidden code block at the top of the file, as people shouldn't use this from pandas, but just import it themselves

if (isinstance(dtype, CategoricalDtype) and
dtype.categories is not None):
# recode for dtype.categories
categories = dtype.categories
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed (will wait to push until I hear back about #17643 (comment))

if dtype.categories.is_numeric():
# is ignore correct?
cats = to_numeric(cats, errors='ignore')
elif dtype.categories.is_all_dates:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this may leave open corner cases where strings don't map 1->1 with categories? For example:

cats:
# DatetimeIndex(['2014-01-01'], dtype='datetime64[ns]', freq=None)

data:
# ['2014-01-01', '2014-01-01T00:00:00', '2014-01-01']

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I don't follow. This passes:

        dtype = {
            'b': CategoricalDtype([pd.Timestamp("2014")])
        }
        # Two representations of the same value
        data = "b\n2014-01-01\n2014-01-01T00:00:00"
        expected = pd.DataFrame({'b': Categorical([pd.Timestamp('2014')] * 2)})
        result = self.read_csv(StringIO(data), dtype=dtype)
        tm.assert_frame_equal(result, expected)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does result['b'] not have duplicated categories? Sorry, don't have it checked out locally, only guessing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No problem. It has multiple values, but the categories are unique.

In [10]: pd.read_csv(StringIO(data), dtype=dtype).b.dtype
Out[10]: CategoricalDtype(categories=['2014-01-01'], ordered=False)

The categories passed to the Categorical constructor later on comes directly from dtype.categories, which is unique. The coercion is done on the values so it's OK if different string forms are coerced to the same value.

@@ -468,12 +469,38 @@ Individual columns can be parsed as a ``Categorical`` using a dict specification

pd.read_csv(StringIO(data), dtype={'col1': 'category'}).dtypes

Specifying ``dtype='cateogry'`` will result in an unordered ``Categorical``
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

versionadded here

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe a sub-section for this?

@@ -129,8 +129,37 @@ e.g., when converting string data to a ``Categorical`` (:issue:`14711`,
dtype = CategoricalDtype(categories=['a', 'b', 'c', 'd'], ordered=True)
s.astype(dtype)

One place that deserves special mention is in :meth:`read_csv`. Previously, with
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe a separate sub-section for this


# Determine if we should convert inferred string
# categories to a specialized type
if (isinstance(dtype, CategoricalDtype) and
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would rather move this entire section to a free function (except for the actual constructor)

maybe

cats, dtype = infer_categorical_dtype(cats) # put in pandas.core.dtypes.cast.py
cats = Categorical(cats, codes, dtype=dtype)

NONE of this logic should be here

result[name] = union_categoricals(arrs, sort_categories=True)
dtype = dtypes.pop()
if is_categorical_dtype(dtype):
sort_categories = isinstance(dtype, str)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

str -> string_types

@@ -1605,9 +1607,23 @@ def _cast_types(self, values, cast_type, column):
# XXX this is for consistency with
# c-parser which parses all categories
# as strings
if not is_object_dtype(values):
known_cats = (isinstance(cast_type, CategoricalDtype) and
cast_type.categories is not None)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

none of this logic should live here either. move to pandas.core.dtypes.cast.py (also ok with a new module pandas.core.dtypes.categorical.py if its simpler)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Refactored most of this to pandas.core.dtypes.cast

cats = cats.sort_values()
indexer = cats.get_indexer(unsorted)
codes = take_1d(indexer, codes, fill_value=-1)
categories = cats.sort_values()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would move ALL of this logic and simply create a new factory for Categorical.infer_from_categories(cats, codes, dtype=dtype) (and even fold in the maybe_convert_for_categorical). This just makes parsing code longer and longer; we want to push down logic to the dtypes.

dtype = CategoricalDtype(cats, ordered=False)
codes = inferred_codes

return cls(codes, dtype=dtype, fastpath=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

much nicer

-------
Categorical
"""
from pandas.core.dtypes.cast import maybe_convert_for_categorical
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah not sure we need maybe_covnert_for_categorical now, maybe move it here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left it as it's own method since the python parser still needed it too. That one is different enough since it doesn't have codes, just values.

>>> maybe_convert_for_categorical([1, 'a'], CategoricalDtype([1, 2]))
array([ 1., nan])
"""
if isinstance(dtype, CategoricalDtype) and dtype.categories is not None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in reaility this just an Index routine maybe

Index(categories).coerce_to_dtype(dtype.categories)

and if the ifisinstance(dtype, ....) logic can be in Categorical_from_inferred....

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see my comment below, you can simply fold this in to Categorical._from_infererd_categories / not averse to making the inside of this an Index routine though (as its just coercing on the index type).

@TomAugspurger
Copy link
Contributor Author

Hmm, seems like the compiler error is back on circle CI. Looking into it.

known_cats = (isinstance(cast_type, CategoricalDtype) and
cast_type.categories is not None)

categories = ordered = None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this not using Categorical._inferred_from_categories, this code duplication is just making technical debt.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure how much cleaner 3de75cd is. This really don't share much code, since the python parser has values, while the C parser has categories and codes. And the python parser has to maybe cast values to strings with cast_type='category'.

>>> maybe_convert_for_categorical([1, 'a'], CategoricalDtype([1, 2]))
array([ 1., nan])
"""
if isinstance(dtype, CategoricalDtype) and dtype.categories is not None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see my comment below, you can simply fold this in to Categorical._from_infererd_categories / not averse to making the inside of this an Index routine though (as its just coercing on the index type).

values = Categorical._from_inferred_categories(
cats, cats.get_indexer(values), cast_type
)
else:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any reason you are not handling this case as well? (I get that it conflates the purpose of _from_inferred_categories a bit), but in reality this is just like passing dtype=None.

I don't like to scatter casting/inferrence code around, very hard to figure out what's going on when when its not in 1 place.

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor comment lgtm otherwise

Parameters
----------

inferred_categories, inferred_codes : Index
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

separate lines for params

cats = Index(inferred_categories)

# Convert to a specialzed type with `dtype` is specified
if (isinstance(dtype, CategoricalDtype) and
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dtype by definition is already a CDT

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It could also be the string 'category'. I've clarified the docstring.

@TomAugspurger
Copy link
Contributor Author

All green. Merging.

I opened up #17743 for optimizing _categorical_convert in the C parser. I won't have time to get to it for the release though.

@TomAugspurger TomAugspurger merged commit def3bce into pandas-dev:master Oct 2, 2017
@jorisvandenbossche
Copy link
Member

Thanks!

@TomAugspurger TomAugspurger deleted the categorical-csv-2 branch October 2, 2017 14:12
@jreback
Copy link
Contributor

jreback commented Oct 2, 2017

thanks @TomAugspurger this is great!

ghost pushed a commit to reef-technologies/pandas that referenced this pull request Oct 16, 2017
* ENH: Accept CategoricalDtype in CSV reader

* rework

* Fixed basic implementation

* Added casting

* Doc and cleanup

* Fixed assignment of categoricals

* Doc and test unexpected values

* DOC: fixups

* More coercion, use _recode_for_categories

* Refactor with maybe_convert_for_categorical

* PEP8

* Type for 32bit

* REF: refactor to new method

* py2 compat

* Refactored

* More in Categorical

* fixup! More in Categorical
alanbato pushed a commit to alanbato/pandas that referenced this pull request Nov 10, 2017
* ENH: Accept CategoricalDtype in CSV reader

* rework

* Fixed basic implementation

* Added casting

* Doc and cleanup

* Fixed assignment of categoricals

* Doc and test unexpected values

* DOC: fixups

* More coercion, use _recode_for_categories

* Refactor with maybe_convert_for_categorical

* PEP8

* Type for 32bit

* REF: refactor to new method

* py2 compat

* Refactored

* More in Categorical

* fixup! More in Categorical
No-Stream pushed a commit to No-Stream/pandas that referenced this pull request Nov 28, 2017
* ENH: Accept CategoricalDtype in CSV reader

* rework

* Fixed basic implementation

* Added casting

* Doc and cleanup

* Fixed assignment of categoricals

* Doc and test unexpected values

* DOC: fixups

* More coercion, use _recode_for_categories

* Refactor with maybe_convert_for_categorical

* PEP8

* Type for 32bit

* REF: refactor to new method

* py2 compat

* Refactored

* More in Categorical

* fixup! More in Categorical
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Categorical Categorical Data Type IO CSV read_csv, to_csv
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants