Accept CategoricalDtype in read_csv #17643

TomAugspurger · 2017-09-23T11:27:36Z

import pandas as pd
from io import StringIO
from pandas.api.types import CategoricalDtype

data = 'col1,col2,col3\na,b,1\na,b,2\nc,d,3'

dtype = CategoricalDtype(['d', 'c', 'b', 'a'], ordered=True)
pd.read_csv(StringIO(data), dtype={'col1': dtype}).dtypes

This is for after #16015

cc @chris-b1

TomAugspurger · 2017-09-23T11:29:52Z

I squashed everything from #16015 down to a single commit, so the changes here are just ccbaa04

TomAugspurger · 2017-09-23T11:30:56Z

pandas/_libs/parsers.pyx

@@ -1267,6 +1267,8 @@ cdef class TextReader:
            return self._string_convert(i, start, end, na_filter,
                                        na_hashset)
        elif is_categorical_dtype(dtype):
+            # TODO: I suspect that this could be optimized when dtype


I haven't spent any time optimizing this. It could presumably be made faster when we know the categories ahead of time.

I agree. There should be a fastpath for this (or at least implement a different method to extract them).

TomAugspurger · 2017-09-23T11:31:57Z

doc/source/io.rst

+
+   dtype = CategoricalDtype(['d', 'c', 'b', 'a'], ordered=True)
+   pd.read_csv(StringIO(data), dtype={'col1': dtype}).dtypes
+
 .. note::

   The resulting categories will always be parsed as strings (object dtype).


How clever do we want to be here? If we have CategoricalDtype([1, 2, 3]) and see a CSV with 1,2,3, should we interpret those as integers? I'm not sure.

I think ideally we would - I actually had it working that way at one point the original PR, but the implementation was too complex / duplicative - so we decided not to. But I don't think it will be as bad in the categories known in advance case.

OK. I'm thinking we limit it to the case where all the categories are the same type. I'll see how difficult it is.

yes we would have to cast to the categories dtype.

gfyoung · 2017-09-24T19:19:22Z

doc/source/io.rst

@@ -468,6 +469,18 @@ Individual columns can be parsed as a ``Categorical`` using a dict specification

   pd.read_csv(StringIO(data), dtype={'col1': 'category'}).dtypes

+Specifying ``dtype='cateogry'`` will result in a ``Categorical`` that is
+unordered, and whose ``categories`` are the unique values observed in the data.


nit: no comma after "unordered"

gfyoung · 2017-09-24T19:25:09Z

pandas/_libs/parsers.pyx

+                        cat = cat.set_ordered(ordered=dtype.ordered)
+                else:
+                    cat = cat.set_categories(dtype.categories,
+                                             ordered=dtype.ordered)


I wonder if you could refactor this a little and write it as such:

if isinstance(dtype, CategoricalDtype): if dtype.categories is not None: cat = cat.set_categories(dtype.categories) cat = cat.set_ordered(ordered=dtype.ordered)

TomAugspurger · 2017-09-25T18:38:35Z

This should be ready to go. My earlier implementation was buggy and only worked when the data were already sorted.

Casting is now implemented by

checking if dtype.categories is {numeric,datetime,timedelta} type
calling the appropriate to_* function to cast the values / inferred categories

One question I had is how to control options passed to that function. I've simply hardcoded errors='ignore'. I'm leery about trying to be clever here.

jorisvandenbossche

What (should) happens when there are values in the csv file column that are not specified in the categories?(error or coerce to NaN)? (I would also mention this in the docs)

jorisvandenbossche · 2017-09-25T22:01:20Z

doc/source/io.rst

+   converted using the :func:`to_numeric` function, or as appropriate, another
+   converter such as :func:`to_datetime`.
+
+   When ``dtype`` is a ``CategoricalDtype`` with homogenous ``categoriess`` (


categoriess -> categories

jorisvandenbossche · 2017-09-25T22:03:14Z

doc/source/whatsnew/v0.21.0.txt

@@ -163,6 +163,8 @@ Other Enhancements
 - :func:`Categorical.rename_categories` now accepts a dict-like argument as `new_categories` and only updates the categories found in that dict. (:issue:`17336`)
 - :func:`read_excel` raises ``ImportError`` with a better message if ``xlrd`` is not installed. (:issue:`17613`)
 - :meth:`DataFrame.assign` will preserve the original order of ``**kwargs`` for Python 3.6+ users instead of sorting the column names
+- Pass a :class:`~pandas.api.types.CategoricalDtype` to :meth:`read_csv` to parse categorical


I would clarify this should be passed to the dtype keyword?

Also, apart from the fact you can also have non-string categories, are there not more benefits (like being able to specify the categories yourself, specific order, ... performance?) ?

Perhaps I'll merge this with the main section for CategoricalDtype. (no extra performance yet though)

TomAugspurger · 2017-09-25T22:12:56Z

What (should) happens when there are values in the csv file column that are not specified in the categories?

Ah I forgot about this case. Yes, I think we will insert NaNs then. In my mind this should behave like a .set_categories(dtype.categories) after the fact. I'll add tests and docs for this tomorrow.

codecov · 2017-09-26T13:52:16Z

Codecov Report

❗ No coverage uploaded for pull request base (master@7e87385). Click here to learn what that means.
The diff coverage is 100%.

@@            Coverage Diff            @@
##             master   #17643   +/-   ##
=========================================
  Coverage          ?   91.24%           
=========================================
  Files             ?      163           
  Lines             ?    49819           
  Branches          ?        0           
=========================================
  Hits              ?    45456           
  Misses            ?     4363           
  Partials          ?        0

Flag	Coverage Δ
#multiple	`89.04% <100%> (?)`
#single	`40.31% <14.28%> (?)`

Impacted Files	Coverage Δ
pandas/io/parsers.py	`95.51% <100%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 7e87385...6f175a7. Read the comment docs.

codecov · 2017-09-26T13:52:19Z

Codecov Report

❗ No coverage uploaded for pull request base (master@7e87385). Click here to learn what that means.
The diff coverage is 100%.

@@            Coverage Diff            @@
##             master   #17643   +/-   ##
=========================================
  Coverage          ?   91.24%           
=========================================
  Files             ?      163           
  Lines             ?    49819           
  Branches          ?        0           
=========================================
  Hits              ?    45456           
  Misses            ?     4363           
  Partials          ?        0

Flag	Coverage Δ
#multiple	`89.04% <100%> (?)`
#single	`40.31% <14.28%> (?)`

Impacted Files	Coverage Δ
pandas/io/parsers.py	`95.51% <100%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 7e87385...6f175a7. Read the comment docs.

codecov · 2017-09-26T13:52:40Z

Codecov Report

Merging #17643 into master will decrease coverage by 0.03%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master   #17643      +/-   ##
==========================================
- Coverage   91.27%   91.23%   -0.04%     
==========================================
  Files         163      163              
  Lines       49765    49848      +83     
==========================================
+ Hits        45421    45480      +59     
- Misses       4344     4368      +24

Flag	Coverage Δ
#multiple	`89.03% <100%> (-0.02%)`	⬇️
#single	`40.32% <7.4%> (-0.09%)`	⬇️

Impacted Files	Coverage Δ
pandas/core/categorical.py	`95.73% <100%> (+0.02%)`	⬆️
pandas/io/parsers.py	`95.49% <100%> (ø)`	⬆️
pandas/io/gbq.py	`25% <0%> (-58.34%)`	⬇️
pandas/core/tools/datetimes.py	`82.97% <0%> (-0.83%)`	⬇️
pandas/core/common.py	`91.42% <0%> (-0.56%)`	⬇️
pandas/core/indexes/multi.py	`96.39% <0%> (-0.51%)`	⬇️
pandas/core/config.py	`87.7% <0%> (-0.39%)`	⬇️
pandas/core/indexes/category.py	`97.46% <0%> (-0.29%)`	⬇️
pandas/core/groupby.py	`92.04% <0%> (-0.2%)`	⬇️
... and 15 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update db1206a...9325a93. Read the comment docs.

jorisvandenbossche

some minor doc comments

jorisvandenbossche · 2017-09-26T14:21:47Z

doc/source/io.rst

+When using ``dtype=CategoricalDtype``, "unexpected" values outside of
+``dtype.categories`` are treated as missing values.
+
+   dtype = CategoricalDtype(['a', 'b', 'd'])  # No 'c'


missing .. ipython:: python directive here

jorisvandenbossche · 2017-09-26T14:24:32Z

doc/source/whatsnew/v0.21.0.txt

 The ``.dtype`` property of a ``Categorical``, ``CategoricalIndex`` or a
 ``Series`` with categorical type will now return an instance of ``CategoricalDtype``.
+For the most part, this is backwards compatible, though the string repr has changed.
+If you were previously using ``str(s.dtype == 'category')`` to detect categorical data,


missing closing parenthesis around s.dtype (actually the closing one is in the wrong place)

jorisvandenbossche · 2017-09-26T14:25:04Z

doc/source/whatsnew/v0.21.0.txt

 The ``.dtype`` property of a ``Categorical``, ``CategoricalIndex`` or a
 ``Series`` with categorical type will now return an instance of ``CategoricalDtype``.
+For the most part, this is backwards compatible, though the string repr has changed.
+If you were previously using ``str(s.dtype == 'category')`` to detect categorical data,
+switch to :func:`api.types.is_categorical_dtype`, which is compatible with the old and


I would add pandas in the api.types.is_categorical_dtype

jorisvandenbossche · 2017-09-26T14:27:10Z

doc/source/whatsnew/v0.21.0.txt

+
+.. ipython:: python
+
+   from pandas.compat import StringIO


in general we put this in the hidden code block at the top of the file, as people shouldn't use this from pandas, but just import it themselves

chris-b1 · 2017-09-26T18:43:07Z

pandas/_libs/parsers.pyx

+            if (isinstance(dtype, CategoricalDtype) and
+                    dtype.categories is not None):
+                # recode for dtype.categories
+                categories = dtype.categories


use _recode_for_categories here?

Fixed (will wait to push until I hear back about #17643 (comment))

chris-b1 · 2017-09-26T18:46:27Z

pandas/_libs/parsers.pyx

+                if dtype.categories.is_numeric():
+                    # is ignore correct?
+                    cats = to_numeric(cats, errors='ignore')
+                elif dtype.categories.is_all_dates:


I think this may leave open corner cases where strings don't map 1->1 with categories? For example:

cats: # DatetimeIndex(['2014-01-01'], dtype='datetime64[ns]', freq=None) data: # ['2014-01-01', '2014-01-01T00:00:00', '2014-01-01']

Sorry, I don't follow. This passes:

dtype = { 'b': CategoricalDtype([pd.Timestamp("2014")]) } # Two representations of the same value data = "b\n2014-01-01\n2014-01-01T00:00:00" expected = pd.DataFrame({'b': Categorical([pd.Timestamp('2014')] * 2)}) result = self.read_csv(StringIO(data), dtype=dtype) tm.assert_frame_equal(result, expected)

Does result['b'] not have duplicated categories? Sorry, don't have it checked out locally, only guessing.

No problem. It has multiple values, but the categories are unique.

In [10]: pd.read_csv(StringIO(data), dtype=dtype).b.dtype Out[10]: CategoricalDtype(categories=['2014-01-01'], ordered=False)

The categories passed to the Categorical constructor later on comes directly from dtype.categories, which is unique. The coercion is done on the values so it's OK if different string forms are coerced to the same value.

jreback · 2017-09-26T18:42:02Z

doc/source/io.rst

@@ -468,12 +469,38 @@ Individual columns can be parsed as a ``Categorical`` using a dict specification

   pd.read_csv(StringIO(data), dtype={'col1': 'category'}).dtypes

+Specifying ``dtype='cateogry'`` will result in an unordered ``Categorical``


versionadded here

maybe a sub-section for this?

jreback · 2017-09-26T18:43:12Z

doc/source/whatsnew/v0.21.0.txt

@@ -129,8 +129,37 @@ e.g., when converting string data to a ``Categorical`` (:issue:`14711`,
   dtype = CategoricalDtype(categories=['a', 'b', 'c', 'd'], ordered=True)
   s.astype(dtype)

+One place that deserves special mention is in :meth:`read_csv`. Previously, with


maybe a separate sub-section for this

jreback · 2017-09-26T18:49:22Z

pandas/_libs/parsers.pyx

+
+            # Determine if we should convert inferred string
+            # categories to a specialized type
+            if (isinstance(dtype, CategoricalDtype) and


I would rather move this entire section to a free function (except for the actual constructor)

maybe

cats, dtype = infer_categorical_dtype(cats) # put in pandas.core.dtypes.cast.py cats = Categorical(cats, codes, dtype=dtype)

NONE of this logic should be here

jreback · 2017-09-26T18:51:25Z

pandas/_libs/parsers.pyx

-            result[name] = union_categoricals(arrs, sort_categories=True)
+        dtype = dtypes.pop()
+        if is_categorical_dtype(dtype):
+            sort_categories = isinstance(dtype, str)


str -> string_types

jreback · 2017-09-26T18:52:07Z

pandas/io/parsers.py

@@ -1605,9 +1607,23 @@ def _cast_types(self, values, cast_type, column):
            # XXX this is for consistency with
            # c-parser which parses all categories
            # as strings
-            if not is_object_dtype(values):
+            known_cats = (isinstance(cast_type, CategoricalDtype) and
+                          cast_type.categories is not None)


none of this logic should live here either. move to pandas.core.dtypes.cast.py (also ok with a new module pandas.core.dtypes.categorical.py if its simpler)

Refactored most of this to pandas.core.dtypes.cast

jreback · 2017-09-28T11:49:19Z

pandas/_libs/parsers.pyx

-                cats = cats.sort_values()
-                indexer = cats.get_indexer(unsorted)
-                codes = take_1d(indexer, codes, fill_value=-1)
+                categories = cats.sort_values()


I would move ALL of this logic and simply create a new factory for Categorical.infer_from_categories(cats, codes, dtype=dtype) (and even fold in the maybe_convert_for_categorical). This just makes parsing code longer and longer; we want to push down logic to the dtypes.

jreback · 2017-09-28T13:31:54Z

pandas/core/categorical.py

+            dtype = CategoricalDtype(cats, ordered=False)
+            codes = inferred_codes
+
+        return cls(codes, dtype=dtype, fastpath=True)


jreback · 2017-09-28T13:32:29Z

pandas/core/categorical.py

+        -------
+        Categorical
+        """
+        from pandas.core.dtypes.cast import maybe_convert_for_categorical


yeah not sure we need maybe_covnert_for_categorical now, maybe move it here

I left it as it's own method since the python parser still needed it too. That one is different enough since it doesn't have codes, just values.

jreback · 2017-09-28T13:35:40Z

pandas/core/dtypes/cast.py

+    >>> maybe_convert_for_categorical([1, 'a'], CategoricalDtype([1, 2]))
+    array([  1.,  nan])
+    """
+    if isinstance(dtype, CategoricalDtype) and dtype.categories is not None:


in reaility this just an Index routine maybe

Index(categories).coerce_to_dtype(dtype.categories)

and if the ifisinstance(dtype, ....) logic can be in Categorical_from_inferred....

see my comment below, you can simply fold this in to Categorical._from_infererd_categories / not averse to making the inside of this an Index routine though (as its just coercing on the index type).

TomAugspurger · 2017-09-28T13:49:05Z

Hmm, seems like the compiler error is back on circle CI. Looking into it.

jreback · 2017-09-29T10:19:30Z

pandas/io/parsers.py

+            known_cats = (isinstance(cast_type, CategoricalDtype) and
+                          cast_type.categories is not None)
+
+            categories = ordered = None


why is this not using Categorical._inferred_from_categories, this code duplication is just making technical debt.

I'm not sure how much cleaner 3de75cd is. This really don't share much code, since the python parser has values, while the C parser has categories and codes. And the python parser has to maybe cast values to strings with cast_type='category'.

jreback · 2017-09-29T10:20:34Z

pandas/core/dtypes/cast.py

+    >>> maybe_convert_for_categorical([1, 'a'], CategoricalDtype([1, 2]))
+    array([  1.,  nan])
+    """
+    if isinstance(dtype, CategoricalDtype) and dtype.categories is not None:


see my comment below, you can simply fold this in to Categorical._from_infererd_categories / not averse to making the inside of this an Index routine though (as its just coercing on the index type).

jreback · 2017-09-29T12:11:21Z

pandas/io/parsers.py

+                values = Categorical._from_inferred_categories(
+                    cats, cats.get_indexer(values), cast_type
+                )
+            else:


any reason you are not handling this case as well? (I get that it conflates the purpose of _from_inferred_categories a bit), but in reality this is just like passing dtype=None.

I don't like to scatter casting/inferrence code around, very hard to figure out what's going on when when its not in 1 place.

jreback

minor comment lgtm otherwise

jreback · 2017-09-30T21:29:04Z

pandas/core/categorical.py

+        Parameters
+        ----------
+
+        inferred_categories, inferred_codes : Index


separate lines for params

jreback · 2017-09-30T21:29:49Z

pandas/core/categorical.py

+        cats = Index(inferred_categories)
+
+        # Convert to a specialzed type with `dtype` is specified
+        if (isinstance(dtype, CategoricalDtype) and


dtype by definition is already a CDT

It could also be the string 'category'. I've clarified the docstring.

TomAugspurger · 2017-10-02T14:07:53Z

All green. Merging.

I opened up #17743 for optimizing _categorical_convert in the C parser. I won't have time to get to it for the release though.

jorisvandenbossche · 2017-10-02T14:11:32Z

Thanks!

jreback · 2017-10-02T14:32:55Z

thanks @TomAugspurger this is great!

* ENH: Accept CategoricalDtype in CSV reader * rework * Fixed basic implementation * Added casting * Doc and cleanup * Fixed assignment of categoricals * Doc and test unexpected values * DOC: fixups * More coercion, use _recode_for_categories * Refactor with maybe_convert_for_categorical * PEP8 * Type for 32bit * REF: refactor to new method * py2 compat * Refactored * More in Categorical * fixup! More in Categorical

TomAugspurger force-pushed the categorical-csv-2 branch from e8c4619 to ccbaa04 Compare September 23, 2017 11:28

TomAugspurger commented Sep 23, 2017

View reviewed changes

jreback added Categorical Categorical Data Type IO CSV read_csv, to_csv labels Sep 23, 2017

TomAugspurger added 2 commits September 24, 2017 09:01

ENH: Accept CategoricalDtype in CSV reader

e83a0b8

rework

388e8a9

gfyoung reviewed Sep 24, 2017

View reviewed changes

TomAugspurger added 2 commits September 24, 2017 15:39

Fixed basic implementation

c5f6e04

Added casting

4b588cd

TomAugspurger force-pushed the categorical-csv-2 branch from ccbaa04 to 4b588cd Compare September 25, 2017 18:23

Doc and cleanup

e32d5be

TomAugspurger added this to the 0.21.0 milestone Sep 25, 2017

Fixed assignment of categoricals

508dd1e

jorisvandenbossche reviewed Sep 25, 2017

View reviewed changes

Doc and test unexpected values

6f175a7

TomAugspurger force-pushed the categorical-csv-2 branch from aa72ffe to 6f175a7 Compare September 26, 2017 13:52

jorisvandenbossche reviewed Sep 26, 2017

View reviewed changes

DOC: fixups

1545734

chris-b1 reviewed Sep 26, 2017

View reviewed changes

jreback requested changes Sep 26, 2017

View reviewed changes

TomAugspurger added 2 commits September 26, 2017 14:12

Merge remote-tracking branch 'upstream/master' into categorical-csv-2

de9e3ee

More coercion, use _recode_for_categories

b80cff8

TomAugspurger added 3 commits September 26, 2017 15:50

Refactor with maybe_convert_for_categorical

b028827

PEP8

fc34080

Type for 32bit

d100f0c

jreback requested changes Sep 28, 2017

View reviewed changes

REF: refactor to new method

8600c50

jreback reviewed Sep 28, 2017

View reviewed changes

TomAugspurger added 2 commits September 28, 2017 08:50

Merge remote-tracking branch 'upstream/master' into categorical-csv-2

8c4ab5b

py2 compat

96d5144

jreback requested changes Sep 29, 2017

View reviewed changes

Refactored

3de75cd

jreback reviewed Sep 29, 2017

View reviewed changes

More in Categorical

f03798d

jreback approved these changes Sep 30, 2017

View reviewed changes

fixup! More in Categorical

9325a93

TomAugspurger mentioned this pull request Oct 2, 2017

PERF: Optimize _categorical_convert CSV parser when categories are known ahead of time #17743

Open

TomAugspurger merged commit def3bce into pandas-dev:master Oct 2, 2017

TomAugspurger deleted the categorical-csv-2 branch October 2, 2017 14:12

		@@ -468,12 +469,38 @@ Individual columns can be parsed as a ``Categorical`` using a dict specification

		pd.read_csv(StringIO(data), dtype={'col1': 'category'}).dtypes

		Specifying ``dtype='cateogry'`` will result in an unordered ``Categorical``

Uh oh!

Accept CategoricalDtype in read_csv #17643

Accept CategoricalDtype in read_csv #17643

Uh oh!

Conversation

TomAugspurger commented Sep 23, 2017

Uh oh!

TomAugspurger commented Sep 23, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

TomAugspurger commented Sep 25, 2017

Uh oh!

jorisvandenbossche left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

TomAugspurger commented Sep 25, 2017

Uh oh!

codecov bot commented Sep 26, 2017

Codecov Report

Uh oh!

codecov bot commented Sep 26, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

codecov bot commented Sep 26, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

jorisvandenbossche left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

codecov bot commented Sep 26, 2017 •

edited

Loading

codecov bot commented Sep 26, 2017 •

edited

Loading