Skip to content

ENH: parse categoricals in read_csv #13406

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 11 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
21 changes: 21 additions & 0 deletions asv_bench/benchmarks/parser_vb.py
Original file line number Diff line number Diff line change
Expand Up @@ -114,6 +114,27 @@ def teardown(self):
os.remove('test.csv')


class read_csv_categorical(object):
goal_time = 0.2

def setup(self):
N = 100000
group1 = ['aaaaaaaa', 'bbbbbbb', 'cccccccc', 'dddddddd', 'eeeeeeee']
df = DataFrame({'a': np.random.choice(group1, N).astype('object'),
'b': np.random.choice(group1, N).astype('object'),
'c': np.random.choice(group1, N).astype('object')})
df.to_csv('strings.csv', index=False)

def time_read_csv_categorical_post(self):
read_csv('strings.csv').apply(pd.Categorical)

def time_read_csv_categorical_direct(self):
read_csv('strings.csv', dtype='category')

def teardown(self):
os.remove('strings.csv')


class read_table_multiple_date(object):
goal_time = 0.2

Expand Down
37 changes: 37 additions & 0 deletions doc/source/io.rst
Original file line number Diff line number Diff line change
Expand Up @@ -500,6 +500,43 @@ worth trying.
data that was read in. It is important to note that the overall column will be
marked with a ``dtype`` of ``object``, which is used for columns with mixed dtypes.

.. _io.categorical:

Specifying Categorical dtype
''''''''''''''''''''''''''''

.. versionadded:: 0.19.0

``Categorical`` columns can be parsed directly by specifying ``dtype='category'``

.. ipython:: python

data = 'col1,col2,col3\na,b,1\na,b,2\nc,d,3'

pd.read_csv(StringIO(data))
pd.read_csv(StringIO(data)).dtypes
pd.read_csv(StringIO(data), dtype='category').dtypes

Individual columns can be parsed as a ``Categorical`` using a dict specification

.. ipython:: python

pd.read_csv(StringIO(data), dtype={'col1': 'category'}).dtypes

.. note::

The resulting categories will always be parsed as strings (object dtype).
If the categories are numeric they can be converted using the
:func:`to_numeric` function, or as appropriate, another converter
such as :func:`to_datetime`.

.. ipython:: python

df = pd.read_csv(StringIO(data), dtype='category')
df.dtypes
df['col3']
df['col3'].cat.categories = pd.to_numeric(df['col3'].cat.categories)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is not idiomatic, we don't assign directly to .categories, instead use .set_categories

In [17]: df = pd.DataFrame({'A' : pd.Categorical(['1','2','1'])})

In [18]: df.A
Out[18]:
0    1
1    2
2    1
Name: A, dtype: category
Categories (2, object): [1, 2]

In [19]: df['A'] = df['A'].cat.set_categories(pd.to_numeric(df['A'].cat.categories), rename=True)

In [20]: df.A
Out[20]:
0    1
1    2
2    1
Name: A, dtype: category
Categories (2, int64): [1, 2]

I could see adding a .apply_categories accessor, so this would become:

df['A'] = df['A'].cat.apply_categories(pd.to_numeric) which could do the apply-set-with-rename idiom

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually on 2nd thought, this is ok. assigning directly to cat.categories is fine

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

@jankatins jankatins Jul 21, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO assigning to categories is ok, we even have it in the docs: https://github.com/pydata/pandas/blob/master/pandas/core/categorical.py#L1857

Code is here and checks for same length as old cats: https://github.com/pydata/pandas/blob/master/pandas/core/categorical.py#L546-L570

set_categories (source) was designed to do more:

This method can be used to perform more than one action of adding,
removing, and reordering simultaneously and is therefore faster than
performing the individual steps via the more specialised methods.
On the other hand this methods does not do checks (e.g., whether the
old categories are included in the new categories on a reorder), which
can result in surprising changes, for example when using special string
dtypes on python3, which does not considers a S1 string equal to a
single char python string.

-> IMO assigning is better and set_categories should be avoided until you really have a performance bottleneck due to multiple calls to change the categories...

df['col3']


Naming and Using Columns
Expand Down
49 changes: 49 additions & 0 deletions doc/source/whatsnew/v0.19.0.txt
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,7 @@ Highlights include:
- :func:`merge_asof` for asof-style time-series joining, see :ref:`here <whatsnew_0190.enhancements.asof_merge>`
- ``.rolling()`` are now time-series aware, see :ref:`here <whatsnew_0190.enhancements.rolling_ts>`
- pandas development api, see :ref:`here <whatsnew_0190.dev_api>`
- :func:`read_csv` now supports parsing ``Categorical`` data, see :ref:`here <whatsnew_0190.enhancements.read_csv_categorical>`

.. contents:: What's new in v0.19.0
:local:
Expand Down Expand Up @@ -195,6 +196,14 @@ default of the index) in a DataFrame.
:func:`read_csv` has improved support for duplicate column names
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. ipython:: python
:suppress:

from pandas.compat import StringIO

.. _whatsnew_0190.enhancements.read_csv_dupe_col_names_support:


:ref:`Duplicate column names <io.dupe_names>` are now supported in :func:`read_csv` whether
they are in the file or passed in as the ``names`` parameter (:issue:`7160`, :issue:`9424`)

Expand Down Expand Up @@ -222,6 +231,46 @@ New behaviour:

In [2]: pd.read_csv(StringIO(data), names=names)


.. _whatsnew_0190.enhancements.read_csv_categorical:

:func:`read_csv` supports parsing ``Categorical`` directly
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

The :func:`read_csv` function now supports parsing a ``Categorical`` column when
specified as a dtype (:issue:`10153`). Depending on the structure of the data,
this can result in a faster parse time and lower memory usage compared to
converting to ``Categorical`` after parsing. See the io :ref:`docs here <io.categorical>`

.. ipython:: python

data = 'col1,col2,col3\na,b,1\na,b,2\nc,d,3'

pd.read_csv(StringIO(data))
pd.read_csv(StringIO(data)).dtypes
pd.read_csv(StringIO(data), dtype='category').dtypes

Individual columns can be parsed as a ``Categorical`` using a dict specification

.. ipython:: python

pd.read_csv(StringIO(data), dtype={'col1': 'category'}).dtypes

.. note::

The resulting categories will always be parsed as strings (object dtype).
If the categories are numeric they can be converted using the
:func:`to_numeric` function, or as appropriate, another converter
such as :func:`to_datetime`.

.. ipython:: python

df = pd.read_csv(StringIO(data), dtype='category')
df.dtypes
df['col3']
df['col3'].cat.categories = pd.to_numeric(df['col3'].cat.categories)
df['col3']

.. _whatsnew_0190.enhancements.semi_month_offsets:

Semi-Month Offsets
Expand Down
94 changes: 93 additions & 1 deletion pandas/io/tests/parser/c_parser_only.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,9 +12,10 @@

import pandas as pd
import pandas.util.testing as tm
from pandas import DataFrame, Series, Index, MultiIndex
from pandas import DataFrame, Series, Index, MultiIndex, Categorical
from pandas import compat
from pandas.compat import StringIO, range, lrange
from pandas.types.dtypes import CategoricalDtype


class CParserTests(object):
Expand Down Expand Up @@ -135,6 +136,11 @@ def test_passing_dtype(self):
dtype={'A': 'timedelta64', 'B': 'float64'},
index_col=0)

# valid but unsupported - fixed width unicode string
self.assertRaises(TypeError, self.read_csv, path,
dtype={'A': 'U8'},
index_col=0)

# see gh-12048: empty frame
actual = self.read_csv(StringIO('A,B'), dtype=str)
expected = DataFrame({'A': [], 'B': []}, index=[], dtype=str)
Expand Down Expand Up @@ -184,6 +190,92 @@ def test_pass_dtype(self):
self.assertEqual(result['one'].dtype, 'u1')
self.assertEqual(result['two'].dtype, 'object')

def test_categorical_dtype(self):
# GH 10153
data = """a,b,c
1,a,3.4
1,a,3.4
2,b,4.5"""
expected = pd.DataFrame({'a': Categorical(['1', '1', '2']),
'b': Categorical(['a', 'a', 'b']),
'c': Categorical(['3.4', '3.4', '4.5'])})
actual = self.read_csv(StringIO(data), dtype='category')
tm.assert_frame_equal(actual, expected)

actual = self.read_csv(StringIO(data), dtype=CategoricalDtype())
tm.assert_frame_equal(actual, expected)

actual = self.read_csv(StringIO(data), dtype={'a': 'category',
'b': 'category',
'c': CategoricalDtype()})
tm.assert_frame_equal(actual, expected)

actual = self.read_csv(StringIO(data), dtype={'b': 'category'})
expected = pd.DataFrame({'a': [1, 1, 2],
'b': Categorical(['a', 'a', 'b']),
'c': [3.4, 3.4, 4.5]})
tm.assert_frame_equal(actual, expected)

actual = self.read_csv(StringIO(data), dtype={1: 'category'})
tm.assert_frame_equal(actual, expected)

# unsorted
data = """a,b,c
1,b,3.4
1,b,3.4
2,a,4.5"""
expected = pd.DataFrame({'a': Categorical(['1', '1', '2']),
'b': Categorical(['b', 'b', 'a']),
'c': Categorical(['3.4', '3.4', '4.5'])})
actual = self.read_csv(StringIO(data), dtype='category')
tm.assert_frame_equal(actual, expected)

# missing
data = """a,b,c
1,b,3.4
1,nan,3.4
2,a,4.5"""
expected = pd.DataFrame({'a': Categorical(['1', '1', '2']),
'b': Categorical(['b', np.nan, 'a']),
'c': Categorical(['3.4', '3.4', '4.5'])})
actual = self.read_csv(StringIO(data), dtype='category')
tm.assert_frame_equal(actual, expected)

def test_categorical_dtype_encoding(self):
# GH 10153
pth = tm.get_data_path('unicode_series.csv')
encoding = 'latin-1'
expected = self.read_csv(pth, header=None, encoding=encoding)
expected[1] = Categorical(expected[1])
actual = self.read_csv(pth, header=None, encoding=encoding,
dtype={1: 'category'})
tm.assert_frame_equal(actual, expected)

pth = tm.get_data_path('utf16_ex.txt')
encoding = 'utf-16'
expected = self.read_table(pth, encoding=encoding)
expected = expected.apply(Categorical)
actual = self.read_table(pth, encoding=encoding, dtype='category')
tm.assert_frame_equal(actual, expected)

def test_categorical_dtype_chunksize(self):
# GH 10153
data = """a,b
1,a
1,b
1,b
2,c"""
expecteds = [pd.DataFrame({'a': [1, 1],
'b': Categorical(['a', 'b'])}),
pd.DataFrame({'a': [1, 2],
'b': Categorical(['b', 'c'])},
index=[2, 3])]
actuals = self.read_csv(StringIO(data), dtype={'b': 'category'},
chunksize=2)

for actual, expected in zip(actuals, expecteds):
tm.assert_frame_equal(actual, expected)

def test_pass_dtype_as_recarray(self):
if compat.is_platform_windows() and self.low_memory:
raise nose.SkipTest(
Expand Down
Loading