Skip to content

BUG: read_csv(dtype='category') raises with many categories #18186

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
adbull opened this issue Nov 9, 2017 · 5 comments · Fixed by #18402
Closed

BUG: read_csv(dtype='category') raises with many categories #18186

adbull opened this issue Nov 9, 2017 · 5 comments · Fixed by #18402
Labels
Categorical Categorical Data Type IO CSV read_csv, to_csv Regression Functionality that used to work in a prior pandas version
Milestone

Comments

@adbull
Copy link
Contributor

adbull commented Nov 9, 2017

Code Sample, a copy-pastable example if possible

import io
import pandas as pd
csv = io.StringIO('\n'.join(map(str, range(10**6))))
df = pd.read_csv(csv, dtype='category')

results in

  File "bug.py", line 5, in <module>
    df = pd.read_csv(csv, dtype='category')
  File "lib/python3.6/site-packages/pandas/io/parsers.py", line 705, in parser_f
    return _read(filepath_or_buffer, kwds)
  File "lib/python3.6/site-packages/pandas/io/parsers.py", line 451, in _read
    data = parser.read(nrows)
  File "lib/python3.6/site-packages/pandas/io/parsers.py", line 1065, in read
    ret = self._engine.read(nrows)
  File "lib/python3.6/site-packages/pandas/io/parsers.py", line 1828, in read
    data = self._reader.read(nrows)
  File "pandas/_libs/parsers.pyx", line 894, in pandas._libs.parsers.TextReader.read
  File "pandas/_libs/parsers.pyx", line 944, in pandas._libs.parsers.TextReader._read_low_memory
  File "pandas/_libs/parsers.pyx", line 2218, in pandas._libs.parsers._concatenate_chunks
  File "lib/python3.6/site-packages/numpy/core/numerictypes.py", line 1016, in find_common_type
    array_types = [dtype(x) for x in array_types]
  File "lib/python3.6/site-packages/numpy/core/numerictypes.py", line 1016, in <listcomp>
    array_types = [dtype(x) for x in array_types]
TypeError: data type not understood

Problem description

read_csv now raises when reading a column with many unique values as a category. This appears to be a regression in 0.21.0, due to the introduction of CategoricalDtype.

Expected Output

No exception.

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.6.3.final.0
python-bits: 64
OS: Linux
OS-release: 4.11.11-300.fc26.x86_64
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: C
LANG: C
LOCALE: None.None

pandas: 0.21.0
pytest: 3.2.1
pip: 9.0.1
setuptools: 36.5.0.post20170921
Cython: 0.26.1
numpy: 1.13.3
scipy: 0.19.1
pyarrow: None
xarray: None
IPython: 4.2.1
sphinx: None
patsy: 0.4.1
dateutil: 2.6.1
pytz: 2017.2
blosc: None
bottleneck: 1.2.1
tables: 3.4.2
numexpr: 2.6.2
feather: None
matplotlib: 2.1.0
openpyxl: None
xlrd: 1.1.0
xlwt: None
xlsxwriter: None
lxml: None
bs4: 4.6.0
html5lib: 1.0b10
sqlalchemy: 1.1.15
pymysql: None
psycopg2: None
jinja2: 2.9.6
s3fs: None
fastparquet: 0.1.3
pandas_gbq: None
pandas_datareader: None

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Nov 9, 2017

Thanks for the report.

Something like

diff --git a/pandas/_libs/parsers.pyx b/pandas/_libs/parsers.pyx
index 85857c158..e3a2a2186 100644
--- a/pandas/_libs/parsers.pyx
+++ b/pandas/_libs/parsers.pyx
@@ -2228,8 +2228,9 @@ def _concatenate_chunks(list chunks):
         arrs = [chunk.pop(name) for chunk in chunks]
         # Check each arr for consistent types.
         dtypes = set([a.dtype for a in arrs])
-        if len(dtypes) > 1:
-            common_type = np.find_common_type(dtypes, [])
+        numpy_dtypes = {x for x in dtypes if not is_categorical_dtype(x)}
+        if len(numpy_dtypes) > 1:
+            common_type = np.find_common_type(numpy_dtypes, [])
             if common_type == np.object:
                 warning_columns.append(str(name))

is what we want, though that's specific to Categoricals. We would want to avoid sending any of our extension dtypes to np.find_common_type.

@TomAugspurger TomAugspurger added Categorical Categorical Data Type IO CSV read_csv, to_csv Regression Functionality that used to work in a prior pandas version labels Nov 9, 2017
@TomAugspurger TomAugspurger added this to the 0.21.1 milestone Nov 9, 2017
@jreback
Copy link
Contributor

jreback commented Nov 9, 2017

use pandas_dtype here or find_common_type

@TomAugspurger
Copy link
Contributor

@adbull do you have time to submit a PR with a fix like that, along with some tests?

@tomanizer
Copy link

Is there a work around which would help us run existing code on 0.21.0?
Or should users downgrade to 0.20.3?

@TomAugspurger
Copy link
Contributor

@tomanizer not sure off the top of my head. Reading them as strings and then converting to categorical I suppose, but that may not be an option depending on memory usage. You could presumably use chunksize=., but that may introduce problems as well.

We're doing a bugfix release Wednesday or Thursday. If you have a chance to submit a PR before then we can get it merged.

@sam-cohan sam-cohan mentioned this issue Nov 21, 2017
4 tasks
sam-cohan added a commit to sam-cohan/pandas that referenced this issue Nov 21, 2017
sam-cohan added a commit to sam-cohan/pandas that referenced this issue Nov 21, 2017
sam-cohan added a commit to sam-cohan/pandas that referenced this issue Nov 21, 2017
sam-cohan added a commit to sam-cohan/pandas that referenced this issue Nov 22, 2017
sam-cohan added a commit to sam-cohan/pandas that referenced this issue Nov 22, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Categorical Categorical Data Type IO CSV read_csv, to_csv Regression Functionality that used to work in a prior pandas version
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants