Error reading an empty CSV with known column names and dtype 'category' #14606

spitz-dan-l · 2016-11-07T20:50:31Z

Hello, I've found a corner case where specifying category dtypes in pd.read_csv causes an error when it ought to return an empty dataframe.

A small, complete example of the issue

pd.read_csv(StringIO(''), names=['a'], dtype={'a': 'object'}) #works

pd.read_csv(StringIO(''), names=['a'], dtype={'a': 'category'}) #breaks

Expected Output

---------------------------------------------------------------------------
StopIteration                             Traceback (most recent call last)
/Users/dspitz/miniconda3/lib/python3.5/site-packages/pandas/io/parsers.py in read(self, nrows)
   1506         try:
-> 1507             data = self._reader.read(nrows)
   1508         except StopIteration:

pandas/parser.pyx in pandas.parser.TextReader.read (pandas/parser.c:10364)()

pandas/parser.pyx in pandas.parser.TextReader._read_low_memory (pandas/parser.c:11033)()

StopIteration: 

During handling of the above exception, another exception occurred:

TypeError                                 Traceback (most recent call last)
<ipython-input-13-819a0c4e04cb> in <module>()
----> 1 pd.read_csv(StringIO(''), header=None, names=['a'], dtype={'a': 'category'})

/Users/dspitz/miniconda3/lib/python3.5/site-packages/pandas/io/parsers.py in parser_f(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, squeeze, prefix, mangle_dupe_cols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, dayfirst, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, escapechar, comment, encoding, dialect, tupleize_cols, error_bad_lines, warn_bad_lines, skipfooter, skip_footer, doublequote, delim_whitespace, as_recarray, compact_ints, use_unsigned, low_memory, buffer_lines, memory_map, float_precision)
    643                     skip_blank_lines=skip_blank_lines)
    644 
--> 645         return _read(filepath_or_buffer, kwds)
    646 
    647     parser_f.__name__ = name

/Users/dspitz/miniconda3/lib/python3.5/site-packages/pandas/io/parsers.py in _read(filepath_or_buffer, kwds)
    398         return parser
    399 
--> 400     data = parser.read()
    401     parser.close()
    402     return data

/Users/dspitz/miniconda3/lib/python3.5/site-packages/pandas/io/parsers.py in read(self, nrows)
    936                 raise ValueError('skipfooter not supported for iteration')
    937 
--> 938         ret = self._engine.read(nrows)
    939 
    940         if self.options.get('as_recarray'):

/Users/dspitz/miniconda3/lib/python3.5/site-packages/pandas/io/parsers.py in read(self, nrows)
   1513                 index, columns, col_dict = _get_empty_meta(
   1514                     names, self.index_col, self.index_names,
-> 1515                     dtype=self.kwds.get('dtype'))
   1516 
   1517                 if self.usecols is not None:

/Users/dspitz/miniconda3/lib/python3.5/site-packages/pandas/io/parsers.py in _get_empty_meta(columns, index_col, index_names, dtype)
   2803     col_dict = dict((col_name,
   2804                      np.empty(0, dtype=dtype.get(col_name, np.object)))
-> 2805                     for col_name in columns)
   2806 
   2807     return index, columns, col_dict

/Users/dspitz/miniconda3/lib/python3.5/site-packages/pandas/io/parsers.py in <genexpr>(.0)
   2803     col_dict = dict((col_name,
   2804                      np.empty(0, dtype=dtype.get(col_name, np.object)))
-> 2805                     for col_name in columns)
   2806 
   2807     return index, columns, col_dict

TypeError: data type "category" not understood

Appears the problem is in _get_empty_meta() where the category dtype is passed along to np.empty. I don't know the idiomatic way to construct an empty Categorical series but that is what needs to happen.

Output of `pd.show_versions()`

INSTALLED VERSIONS ------------------ commit: None python: 3.5.1.final.0 python-bits: 64 OS: Darwin OS-release: 15.6.0 machine: x86_64 processor: i386 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: en_US.UTF-8

pandas: 0.19.1
nose: None
pip: 8.1.1
setuptools: 20.3
Cython: 0.24.1
numpy: 1.11.2
scipy: 0.18.1
statsmodels: None
xarray: None
IPython: 5.1.0
sphinx: None
patsy: None
dateutil: 2.5.3
pytz: 2016.3
blosc: None
bottleneck: None
tables: 3.2.2
numexpr: 2.6.1
matplotlib: 1.5.1
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: 1.0.12
pymysql: 0.6.7.None
psycopg2: 2.6.1 (dt dec pq3 ext)
jinja2: 2.8
boto: None
pandas_datareader: None

The text was updated successfully, but these errors were encountered:

chris-b1 · 2016-11-07T21:03:13Z

Thanks for the report. You can construct an empty Categorical like this - the dtype is assumed float64, which is a bit odd, but used in several places as the "don't know" dtype.

In [5]: pd.Categorical([])
Out[5]: [], Categories (0, float64): []

or

In [17]: pd.Series(dtype='category')
Out[17]: 
Series([], dtype: category
Categories (0, float64): [])

jreback · 2016-11-07T21:05:58Z

you can specify the dtype so think / maybe should be object (shouldn't matter if u union as it will coerce I think, maybe need to check that)

jreback · 2016-11-07T21:06:44Z

could also give it

np.empty(0, dtype='object') instead of an empty list to the Categorical constructor

gfyoung · 2016-11-26T07:29:35Z

@jreback : #14717 patches this up with a test included to confirm!

Issue pandas-dev#14606 was fixed by PR pandas-dev#14717, adding one more specific test to confirm this

jorisvandenbossche · 2016-11-26T11:44:22Z

@gfyoung Thanks for the notice. Your PR indeed fixed this issue, but I added one more test to specifically confirm the fix of this issue: #14752 (dtype='category' did not give an error before, while dtype={'a': 'category'} did)

…14752) Issue #14606 was fixed by PR #14717, adding one more specific test to confirm this

…andas-dev#14752) Issue pandas-dev#14606 was fixed by PR pandas-dev#14717, adding one more specific test to confirm this

chris-b1 added Bug IO CSV read_csv, to_csv Categorical Categorical Data Type labels Nov 7, 2016

chris-b1 added this to the Next Major Release milestone Nov 7, 2016

jorisvandenbossche added a commit to jorisvandenbossche/pandas that referenced this issue Nov 26, 2016

TST: add test to confirm GH14606 (specify category dtype for empty)

371eae2

Issue pandas-dev#14606 was fixed by PR pandas-dev#14717, adding one more specific test to confirm this

jorisvandenbossche modified the milestones: 0.19.2, Next Major Release Nov 26, 2016

jorisvandenbossche mentioned this issue Nov 26, 2016

TST: add test to confirm GH14606 (specify category dtype for empty) #14752

Merged

jorisvandenbossche added a commit to jorisvandenbossche/pandas that referenced this issue Nov 26, 2016

TST: add test to confirm GH14606 (specify category dtype for empty)

bd88c60

Issue pandas-dev#14606 was fixed by PR pandas-dev#14717, adding one more specific test to confirm this

jorisvandenbossche closed this as completed in #14752 Dec 10, 2016

jorisvandenbossche added a commit that referenced this issue Dec 10, 2016

TST: add test to confirm GH14606 (specify category dtype for empty) (#…

3710f2e

…14752) Issue #14606 was fixed by PR #14717, adding one more specific test to confirm this

ischurov pushed a commit to ischurov/pandas that referenced this issue Dec 19, 2016

TST: add test to confirm GH14606 (specify category dtype for empty) (p…

47f9730

…andas-dev#14752) Issue pandas-dev#14606 was fixed by PR pandas-dev#14717, adding one more specific test to confirm this

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error reading an empty CSV with known column names and dtype 'category' #14606

Error reading an empty CSV with known column names and dtype 'category' #14606

spitz-dan-l commented Nov 7, 2016

chris-b1 commented Nov 7, 2016 •

edited by jorisvandenbossche

Loading

jreback commented Nov 7, 2016

jreback commented Nov 7, 2016 •

edited

Loading

gfyoung commented Nov 26, 2016 •

edited by jorisvandenbossche

Loading

jorisvandenbossche commented Nov 26, 2016

Error reading an empty CSV with known column names and dtype 'category' #14606

Error reading an empty CSV with known column names and dtype 'category' #14606

Comments

spitz-dan-l commented Nov 7, 2016

A small, complete example of the issue

Expected Output

Output of pd.show_versions()

chris-b1 commented Nov 7, 2016 • edited by jorisvandenbossche Loading

jreback commented Nov 7, 2016

jreback commented Nov 7, 2016 • edited Loading

gfyoung commented Nov 26, 2016 • edited by jorisvandenbossche Loading

jorisvandenbossche commented Nov 26, 2016

Output of `pd.show_versions()`

chris-b1 commented Nov 7, 2016 •

edited by jorisvandenbossche

Loading

jreback commented Nov 7, 2016 •

edited

Loading

gfyoung commented Nov 26, 2016 •

edited by jorisvandenbossche

Loading