Skip to content

Error reading an empty CSV with known column names and dtype 'category' #14606

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
spitz-dan-l opened this issue Nov 7, 2016 · 5 comments · Fixed by #14752
Closed

Error reading an empty CSV with known column names and dtype 'category' #14606

spitz-dan-l opened this issue Nov 7, 2016 · 5 comments · Fixed by #14752
Labels
Bug Categorical Categorical Data Type IO CSV read_csv, to_csv
Milestone

Comments

@spitz-dan-l
Copy link

Hello, I've found a corner case where specifying category dtypes in pd.read_csv causes an error when it ought to return an empty dataframe.

A small, complete example of the issue

pd.read_csv(StringIO(''), names=['a'], dtype={'a': 'object'}) #works

pd.read_csv(StringIO(''), names=['a'], dtype={'a': 'category'}) #breaks

Expected Output

---------------------------------------------------------------------------
StopIteration                             Traceback (most recent call last)
/Users/dspitz/miniconda3/lib/python3.5/site-packages/pandas/io/parsers.py in read(self, nrows)
   1506         try:
-> 1507             data = self._reader.read(nrows)
   1508         except StopIteration:

pandas/parser.pyx in pandas.parser.TextReader.read (pandas/parser.c:10364)()

pandas/parser.pyx in pandas.parser.TextReader._read_low_memory (pandas/parser.c:11033)()

StopIteration: 

During handling of the above exception, another exception occurred:

TypeError                                 Traceback (most recent call last)
<ipython-input-13-819a0c4e04cb> in <module>()
----> 1 pd.read_csv(StringIO(''), header=None, names=['a'], dtype={'a': 'category'})

/Users/dspitz/miniconda3/lib/python3.5/site-packages/pandas/io/parsers.py in parser_f(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, squeeze, prefix, mangle_dupe_cols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, dayfirst, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, escapechar, comment, encoding, dialect, tupleize_cols, error_bad_lines, warn_bad_lines, skipfooter, skip_footer, doublequote, delim_whitespace, as_recarray, compact_ints, use_unsigned, low_memory, buffer_lines, memory_map, float_precision)
    643                     skip_blank_lines=skip_blank_lines)
    644 
--> 645         return _read(filepath_or_buffer, kwds)
    646 
    647     parser_f.__name__ = name

/Users/dspitz/miniconda3/lib/python3.5/site-packages/pandas/io/parsers.py in _read(filepath_or_buffer, kwds)
    398         return parser
    399 
--> 400     data = parser.read()
    401     parser.close()
    402     return data

/Users/dspitz/miniconda3/lib/python3.5/site-packages/pandas/io/parsers.py in read(self, nrows)
    936                 raise ValueError('skipfooter not supported for iteration')
    937 
--> 938         ret = self._engine.read(nrows)
    939 
    940         if self.options.get('as_recarray'):

/Users/dspitz/miniconda3/lib/python3.5/site-packages/pandas/io/parsers.py in read(self, nrows)
   1513                 index, columns, col_dict = _get_empty_meta(
   1514                     names, self.index_col, self.index_names,
-> 1515                     dtype=self.kwds.get('dtype'))
   1516 
   1517                 if self.usecols is not None:

/Users/dspitz/miniconda3/lib/python3.5/site-packages/pandas/io/parsers.py in _get_empty_meta(columns, index_col, index_names, dtype)
   2803     col_dict = dict((col_name,
   2804                      np.empty(0, dtype=dtype.get(col_name, np.object)))
-> 2805                     for col_name in columns)
   2806 
   2807     return index, columns, col_dict

/Users/dspitz/miniconda3/lib/python3.5/site-packages/pandas/io/parsers.py in <genexpr>(.0)
   2803     col_dict = dict((col_name,
   2804                      np.empty(0, dtype=dtype.get(col_name, np.object)))
-> 2805                     for col_name in columns)
   2806 
   2807     return index, columns, col_dict

TypeError: data type "category" not understood

Appears the problem is in _get_empty_meta() where the category dtype is passed along to np.empty. I don't know the idiomatic way to construct an empty Categorical series but that is what needs to happen.

Output of pd.show_versions()

INSTALLED VERSIONS ------------------ commit: None python: 3.5.1.final.0 python-bits: 64 OS: Darwin OS-release: 15.6.0 machine: x86_64 processor: i386 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: en_US.UTF-8

pandas: 0.19.1
nose: None
pip: 8.1.1
setuptools: 20.3
Cython: 0.24.1
numpy: 1.11.2
scipy: 0.18.1
statsmodels: None
xarray: None
IPython: 5.1.0
sphinx: None
patsy: None
dateutil: 2.5.3
pytz: 2016.3
blosc: None
bottleneck: None
tables: 3.2.2
numexpr: 2.6.1
matplotlib: 1.5.1
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: 1.0.12
pymysql: 0.6.7.None
psycopg2: 2.6.1 (dt dec pq3 ext)
jinja2: 2.8
boto: None
pandas_datareader: None

@chris-b1
Copy link
Contributor

chris-b1 commented Nov 7, 2016

Thanks for the report. You can construct an empty Categorical like this - the dtype is assumed float64, which is a bit odd, but used in several places as the "don't know" dtype.

In [5]: pd.Categorical([])
Out[5]: [], Categories (0, float64): []

or

In [17]: pd.Series(dtype='category')
Out[17]: 
Series([], dtype: category
Categories (0, float64): [])

@chris-b1 chris-b1 added Bug IO CSV read_csv, to_csv Categorical Categorical Data Type labels Nov 7, 2016
@chris-b1 chris-b1 added this to the Next Major Release milestone Nov 7, 2016
@jreback
Copy link
Contributor

jreback commented Nov 7, 2016

you can specify the dtype so think / maybe should be object (shouldn't matter if u union as it will coerce I think, maybe need to check that)

@jreback
Copy link
Contributor

jreback commented Nov 7, 2016

could also give it

np.empty(0, dtype='object') instead of an empty list to the Categorical constructor

@gfyoung
Copy link
Member

gfyoung commented Nov 26, 2016

@jreback : #14717 patches this up with a test included to confirm!

jorisvandenbossche added a commit to jorisvandenbossche/pandas that referenced this issue Nov 26, 2016
Issue pandas-dev#14606 was fixed by PR pandas-dev#14717, adding one more specific test to confirm this
@jorisvandenbossche jorisvandenbossche modified the milestones: 0.19.2, Next Major Release Nov 26, 2016
jorisvandenbossche added a commit to jorisvandenbossche/pandas that referenced this issue Nov 26, 2016
Issue pandas-dev#14606 was fixed by PR pandas-dev#14717, adding one more specific test to confirm this
@jorisvandenbossche
Copy link
Member

@gfyoung Thanks for the notice. Your PR indeed fixed this issue, but I added one more test to specifically confirm the fix of this issue: #14752 (dtype='category' did not give an error before, while dtype={'a': 'category'} did)

jorisvandenbossche added a commit that referenced this issue Dec 10, 2016
…14752)

Issue #14606 was fixed by PR #14717, adding one more specific test to confirm this
ischurov pushed a commit to ischurov/pandas that referenced this issue Dec 19, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Categorical Categorical Data Type IO CSV read_csv, to_csv
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants