Skip to content

BUG: read_csv raises exception if first row is incomplete #6710

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
altaurog opened this issue Mar 26, 2014 · 10 comments
Closed

BUG: read_csv raises exception if first row is incomplete #6710

altaurog opened this issue Mar 26, 2014 · 10 comments
Labels
Bug IO CSV read_csv, to_csv
Milestone

Comments

@altaurog
Copy link

xref #4749
xref #8985

This is a bit like issue #4749, but the conditions are different.

I have been able to reproduce it by specifying usecols:

In [98]: mydata='1,2,3\n1,2\n1,2\n'
In [99]: pd.read_csv(StringIO(mydata), names=['a', 'b', 'c'], usecols=['a', 'c'])
Out[99]: 
   a   c
0  1   3
1  1 NaN
2  1 NaN

[3 rows x 2 columns]
In [100]: mydata='1,2\n1,2,3\n4,5,6\n'
In [101]: pd.read_csv(StringIO(mydata), names=['a', 'b', 'c'], usecols=['a', 'c'])
---------------------------------------------------------------------------
CParserError                              Traceback (most recent call last)
<ipython-input-101-f60127771eeb> in <module>()
----> 1 pd.read_csv(StringIO(mydata), names=['a', 'b', 'c'], usecols=['a', 'c'])

/home/altaurog/venv/p27/local/lib/python2.7/site-packages/pandas/io/parsers.pyc in parser_f(filepath_or_buffer, sep, dialect, compression, doublequote, escapechar, quotechar, quoting, skipinitialspace, lineterminator, header, index_col, names, prefix, skiprows, skipfooter, skip_footer, na_values, na_fvalues, true_values, false_values, delimiter, converters, dtype, usecols, engine, delim_whitespace, as_recarray, na_filter, compact_ints, use_unsigned, low_memory, buffer_lines, warn_bad_lines, error_bad_lines, keep_default_na, thousands, comment, decimal, parse_dates, keep_date_col, dayfirst, date_parser, memory_map, nrows, iterator, chunksize, verbose, encoding, squeeze, mangle_dupe_cols, tupleize_cols, infer_datetime_format)
    418                     infer_datetime_format=infer_datetime_format)
    419 
--> 420         return _read(filepath_or_buffer, kwds)
    421 
    422     parser_f.__name__ = name

/home/altaurog/venv/p27/local/lib/python2.7/site-packages/pandas/io/parsers.pyc in _read(filepath_or_buffer, kwds)
    223         return parser
    224 
--> 225     return parser.read()
    226 
    227 _parser_defaults = {

/home/altaurog/venv/p27/local/lib/python2.7/site-packages/pandas/io/parsers.pyc in read(self, nrows)
    624                 raise ValueError('skip_footer not supported for iteration')
    625 
--> 626         ret = self._engine.read(nrows)
    627 
    628         if self.options.get('as_recarray'):

/home/altaurog/venv/p27/local/lib/python2.7/site-packages/pandas/io/parsers.pyc in read(self, nrows)
   1068 
   1069         try:
-> 1070             data = self._reader.read(nrows)
   1071         except StopIteration:
   1072             if nrows is None:

/home/altaurog/venv/p27/local/lib/python2.7/site-packages/pandas/parser.so in pandas.parser.TextReader.read (pandas/parser.c:6866)()

/home/altaurog/venv/p27/local/lib/python2.7/site-packages/pandas/parser.so in pandas.parser.TextReader._read_low_memory (pandas/parser.c:7086)()

/home/altaurog/venv/p27/local/lib/python2.7/site-packages/pandas/parser.so in pandas.parser.TextReader._read_rows (pandas/parser.c:7691)()

/home/altaurog/venv/p27/local/lib/python2.7/site-packages/pandas/parser.so in pandas.parser.TextReader._tokenize_rows (pandas/parser.c:7575)()

/home/altaurog/venv/p27/local/lib/python2.7/site-packages/pandas/parser.so in pandas.parser.raise_parser_error (pandas/parser.c:19038)()

CParserError: Error tokenizing data. C error: Expected 2 fields in line 2, saw 3

In [102]: pd.show_versions()

INSTALLED VERSIONS
------------------
commit: None
python: 2.7.3.final.0
python-bits: 64
OS: Linux
OS-release: 3.2.0-4-amd64
machine: x86_64
processor: 
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8

pandas: 0.13.1
Cython: 0.20
numpy: 1.8.0
scipy: None
statsmodels: None
IPython: 1.2.1
sphinx: 1.1.3
patsy: None
scikits.timeseries: None
dateutil: 1.5
pytz: 2012c
bottleneck: 0.8.0
tables: 3.1.0
numexpr: 2.3.1
matplotlib: 1.3.1
openpyxl: None
xlrd: None
xlwt: 0.7.4
xlsxwriter: None
sqlalchemy: None
lxml: None
bs4: None
html5lib: 0.999
bq: None
apiclient: None
@jreback
Copy link
Contributor

jreback commented Mar 26, 2014

hmm..

I think the error is useful here, rather than providing short row behavior. You said the header was 3 fields (e.g. by specifying names and then using those columns).

What would you have the results be?

@altaurog
Copy link
Author

I don't see why the behavior where the first row is short should differ from where the second row is short. If I specify three columns in the header, then I want three columns. Any row which has fewer values than that should be filled with NaN, assuming the dtype supports it, thus:

In [100]: mydata='1,2\n1,2,3\n4,5,6\n'
In [101]: pd.read_csv(StringIO(mydata), names=['a', 'b', 'c'], usecols=['a', 'c'])
Out[101]: 
   a   c
0  1 NaN
1  1   3
2  4   6

[3 rows x 2 columns]

The current behavior is inconsistent and frankly not very useful. My understanding from the docs is that the intention is to handle real-world data files intelligently and gracefully so we can go about getting our work done. Raising an exception in this case is neither intelligent nor graceful.

In any case, I might as well mention that the obvious workaround is to omit usecols and slice the DataFrame afterwards:

In [100]: mydata='1,2\n1,2,3\n4,5,6\n'
In [101]: df = pd.read_csv(StringIO(mydata), names=['a', 'b', 'c'])[['a', 'c']]

The result is the same, but this approach sacrifices "much faster parsing time and lower memory usage" which we were promised with usecols.

@hayd hayd added Bug labels Mar 26, 2014
@jreback
Copy link
Contributor

jreback commented Mar 26, 2014

the difference is that the first row is a header and this is a mispecification of the names

so you are asking pandas to decide which parameter is valid here

I think that is a good reason to raise an exception

that said would consider a pull request to modify this to just take the names (and maybe just have a warning) - I think as a user you would want to know that your file is malformed?

otherwise you would have set header=None and just used skip lines no ?

@altaurog
Copy link
Author

Thanks for your explanation.

In this case, the first row is not a header and the file is not malformed any more than it would be if subsequent lines were short. Perhaps I misunderstood, but I was under the impression that header is set to None implicitly when I specify names in the call to read_csv. In any case, the exception is raised even with an explicit header=None.

@jreback
Copy link
Contributor

jreback commented Mar 26, 2014

hmm...that could be a bug then if header=None, want to try to fix this?

their is a complex interaction between parameters, this is most probably an untested case.

@jreback jreback added this to the 0.15.0 milestone Mar 26, 2014
@altaurog
Copy link
Author

I'm new to pandas. (I am really impressed and very much enjoying it, I should add.) I would love to fix this, but I had a look at the code and I wasn't quite able to follow it. Unfortunately, I don't have the time right now to track this down and fix it. The performance benefits of usecols aren't a big loss for me, so I'm just going to press ahead right now with the workaround I mentioned above. The fact that it worked without usecols seemed to me to indicate it was a bug, though, so I thought I should at least create an issue for it even if I can't fix it. Now you have one fewer untested cases.

@jreback
Copy link
Contributor

jreback commented Mar 26, 2014

@altaurog great...thanks for reporting. marked as a bug.

@evanpw
Copy link
Contributor

evanpw commented Sep 18, 2015

The case where usecols is set actually has explicitly different behavior (in parser.pyx):

# Enforce this unless usecols
if not self.has_usecols:
    self.parser.expected_fields = len(self.names)

Do we want to get rid of this check? I don't see a good reason for it (and all tests pass without it), but maybe I'm missing something.

@gfyoung
Copy link
Member

gfyoung commented Aug 22, 2016

An update here: seems like the C engine is happy, but the Python engine is a little sad:

>>> from pandas import read_csv
>>> from pandas.compat import StringIO
>>>
>>> data = '1,2\n1,2,3'
>>> usecols = ['a', 'c']
>>> names = ['a', 'b', 'c']
>>>
>>> read_csv(StringIO(data), usecols=usecols, names=names, engine='c')
   a    c
0  1  NaN
1  1  3.0
>>> read_csv(StringIO(data), usecols=usecols, names=names, engine='python')
...
ValueError: Number of passed names did not match number of header fields in the file

I suspect the error in the Python engine is being raised because we only check the first row and compare it to the header.

gfyoung added a commit to forking-repos/pandas that referenced this issue Jan 5, 2017
@jreback jreback modified the milestones: 0.20.0, Next Major Release Jan 9, 2017
gfyoung added a commit to forking-repos/pandas that referenced this issue Jan 10, 2017
gfyoung added a commit to forking-repos/pandas that referenced this issue Jan 11, 2017
AnkurDedania pushed a commit to AnkurDedania/pandas that referenced this issue Mar 21, 2017
@dmitriyshashkin
Copy link

Strangely enough I'm having the same problem in 1.3.5

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug IO CSV read_csv, to_csv
Projects
None yet
Development

No branches or pull requests

6 participants