csv parsing is too restrictive #9294

kousu · 2015-01-19T09:08:03Z

On this csv file:

DecisionM,IntelligentM,freq,total
0, 5, 9, 20 
0, 6, 21,33
0, 7, 35,65
0, 8, 35,83
0, 9, 14,41
0, 10, 10,26
1, 5, 11,20
1, 6, 12,33
1, 7, 30,65
1, 8, 48,83
1, 9, 27, 41,,
1, 10, 16, 26

pandas.read_csv() gives me

---------------------------------------------------------------------------
CParserError                              Traceback (most recent call last)
<ipython-input-9-28b9c031f12d> in <module>()
----> 1 pandas.read_csv("speeddating.csv")

/usr/lib/python3.4/site-packages/pandas/io/parsers.py in parser_f(filepath_or_buffer, sep, dialect, compression, doublequote, escapechar, quotechar, quoting, skipinitialspace, lineterminator, header, index_col, names, prefix, skiprows, skipfooter, skip_footer, na_values, na_fvalues, true_values, false_values, delimiter, converters, dtype, usecols, engine, delim_whitespace, as_recarray, na_filter, compact_ints, use_unsigned, low_memory, buffer_lines, warn_bad_lines, error_bad_lines, keep_default_na, thousands, comment, decimal, parse_dates, keep_date_col, dayfirst, date_parser, memory_map, float_precision, nrows, iterator, chunksize, verbose, encoding, squeeze, mangle_dupe_cols, tupleize_cols, infer_datetime_format, skip_blank_lines)
    463                     skip_blank_lines=skip_blank_lines)
    464 
--> 465         return _read(filepath_or_buffer, kwds)
    466 
    467     parser_f.__name__ = name

/usr/lib/python3.4/site-packages/pandas/io/parsers.py in _read(filepath_or_buffer, kwds)
    249         return parser
    250 
--> 251     return parser.read()
    252 
    253 _parser_defaults = {

/usr/lib/python3.4/site-packages/pandas/io/parsers.py in read(self, nrows)
    708                 raise ValueError('skip_footer not supported for iteration')
    709 
--> 710         ret = self._engine.read(nrows)
    711 
    712         if self.options.get('as_recarray'):

/usr/lib/python3.4/site-packages/pandas/io/parsers.py in read(self, nrows)
   1157 
   1158         try:
-> 1159             data = self._reader.read(nrows)
   1160         except StopIteration:
   1161             if nrows is None:

/usr/lib/python3.4/site-packages/pandas/parser.cpython-34m.so in pandas.parser.TextReader.read (pandas/parser.c:7403)()

/usr/lib/python3.4/site-packages/pandas/parser.cpython-34m.so in pandas.parser.TextReader._read_low_memory (pandas/parser.c:7643)()

/usr/lib/python3.4/site-packages/pandas/parser.cpython-34m.so in pandas.parser.TextReader._read_rows (pandas/parser.c:8265)()

/usr/lib/python3.4/site-packages/pandas/parser.cpython-34m.so in pandas.parser.TextReader._tokenize_rows (pandas/parser.c:8139)()

/usr/lib/python3.4/site-packages/pandas/parser.cpython-34m.so in pandas.parser.raise_parser_error (pandas/parser.c:20776)()

CParserError: Error tokenizing data. C error: Expected 4 fields in line 12, saw 6

This seems odd to me. There's nothing in those fields to worry about anyway, they can just be dropped. Even if there was data there, if it doesn't have a column to go with it should be dropped too. Lots of csv files are messy like this; that's basically why they're used all the time.

I am on pandas '0.15.2' and Python3

The text was updated successfully, but these errors were encountered:

jreback · 2015-01-19T12:20:08Z

You need to explicity allow this, otherwise its designed to give an error.

In [8]: read_csv(StringIO(data),error_bad_lines=False)
Skipping line 12: expected 4 fields, saw 6

Out[8]: 
    DecisionM  IntelligentM  freq  total
0           0             5     9     20
1           0             6    21     33
2           0             7    35     65
3           0             8    35     83
4           0             9    14     41
5           0            10    10     26
6           1             5    11     20
7           1             6    12     33
8           1             7    30     65
9           1             8    48     83
10          1            10    16     26

szeitlin · 2015-09-30T00:13:25Z

when I try adding error_bad_lines=False to mine, I get this very verbose output:

b'Skipping line 6: expected 2 fields, saw 8\nSkipping line 7: expected 2 fields, saw 8\nSkipping line 8: expected 2 fields, saw 8\nSkipping line 9: expected 2 fields, saw 8\nSkipping line 10: expected 2 fields, saw 8\nSkipping line 11: expected 2 fields, saw 8\nSkipping line 12: expected 2 fields, saw 8\nSkipping line 13: expected 2 fields, saw 8\nSkipping line 14: expected 2 fields, saw 8\nSkipping line 15: expected 2 fields, saw 8\ etc.

Looks like pandas is assuming the first line is the start of the data table, and is trying to tokenize based on the number of fields it found there, even though the entire rest of the file has 8 fields per row.

I know that the header on this file is such that if I do skiprows=4, pandas can read it just fine.

I don't think R has the ability to autodetect headers, but it sure would be nice if pandas had an option to do it, or if we could specify the number of rows to look at, to help determine what might be a header (?). Just an idea.

TomAugspurger · 2015-09-30T02:24:49Z

Just my 2 cents, but I don't think pandas should try to guess here. Even though warning on every line past the first one is almost surely not what the user wants, it could also indicate a bad file, where a column wasn't written correctly.

Perhaps with another keyword or option to read_csv, but even then this seems like something that's tough to get right often enough to rely on. My preference is to put that burden on the user. Thoughts?

kawochen · 2015-09-30T13:08:59Z

Input files should follow RFC4180 as closely as possible. The optional header should be the first line, so we shouldn't need to look past the first line to determine the header.

szeitlin · 2015-09-30T15:13:50Z

My point is that it's a common problem, and pandas could help alleviate the issue, which would save everyone time and duplicated effort.

Even if there's just an optional flag to return a list of tuples structured as (line number, number of fields in that line), that kind of output would be more useful than what it's already doing, which actually already includes that information, just not in a usable format.

jreback · 2015-09-30T18:23:28Z

@szeitlin their are a number of issues w.r.t. how to 'report' bad/warnings of lines, e.g. see #5686

so maybe someone can propose an API, e.g. we could do something like

replicates the current api

df = read_csv(......, errors='raise')
df = read_csv(......, errors='warn')

df, errors = read_csv(...., return_errors=True)

and errors be something like:

errors = [ { 'line' : 10, 'contents' : ..... ,'error' : ..... }, ..... ]

randomgambit · 2017-04-26T14:00:43Z

something I was curious about is whether we can still retrieve these bogus cells in some way... My understanding is that error_bad_lines = False will just skip these lines. What about keeping them and putting some fake column to get them?

szeitlin · 2017-04-26T14:44:09Z

So this was a while ago, but I ended up actually catching those errors and parsing them, in order to figure out what was going on with my files. It worked ok, but it's probably kind of fragile (if anything about this part of pandas changes, it will break). I'd post the code here, but it was at my old job, so I don't have access to the repo anymore. 😂

randomgambit · 2017-04-26T15:19:31Z

yeah these bogus lines are difficult to catch unfortunately (at least not in native Pandas)
http://stackoverflow.com/questions/38902553/error-tokenizing-data-during-pandas-read-csv-how-to-actually-see-the-bad-lines

kousu mentioned this issue Jan 19, 2015

csv parsing gets confused on first line commas #9295

Closed

jreback closed this as completed Jan 19, 2015

jreback added IO CSV read_csv, to_csv Error Reporting Incorrect or improved errors from pandas Usage Question labels Jan 19, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

csv parsing is too restrictive #9294

csv parsing is too restrictive #9294

kousu commented Jan 19, 2015

jreback commented Jan 19, 2015

szeitlin commented Sep 30, 2015

TomAugspurger commented Sep 30, 2015

kawochen commented Sep 30, 2015

szeitlin commented Sep 30, 2015

jreback commented Sep 30, 2015

randomgambit commented Apr 26, 2017

szeitlin commented Apr 26, 2017

randomgambit commented Apr 26, 2017

csv parsing is too restrictive #9294

csv parsing is too restrictive #9294

Comments

kousu commented Jan 19, 2015

jreback commented Jan 19, 2015

szeitlin commented Sep 30, 2015

TomAugspurger commented Sep 30, 2015

kawochen commented Sep 30, 2015

szeitlin commented Sep 30, 2015

jreback commented Sep 30, 2015

randomgambit commented Apr 26, 2017

szeitlin commented Apr 26, 2017

randomgambit commented Apr 26, 2017