Skip to content

csv parsing is too restrictive #9294

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
kousu opened this issue Jan 19, 2015 · 9 comments
Closed

csv parsing is too restrictive #9294

kousu opened this issue Jan 19, 2015 · 9 comments
Labels
Error Reporting Incorrect or improved errors from pandas IO CSV read_csv, to_csv Usage Question

Comments

@kousu
Copy link

kousu commented Jan 19, 2015

On this csv file:

DecisionM,IntelligentM,freq,total
0, 5, 9, 20 
0, 6, 21,33
0, 7, 35,65
0, 8, 35,83
0, 9, 14,41
0, 10, 10,26
1, 5, 11,20
1, 6, 12,33
1, 7, 30,65
1, 8, 48,83
1, 9, 27, 41,,
1, 10, 16, 26

pandas.read_csv() gives me

---------------------------------------------------------------------------
CParserError                              Traceback (most recent call last)
<ipython-input-9-28b9c031f12d> in <module>()
----> 1 pandas.read_csv("speeddating.csv")

/usr/lib/python3.4/site-packages/pandas/io/parsers.py in parser_f(filepath_or_buffer, sep, dialect, compression, doublequote, escapechar, quotechar, quoting, skipinitialspace, lineterminator, header, index_col, names, prefix, skiprows, skipfooter, skip_footer, na_values, na_fvalues, true_values, false_values, delimiter, converters, dtype, usecols, engine, delim_whitespace, as_recarray, na_filter, compact_ints, use_unsigned, low_memory, buffer_lines, warn_bad_lines, error_bad_lines, keep_default_na, thousands, comment, decimal, parse_dates, keep_date_col, dayfirst, date_parser, memory_map, float_precision, nrows, iterator, chunksize, verbose, encoding, squeeze, mangle_dupe_cols, tupleize_cols, infer_datetime_format, skip_blank_lines)
    463                     skip_blank_lines=skip_blank_lines)
    464 
--> 465         return _read(filepath_or_buffer, kwds)
    466 
    467     parser_f.__name__ = name

/usr/lib/python3.4/site-packages/pandas/io/parsers.py in _read(filepath_or_buffer, kwds)
    249         return parser
    250 
--> 251     return parser.read()
    252 
    253 _parser_defaults = {

/usr/lib/python3.4/site-packages/pandas/io/parsers.py in read(self, nrows)
    708                 raise ValueError('skip_footer not supported for iteration')
    709 
--> 710         ret = self._engine.read(nrows)
    711 
    712         if self.options.get('as_recarray'):

/usr/lib/python3.4/site-packages/pandas/io/parsers.py in read(self, nrows)
   1157 
   1158         try:
-> 1159             data = self._reader.read(nrows)
   1160         except StopIteration:
   1161             if nrows is None:

/usr/lib/python3.4/site-packages/pandas/parser.cpython-34m.so in pandas.parser.TextReader.read (pandas/parser.c:7403)()

/usr/lib/python3.4/site-packages/pandas/parser.cpython-34m.so in pandas.parser.TextReader._read_low_memory (pandas/parser.c:7643)()

/usr/lib/python3.4/site-packages/pandas/parser.cpython-34m.so in pandas.parser.TextReader._read_rows (pandas/parser.c:8265)()

/usr/lib/python3.4/site-packages/pandas/parser.cpython-34m.so in pandas.parser.TextReader._tokenize_rows (pandas/parser.c:8139)()

/usr/lib/python3.4/site-packages/pandas/parser.cpython-34m.so in pandas.parser.raise_parser_error (pandas/parser.c:20776)()

CParserError: Error tokenizing data. C error: Expected 4 fields in line 12, saw 6

This seems odd to me. There's nothing in those fields to worry about anyway, they can just be dropped. Even if there was data there, if it doesn't have a column to go with it should be dropped too. Lots of csv files are messy like this; that's basically why they're used all the time.

I am on pandas '0.15.2' and Python3

@jreback
Copy link
Contributor

jreback commented Jan 19, 2015

You need to explicity allow this, otherwise its designed to give an error.

In [8]: read_csv(StringIO(data),error_bad_lines=False)
Skipping line 12: expected 4 fields, saw 6

Out[8]: 
    DecisionM  IntelligentM  freq  total
0           0             5     9     20
1           0             6    21     33
2           0             7    35     65
3           0             8    35     83
4           0             9    14     41
5           0            10    10     26
6           1             5    11     20
7           1             6    12     33
8           1             7    30     65
9           1             8    48     83
10          1            10    16     26

@jreback jreback closed this as completed Jan 19, 2015
@jreback jreback added IO CSV read_csv, to_csv Error Reporting Incorrect or improved errors from pandas Usage Question labels Jan 19, 2015
@szeitlin
Copy link

when I try adding error_bad_lines=False to mine, I get this very verbose output:

b'Skipping line 6: expected 2 fields, saw 8\nSkipping line 7: expected 2 fields, saw 8\nSkipping line 8: expected 2 fields, saw 8\nSkipping line 9: expected 2 fields, saw 8\nSkipping line 10: expected 2 fields, saw 8\nSkipping line 11: expected 2 fields, saw 8\nSkipping line 12: expected 2 fields, saw 8\nSkipping line 13: expected 2 fields, saw 8\nSkipping line 14: expected 2 fields, saw 8\nSkipping line 15: expected 2 fields, saw 8\ etc.

Looks like pandas is assuming the first line is the start of the data table, and is trying to tokenize based on the number of fields it found there, even though the entire rest of the file has 8 fields per row.

I know that the header on this file is such that if I do skiprows=4, pandas can read it just fine.

I don't think R has the ability to autodetect headers, but it sure would be nice if pandas had an option to do it, or if we could specify the number of rows to look at, to help determine what might be a header (?). Just an idea.

@TomAugspurger
Copy link
Contributor

Just my 2 cents, but I don't think pandas should try to guess here. Even though warning on every line past the first one is almost surely not what the user wants, it could also indicate a bad file, where a column wasn't written correctly.

Perhaps with another keyword or option to read_csv, but even then this seems like something that's tough to get right often enough to rely on. My preference is to put that burden on the user. Thoughts?

@kawochen
Copy link
Contributor

Input files should follow RFC4180 as closely as possible. The optional header should be the first line, so we shouldn't need to look past the first line to determine the header.

@szeitlin
Copy link

My point is that it's a common problem, and pandas could help alleviate the issue, which would save everyone time and duplicated effort.

Even if there's just an optional flag to return a list of tuples structured as (line number, number of fields in that line), that kind of output would be more useful than what it's already doing, which actually already includes that information, just not in a usable format.

@jreback
Copy link
Contributor

jreback commented Sep 30, 2015

@szeitlin their are a number of issues w.r.t. how to 'report' bad/warnings of lines, e.g. see #5686

so maybe someone can propose an API, e.g. we could do something like

replicates the current api

df = read_csv(......, errors='raise')
df = read_csv(......, errors='warn')
df, errors = read_csv(...., return_errors=True)

and errors be something like:

errors = [ { 'line' : 10, 'contents' : ..... ,'error' : ..... }, ..... ]

@randomgambit
Copy link

something I was curious about is whether we can still retrieve these bogus cells in some way... My understanding is that error_bad_lines = False will just skip these lines. What about keeping them and putting some fake column to get them?

@szeitlin
Copy link

So this was a while ago, but I ended up actually catching those errors and parsing them, in order to figure out what was going on with my files. It worked ok, but it's probably kind of fragile (if anything about this part of pandas changes, it will break). I'd post the code here, but it was at my old job, so I don't have access to the repo anymore. 😂

@randomgambit
Copy link

yeah these bogus lines are difficult to catch unfortunately (at least not in native Pandas)
http://stackoverflow.com/questions/38902553/error-tokenizing-data-during-pandas-read-csv-how-to-actually-see-the-bad-lines

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Error Reporting Incorrect or improved errors from pandas IO CSV read_csv, to_csv Usage Question
Projects
None yet
Development

No branches or pull requests

6 participants