Read CSV error_bad_lines does not error for too many values in first data row #12519

jarekszymczak · 2016-03-03T13:13:37Z

Hi, I would like to report an unexpected behaviour connected with option error_bad_lines (I just reference this for make it easier to find this bug if someone was to report the same).

Given the following two CSV files:
simple.csv

Col1,Col2
4,Here
5,7,Invalid
8,Row
9,Does
10,Break

simple2.csv

Col1,Col2
5,7,Invalid
4,Row
8,Does
9,Not
10,Break

The following code breaks, as expected:

df = pd.read_csv('simple.csv')

The fact that the following code does not break is up to discussion (though it is inconsistent depending on whether an error is in first row or futher ones):

df2 = pd.read_csv('simple2.csv')

As the index is read as the first column and hence it comes down to another potential issue reported (ragarding erroring on too few values), so I am not going into this here.

So the result is as follows:

    Col1        Col2
5   7       Invalid
4   Row     NaN
8   Does    NaN
9   Not     NaN
10  Break   NaN

However even if I specify explicitly, that there is no index in CSV:

df3 = pd.read_csv('simple2.csv', index_col=False)

It still works, yielding the result:

    Col1    Col2
0   5       7
1   4       Row
2   8       Does
3   9       Not
4   10      Break

And this is definitely a bug I believe. I discovered it by accident, as in CSV that I was about to read comma was used also as decimal separator in first column and the totally corrupted CSV ended up read and parsed as DataFrame.

output of `pd.show_versions()`

INSTALLED VERSIONS

commit: None
python: 3.5.1.final.0
python-bits: 64
OS: Windows
OS-release: 7
machine: AMD64
processor: Intel64 Family 6 Model 60 Stepping 3, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None

pandas: 0.17.1
nose: 1.3.7
pip: 8.0.3
setuptools: 20.1.1
Cython: 0.23.4
numpy: 1.10.1
scipy: 0.16.0
statsmodels: 0.6.1
IPython: 4.0.1
sphinx: 1.3.1
patsy: 0.4.0
dateutil: 2.4.2
pytz: 2015.7
blosc: None
bottleneck: 1.0.0
tables: 3.2.2
numexpr: 2.4.4
matplotlib: 1.5.0
openpyxl: 2.2.6
xlrd: 0.9.4
xlwt: 1.0.0
xlsxwriter: 0.7.7
lxml: 3.4.4
bs4: 4.4.1
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: 1.0.9
pymysql: None
psycopg2: None
Jinja2: 2.8

The text was updated successfully, but these errors were encountered:

jreback · 2016-03-03T13:52:39Z

looks like it. thanks!

pull-requests are welcome to fix!

vlfom · 2016-03-04T15:20:00Z

I'd like to contribute.
The problem is in this line: https://github.com/pydata/pandas/blob/master/pandas/src/parser/tokenizer.c#L539
However, the tests say that such behaviour is expected (lines 376-390): https://github.com/pydata/pandas/blob/master/pandas/io/tests/test_cparser.py#L376
So is this really a bug?

jreback · 2016-03-04T15:23:56Z

http://pandas.pydata.org/pandas-docs/stable/contributing.html are the contributing docs

you can submit a pr for code comments

Addresses issue in pandas-dev#12519 by raising exception when 'filepath_or_buffer' in 'read_csv' contains different number of fields in input lines.

gfyoung · 2016-06-03T11:05:44Z

These phenomena also exist with the Python parser, but I will say the second example is handled in internal documentation (see here). In light of that, I wouldn't consider the behaviour (the df2 case) inconsistent or problematic AFAICT.

phofl · 2021-11-28T13:21:18Z

We raise a ParserWarning in this case since 1.3, see #21768

jreback added Bug IO CSV read_csv, to_csv Difficulty Intermediate labels Mar 3, 2016

jreback added this to the Next Major Release milestone Mar 3, 2016

vlfom mentioned this issue Mar 4, 2016

BUG: Add exception when different number of fields present in file lines #12526

Closed

jbrockmendel removed Difficulty Intermediate labels Oct 21, 2019

phofl closed this as completed Nov 28, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Read CSV error_bad_lines does not error for too many values in first data row #12519

Read CSV error_bad_lines does not error for too many values in first data row #12519

jarekszymczak commented Mar 3, 2016

jreback commented Mar 3, 2016

vlfom commented Mar 4, 2016

jreback commented Mar 4, 2016

gfyoung commented Jun 3, 2016 •

edited

Loading

phofl commented Nov 28, 2021

Read CSV error_bad_lines does not error for too many values in first data row #12519

Read CSV error_bad_lines does not error for too many values in first data row #12519

Comments

jarekszymczak commented Mar 3, 2016

output of pd.show_versions()

INSTALLED VERSIONS

jreback commented Mar 3, 2016

vlfom commented Mar 4, 2016

jreback commented Mar 4, 2016

gfyoung commented Jun 3, 2016 • edited Loading

phofl commented Nov 28, 2021

output of `pd.show_versions()`

gfyoung commented Jun 3, 2016 •

edited

Loading