Skip to content

Read CSV error_bad_lines does not error for too many values in first data row #12519

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
jarekszymczak opened this issue Mar 3, 2016 · 5 comments
Labels
Bug IO CSV read_csv, to_csv

Comments

@jarekszymczak
Copy link

Hi, I would like to report an unexpected behaviour connected with option error_bad_lines (I just reference this for make it easier to find this bug if someone was to report the same).

Given the following two CSV files:
simple.csv

Col1,Col2
4,Here
5,7,Invalid
8,Row
9,Does
10,Break

simple2.csv

Col1,Col2
5,7,Invalid
4,Row
8,Does
9,Not
10,Break

The following code breaks, as expected:

df = pd.read_csv('simple.csv')

The fact that the following code does not break is up to discussion (though it is inconsistent depending on whether an error is in first row or futher ones):

df2 = pd.read_csv('simple2.csv')

As the index is read as the first column and hence it comes down to another potential issue reported (ragarding erroring on too few values), so I am not going into this here.

So the result is as follows:

    Col1        Col2
5   7       Invalid
4   Row     NaN
8   Does    NaN
9   Not     NaN
10  Break   NaN

However even if I specify explicitly, that there is no index in CSV:

df3 = pd.read_csv('simple2.csv', index_col=False)

It still works, yielding the result:

    Col1    Col2
0   5       7
1   4       Row
2   8       Does
3   9       Not
4   10      Break

And this is definitely a bug I believe. I discovered it by accident, as in CSV that I was about to read comma was used also as decimal separator in first column and the totally corrupted CSV ended up read and parsed as DataFrame.

output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.5.1.final.0
python-bits: 64
OS: Windows
OS-release: 7
machine: AMD64
processor: Intel64 Family 6 Model 60 Stepping 3, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None

pandas: 0.17.1
nose: 1.3.7
pip: 8.0.3
setuptools: 20.1.1
Cython: 0.23.4
numpy: 1.10.1
scipy: 0.16.0
statsmodels: 0.6.1
IPython: 4.0.1
sphinx: 1.3.1
patsy: 0.4.0
dateutil: 2.4.2
pytz: 2015.7
blosc: None
bottleneck: 1.0.0
tables: 3.2.2
numexpr: 2.4.4
matplotlib: 1.5.0
openpyxl: 2.2.6
xlrd: 0.9.4
xlwt: 1.0.0
xlsxwriter: 0.7.7
lxml: 3.4.4
bs4: 4.4.1
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: 1.0.9
pymysql: None
psycopg2: None
Jinja2: 2.8

@jreback jreback added this to the Next Major Release milestone Mar 3, 2016
@jreback
Copy link
Contributor

jreback commented Mar 3, 2016

looks like it. thanks!

pull-requests are welcome to fix!

@vlfom
Copy link

vlfom commented Mar 4, 2016

I'd like to contribute.
The problem is in this line: https://github.com/pydata/pandas/blob/master/pandas/src/parser/tokenizer.c#L539
However, the tests say that such behaviour is expected (lines 376-390): https://github.com/pydata/pandas/blob/master/pandas/io/tests/test_cparser.py#L376
So is this really a bug?

@jreback
Copy link
Contributor

jreback commented Mar 4, 2016

http://pandas.pydata.org/pandas-docs/stable/contributing.html are the contributing docs

you can submit a pr for code comments

vlfom added a commit to vlfom/pandas that referenced this issue Mar 4, 2016
Addresses issue in pandas-dev#12519 by raising exception when 'filepath_or_buffer'
in 'read_csv' contains different number of fields in input lines.
@gfyoung
Copy link
Member

gfyoung commented Jun 3, 2016

These phenomena also exist with the Python parser, but I will say the second example is handled in internal documentation (see here). In light of that, I wouldn't consider the behaviour (the df2 case) inconsistent or problematic AFAICT.

@phofl
Copy link
Member

phofl commented Nov 28, 2021

We raise a ParserWarning in this case since 1.3, see #21768

@phofl phofl closed this as completed Nov 28, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug IO CSV read_csv, to_csv
Projects
None yet
Development

No branches or pull requests

6 participants