Skip to content

read_csv crash if second input line is malformed #14782

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
karlacio opened this issue Dec 1, 2016 · 28 comments
Closed

read_csv crash if second input line is malformed #14782

karlacio opened this issue Dec 1, 2016 · 28 comments
Labels
Bug IO CSV read_csv, to_csv Segfault Non-Recoverable Error
Milestone

Comments

@karlacio
Copy link

karlacio commented Dec 1, 2016

Problem description

Added 2020/04/21:

Similar issue as OP but segfaults (from below)!

import pandas
import io

pandas.read_csv(io.StringIO("\na\nb\n"),  skip_blank_lines=False, header=1)

===

I'm usign read_csv to import files containing data in 9 columns.
Some lines have a 'time tag' composed by two columns (the first line is always a 'time tag').
Some lines are malformed in various ways.

I use read_csv to read data and skip bad lines just with a warn and usually it works.

Recently I found a file that crash my program and it seems caused by a bad formed second line (not 9 fields).

The code:

import pandas as pd
print("before read_csv")
d = pd.read_csv("inputfile.txt",
                header=0,
                names=(11,22,33,44,55,66,77,88,99),
                compression='infer',
                error_bad_lines=False, warn_bad_lines=True)
print("after read_csv")
print(d)

works on input like this:

1,2
1,2,3,4,5,6,7,8,9
yunk
1,2,3,4,5,6,7,8,9

with output:

before read_csv
after read_csv
     11   22   33   44   55   66   77   88   99
0     1  2.0  3.0  4.0  5.0  6.0  7.0  8.0  9.0
1  yunk  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN
2     1  2.0  3.0  4.0  5.0  6.0  7.0  8.0  9.0

But fails on input like this:

1,2
yunk
1,2,3,4,5,6,7,8,9
1,2,3,4,5,6,7,8,9

with output:

before read_csv
b'Skipping line 3: expected 2 fields, saw 9\nSkipping line 4: expected 2 fields, saw 9\n'
Traceback (most recent call last):
  File ".../bug_pandas.read_scv.py", line 42, in <module>
    error_bad_lines=False, warn_bad_lines=True)
  File ".../Python.framework/Versions/3.5/lib/python3.5/site-packages/pandas/io/parsers.py", line 645, in parser_f
    return _read(filepath_or_buffer, kwds)
  File ".../Python.framework/Versions/3.5/lib/python3.5/site-packages/pandas/io/parsers.py", line 400, in _read
    data = parser.read()
  File ".../Python.framework/Versions/3.5/lib/python3.5/site-packages/pandas/io/parsers.py", line 938, in read
    ret = self._engine.read(nrows)
  File ".../Python.framework/Versions/3.5/lib/python3.5/site-packages/pandas/io/parsers.py", line 1507, in read
    data = self._reader.read(nrows)
  File "pandas/parser.pyx", line 846, in pandas.parser.TextReader.read (pandas/parser.c:10364)
  File "pandas/parser.pyx", line 868, in pandas.parser.TextReader._read_low_memory (pandas/parser.c:10640)
  File "pandas/parser.pyx", line 945, in pandas.parser.TextReader._read_rows (pandas/parser.c:11677)
  File "pandas/parser.pyx", line 1007, in pandas.parser.TextReader._convert_column_data (pandas/parser.c:12627)
pandas.io.common.CParserError: Too many columns specified: expected 9 and found 2

Process finished with exit code 1

The problem arises both on OSX and on CentOS 7.

I tried to find if it's an already known bug but I found nothing.
Sorry if it's already known.

Carlo.

Output of pd.show_versions() on OSX 10.11.6

INSTALLED VERSIONS ------------------ commit: None python: 3.5.2.final.0 python-bits: 64 OS: Darwin OS-release: 15.6.0 machine: x86_64 processor: i386 byteorder: little LC_ALL: None LANG: None LOCALE: en_US.UTF-8

pandas: 0.19.1
nose: 1.3.7
pip: 9.0.1
setuptools: 28.8.0
Cython: None
numpy: 1.11.2
scipy: 0.18.1
statsmodels: None
xarray: None
IPython: 5.1.0
sphinx: None
patsy: None
dateutil: 2.6.0
pytz: 2016.7
blosc: None
bottleneck: None
tables: 3.3.0
numexpr: 2.6.1
matplotlib: 1.5.3
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.8
boto: None
pandas_datareader: None

Output of pd.show_versions() on CentOS 7

pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.4.5.final.0
python-bits: 64
OS: Linux
OS-release: 3.10.0-327.36.3.el7.x86_64
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.19.1
nose: 1.3.7
pip: 9.0.1
setuptools: 29.0.1
Cython: None
numpy: 1.11.2
scipy: 0.18.1
statsmodels: None
xarray: None
IPython: 5.1.0
sphinx: None
patsy: None
dateutil: 2.6.0
pytz: 2016.7
blosc: None
bottleneck: None
tables: 3.3.0
numexpr: 2.6.1
matplotlib: 1.5.3
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.8
boto: None
pandas_datareader: None

@gfyoung
Copy link
Member

gfyoung commented Dec 2, 2016

@karlacio : Why are you specifying header if you are passing in names? That seems illogical since both are for the top row of the CSV file.

@karlacio
Copy link
Author

karlacio commented Dec 2, 2016

You are right. If I remove header or set it to None it works.

I don't know why it was there, maybe I hastily read the docs:

header : int or list of ints, default ‘infer’
Row number(s) to use as the column names, and the start of the data. Default behavior is as if set to 0 if no names passed, otherwise None. Explicitly pass header=0 to be able to replace existing names. The header can be a list of integers that specify row locations for a multi-index on the columns e.g. [0,1,3]. Intervening rows that are not specified will be skipped (e.g. 2 in this example is skipped). Note that this parameter ignores commented lines and empty lines if skip_blank_lines=True, so header=0 denotes the first line of data rather than the first line of the file.

In any case is the code behavior correct?
Should read_csv give an error even if I set incorrectly header=?

@gfyoung
Copy link
Member

gfyoung commented Dec 2, 2016

@karlacio : Actually, that phrase is a little confusing admittedly. @jreback @jorisvandenbossche ? I suspect a documentation fix is needed there.

Because we process header before applying names, the answer to your last questions is yes.

@karlacio
Copy link
Author

karlacio commented Dec 2, 2016

@gfyoung
Ok thanks!

@jreback
Copy link
Contributor

jreback commented Dec 4, 2016

yeah that does look a bit confusing. someone want to re-word? (I would put a blank line or 2 somewhere; that is a lot of text).

I am also suspect of the utility of: Explicitly pass header=0 to be able to replace existing names.

@gfyoung can you divine from the code how this is useful

@jreback jreback added the IO CSV read_csv, to_csv label Dec 4, 2016
@gfyoung
Copy link
Member

gfyoung commented Dec 4, 2016

@jreback : Indeed, the doc just seems outright wrong. Here's what happens:

  1. If header is a single index, then names overrides it if passed in.
  2. If header is multi-index, then passed in names will cause an Exception to be raised.

I'll put up a PR to update this. Note that we already have tests in header.py to confirm.

@jorisvandenbossche
Copy link
Member

Sorry, what is exactly unclear about the current phrasing?
If you pass a names arg, you can use header=0 to replace the existing header row (otherwise the first row will be seen as 'data'). You can maybe question its usefulness, as you can also do skiprows=1 to get the same effect, but to me that is a clear explanation. And based on that, the original reported issue seems a bug to me.

@gfyoung
Copy link
Member

gfyoung commented Dec 4, 2016

@jorisvandenbossche : The documentation is outright wrong. We even have tests that confirm this.

@jorisvandenbossche
Copy link
Member

Can you point to them?

@gfyoung
Copy link
Member

gfyoung commented Dec 4, 2016

Can't indicate right now but if you search for the word "names," in that file, you should be able to find them easily.

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Dec 4, 2016

Not sure what 'that' file is, but looking at the header tests, I find a test that explicitly tests this header=0, names=[...] replacing header in data behaviour:

@gfyoung
Copy link
Member

gfyoung commented Dec 4, 2016

I specifically mentioned header.py in a previous comment. And yes, that is the test that confirms the incorrectness of the documentation.

@jorisvandenbossche
Copy link
Member

Sorry, overlooked that.

But, again, as asked above, can you clarify why you think this shows the incorrectness of the documentation?
The docstring says

Explicitly pass header=0 to be able to replace existing names.

so in the test, the existing names in the data (the header row) is replaced by the values in the passed names arg by also supplying header=0. Otherwise those existing names in the data would be included as a data row.

@gfyoung
Copy link
Member

gfyoung commented Dec 4, 2016

So reading it again, I think I see what the intended meaning is, which technically is correct. However, I think we should clarify the interaction between names and header and close this issue.

@gfyoung
Copy link
Member

gfyoung commented Dec 4, 2016

Also I would not consider this a bug because the header rows are used for determining the number of columns that we expect in the data.

@jorisvandenbossche
Copy link
Member

IMO at least one of the two examples is a bug. As either both should raise (if we say that the header row should match the number of columns), or either both should work (if we decide that when replacing the header row, the number of columns don't need to match)

@gfyoung
Copy link
Member

gfyoung commented Dec 4, 2016

It's a little more involved than that. The header rows are used as well as the row following the header to figure out the expected number of columns. That's why the second example fails and not the first.

@gfyoung
Copy link
Member

gfyoung commented Jan 3, 2017

@jreback , @jorisvandenbossche : While it does seem like we resolved @karlacio 's immediate issue, some sort of patch is needed here. @jorisvandenbossche is advocating what appears to be an API change, while I think that the behavior is alright as is and just needs doc clarification. @jreback thoughts?

@jorisvandenbossche
Copy link
Member

OK, I think the confusion/disagreement is due to mixing several issues here:

1. Combination of header and names kwargs (in a simple case)

In [1]: s = """a,b
   ...: 1,2
   ...: 3,4"""

In [2]: pd.read_csv(StringIO(s), names=['A', 'B'])
Out[2]: 
   A  B
0  a  b
1  1  2
2  3  4

In [3]: pd.read_csv(StringIO(s), names=['A', 'B'], header=0)
Out[3]: 
   A  B
0  1  2
1  3  4

Here, as stated in the docs, you can pass header=0 to replace the header in the file. This is what I meant above with: "I don't see what is wrong with the docs" and so questioned your "the doc just seems outright wrong" statement. @gfyoung Do you agree this behaviour is correct? And as documented? (possibly the wording can be improved)

2. Determination of the number of columns

and whether this is based on the header, names of the number of values in the data.

@gfyoung you said above:

It's a little more involved than that. The header rows are used as well as the row following the header to figure out the expected number of columns. That's why the second example fails and not the first.

The main problem here is, I think, that it is not clear what exactly is this "row following the header" (first data row).
In the below example I try to logically think about what this would mean, and what I therefore expect. In the last case, this does not match.

# Consider those two files, starting with 1 or two shorter rows
s1 = """1,2
1,2,3,4
1,2,3,4
1,2,3,4"""

s2 = """1,2
1,2
1,2,3,4
1,2,3,4"""

## by default, first row is taken as header
# based on first two rows -> 4 colums (but first two set as index because header is shorter)
In [28]: pd.read_csv(StringIO(s1))
Out[28]: 
     1  2
1 2  3  4
  2  3  4
  2  3  4

# based on first two rows -> 2 columns inferred -> error as further on you get more columns
In [29]: pd.read_csv(StringIO(s2))
...
ParserError: Error tokenizing data. C error: Expected 2 fields in line 3, saw 4

## passing names -> this is used as header, all rows in file as data rows
# based on header (=names) and first row of file (first data row) -> 4 columns inferred
In [24]: pd.read_csv(StringIO(s1), names=list('abcd'))
Out[24]: 
   a  b    c    d
0  1  2  NaN  NaN
1  1  2  3.0  4.0
2  1  2  3.0  4.0
3  1  2  3.0  4.0

# based on header (=names) and first row of file (first data row) -> 4 columns inferred
In [25]: pd.read_csv(StringIO(s2), names=list('abcd'))
Out[25]: 
   a  b    c    d
0  1  2  NaN  NaN
1  1  2  NaN  NaN
2  1  2  3.0  4.0
3  1  2  3.0  4.0

## also passing header=0 -> names overwrite first (header) row of file
# based on header (=names) and second row of file (first data row) -> 4 columns inferred
In [26]: pd.read_csv(StringIO(s1), names=list('abcd'), header=0)
Out[26]: 
   a  b  c  d
0  1  2  3  4
1  1  2  3  4
2  1  2  3  4

## also passing header=0 -> names overwrite first (header) row of file
# based on header (=names) and second row of file (first data row) -> 4 columns inferred 
# -> NOT the case -> in reality only 2 columns inferred
In [27]: pd.read_csv(StringIO(s2), names=list('abcd'), header=0)
...
ParserError: Error tokenizing data. C error: Expected 2 fields in line 3, saw 4

So it seems that names is not used to determine the number of columns when header=0 (but when header=None, names is used). Only the actual rows in the file (so based on the header row in the file before it is replaced by names, and not based on the header passed through names) are used in this case. This seems an inconsistency between header=None and header=0 to me.

Of course it can be discussed whether the header row or names should be used to determine the number of rows in the case of header=0, but since you are replacing the existing header with names, I would expect the names to be used.

@jorisvandenbossche
Copy link
Member

Related issue for the second bullet: #6710

@gfyoung
Copy link
Member

gfyoung commented Jan 4, 2017

@jorisvandenbossche : I disagree that your examples are contradictory. In your final example, the original header is 1,2 (first line), and first data row is also 1,2 (second line). That is why you get the error.

@jorisvandenbossche
Copy link
Member

But you explicitly say to replace the original header, shouldn't you use the replaced header then?
In pd.read_csv(StringIO(s2), names=list('abcd')), the provided header based on names is used to determine the number of columns.

@gfyoung
Copy link
Member

gfyoung commented Jan 4, 2017

@jorisvandenbossche : By replacing the header, based on how the code behaves, it should refer to using different names for the header (but not changing the size of the number names provided). That is why we would not use names to determine the number of columns if a header row(s) is specified.

Perhaps that needs clarification or verification (in any case, the Python engine behavior is inconsistent with that of the C engine if you run it through the examples you provided), but I don't believe there is a bug still.

@jbrockmendel jbrockmendel added the Segfault Non-Recoverable Error label Oct 16, 2019
@mroeschke mroeschke added Bug Needs Discussion Requires discussion from core team before further action labels Apr 4, 2020
@mfuellbier
Copy link

I found a seg fault in read_csv- I am not sure if it is the same error as the one discussed here, so before I open a new ticket, I wanted to check if it is necessary.

This is the MWE which creates a segmenetatiion fault.

import pandas
import io

pandas.read_csv(io.StringIO("\na\nb\n"),  skip_blank_lines=False, header=1)

Shall I create a new ticket for that or is this basically the problem discussed here?

@gfyoung
Copy link
Member

gfyoung commented Apr 21, 2020

@mfuellbier Yikes! That's not great to say the least.

I copied your example into the issue description at top, as it is similar to the OP. Thanks!

@gfyoung gfyoung removed the Needs Discussion Requires discussion from core team before further action label Apr 21, 2020
@ahawryluk
Copy link
Contributor

Good news,

pd.read_csv(io.StringIO("\na\nb\n"),  skip_blank_lines=False, header=1)

no longer causes a segfault on master. It returns

   a
0  b

@gfyoung
Copy link
Member

gfyoung commented Mar 15, 2021

@ahawryluk : That's great! Would you like to add this test as a PR?

I would check the OP as well to see if we can close this issue out.

@karlacio
Copy link
Author

Ok to close,
Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug IO CSV read_csv, to_csv Segfault Non-Recoverable Error
Projects
None yet
Development

No branches or pull requests

8 participants