read_csv crash if second input line is malformed #14782

karlacio · 2016-12-01T17:20:12Z

Problem description

Added 2020/04/21:

Similar issue as OP but segfaults (from below)!

import pandas
import io

pandas.read_csv(io.StringIO("\na\nb\n"),  skip_blank_lines=False, header=1)

===

I'm usign read_csv to import files containing data in 9 columns.
Some lines have a 'time tag' composed by two columns (the first line is always a 'time tag').
Some lines are malformed in various ways.

I use read_csv to read data and skip bad lines just with a warn and usually it works.

Recently I found a file that crash my program and it seems caused by a bad formed second line (not 9 fields).

The code:

import pandas as pd
print("before read_csv")
d = pd.read_csv("inputfile.txt",
                header=0,
                names=(11,22,33,44,55,66,77,88,99),
                compression='infer',
                error_bad_lines=False, warn_bad_lines=True)
print("after read_csv")
print(d)

works on input like this:

1,2
1,2,3,4,5,6,7,8,9
yunk
1,2,3,4,5,6,7,8,9

with output:

before read_csv
after read_csv
     11   22   33   44   55   66   77   88   99
0     1  2.0  3.0  4.0  5.0  6.0  7.0  8.0  9.0
1  yunk  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN
2     1  2.0  3.0  4.0  5.0  6.0  7.0  8.0  9.0

But fails on input like this:

1,2
yunk
1,2,3,4,5,6,7,8,9
1,2,3,4,5,6,7,8,9

with output:

before read_csv
b'Skipping line 3: expected 2 fields, saw 9\nSkipping line 4: expected 2 fields, saw 9\n'
Traceback (most recent call last):
  File ".../bug_pandas.read_scv.py", line 42, in <module>
    error_bad_lines=False, warn_bad_lines=True)
  File ".../Python.framework/Versions/3.5/lib/python3.5/site-packages/pandas/io/parsers.py", line 645, in parser_f
    return _read(filepath_or_buffer, kwds)
  File ".../Python.framework/Versions/3.5/lib/python3.5/site-packages/pandas/io/parsers.py", line 400, in _read
    data = parser.read()
  File ".../Python.framework/Versions/3.5/lib/python3.5/site-packages/pandas/io/parsers.py", line 938, in read
    ret = self._engine.read(nrows)
  File ".../Python.framework/Versions/3.5/lib/python3.5/site-packages/pandas/io/parsers.py", line 1507, in read
    data = self._reader.read(nrows)
  File "pandas/parser.pyx", line 846, in pandas.parser.TextReader.read (pandas/parser.c:10364)
  File "pandas/parser.pyx", line 868, in pandas.parser.TextReader._read_low_memory (pandas/parser.c:10640)
  File "pandas/parser.pyx", line 945, in pandas.parser.TextReader._read_rows (pandas/parser.c:11677)
  File "pandas/parser.pyx", line 1007, in pandas.parser.TextReader._convert_column_data (pandas/parser.c:12627)
pandas.io.common.CParserError: Too many columns specified: expected 9 and found 2

Process finished with exit code 1

The problem arises both on OSX and on CentOS 7.

I tried to find if it's an already known bug but I found nothing.
Sorry if it's already known.

Carlo.

Output of `pd.show_versions()` on OSX 10.11.6

INSTALLED VERSIONS ------------------ commit: None python: 3.5.2.final.0 python-bits: 64 OS: Darwin OS-release: 15.6.0 machine: x86_64 processor: i386 byteorder: little LC_ALL: None LANG: None LOCALE: en_US.UTF-8

pandas: 0.19.1
nose: 1.3.7
pip: 9.0.1
setuptools: 28.8.0
Cython: None
numpy: 1.11.2
scipy: 0.18.1
statsmodels: None
xarray: None
IPython: 5.1.0
sphinx: None
patsy: None
dateutil: 2.6.0
pytz: 2016.7
blosc: None
bottleneck: None
tables: 3.3.0
numexpr: 2.6.1
matplotlib: 1.5.3
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.8
boto: None
pandas_datareader: None

Output of `pd.show_versions()` on CentOS 7

pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.4.5.final.0
python-bits: 64
OS: Linux
OS-release: 3.10.0-327.36.3.el7.x86_64
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.19.1
nose: 1.3.7
pip: 9.0.1
setuptools: 29.0.1
Cython: None
numpy: 1.11.2
scipy: 0.18.1
statsmodels: None
xarray: None
IPython: 5.1.0
sphinx: None
patsy: None
dateutil: 2.6.0
pytz: 2016.7
blosc: None
bottleneck: None
tables: 3.3.0
numexpr: 2.6.1
matplotlib: 1.5.3
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.8
boto: None
pandas_datareader: None

The text was updated successfully, but these errors were encountered:

gfyoung · 2016-12-02T06:19:10Z

@karlacio : Why are you specifying header if you are passing in names? That seems illogical since both are for the top row of the CSV file.

karlacio · 2016-12-02T09:09:42Z

You are right. If I remove header or set it to None it works.

I don't know why it was there, maybe I hastily read the docs:

header : int or list of ints, default ‘infer’
Row number(s) to use as the column names, and the start of the data. Default behavior is as if set to 0 if no names passed, otherwise None. Explicitly pass header=0 to be able to replace existing names. The header can be a list of integers that specify row locations for a multi-index on the columns e.g. [0,1,3]. Intervening rows that are not specified will be skipped (e.g. 2 in this example is skipped). Note that this parameter ignores commented lines and empty lines if skip_blank_lines=True, so header=0 denotes the first line of data rather than the first line of the file.

In any case is the code behavior correct?
Should read_csv give an error even if I set incorrectly header=?

gfyoung · 2016-12-02T09:17:30Z

@karlacio : Actually, that phrase is a little confusing admittedly. @jreback @jorisvandenbossche ? I suspect a documentation fix is needed there.

Because we process header before applying names, the answer to your last questions is yes.

karlacio · 2016-12-02T10:53:45Z

@gfyoung
Ok thanks!

jreback · 2016-12-04T17:11:36Z

yeah that does look a bit confusing. someone want to re-word? (I would put a blank line or 2 somewhere; that is a lot of text).

I am also suspect of the utility of: Explicitly pass header=0 to be able to replace existing names.

@gfyoung can you divine from the code how this is useful

gfyoung · 2016-12-04T20:31:15Z

@jreback : Indeed, the doc just seems outright wrong. Here's what happens:

If header is a single index, then names overrides it if passed in.
If header is multi-index, then passed in names will cause an Exception to be raised.

I'll put up a PR to update this. Note that we already have tests in header.py to confirm.

jorisvandenbossche · 2016-12-04T20:40:16Z

Sorry, what is exactly unclear about the current phrasing?
If you pass a names arg, you can use header=0 to replace the existing header row (otherwise the first row will be seen as 'data'). You can maybe question its usefulness, as you can also do skiprows=1 to get the same effect, but to me that is a clear explanation. And based on that, the original reported issue seems a bug to me.

gfyoung · 2016-12-04T20:43:48Z

@jorisvandenbossche : The documentation is outright wrong. We even have tests that confirm this.

jorisvandenbossche · 2016-12-04T20:45:22Z

Can you point to them?

gfyoung · 2016-12-04T21:06:21Z

Can't indicate right now but if you search for the word "names," in that file, you should be able to find them easily.

jorisvandenbossche · 2016-12-04T21:10:36Z

Not sure what 'that' file is, but looking at the header tests, I find a test that explicitly tests this header=0, names=[...] replacing header in data behaviour:

pandas/pandas/io/tests/parser/header.py

Line 243 in 748000d

header=0)

gfyoung · 2016-12-04T21:13:54Z

I specifically mentioned header.py in a previous comment. And yes, that is the test that confirms the incorrectness of the documentation.

jorisvandenbossche · 2016-12-04T21:20:13Z

Sorry, overlooked that.

But, again, as asked above, can you clarify why you think this shows the incorrectness of the documentation?
The docstring says

Explicitly pass header=0 to be able to replace existing names.

so in the test, the existing names in the data (the header row) is replaced by the values in the passed names arg by also supplying header=0. Otherwise those existing names in the data would be included as a data row.

gfyoung · 2016-12-04T21:23:22Z

So reading it again, I think I see what the intended meaning is, which technically is correct. However, I think we should clarify the interaction between names and header and close this issue.

gfyoung · 2016-12-04T21:25:46Z

Also I would not consider this a bug because the header rows are used for determining the number of columns that we expect in the data.

jorisvandenbossche · 2016-12-04T21:29:04Z

IMO at least one of the two examples is a bug. As either both should raise (if we say that the header row should match the number of columns), or either both should work (if we decide that when replacing the header row, the number of columns don't need to match)

gfyoung · 2016-12-04T21:47:29Z

It's a little more involved than that. The header rows are used as well as the row following the header to figure out the expected number of columns. That's why the second example fails and not the first.

gfyoung · 2017-01-03T22:53:38Z

@jreback , @jorisvandenbossche : While it does seem like we resolved @karlacio 's immediate issue, some sort of patch is needed here. @jorisvandenbossche is advocating what appears to be an API change, while I think that the behavior is alright as is and just needs doc clarification. @jreback thoughts?

jorisvandenbossche · 2017-01-04T13:41:31Z

OK, I think the confusion/disagreement is due to mixing several issues here:

1. Combination of header and names kwargs (in a simple case)

In [1]: s = """a,b
   ...: 1,2
   ...: 3,4"""

In [2]: pd.read_csv(StringIO(s), names=['A', 'B'])
Out[2]: 
   A  B
0  a  b
1  1  2
2  3  4

In [3]: pd.read_csv(StringIO(s), names=['A', 'B'], header=0)
Out[3]: 
   A  B
0  1  2
1  3  4

Here, as stated in the docs, you can pass header=0 to replace the header in the file. This is what I meant above with: "I don't see what is wrong with the docs" and so questioned your "the doc just seems outright wrong" statement. @gfyoung Do you agree this behaviour is correct? And as documented? (possibly the wording can be improved)

2. Determination of the number of columns

and whether this is based on the header, names of the number of values in the data.

@gfyoung you said above:

It's a little more involved than that. The header rows are used as well as the row following the header to figure out the expected number of columns. That's why the second example fails and not the first.

The main problem here is, I think, that it is not clear what exactly is this "row following the header" (first data row).
In the below example I try to logically think about what this would mean, and what I therefore expect. In the last case, this does not match.

# Consider those two files, starting with 1 or two shorter rows
s1 = """1,2
1,2,3,4
1,2,3,4
1,2,3,4"""

s2 = """1,2
1,2
1,2,3,4
1,2,3,4"""

## by default, first row is taken as header
# based on first two rows -> 4 colums (but first two set as index because header is shorter)
In [28]: pd.read_csv(StringIO(s1))
Out[28]: 
     1  2
1 2  3  4
  2  3  4
  2  3  4

# based on first two rows -> 2 columns inferred -> error as further on you get more columns
In [29]: pd.read_csv(StringIO(s2))
...
ParserError: Error tokenizing data. C error: Expected 2 fields in line 3, saw 4

## passing names -> this is used as header, all rows in file as data rows
# based on header (=names) and first row of file (first data row) -> 4 columns inferred
In [24]: pd.read_csv(StringIO(s1), names=list('abcd'))
Out[24]: 
   a  b    c    d
0  1  2  NaN  NaN
1  1  2  3.0  4.0
2  1  2  3.0  4.0
3  1  2  3.0  4.0

# based on header (=names) and first row of file (first data row) -> 4 columns inferred
In [25]: pd.read_csv(StringIO(s2), names=list('abcd'))
Out[25]: 
   a  b    c    d
0  1  2  NaN  NaN
1  1  2  NaN  NaN
2  1  2  3.0  4.0
3  1  2  3.0  4.0

## also passing header=0 -> names overwrite first (header) row of file
# based on header (=names) and second row of file (first data row) -> 4 columns inferred
In [26]: pd.read_csv(StringIO(s1), names=list('abcd'), header=0)
Out[26]: 
   a  b  c  d
0  1  2  3  4
1  1  2  3  4
2  1  2  3  4

## also passing header=0 -> names overwrite first (header) row of file
# based on header (=names) and second row of file (first data row) -> 4 columns inferred 
# -> NOT the case -> in reality only 2 columns inferred
In [27]: pd.read_csv(StringIO(s2), names=list('abcd'), header=0)
...
ParserError: Error tokenizing data. C error: Expected 2 fields in line 3, saw 4

So it seems that names is not used to determine the number of columns when header=0 (but when header=None, names is used). Only the actual rows in the file (so based on the header row in the file before it is replaced by names, and not based on the header passed through names) are used in this case. This seems an inconsistency between header=None and header=0 to me.

Of course it can be discussed whether the header row or names should be used to determine the number of rows in the case of header=0, but since you are replacing the existing header with names, I would expect the names to be used.

jorisvandenbossche · 2017-01-04T14:28:13Z

Related issue for the second bullet: #6710

gfyoung · 2017-01-04T21:33:57Z

@jorisvandenbossche : I disagree that your examples are contradictory. In your final example, the original header is 1,2 (first line), and first data row is also 1,2 (second line). That is why you get the error.

jorisvandenbossche · 2017-01-04T22:21:08Z

But you explicitly say to replace the original header, shouldn't you use the replaced header then?
In pd.read_csv(StringIO(s2), names=list('abcd')), the provided header based on names is used to determine the number of columns.

gfyoung · 2017-01-04T22:36:49Z

@jorisvandenbossche : By replacing the header, based on how the code behaves, it should refer to using different names for the header (but not changing the size of the number names provided). That is why we would not use names to determine the number of columns if a header row(s) is specified.

Perhaps that needs clarification or verification (in any case, the Python engine behavior is inconsistent with that of the C engine if you run it through the examples you provided), but I don't believe there is a bug still.

mfuellbier · 2020-04-21T14:08:47Z

I found a seg fault in read_csv- I am not sure if it is the same error as the one discussed here, so before I open a new ticket, I wanted to check if it is necessary.

This is the MWE which creates a segmenetatiion fault.

import pandas
import io

pandas.read_csv(io.StringIO("\na\nb\n"),  skip_blank_lines=False, header=1)

Shall I create a new ticket for that or is this basically the problem discussed here?

gfyoung · 2020-04-21T17:19:52Z

@mfuellbier Yikes! That's not great to say the least.

I copied your example into the issue description at top, as it is similar to the OP. Thanks!

ahawryluk · 2021-03-15T16:49:35Z

Good news,

pd.read_csv(io.StringIO("\na\nb\n"),  skip_blank_lines=False, header=1)

no longer causes a segfault on master. It returns

   a
0  b

gfyoung · 2021-03-15T20:00:53Z

@ahawryluk : That's great! Would you like to add this test as a PR?

I would check the OP as well to see if we can close this issue out.

karlacio · 2021-03-15T20:16:25Z

Ok to close,
Thanks

jreback added the IO CSV read_csv, to_csv label Dec 4, 2016

gfyoung mentioned this issue Jan 5, 2017

BUG: Patch missing data handling with usecols #15066

Merged

jbrockmendel added the Segfault Non-Recoverable Error label Oct 16, 2019

mroeschke added Bug Needs Discussion Requires discussion from core team before further action labels Apr 4, 2020

gfyoung removed the Needs Discussion Requires discussion from core team before further action label Apr 21, 2020

mzeitlin11 mentioned this issue Dec 28, 2020

REGR: pd.read_csv segfaults with 1.2 (has worked since before pandas 1.0) #38753

Closed

3 tasks

karlacio closed this as completed Mar 15, 2021

ahawryluk mentioned this issue Mar 16, 2021

TST: csv parser segfaulting on malformed input. #40458

Merged

3 tasks

jreback added this to the 1.3 milestone Mar 16, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

read_csv crash if second input line is malformed #14782

read_csv crash if second input line is malformed #14782

karlacio commented Dec 1, 2016 •

edited by gfyoung

Loading

INSTALLED VERSIONS

gfyoung commented Dec 2, 2016

karlacio commented Dec 2, 2016

gfyoung commented Dec 2, 2016

karlacio commented Dec 2, 2016

jreback commented Dec 4, 2016

gfyoung commented Dec 4, 2016 •

edited

Loading

jorisvandenbossche commented Dec 4, 2016

gfyoung commented Dec 4, 2016

jorisvandenbossche commented Dec 4, 2016

gfyoung commented Dec 4, 2016

jorisvandenbossche commented Dec 4, 2016 •

edited

Loading

gfyoung commented Dec 4, 2016

jorisvandenbossche commented Dec 4, 2016

gfyoung commented Dec 4, 2016

gfyoung commented Dec 4, 2016

jorisvandenbossche commented Dec 4, 2016

gfyoung commented Dec 4, 2016

gfyoung commented Jan 3, 2017 •

edited

Loading

jorisvandenbossche commented Jan 4, 2017

jorisvandenbossche commented Jan 4, 2017

gfyoung commented Jan 4, 2017

jorisvandenbossche commented Jan 4, 2017

gfyoung commented Jan 4, 2017 •

edited

Loading

mfuellbier commented Apr 21, 2020

gfyoung commented Apr 21, 2020 •

edited

Loading

ahawryluk commented Mar 15, 2021

gfyoung commented Mar 15, 2021 •

edited

Loading

karlacio commented Mar 15, 2021

read_csv crash if second input line is malformed #14782

read_csv crash if second input line is malformed #14782

Comments

karlacio commented Dec 1, 2016 • edited by gfyoung Loading

Problem description

Output of pd.show_versions() on OSX 10.11.6

Output of pd.show_versions() on CentOS 7

INSTALLED VERSIONS

gfyoung commented Dec 2, 2016

karlacio commented Dec 2, 2016

gfyoung commented Dec 2, 2016

karlacio commented Dec 2, 2016

jreback commented Dec 4, 2016

gfyoung commented Dec 4, 2016 • edited Loading

jorisvandenbossche commented Dec 4, 2016

gfyoung commented Dec 4, 2016

jorisvandenbossche commented Dec 4, 2016

gfyoung commented Dec 4, 2016

jorisvandenbossche commented Dec 4, 2016 • edited Loading

gfyoung commented Dec 4, 2016

jorisvandenbossche commented Dec 4, 2016

gfyoung commented Dec 4, 2016

gfyoung commented Dec 4, 2016

jorisvandenbossche commented Dec 4, 2016

gfyoung commented Dec 4, 2016

gfyoung commented Jan 3, 2017 • edited Loading

jorisvandenbossche commented Jan 4, 2017

jorisvandenbossche commented Jan 4, 2017

gfyoung commented Jan 4, 2017

jorisvandenbossche commented Jan 4, 2017

gfyoung commented Jan 4, 2017 • edited Loading

mfuellbier commented Apr 21, 2020

gfyoung commented Apr 21, 2020 • edited Loading

ahawryluk commented Mar 15, 2021

gfyoung commented Mar 15, 2021 • edited Loading

karlacio commented Mar 15, 2021

karlacio commented Dec 1, 2016 •

edited by gfyoung

Loading

Output of `pd.show_versions()` on OSX 10.11.6

Output of `pd.show_versions()` on CentOS 7

gfyoung commented Dec 4, 2016 •

edited

Loading

jorisvandenbossche commented Dec 4, 2016 •

edited

Loading

gfyoung commented Jan 3, 2017 •

edited

Loading

gfyoung commented Jan 4, 2017 •

edited

Loading

gfyoung commented Apr 21, 2020 •

edited

Loading

gfyoung commented Mar 15, 2021 •

edited

Loading