Loading CSV files (using `read_csv`) with blank lines between header and data rows quits Python interpreter #28071

plartoo · 2019-08-21T19:07:05Z

Code Sample, a copy-pastable example if possible

import pandas as pd # tested with pandas 25.0 using Python 3.6.8
pd.read_csv('my_csv.csv', delimiter='|', header=4, nrows=1, skip_blank_lines=False) # this makes interpreter exit without any error message

pd.read_csv('my_csv.csv', delimiter='|', header=4, nrows=2, skip_blank_lines=False) # this is fine producing output below
   int_id_1_1_1  date_2019-01-01_2019-12-31_2  ascii_str_8_8_3  double_-1.0_1.0_4  integer_-1000_1000_5
0           NaN                           NaN              NaN                NaN                   NaN
1           NaN                           NaN              NaN                NaN                   NaN

Problem description

I have been trying to load a test CSV file ("my_csv.txt", attached), which is structured in a way that there's an information text on the second row; row header line on the fifth row; and the data starts at the ninth row. As you can see in the Python code above, read_csv fails when nrows=1 , but doesn't when nrows>1.

I think there's some uncaught bug in Pandas' read_csv when CSV file has blank lines between header and the start of the data rows. Thank you for your hard work maintaining and extending this very useful library.

my_csv.txt

The text was updated successfully, but these errors were encountered:

TomAugspurger · 2019-08-21T20:01:16Z

Thanks for the report. Self-contained example

import pandas as pd
import io
s = """\

header

a,b\n

1,2\n
3,4"""
pd.read_csv(io.StringIO(s), header=3, nrows=1, skip_blank_lines=False)

WillAyd · 2020-01-17T01:00:18Z

I believe I've narrowed it down to this line:

pandas/pandas/_libs/src/parser/tokenizer.c

Line 1192 in b68a9bb

char_count = (self->word_starts[word_deletions - 1] +

Debugging @TomAugspurger example above, word_deletions is 0 at that point in time, so self->words[word_deletions - 1] would provide a negative index which would be undefined behavior

I think could add a check that word_deletions is > 0 before that indexing operation, but have to review some more to understand what the intent here is

roberthdevries · 2020-03-09T22:04:49Z

take

roberthdevries · 2020-03-09T22:08:58Z

I do not claim to understand exactly what happens here, but I created a test that reproduces the problem and, with this fix, produces the expected results.

TomAugspurger added IO CSV read_csv, to_csv Segfault Non-Recoverable Error labels Aug 21, 2019

roberthdevries added a commit to roberthdevries/pandas that referenced this issue Mar 9, 2020

Fix segfault in csv tokenizer (issue pandas-dev#28071)

1ca88b6

github-actions bot assigned roberthdevries Mar 9, 2020

roberthdevries mentioned this issue Mar 9, 2020

BUG: Fix segfault in csv tokenizer #32566

Merged

5 tasks

roberthdevries added a commit to roberthdevries/pandas that referenced this issue Mar 11, 2020

Fix segfault in csv tokenizer (issue pandas-dev#28071)

344ab1d

roberthdevries added a commit to roberthdevries/pandas that referenced this issue Mar 12, 2020

Fix segfault in csv tokenizer (issue pandas-dev#28071)

998fb48

jreback added this to the 1.1 milestone Mar 15, 2020

roberthdevries added a commit to roberthdevries/pandas that referenced this issue Mar 15, 2020

Fix segfault in csv tokenizer (issue pandas-dev#28071)

6da5e2a

jreback closed this as completed in #32566 Mar 16, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Loading CSV files (using `read_csv`) with blank lines between header and data rows quits Python interpreter #28071

Loading CSV files (using `read_csv`) with blank lines between header and data rows quits Python interpreter #28071

plartoo commented Aug 21, 2019 •

edited

Loading

TomAugspurger commented Aug 21, 2019

WillAyd commented Jan 17, 2020

roberthdevries commented Mar 9, 2020

roberthdevries commented Mar 9, 2020

Loading CSV files (using read_csv) with blank lines between header and data rows quits Python interpreter #28071

Loading CSV files (using read_csv) with blank lines between header and data rows quits Python interpreter #28071

Comments

plartoo commented Aug 21, 2019 • edited Loading

Code Sample, a copy-pastable example if possible

Problem description

TomAugspurger commented Aug 21, 2019

WillAyd commented Jan 17, 2020

roberthdevries commented Mar 9, 2020

roberthdevries commented Mar 9, 2020

Loading CSV files (using `read_csv`) with blank lines between header and data rows quits Python interpreter #28071

Loading CSV files (using `read_csv`) with blank lines between header and data rows quits Python interpreter #28071

plartoo commented Aug 21, 2019 •

edited

Loading