Skip to content

Loading CSV files (using read_csv) with blank lines between header and data rows quits Python interpreter #28071

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
plartoo opened this issue Aug 21, 2019 · 4 comments · Fixed by #32566
Assignees
Labels
IO CSV read_csv, to_csv Segfault Non-Recoverable Error
Milestone

Comments

@plartoo
Copy link

plartoo commented Aug 21, 2019

Code Sample, a copy-pastable example if possible

import pandas as pd # tested with pandas 25.0 using Python 3.6.8
pd.read_csv('my_csv.csv', delimiter='|', header=4, nrows=1, skip_blank_lines=False) # this makes interpreter exit without any error message

pd.read_csv('my_csv.csv', delimiter='|', header=4, nrows=2, skip_blank_lines=False) # this is fine producing output below
   int_id_1_1_1  date_2019-01-01_2019-12-31_2  ascii_str_8_8_3  double_-1.0_1.0_4  integer_-1000_1000_5
0           NaN                           NaN              NaN                NaN                   NaN
1           NaN                           NaN              NaN                NaN                   NaN

Problem description

I have been trying to load a test CSV file ("my_csv.txt", attached), which is structured in a way that there's an information text on the second row; row header line on the fifth row; and the data starts at the ninth row. As you can see in the Python code above, read_csv fails when nrows=1 , but doesn't when nrows>1.

I think there's some uncaught bug in Pandas' read_csv when CSV file has blank lines between header and the start of the data rows. Thank you for your hard work maintaining and extending this very useful library.

my_csv.txt

@TomAugspurger
Copy link
Contributor

Thanks for the report. Self-contained example

import pandas as pd
import io
s = """\

header

a,b\n

1,2\n
3,4"""
pd.read_csv(io.StringIO(s), header=3, nrows=1, skip_blank_lines=False)

@TomAugspurger TomAugspurger added IO CSV read_csv, to_csv Segfault Non-Recoverable Error labels Aug 21, 2019
@WillAyd
Copy link
Member

WillAyd commented Jan 17, 2020

I believe I've narrowed it down to this line:

char_count = (self->word_starts[word_deletions - 1] +

Debugging @TomAugspurger example above, word_deletions is 0 at that point in time, so self->words[word_deletions - 1] would provide a negative index which would be undefined behavior

I think could add a check that word_deletions is > 0 before that indexing operation, but have to review some more to understand what the intent here is

roberthdevries added a commit to roberthdevries/pandas that referenced this issue Mar 9, 2020
@roberthdevries
Copy link
Contributor

take

@roberthdevries
Copy link
Contributor

I do not claim to understand exactly what happens here, but I created a test that reproduces the problem and, with this fix, produces the expected results.

roberthdevries added a commit to roberthdevries/pandas that referenced this issue Mar 11, 2020
roberthdevries added a commit to roberthdevries/pandas that referenced this issue Mar 12, 2020
@jreback jreback added this to the 1.1 milestone Mar 15, 2020
roberthdevries added a commit to roberthdevries/pandas that referenced this issue Mar 15, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO CSV read_csv, to_csv Segfault Non-Recoverable Error
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants