-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
read_csv crash if second input line is malformed #14782
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
@karlacio : Why are you specifying |
You are right. If I remove header or set it to None it works. I don't know why it was there, maybe I hastily read the docs:
In any case is the code behavior correct? |
@karlacio : Actually, that phrase is a little confusing admittedly. @jreback @jorisvandenbossche ? I suspect a documentation fix is needed there. Because we process |
@gfyoung |
yeah that does look a bit confusing. someone want to re-word? (I would put a blank line or 2 somewhere; that is a lot of text). I am also suspect of the utility of: @gfyoung can you divine from the code how this is useful |
@jreback : Indeed, the doc just seems outright wrong. Here's what happens:
I'll put up a PR to update this. Note that we already have tests in header.py to confirm. |
Sorry, what is exactly unclear about the current phrasing? |
@jorisvandenbossche : The documentation is outright wrong. We even have tests that confirm this. |
Can you point to them? |
Can't indicate right now but if you search for the word "names," in that file, you should be able to find them easily. |
Not sure what 'that' file is, but looking at the header tests, I find a test that explicitly tests this pandas/pandas/io/tests/parser/header.py Line 243 in 748000d
|
I specifically mentioned header.py in a previous comment. And yes, that is the test that confirms the incorrectness of the documentation. |
Sorry, overlooked that. But, again, as asked above, can you clarify why you think this shows the incorrectness of the documentation?
so in the test, the existing names in the data (the header row) is replaced by the values in the passed |
So reading it again, I think I see what the intended meaning is, which technically is correct. However, I think we should clarify the interaction between names and header and close this issue. |
Also I would not consider this a bug because the header rows are used for determining the number of columns that we expect in the data. |
IMO at least one of the two examples is a bug. As either both should raise (if we say that the header row should match the number of columns), or either both should work (if we decide that when replacing the header row, the number of columns don't need to match) |
It's a little more involved than that. The |
@jreback , @jorisvandenbossche : While it does seem like we resolved @karlacio 's immediate issue, some sort of patch is needed here. @jorisvandenbossche is advocating what appears to be an API change, while I think that the behavior is alright as is and just needs doc clarification. @jreback thoughts? |
OK, I think the confusion/disagreement is due to mixing several issues here: 1. Combination of
Here, as stated in the docs, you can pass 2. Determination of the number of columns and whether this is based on the @gfyoung you said above:
The main problem here is, I think, that it is not clear what exactly is this "row following the header" (first data row).
So it seems that Of course it can be discussed whether the header row or |
Related issue for the second bullet: #6710 |
@jorisvandenbossche : I disagree that your examples are contradictory. In your final example, the original |
But you explicitly say to replace the original header, shouldn't you use the replaced header then? |
@jorisvandenbossche : By replacing the header, based on how the code behaves, it should refer to using different names for the header (but not changing the size of the number names provided). That is why we would not use Perhaps that needs clarification or verification (in any case, the Python engine behavior is inconsistent with that of the C engine if you run it through the examples you provided), but I don't believe there is a bug still. |
I found a seg fault in This is the MWE which creates a segmenetatiion fault.
Shall I create a new ticket for that or is this basically the problem discussed here? |
@mfuellbier Yikes! That's not great to say the least. I copied your example into the issue description at top, as it is similar to the OP. Thanks! |
Good news, pd.read_csv(io.StringIO("\na\nb\n"), skip_blank_lines=False, header=1) no longer causes a segfault on master. It returns
|
@ahawryluk : That's great! Would you like to add this test as a PR? I would check the OP as well to see if we can close this issue out. |
Ok to close, |
Problem description
Added 2020/04/21:
Similar issue as OP but segfaults (from below)!
===
I'm usign read_csv to import files containing data in 9 columns.
Some lines have a 'time tag' composed by two columns (the first line is always a 'time tag').
Some lines are malformed in various ways.
I use read_csv to read data and skip bad lines just with a warn and usually it works.
Recently I found a file that crash my program and it seems caused by a bad formed second line (not 9 fields).
The code:
works on input like this:
with output:
But fails on input like this:
with output:
The problem arises both on OSX and on CentOS 7.
I tried to find if it's an already known bug but I found nothing.
Sorry if it's already known.
Carlo.
Output of
pd.show_versions()
on OSX 10.11.6pandas: 0.19.1
nose: 1.3.7
pip: 9.0.1
setuptools: 28.8.0
Cython: None
numpy: 1.11.2
scipy: 0.18.1
statsmodels: None
xarray: None
IPython: 5.1.0
sphinx: None
patsy: None
dateutil: 2.6.0
pytz: 2016.7
blosc: None
bottleneck: None
tables: 3.3.0
numexpr: 2.6.1
matplotlib: 1.5.3
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.8
boto: None
pandas_datareader: None
Output of
pd.show_versions()
on CentOS 7pd.show_versions()
INSTALLED VERSIONS
commit: None
python: 3.4.5.final.0
python-bits: 64
OS: Linux
OS-release: 3.10.0-327.36.3.el7.x86_64
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8
pandas: 0.19.1
nose: 1.3.7
pip: 9.0.1
setuptools: 29.0.1
Cython: None
numpy: 1.11.2
scipy: 0.18.1
statsmodels: None
xarray: None
IPython: 5.1.0
sphinx: None
patsy: None
dateutil: 2.6.0
pytz: 2016.7
blosc: None
bottleneck: None
tables: 3.3.0
numexpr: 2.6.1
matplotlib: 1.5.3
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.8
boto: None
pandas_datareader: None
The text was updated successfully, but these errors were encountered: