-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
Core dumped in read_csv (C engine) when reading multiple corrupted gzip files #12098
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
pls show an example with data as minimal as possible. why does looping matter here? |
Sorry I forgot to attach the file. Here it is: The loop is to reproduce the problem without having to attach multiple files. INSTALLED VERSIONScommit: None pandas: 0.17.1 |
@alessiodore can you see if you can narrow it down a bit more pls. e.g. keep chopping untill you don't get the error then back up |
I am not sure this is what you mean but I changed my script to:
When I slice the file to 9656 I don't get the segmentation fault. At 9657 it gets the segm fault. |
great. so extract the slice that is causing the error. we need a simple, copy-pastable example in order to pinpoint the problem, I don't want a file, rather a string of characters that repro. further try with and w/o the gzip to see if that is the problem. The more you can narrow it down the better. |
I tried to slice the left part of the file (log[n:9657]) but I got the segm fault only for n=0. Also I tried for log[1:len(log)] and I didn't get the segm fault. |
yes, ideally what you can do is something like:
e.g. a complete copy-pastable example that repros. Then we can use this to debug and as a test. I know narrowing down is not so fun :< but in order fix these issues much better to have a simple example. thanks! |
I understand. I wasn't just entirely sure if it was okay to post a 10K characters string.
|
can you cut this down |
I am not sure how I can give you a simpler example. I can reproduce the segmentation fault only considering the file sliced from 0 to 9657. If I get the characters even from the first one to the end of the file [1:len(log)] I don't get the segm fault. Also no segmentation fault if I consider the file from 0 to 9656. The parser seems to detect that the file is corrupted but when I try to read a certain number of corrupted files at some point I get a segmentation fault. Unfortunately, these are all the information I have and this is the only way I could recreate the problem. |
ok, this is reproducible. thanks for the example. |
if anyone is interested cc @mcwitt |
The segfault with python2 is caused by the Py_XDECREF(RDS(rds)->buffer); line in the del_rd_source function in the io.c source file. The reference count for rds->obj is explicitly incremented in new_rd_source() but I haven't found where the reference count for rds->buffer is incremented. Removing the Py_XDECREF(RDS(rds)->buffer); line in io.c allows the example code to run without a segfault. Does anyone know of a good reason to keep the call to Py_XDECREF(RDS(rds)->buffer) in the del_rd_source function? |
@selasley I just looked this over. In the line which creates result = PyObject_CallObject(func, args); This returns a new reference, so the reference count of this object should be 1. The problematic thing I'm seeing is actually this block: if (result == NULL) {
PyGILState_Release(state);
*bytes_read = 0;
*status = CALLING_READ_FAILED;
return NULL;
} From first principles: If |
To be on the safe side it would be better to always set |
I put the call to Py_XDECREF(RDS(rds)->buffer); back in del_rd_source() and added the lines you suggested The problem code runs without segfaulting and all tests pass in |
Cool, I think just that one line |
Will do. |
I am using read_csv to read some gzip compressed log files. Some of these files are corrupted and they cannot be uncompressed.
At different iterations in the loop that reads these files my script crashes with a core dumped message:
*** Error in `/usr/bin/python': corrupted double-linked list: 0x0000000003836790 ***
or just:
Segmentation fault (core dumped)
This is a stripped-down version (just looping over one of the corrupted files) of the code where this error occurs:
The traceback of the catched exception is:
File "/usr/local/lib/python2.7/dist-packages/pandas/io/parsers.py", line 498, in parser_f
return _read(filepath_or_buffer, kwds)
File "/usr/local/lib/python2.7/dist-packages/pandas/io/parsers.py", line 285, in _read
return parser.read()
File "/usr/local/lib/python2.7/dist-packages/pandas/io/parsers.py", line 747, in read
ret = self._engine.read(nrows)
File "/usr/local/lib/python2.7/dist-packages/pandas/io/parsers.py", line 1197, in read
data = self._reader.read(nrows)
File "pandas/parser.pyx", line 766, in pandas.parser.TextReader.read (pandas/parser.c:7988)
File "pandas/parser.pyx", line 788, in pandas.parser.TextReader._read_low_memory (pandas/parser.c:8244)
File "pandas/parser.pyx", line 842, in pandas.parser.TextReader._read_rows (pandas/parser.c:8970)
File "pandas/parser.pyx", line 829, in pandas.parser.TextReader._tokenize_rows (pandas/parser.c:8838)
File "pandas/parser.pyx", line 1833, in pandas.parser.raise_parser_error (pandas/parser.c:22649)
CParserError: Error tokenizing data. C error: Calling read(nbytes) on source failed. Try engine='python'.
If I remove the delim_whitespace argument the loop completes without segmentation fault. I tried adding low_memory=False but the program still crashes.
I am using pandas version 0.17.1 on Ubuntu 14.04 OS.
It looks like a similar issue to #5664 but the problems should have been resolved in v0.16.1
The text was updated successfully, but these errors were encountered: