-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
Unexpected segmentation fault in pd.read_csv C-engine #13703
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
If the issue persists on |
Hello, @gfyoung ! I cloned the master branch, but still the problem persisted. I've managed to trace the source to a read access beyond the allocated large memory block in either kh_get_str or kh_get_strbox (defined in src/klib/khash.h) in _string_box_factorize (from parser.pyx) called from the very last branch in _string_convert for the remaining 5 lines of the text input. The particular function depends on the na_filter setting. I #define COLITER_NEXT(iter, word) do { \
const int i = *iter.line_start++ + iter.col; \
word = i < *iter.line_start ? iter.words[i]: ""; \
printf("%d, %p\n", i, (const void*) iter.words[i]); \
} while(0) -- to print out the last address just before the crash. It printed I also added this to tokenizer.c: void _dump(const void *addr, size_t len)
{
size_t i;
unsigned char buff[17];
unsigned char *pc = (unsigned char*)addr;
printf("%p:\n", addr);
for (i = 0; i < len; i++) {
if ((i % 16) == 0) {
if (i != 0)
printf(" %s\n", buff);
printf(" %04x ", i);
}
printf(" %02x", pc[i]);
if ((pc[i] < 0x20) || (pc[i] > 0x7e)) {
buff[i % 16] = '.';
} else {
buff[i % 16] = pc[i];
}
buff[(i % 16) + 1] = '\0';
}
while ((i % 16) != 0) {
printf(" ");
i++;
}
printf(" %s\n", buff);
} I borrowed it with simplifications from this gist to dump memory contents of parser->words and parser->stream in a call to _string_box_factorize. It turns out that just before the crash the pointers in parser->words point to a memory region starting with the address that causes the crash. I strongly suspect that this problem is specific to OS X memory allocation. PS: It seems that end_field records the pointers to words in a memory region into parser->word_starts, which later becomes unaccessible. PPS: I suspect parser_trim_buffers changes allocated memory but does not re-initialize parser->word_starts. PPPS: Here is a snippet which does not use the import pandas as pd
from cStringIO import StringIO
record_ = """9999-9,99:99,,,,ZZ,ZZ,,,ZZZ-ZZZZ,.Z-ZZZZ,-9.99,,,9.99,ZZZZZ,,-99,9,ZZZ-ZZZZ,ZZ-ZZZZ,,9.99,ZZZ-ZZZZZ,ZZZ-ZZZZZ,ZZZ-ZZZZ,ZZZ-ZZZZ,ZZZ-ZZZZ,ZZZ-ZZZZ,ZZZ-ZZZZ,ZZZ-ZZZZ,999,ZZZ-ZZZZ,,ZZ-ZZZZ,,,,,ZZZZ,ZZZ-ZZZZZ,ZZZ-ZZZZ,,,9,9,9,9,99,99,999,999,ZZZZZ,ZZZ-ZZZZZ,ZZZ-ZZZZ,9,ZZ-ZZZZ,9.99,ZZ-ZZZZ,ZZ-ZZZZ,,,,ZZZZ,,,ZZ,ZZ,,,,,,,,,,,,,9,,,999.99,999.99,,,ZZZZZ,,,Z9,,,,,,,ZZZ,ZZZ,,,,,,,,,,,ZZZZZ,ZZZZZ,ZZZ-ZZZZZZ,ZZZ-ZZZZZZ,ZZ-ZZZZ,ZZ-ZZZZ,ZZ-ZZZZ,ZZ-ZZZZ,,,999999,999999,ZZZ,ZZZ,,,ZZZ,ZZZ,999.99,999.99,,,,ZZZ-ZZZ,ZZZ-ZZZ,-9.99,-9.99,9,9,,99,,9.99,9.99,9,9,9.99,9.99,,,,9.99,9.99,,99,,99,9.99,9.99,,,ZZZ,ZZZ,,999.99,,999.99,ZZZ,ZZZ-ZZZZ,ZZZ-ZZZZ,,,ZZZZZ,ZZZZZ,ZZZ,ZZZ,9,9,,,,,,ZZZ-ZZZZ,ZZZ999Z,,,999.99,,999.99,ZZZ-ZZZZ,,,9.999,9.999,9.999,9.999,-9.999,-9.999,-9.999,-9.999,9.999,9.999,9.999,9.999,9.999,9.999,9.999,9.999,99999,ZZZ-ZZZZ,,9.99,ZZZ,,,,,,,,ZZZ,,,,,9,,,,9,,,,,,,,,,ZZZ-ZZZZ,ZZZ-ZZZZ,,ZZZZZ,ZZZZZ,ZZZZZ,ZZZZZ,,,9.99,,ZZ-ZZZZ,ZZ-ZZZZ,ZZ,999,,,,ZZ-ZZZZ,ZZZ,ZZZ,ZZZ-ZZZZ,ZZZ-ZZZZ,,,99.99,99.99,,,9.99,9.99,9.99,9.99,ZZZ-ZZZZ,,,ZZZ-ZZZZZ,,,,,-9.99,-9.99,-9.99,-9.99,,,,,,,,,ZZZ-ZZZZ,,9,9.99,9.99,99ZZ,,-9.99,-9.99,ZZZ-ZZZZ,,,,,,,ZZZ-ZZZZ,9.99,9.99,9999,,,,,,,,,,-9.9,Z/Z-ZZZZ,999.99,9.99,,999.99,ZZ-ZZZZ,ZZ-ZZZZ,9.99,9.99,9.99,9.99,9.99,9.99,,ZZZ-ZZZZZ,ZZZ-ZZZZZ,ZZZ-ZZZZZ,ZZZ-ZZZZZ,ZZZ-ZZZZZ,ZZZ,ZZZ,ZZZ,ZZZ,9.99,,,-9.99,ZZ-ZZZZ,-999.99,,-9999,,999.99,,,,999.99,99.99,,,ZZ-ZZZZZZZZ,ZZ-ZZZZ-ZZZZZZZ,,,,ZZ-ZZ-ZZZZZZZZ,ZZZZZZZZ,ZZZ-ZZZZ,9999,999.99,ZZZ-ZZZZ,-9.99,-9.99,ZZZ-ZZZZ,99:99:99,,99,99,,9.99,,-99.99,,,,,,9.99,ZZZ-ZZZZ,-9.99,-9.99,9.99,9.99,,ZZZ,,,,,,,ZZZ,ZZZ,,,,,"""
csv_data = "\n".join([record_]*173) + "\n"
for _ in range(2):
iterator_ = pd.read_csv(StringIO(csv_data), header=None, engine="c",
dtype=object, chunksize=84, na_filter=True)
for chunk_ in iterator_:
print chunk_.iloc[0, 0], chunk_.iloc[-1, 0]
print ">>>NEXT" |
The problems seems to be in the parser_trim_buffers as it appears not to move word pointers. If I swap blocks L1224 -- L1237 (/* trim stream */) and L1239 -- L1256(/* trim words, word_starts */) and then copy the initialization of parser->words from make_stream_space to the trim stream block (as shown in the snippet below), the problem goes away. Here is a new version of parser_trim_buffers int parser_trim_buffers(parser_t *self) {
/*
Free memory
*/
size_t new_cap;
void *newptr;
int i;
/* trim words, word_starts */
new_cap = _next_pow2(self->words_len) + 1;
if (new_cap < self->words_cap) {
TRACE(("parser_trim_buffers: new_cap < self->words_cap\n"));
newptr = safe_realloc((void*) self->words, new_cap * sizeof(char*));
if (newptr == NULL) {
return PARSER_OUT_OF_MEMORY;
} else {
self->words = (char**) newptr;
}
newptr = safe_realloc((void*) self->word_starts, new_cap * sizeof(int));
if (newptr == NULL) {
return PARSER_OUT_OF_MEMORY;
} else {
self->word_starts = (int*) newptr;
self->words_cap = new_cap;
}
}
/* trim stream */
new_cap = _next_pow2(self->stream_len) + 1;
TRACE(("parser_trim_buffers: new_cap = %zu, stream_cap = %zu, lines_cap = %zu\n",
new_cap, self->stream_cap, self->lines_cap));
if (new_cap < self->stream_cap) {
TRACE(("parser_trim_buffers: new_cap < self->stream_cap, calling safe_realloc\n"));
newptr = safe_realloc((void*) self->stream, new_cap);
if (newptr == NULL) {
return PARSER_OUT_OF_MEMORY;
} else {
// realloc sets errno when moving buffer?
if (self->stream != newptr) {
// uff
/* TRACE(("Moving word pointers\n")) */
self->pword_start = newptr + self->word_start;
for (i = 0; i < self->words_len; ++i)
{
self->words[i] = newptr + self->word_starts[i];
}
}
self->stream = newptr;
self->stream_cap = new_cap;
}
}
...
return 0;
} |
Awesome that you were able to fix your segfault! Here's what I would do now:
|
I issued a pull request (#13788), however there was one failed test and several deprecation warnings:
|
that test is ok, its not currently engaged on travis and fails on linux/macosx (and already has an outstanding issue). deprecation warnings are ok (though I do try to eliminate them periodically). |
Dear developers,
I am using pandas in an application where I need to process large csv files (around 1Gb each) which have approximately 800k records and 400+ columns of mixed type. That is why I decided to use data iterator functionality of pd.read_csv(). When experimenting with
chunksize
my application seems to crash somewhere inside TextReader__string_convert call.Here is an archive with a sample CSV data file that seems to cause the crash (it also includes crash dump reports, a copy of the example, and a snapshot of versions of installed python packages).
read_csv_crash.tar.gz
Code Sample
To run this example you would have to extract
dataset.csv
from the supplied archive.Please, note that this crash does not seem to occur when the file is less than 260Kib. Also note that playing with
low_memory
setting did not alleviate the problem.Expected Output
This code sample outputs this:
output of
pd.show_versions()
The output of this call is attached to this issue.
pd_show_versions.txt
Python greetings string
OSX version
The text was updated successfully, but these errors were encountered: