Skip to content

Unexpected segmentation fault in pd.read_csv C-engine #13703

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
ivannz opened this issue Jul 19, 2016 · 6 comments
Closed

Unexpected segmentation fault in pd.read_csv C-engine #13703

ivannz opened this issue Jul 19, 2016 · 6 comments
Labels
Bug IO CSV read_csv, to_csv
Milestone

Comments

@ivannz
Copy link
Contributor

ivannz commented Jul 19, 2016

Dear developers,

I am using pandas in an application where I need to process large csv files (around 1Gb each) which have approximately 800k records and 400+ columns of mixed type. That is why I decided to use data iterator functionality of pd.read_csv(). When experimenting with chunksize my application seems to crash somewhere inside TextReader__string_convert call.

Here is an archive with a sample CSV data file that seems to cause the crash (it also includes crash dump reports, a copy of the example, and a snapshot of versions of installed python packages).
read_csv_crash.tar.gz

Code Sample

To run this example you would have to extract dataset.csv from the supplied archive.

import pandas as pd
for n_lines in range(82, 87):
    filelike = open("dataset.csv", "r")
    iterator_ = pd.read_csv(filelike, header=None, engine="c",
                            dtype=object, chunksize=n_lines)
    for chunk_ in iterator_:
        print n_lines, chunk_.iloc[0, 0], chunk_.iloc[-1, 0]
    filelike.close()

Please, note that this crash does not seem to occur when the file is less than 260Kib. Also note that playing with low_memory setting did not alleviate the problem.

Expected Output

This code sample outputs this:

82 9999-9 9999-9
82 9999-9 9999-9
82 9999-9 9999-9
83 9999-9 9999-9
83 9999-9 9999-9
83 9999-9 9999-9
84 9999-9 9999-9
84 9999-9 9999-9
Segmentation fault: 11

output of pd.show_versions()

The output of this call is attached to this issue.
pd_show_versions.txt

Python greetings string

Python 2.7.10 (v2.7.10:15c95b7d81dc, May 23 2015, 09:33:12) 
[GCC 4.2.1 (Apple Inc. build 5666) (dot 3)] on darwin

OSX version

OS Version:            Mac OS X 10.11.5 (15F34)
Model: Macmini6,2, BootROM MM61.0106.B0A, 4 processors, Intel Core i7, 2,6 GHz, 16 GB, SMC 2.8f
@gfyoung
Copy link
Member

gfyoung commented Jul 25, 2016

  1. For the expected output section, put what you're actually expecting to see, not what you actually saw. Underneath that, you should then put what you saw.

  2. Can you try your code sample again with the master branch?

  3. While I cannot reproduce this (either with 0.18.1 or master) on Linux (sorry, no access to Mac ATM), the fact that it's crashing with string and object dtype bares resemblance to an earlier segfault we were seeing in a different part of the code.

If the issue persists on master, in parser.pyx, you can find the string_convert function here. Judging from your versions output, I suspect the segfault is occurring in this function here in fact. If my suspicion is correct, can you further specify which method call is causing the crash?

@ivannz
Copy link
Contributor Author

ivannz commented Jul 25, 2016

Hello, @gfyoung !

I cloned the master branch, but still the problem persisted.

I've managed to trace the source to a read access beyond the allocated large memory block in either kh_get_str or kh_get_strbox (defined in src/klib/khash.h) in _string_box_factorize (from parser.pyx) called from the very last branch in _string_convert for the remaining 5 lines of the text input. The particular function depends on the na_filter setting.

I debug-patched COLITER_NEXT in tokenizer.h with:

#define COLITER_NEXT(iter, word) do { \
    const int i = *iter.line_start++ + iter.col; \
    word = i < *iter.line_start ? iter.words[i]: ""; \
    printf("%d, %p\n", i, (const void*) iter.words[i]); \
    } while(0)

-- to print out the last address just before the crash. It printed 0 and the address at which the EXC_BAD_ACCESS happens. Any attempt to print the word returned by COLITER_NEXT results in segfault. Note that the word is later fed into both kh_* functions.

I also added this to tokenizer.c:

void _dump(const void *addr, size_t len) 
{
    size_t i;
    unsigned char buff[17];
    unsigned char *pc = (unsigned char*)addr;

    printf("%p:\n", addr);
    for (i = 0; i < len; i++) {
        if ((i % 16) == 0) {
            if (i != 0)
                printf("  %s\n", buff);
            printf("  %04x ", i);
        }
        printf(" %02x", pc[i]);
        if ((pc[i] < 0x20) || (pc[i] > 0x7e)) {
            buff[i % 16] = '.';
        } else {
            buff[i % 16] = pc[i];
        }
        buff[(i % 16) + 1] = '\0';
    }
    while ((i % 16) != 0) {
        printf("   ");
        i++;
    }
    printf("  %s\n", buff);
}

I borrowed it with simplifications from this gist to dump memory contents of parser->words and parser->stream in a call to _string_box_factorize. It turns out that just before the crash the pointers in parser->words point to a memory region starting with the address that causes the crash.

I strongly suspect that this problem is specific to OS X memory allocation.

PS: It seems that end_field records the pointers to words in a memory region into parser->word_starts, which later becomes unaccessible.

PPS: I suspect parser_trim_buffers changes allocated memory but does not re-initialize parser->word_starts.

PPPS: Here is a snippet which does not use the dataset.csv file, and instead uses StringIO, but still crashes with segfault.

import pandas as pd
from cStringIO import StringIO
record_ = """9999-9,99:99,,,,ZZ,ZZ,,,ZZZ-ZZZZ,.Z-ZZZZ,-9.99,,,9.99,ZZZZZ,,-99,9,ZZZ-ZZZZ,ZZ-ZZZZ,,9.99,ZZZ-ZZZZZ,ZZZ-ZZZZZ,ZZZ-ZZZZ,ZZZ-ZZZZ,ZZZ-ZZZZ,ZZZ-ZZZZ,ZZZ-ZZZZ,ZZZ-ZZZZ,999,ZZZ-ZZZZ,,ZZ-ZZZZ,,,,,ZZZZ,ZZZ-ZZZZZ,ZZZ-ZZZZ,,,9,9,9,9,99,99,999,999,ZZZZZ,ZZZ-ZZZZZ,ZZZ-ZZZZ,9,ZZ-ZZZZ,9.99,ZZ-ZZZZ,ZZ-ZZZZ,,,,ZZZZ,,,ZZ,ZZ,,,,,,,,,,,,,9,,,999.99,999.99,,,ZZZZZ,,,Z9,,,,,,,ZZZ,ZZZ,,,,,,,,,,,ZZZZZ,ZZZZZ,ZZZ-ZZZZZZ,ZZZ-ZZZZZZ,ZZ-ZZZZ,ZZ-ZZZZ,ZZ-ZZZZ,ZZ-ZZZZ,,,999999,999999,ZZZ,ZZZ,,,ZZZ,ZZZ,999.99,999.99,,,,ZZZ-ZZZ,ZZZ-ZZZ,-9.99,-9.99,9,9,,99,,9.99,9.99,9,9,9.99,9.99,,,,9.99,9.99,,99,,99,9.99,9.99,,,ZZZ,ZZZ,,999.99,,999.99,ZZZ,ZZZ-ZZZZ,ZZZ-ZZZZ,,,ZZZZZ,ZZZZZ,ZZZ,ZZZ,9,9,,,,,,ZZZ-ZZZZ,ZZZ999Z,,,999.99,,999.99,ZZZ-ZZZZ,,,9.999,9.999,9.999,9.999,-9.999,-9.999,-9.999,-9.999,9.999,9.999,9.999,9.999,9.999,9.999,9.999,9.999,99999,ZZZ-ZZZZ,,9.99,ZZZ,,,,,,,,ZZZ,,,,,9,,,,9,,,,,,,,,,ZZZ-ZZZZ,ZZZ-ZZZZ,,ZZZZZ,ZZZZZ,ZZZZZ,ZZZZZ,,,9.99,,ZZ-ZZZZ,ZZ-ZZZZ,ZZ,999,,,,ZZ-ZZZZ,ZZZ,ZZZ,ZZZ-ZZZZ,ZZZ-ZZZZ,,,99.99,99.99,,,9.99,9.99,9.99,9.99,ZZZ-ZZZZ,,,ZZZ-ZZZZZ,,,,,-9.99,-9.99,-9.99,-9.99,,,,,,,,,ZZZ-ZZZZ,,9,9.99,9.99,99ZZ,,-9.99,-9.99,ZZZ-ZZZZ,,,,,,,ZZZ-ZZZZ,9.99,9.99,9999,,,,,,,,,,-9.9,Z/Z-ZZZZ,999.99,9.99,,999.99,ZZ-ZZZZ,ZZ-ZZZZ,9.99,9.99,9.99,9.99,9.99,9.99,,ZZZ-ZZZZZ,ZZZ-ZZZZZ,ZZZ-ZZZZZ,ZZZ-ZZZZZ,ZZZ-ZZZZZ,ZZZ,ZZZ,ZZZ,ZZZ,9.99,,,-9.99,ZZ-ZZZZ,-999.99,,-9999,,999.99,,,,999.99,99.99,,,ZZ-ZZZZZZZZ,ZZ-ZZZZ-ZZZZZZZ,,,,ZZ-ZZ-ZZZZZZZZ,ZZZZZZZZ,ZZZ-ZZZZ,9999,999.99,ZZZ-ZZZZ,-9.99,-9.99,ZZZ-ZZZZ,99:99:99,,99,99,,9.99,,-99.99,,,,,,9.99,ZZZ-ZZZZ,-9.99,-9.99,9.99,9.99,,ZZZ,,,,,,,ZZZ,ZZZ,,,,,"""
csv_data = "\n".join([record_]*173) + "\n"

for _ in range(2):
    iterator_ = pd.read_csv(StringIO(csv_data), header=None, engine="c",
                            dtype=object, chunksize=84, na_filter=True)
    for chunk_ in iterator_:
        print chunk_.iloc[0, 0], chunk_.iloc[-1, 0]
    print ">>>NEXT"

@ivannz
Copy link
Contributor Author

ivannz commented Jul 25, 2016

The problems seems to be in the parser_trim_buffers as it appears not to move word pointers.

If I swap blocks L1224 -- L1237 (/* trim stream */) and L1239 -- L1256(/* trim words, word_starts */) and then copy the initialization of parser->words from make_stream_space to the trim stream block (as shown in the snippet below), the problem goes away.

Here is a new version of parser_trim_buffers

int parser_trim_buffers(parser_t *self) {
    /*
      Free memory
     */
    size_t new_cap;
    void *newptr;

    int i;

    /* trim words, word_starts */
    new_cap = _next_pow2(self->words_len) + 1;
    if (new_cap < self->words_cap) {
        TRACE(("parser_trim_buffers: new_cap < self->words_cap\n"));
        newptr = safe_realloc((void*) self->words, new_cap * sizeof(char*));
        if (newptr == NULL) {
            return PARSER_OUT_OF_MEMORY;
        } else {
            self->words = (char**) newptr;
        }
        newptr = safe_realloc((void*) self->word_starts, new_cap * sizeof(int));
        if (newptr == NULL) {
            return PARSER_OUT_OF_MEMORY;
        } else {
            self->word_starts = (int*) newptr;
            self->words_cap = new_cap;
        }
    }

    /* trim stream */
    new_cap = _next_pow2(self->stream_len) + 1;
    TRACE(("parser_trim_buffers: new_cap = %zu, stream_cap = %zu, lines_cap = %zu\n",
           new_cap, self->stream_cap, self->lines_cap));
    if (new_cap < self->stream_cap) {
        TRACE(("parser_trim_buffers: new_cap < self->stream_cap, calling safe_realloc\n"));
        newptr = safe_realloc((void*) self->stream, new_cap);
        if (newptr == NULL) {
            return PARSER_OUT_OF_MEMORY;
        } else {
            // realloc sets errno when moving buffer?
            if (self->stream != newptr) {
                // uff
                /* TRACE(("Moving word pointers\n")) */

                self->pword_start = newptr + self->word_start;

                for (i = 0; i < self->words_len; ++i)
                {
                    self->words[i] = newptr + self->word_starts[i];
                }
            }

            self->stream = newptr;
            self->stream_cap = new_cap;

        }
    }

...

    return 0;
}

@gfyoung
Copy link
Member

gfyoung commented Jul 25, 2016

Awesome that you were able to fix your segfault! Here's what I would do now:

  1. run all of the unit tests to see if your changes break any existing functionality

  2. if they don't, then submit this as a PR so that all of us can have a look!

  3. if they do cause failures, then I'll leave it up to you whether you want to investigate the failure causes. Feel free to provide a patch that we can clone and figure out.

@ivannz
Copy link
Contributor Author

ivannz commented Jul 25, 2016

I issued a pull request (#13788), however there was one failed test and several deprecation warnings:

FAIL: test_round_trip_frame_sep (pandas.io.tests.test_clipboard.TestClipboard)

@jreback
Copy link
Contributor

jreback commented Jul 25, 2016

that test is ok, its not currently engaged on travis and fails on linux/macosx (and already has an outstanding issue).

deprecation warnings are ok (though I do try to eliminate them periodically).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug IO CSV read_csv, to_csv
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants