BUG: set src->buffer = NULL after garbage collecting it in buffer_rd_… #12135

selasley · 2016-01-25T17:13:39Z

Issue #12098

Add src->buffer = NULL; after garbage collecting src->buffer in the buffer_rd_bytes routine in io.c to fix the segfault

wesm · 2016-01-25T17:21:20Z

How large is the data file that's required to reproduce this? Since we understand the underlying cause now, we may be able to construct a much smaller one, maybe by appending garbage bytes to the end of a valid gzip file (which will cause the decompressor to fail and raise an exception, triggering this error, hopefully).

wesm · 2016-01-25T17:21:36Z

Which is to say I would hate to merge this without a unit test.

jreback · 2016-01-25T17:27:37Z

the example is 10k bytes
so could use that or try to construct one

selasley · 2016-01-25T17:28:37Z

I tried removing parts of the data string from the original poster and tried creating a small corrupted gzip file but was unable to reproduce the segfault. I'll try a few more small corrupted gzip files to see if I can get something small enough to fit in a unit test.

wesm · 2016-01-25T17:34:37Z

This is tricky because the first file read has to succeed to hit this bug, and the second to fail. So it's a buffer sizing issue (gzip has an internal buffer size)

jreback · 2016-01-25T18:00:43Z

doc/source/whatsnew/v0.18.0.txt

@@ -543,3 +543,5 @@ of columns didn't match the number of series provided (:issue:`12039`).
 - Big in ``.style`` indexes and multi-indexes not appearing (:issue:`11655`)

 - Bug in ``.skew`` and ``.kurt`` due to roundoff error for highly similar values (:issue:`11974`)
+
+- Bug in ``buffer_rd_bytes()`` set src->buffer = NULL after garbage collecting it (:issue:`12098`) 


more like, segfault in repeated reading of gziped input file.

selasley · 2016-01-25T18:32:44Z

Triggering the segfault varies depending on which python and pandas versions are used. When I run the example with python2.7.11 and pandas 0.17.1, read_csv executes 5 times in the loop before the segfault occurs. The loop executes 9 times before the segfault when run with the latest master of pandas. On a different Mac running python 2.7.10 and pandas 0.17.1 it loops once then segfaults on the second call to read_csv. The example runs without segfaulting under python 3.5.1 using BytesIO instead of StringIO.

wesm · 2016-01-25T18:37:16Z

That's fine, as long as the test is consistently flaky on some platforms. Since we are dealing with a double free issue here it will cause non-deterministic failure. If you run valgrind you'll probably see the invalid deallocation

selasley · 2016-01-26T07:23:18Z

I was able to make a test with a shorter data string that segfaults without the src->buffer = NULL; bug fix

def test_buffer_rd_bytes(self):
    # GH 12098
    # src->buffer can be freed twice leading to a segfault if a corrupt 
    # gzip file is read with read_csv and the buffer is filled more than
    # once before gzip throws an exception

    data = '\x1F\x8B\x08\x00\x00\x00\x00\x00\x00\x03\xED\xC3\x41\x09' \
           '\x00\x00\x08\x00\xB1\xB7\xB6\xBA\xFE\xA5\xCC\x21\x6C\xB0' \
           '\xA6\x4D' + '\x55' * 267 + \
           '\x7D\xF7\x00\x91\xE0\x47\x97\x14\x38\x04\x00' \
           '\x1f\x8b\x08\x00VT\x97V\x00\x03\xed]\xefO'
    for i in range(100):
        try:
            _ = pd.read_csv(StringIO(data),
                            compression='gzip',
                            delim_whitespace=True)
        except Exception as e:
            pass

I put the test in TestCParserHighMemory and in TestCParserLowMemory. If that is the proper place for them I will update my PR

jreback · 2016-01-26T14:06:09Z

looks good, though use self.read_csv which will trigger calls based on the parser engine (e.g. c/python).

… buffer_rd_bytes, causing a segfault Closes #12098

into buffer_rd_bytes_fix

jreback · 2016-01-27T15:12:38Z

@selasley thanks!

jbrockmendel · 2019-09-20T19:22:02Z

@selasley Looking at this test now, the except Exception: pass is hiding what looks like a bytes/str problem in py3. Was this test/problem supposed to be py2-specific?

selasley · 2019-09-21T00:05:10Z

I believe it was meant for python2 based on the Jan 25 comment.

jbrockmendel · 2019-09-21T02:15:33Z

@jreback can you confirm the "this test is only relevant in py2" hypothesis? or suggest someone else to ask about it?

jreback · 2019-09-21T03:01:47Z

yeah might be py2 only

jbrockmendel · 2019-09-22T15:17:41Z

@TomAugspurger or @jorisvandenbossche can I get a third opinion before ripping this test out?

TomAugspurger · 2019-09-23T11:23:41Z

No idea. I don't think I'm familiar with the original issue.

jreback added Bug IO CSV read_csv, to_csv labels Jan 25, 2016

jreback added this to the 0.18.0 milestone Jan 25, 2016

jreback reviewed Jan 25, 2016
View reviewed changes

Scott E Lasley and others added 4 commits January 26, 2016 12:22

BUG: parser buffer could be freed more than once if reading failed in…

fb204a2

… buffer_rd_bytes, causing a segfault Closes #12098

DOC: whatsnew edits

d77f072

BUG: parser buffer could be freed more than once if reading failed in…

a1f0a79

… buffer_rd_bytes, causing a segfault Closes #12098

Merge branch 'buffer_rd_bytes_fix' of https://github.com/selasley/pandas

7339db4

into buffer_rd_bytes_fix

jreback closed this in f32b44f Jan 27, 2016

selasley deleted the buffer_rd_bytes_fix branch January 27, 2016 16:56

Uh oh!

BUG: set src->buffer = NULL after garbage collecting it in buffer_rd_… #12135

BUG: set src->buffer = NULL after garbage collecting it in buffer_rd_… #12135

Uh oh!

Conversation

selasley commented Jan 25, 2016

Uh oh!

wesm commented Jan 25, 2016

Uh oh!

wesm commented Jan 25, 2016

Uh oh!

jreback commented Jan 25, 2016

Uh oh!

selasley commented Jan 25, 2016

Uh oh!

wesm commented Jan 25, 2016

Uh oh!

jreback Jan 25, 2016

Choose a reason for hiding this comment

Uh oh!

selasley commented Jan 25, 2016

Uh oh!

wesm commented Jan 25, 2016

Uh oh!

selasley commented Jan 26, 2016

Uh oh!

jreback commented Jan 26, 2016

Uh oh!

jreback commented Jan 27, 2016

Uh oh!

jbrockmendel commented Sep 20, 2019

Uh oh!

selasley commented Sep 21, 2019

Uh oh!

jbrockmendel commented Sep 21, 2019

Uh oh!

jreback commented Sep 21, 2019

Uh oh!

jbrockmendel commented Sep 22, 2019

Uh oh!

TomAugspurger commented Sep 23, 2019

Uh oh!

Uh oh!