Skip to content

BUG: set src->buffer = NULL after garbage collecting it in buffer_rd_… #12135

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 4 commits into from
Closed

BUG: set src->buffer = NULL after garbage collecting it in buffer_rd_… #12135

wants to merge 4 commits into from

Conversation

selasley
Copy link
Contributor

Issue #12098

Add src->buffer = NULL; after garbage collecting src->buffer in the buffer_rd_bytes routine in io.c to fix the segfault

@wesm
Copy link
Member

wesm commented Jan 25, 2016

How large is the data file that's required to reproduce this? Since we understand the underlying cause now, we may be able to construct a much smaller one, maybe by appending garbage bytes to the end of a valid gzip file (which will cause the decompressor to fail and raise an exception, triggering this error, hopefully).

@wesm
Copy link
Member

wesm commented Jan 25, 2016

Which is to say I would hate to merge this without a unit test.

@jreback
Copy link
Contributor

jreback commented Jan 25, 2016

the example is 10k bytes
so could use that or try to construct one

@selasley
Copy link
Contributor Author

I tried removing parts of the data string from the original poster and tried creating a small corrupted gzip file but was unable to reproduce the segfault. I'll try a few more small corrupted gzip files to see if I can get something small enough to fit in a unit test.

@wesm
Copy link
Member

wesm commented Jan 25, 2016

This is tricky because the first file read has to succeed to hit this bug, and the second to fail. So it's a buffer sizing issue (gzip has an internal buffer size)

@jreback jreback added Bug IO CSV read_csv, to_csv labels Jan 25, 2016
@jreback jreback added this to the 0.18.0 milestone Jan 25, 2016
@@ -543,3 +543,5 @@ of columns didn't match the number of series provided (:issue:`12039`).
- Big in ``.style`` indexes and multi-indexes not appearing (:issue:`11655`)

- Bug in ``.skew`` and ``.kurt`` due to roundoff error for highly similar values (:issue:`11974`)

- Bug in ``buffer_rd_bytes()`` set src->buffer = NULL after garbage collecting it (:issue:`12098`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

more like, segfault in repeated reading of gziped input file.

@selasley
Copy link
Contributor Author

Triggering the segfault varies depending on which python and pandas versions are used. When I run the example with python2.7.11 and pandas 0.17.1, read_csv executes 5 times in the loop before the segfault occurs. The loop executes 9 times before the segfault when run with the latest master of pandas. On a different Mac running python 2.7.10 and pandas 0.17.1 it loops once then segfaults on the second call to read_csv. The example runs without segfaulting under python 3.5.1 using BytesIO instead of StringIO.

@wesm
Copy link
Member

wesm commented Jan 25, 2016

That's fine, as long as the test is consistently flaky on some platforms. Since we are dealing with a double free issue here it will cause non-deterministic failure. If you run valgrind you'll probably see the invalid deallocation

@selasley
Copy link
Contributor Author

I was able to make a test with a shorter data string that segfaults without the src->buffer = NULL; bug fix

def test_buffer_rd_bytes(self):
    # GH 12098
    # src->buffer can be freed twice leading to a segfault if a corrupt 
    # gzip file is read with read_csv and the buffer is filled more than
    # once before gzip throws an exception

    data = '\x1F\x8B\x08\x00\x00\x00\x00\x00\x00\x03\xED\xC3\x41\x09' \
           '\x00\x00\x08\x00\xB1\xB7\xB6\xBA\xFE\xA5\xCC\x21\x6C\xB0' \
           '\xA6\x4D' + '\x55' * 267 + \
           '\x7D\xF7\x00\x91\xE0\x47\x97\x14\x38\x04\x00' \
           '\x1f\x8b\x08\x00VT\x97V\x00\x03\xed]\xefO'
    for i in range(100):
        try:
            _ = pd.read_csv(StringIO(data),
                            compression='gzip',
                            delim_whitespace=True)
        except Exception as e:
            pass

I put the test in TestCParserHighMemory and in TestCParserLowMemory. If that is the proper place for them I will update my PR

@jreback
Copy link
Contributor

jreback commented Jan 26, 2016

looks good, though use self.read_csv which will trigger calls based on the parser engine (e.g. c/python).

@jreback jreback closed this in f32b44f Jan 27, 2016
@jreback
Copy link
Contributor

jreback commented Jan 27, 2016

@selasley thanks!

@selasley selasley deleted the buffer_rd_bytes_fix branch January 27, 2016 16:56
@jbrockmendel
Copy link
Member

@selasley Looking at this test now, the except Exception: pass is hiding what looks like a bytes/str problem in py3. Was this test/problem supposed to be py2-specific?

@selasley
Copy link
Contributor Author

I believe it was meant for python2 based on the Jan 25 comment.

@jbrockmendel
Copy link
Member

@jreback can you confirm the "this test is only relevant in py2" hypothesis? or suggest someone else to ask about it?

@jreback
Copy link
Contributor

jreback commented Sep 21, 2019

yeah might be py2 only

@jbrockmendel
Copy link
Member

@TomAugspurger or @jorisvandenbossche can I get a third opinion before ripping this test out?

@TomAugspurger
Copy link
Contributor

No idea. I don't think I'm familiar with the original issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug IO CSV read_csv, to_csv
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants