-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
BUG: set src->buffer = NULL after garbage collecting it in buffer_rd_… #12135
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
How large is the data file that's required to reproduce this? Since we understand the underlying cause now, we may be able to construct a much smaller one, maybe by appending garbage bytes to the end of a valid gzip file (which will cause the decompressor to fail and raise an exception, triggering this error, hopefully). |
Which is to say I would hate to merge this without a unit test. |
the example is 10k bytes |
I tried removing parts of the data string from the original poster and tried creating a small corrupted gzip file but was unable to reproduce the segfault. I'll try a few more small corrupted gzip files to see if I can get something small enough to fit in a unit test. |
This is tricky because the first file read has to succeed to hit this bug, and the second to fail. So it's a buffer sizing issue (gzip has an internal buffer size) |
@@ -543,3 +543,5 @@ of columns didn't match the number of series provided (:issue:`12039`). | |||
- Big in ``.style`` indexes and multi-indexes not appearing (:issue:`11655`) | |||
|
|||
- Bug in ``.skew`` and ``.kurt`` due to roundoff error for highly similar values (:issue:`11974`) | |||
|
|||
- Bug in ``buffer_rd_bytes()`` set src->buffer = NULL after garbage collecting it (:issue:`12098`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
more like, segfault in repeated reading of gziped input file.
Triggering the segfault varies depending on which python and pandas versions are used. When I run the example with python2.7.11 and pandas 0.17.1, read_csv executes 5 times in the loop before the segfault occurs. The loop executes 9 times before the segfault when run with the latest master of pandas. On a different Mac running python 2.7.10 and pandas 0.17.1 it loops once then segfaults on the second call to read_csv. The example runs without segfaulting under python 3.5.1 using BytesIO instead of StringIO. |
That's fine, as long as the test is consistently flaky on some platforms. Since we are dealing with a double free issue here it will cause non-deterministic failure. If you run valgrind you'll probably see the invalid deallocation |
I was able to make a test with a shorter data string that segfaults without the src->buffer = NULL; bug fix
I put the test in TestCParserHighMemory and in TestCParserLowMemory. If that is the proper place for them I will update my PR |
looks good, though use |
@selasley thanks! |
@selasley Looking at this test now, the |
I believe it was meant for python2 based on the Jan 25 comment. |
@jreback can you confirm the "this test is only relevant in py2" hypothesis? or suggest someone else to ask about it? |
yeah might be py2 only |
@TomAugspurger or @jorisvandenbossche can I get a third opinion before ripping this test out? |
No idea. I don't think I'm familiar with the original issue. |
Issue #12098
Add src->buffer = NULL; after garbage collecting src->buffer in the buffer_rd_bytes routine in io.c to fix the segfault