Skip to content

Segfault in pd.read_csv() using chunksize parameter #11793

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
OEP opened this issue Dec 8, 2015 · 6 comments
Closed

Segfault in pd.read_csv() using chunksize parameter #11793

OEP opened this issue Dec 8, 2015 · 6 comments
Labels
IO CSV read_csv, to_csv Segfault Non-Recoverable Error

Comments

@OEP
Copy link

OEP commented Dec 8, 2015

Here is my repro script:

import pandas as pd
import sys

for df in pd.read_csv(sys.argv[1], chunksize=1000):
    print(df[['sum']].sum())

and I am attaching small.csv.gz as the smallest data set I know reproduces this segfault. Running python repro.py small.csv.gz reproduces the segfault in 0.17.1 on OSX Yosemite. I can't reproduce with 0.13.1 or 0.17.1 on Ubuntu 14.04. Removing chunksize works normally with that file.

I tried my best to narrow it down. You can edit this file down to under 2000 lines and the segfault does not occur. Once it goes over 2000 lines I start to see the segfault. I can add lines 1000 at a time and notice the segfault is intermittent (I see it again at 6002 lines). It seems like to me if there are a multiple of chunksize items in the file the segfault does not occur.

I installed via pip install pandas. I also repro'd this on latest master (43edd83) on OSX Yosemite.

INSTALLED VERSIONS
------------------
commit: None
python: 2.7.9.final.0
python-bits: 64
OS: Darwin
OS-release: 14.5.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8

pandas: 0.17.1
nose: 1.3.4
pip: 1.5.6
setuptools: 8.2.1
Cython: 0.23.4
numpy: 1.10.1
scipy: 0.15.1
statsmodels: None
IPython: 2.3.1
sphinx: 1.1.2
patsy: None
dateutil: 2.4.2
pytz: 2015.7
blosc: None
bottleneck: None
tables: None
numexpr: None
matplotlib: 1.4.3
openpyxl: None
xlrd: 0.9.3
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: None
pymysql: None
psycopg2: None
Jinja2: None
Exception Type:        EXC_BAD_ACCESS (SIGSEGV)
Exception Codes:       KERN_INVALID_ADDRESS at 0x0000000110bbf0bf

VM Regions Near 0x110bbf0bf:
    MALLOC_LARGE           0000000110b3f000-0000000110bbf000 [  512K] rw-/rwx SM=PRV  
--> 
    MALLOC_LARGE           0000000110cae000-0000000110e2e000 [ 1536K] rw-/rwx SM=PRV  

Thread 0 Crashed:: Dispatch queue: com.apple.main-thread
0   parser.so                       0x000000011069af3f __pyx_f_6pandas_6parser_10TextReader__convert_with_dtype + 2191
1   parser.so                       0x00000001106977ed __pyx_f_6pandas_6parser_10TextReader__convert_tokens + 3293
2   parser.so                       0x00000001106c47fe __pyx_pf_6pandas_6parser_10TextReader_16_convert_column_data + 3006
3   parser.so                       0x000000011069572b __pyx_f_6pandas_6parser_10TextReader__read_rows + 1371
4   parser.so                       0x0000000110693f65 __pyx_f_6pandas_6parser_10TextReader__read_low_memory + 869
5   parser.so                       0x00000001106c2b9e __pyx_pw_6pandas_6parser_10TextReader_9read + 174
6   org.python.python               0x000000010e1f77e6 PyEval_EvalFrameEx + 14392
7   org.python.python               0x000000010e1f3d7a PyEval_EvalCodeEx + 1409
8   org.python.python               0x000000010e1fa59d fast_function + 117
9   org.python.python               0x000000010e1f7400 PyEval_EvalFrameEx + 13394
10  org.python.python               0x000000010e1f3d7a PyEval_EvalCodeEx + 1409
11  org.python.python               0x000000010e1fa59d fast_function + 117
12  org.python.python               0x000000010e1f7400 PyEval_EvalFrameEx + 13394
13  org.python.python               0x000000010e18f67a gen_send_ex + 193
14  org.python.python               0x000000010e1f4525 PyEval_EvalFrameEx + 1399
15  org.python.python               0x000000010e1f3d7a PyEval_EvalCodeEx + 1409
16  org.python.python               0x000000010e1f37f3 PyEval_EvalCode + 54
17  org.python.python               0x000000010e2138a2 run_mod + 53
18  org.python.python               0x000000010e213945 PyRun_FileExFlags + 133
19  org.python.python               0x000000010e2134e2 PyRun_SimpleFileExFlags + 769
20  org.python.python               0x000000010e224c5b Py_Main + 3051
21  libdyld.dylib                   0x00007fff8c26c5c9 start + 1
@jdeschenes
Copy link
Contributor

Were you able to reproduce this bug in 0.17.0 or 0.16.2?

@jreback jreback added the IO CSV read_csv, to_csv label Dec 8, 2015
@OEP
Copy link
Author

OEP commented Dec 9, 2015

@jdeschenes Yes, looks like I can reproduce it for OSX Yosemite on 0.17.0 and 0.16.2.

It works fine for both versions on Ubuntu 14.04.

@jreback
Copy link
Contributor

jreback commented Dec 11, 2015

this is prob a dupe of #9726 (though your examples are better!)

@OEP
Copy link
Author

OEP commented Dec 15, 2015

I tried to look into this a little more. I think the segfault is occurring in pandas/parser.pyx at line 1606 at a call to kh_get_str(), inside _try_int64_nogil().

...
if na_filter:
    for i in range(lines):
        COLITER_NEXT(it, word)
        k = kh_get_str(na_hashset, word)
        # in the hash table
...

I think word becomes becomes an invalid reference in the COLITER_NEXT() macro but I'm not sure what the issue is.

I thought given the bug is OSX only maybe we ran into a compiler quirk with clang. I can't repro with clang on LInux though on a recent checkout.

@ivannz
Copy link
Contributor

ivannz commented Jul 28, 2016

This one is resolved by PR #13788. I was able to reproduce the crash on 0.18.1 (same crash report as in issue #13703). Running on the build with the mentioned PR did not crash.

edit: The symptoms described by @OEP are exactly the same as in the mentioned issue: same invalid data returned by COLITER_NEXT.

@jbrockmendel jbrockmendel added the Segfault Non-Recoverable Error label Nov 9, 2018
@mroeschke
Copy link
Member

As mentioned in #11793 (comment), this has been fixed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO CSV read_csv, to_csv Segfault Non-Recoverable Error
Projects
None yet
Development

No branches or pull requests

6 participants