"c"-engine read_table sigsegv when chunksize close to multiple of file length? #9726

ghost · 2015-03-25T11:58:31Z

I have the following script:

table_generator = pd.io.parsers.read_table(chromosome_file, sep="\t", engine="c",
                                               chunksize=50000, names=["Start", "End", "Strand"],
                                               usecols=[1, 2, 5])


for chunk in table_generator:
    print chunk
    print chromosome_file

I get an error when the file length is close to a multiple of the chunksize:

fish: 'python exorcised.py' terminated by signal SIGSEGV (Address boundary error)

Changing the chunksize or the lengths of the files allows me to avoid the error.

The error seems to happen before or when the tiny leftover chunk is read (since the previous print was displayed.)

INSTALLED VERSIONS
------------------
commit: None
python: 2.7.9.final.0
python-bits: 64
OS: Darwin
OS-release: 14.1.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8

pandas: 0.15.2
nose: 1.3.4
Cython: 0.22
numpy: 1.9.2
scipy: 0.14.1
statsmodels: 0.6.1
IPython: 2.1.0
sphinx: 1.2.2
patsy: 0.3.0
dateutil: 2.4.1
pytz: 2015.2
bottleneck: None
tables: 3.1.1
numexpr: 2.3.1
matplotlib: 1.4.2
openpyxl: 1.8.5
xlrd: 0.9.3
xlwt: 0.7.5
xlsxwriter: 0.5.5
lxml: 3.3.5
bs4: 4.3.1
html5lib: None
httplib2: None
apiclient: None
rpy2: None
sqlalchemy: 0.9.4
pymysql: None
psycopg2: None

The files read look like the following:

chrY    59052460    59052659    Keratinocyte_H3K27me3_03    1   +
chrY    246219  246418  Melanocyte_H3K27me3_02  0   -
chrY    9978094 9978293 Melanocyte_H3K27me3_03  1   +
chrY    2472778 2472977 Fibroblast_H3K27me3_01  0   +
chrY    13266277    13266476    Keratinocyte_H3K27me3_03    1   -
chrY    699049  699248  Fibroblast_H3K27me3_02  0   -
chrY    23986624    23986823    Melanocyte_H3K27me3_01  0   -
chrY    562143  562342  Fibroblast_H3K27me3_03  1   +
chrY    23026706    23026905    Melanocyte_H3K27me3_01  0   -
chrY    17509636    17509835    Melanocyte_H3K27me3_03  1   +

With special symbols showing:

chrY^I10500^I10699^IMelanocyte_H3K27me3_01^I0^I+$

The text was updated successfully, but these errors were encountered:

ghost · 2015-03-25T12:12:47Z

It only seems to happen reliably if the length is close to the first multiple for some reason, but I think I remember that it has happened for larger multiples too. Anybody able to reproduce?

Two lengths it always happens for:

50190 output_files/testing/chip/chr7.bed
50015 output_files/testing/chip/chr8.bed

jreback · 2015-12-11T15:21:10Z

dupe of #11793

jreback added Bug IO CSV read_csv, to_csv labels Dec 11, 2015

jreback mentioned this issue Dec 11, 2015

Segfault in pd.read_csv() using chunksize parameter #11793

Closed

jreback closed this as completed Dec 11, 2015

jreback added the Duplicate Report Duplicate issue or pull request label Dec 11, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

"c"-engine read_table sigsegv when chunksize close to multiple of file length? #9726

"c"-engine read_table sigsegv when chunksize close to multiple of file length? #9726

ghost commented Mar 25, 2015

ghost commented Mar 25, 2015

jreback commented Dec 11, 2015

"c"-engine read_table sigsegv when chunksize close to multiple of file length? #9726

"c"-engine read_table sigsegv when chunksize close to multiple of file length? #9726

Comments

ghost commented Mar 25, 2015

ghost commented Mar 25, 2015

jreback commented Dec 11, 2015