Segmentation fault or UnicodeDecodeError when reading csv-file depending on chunksize. #5291

Hedendahl · 2013-10-21T20:28:00Z

I have encountered an issue with the csv parser, pandas.io.parser.read_csv. I get segmentation fault or UnicodeDecodeError when reading a csv-file in chunks and it seems like that the problem depends on the size of the chunks.
Consider the following code:

import codecs
import csv
import pandas as pd


def create_csv_file(columns, rows):
    csv_file_name = 'csv_test_file.csv'
    with codecs.open(csv_file_name, mode='w', encoding='utf_8') as csv_file:
        csv_writer = csv.writer(csv_file, delimiter=',')

        for row in xrange(rows):
            csv_writer.writerow(
                [float(row)] * columns)

    return csv_file_name


def main():
    """
    """
    columns = 20
    rows = 10000
    chunksize = 999
    csv_file_name = create_csv_file(columns, rows)
    reader = pd.io.parsers.read_csv(csv_file_name,
                                    header=None,
                                    chunksize=chunksize,
                                    encoding='utf_8')

    for x, dataframe in enumerate(reader, 1):
        print x * chunksize


if __name__ == "__main__":
    main()

I get segmentation fault from the attached code, when the chunksize is 999 rows. If the chunksize is decreased to 998 rows, I instead get an UnicodeDecodeError. If the chunksize is increased to 1000 rows there are no problems of reading the csv-file. My first guess was that the problem appears when the last chunk include too few rows, but I was surprised when reading of the csv-file with the following setting,

    columns = 20
    rows = 1000
    chunksize = 99

worked properly.

Hedendahl · 2013-10-31T23:27:24Z

I'm using python version 2.7.5 and panda version 0.12.0.

guyrt · 2013-11-02T05:30:47Z

@Hedendahl I'm not getting this error in master.

python test_5291.py 
999
1998
2997
3996
4995
5994
6993
7992
8991
9990
10989

Can you confirm?

guyrt · 2013-11-02T05:32:30Z

May have been this: #5156

Hedendahl · 2013-11-02T08:30:52Z

When I execute the consider code and let the chunksize be 999 rows I get the following output,

python2.7 test_pandas_read_csv.py
999
1998
2997
3996
4995
5994
6993
7992
8991
9990
Segmentation fault: 11

If I change the chunksize to 998 rows, I instead get

python2.7 test_pandas_read_csv.py
998
1996
2994
3992
4990
5988
6986
7984
8982
9980
Traceback (most recent call last):
  File "test_pandas_read_csv.py", line 35, in <module>
    main()
  File "test_pandas_read_csv.py", line 30, in main
    for x, dataframe in enumerate(reader, 1):
  File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/pandas/io/parsers.py", line 578, in __iter__
    yield self.read(self.chunksize)
  File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/pandas/io/parsers.py", line 608, in read
    ret = self._engine.read(nrows)
  File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/pandas/io/parsers.py", line 1028, in read
    data = self._reader.read(nrows)
  File "parser.pyx", line 706, in pandas.parser.TextReader.read (pandas/parser.c:6745)
  File "parser.pyx", line 740, in pandas.parser.TextReader._read_low_memory (pandas/parser.c:7146)
  File "parser.pyx", line 804, in pandas.parser.TextReader._read_rows (pandas/parser.c:7780)
  File "parser.pyx", line 890, in pandas.parser.TextReader._convert_column_data (pandas/parser.c:8793)
  File "parser.pyx", line 950, in pandas.parser.TextReader._convert_tokens (pandas/parser.c:9484)
  File "parser.pyx", line 1026, in pandas.parser.TextReader._convert_with_dtype (pandas/parser.c:10642)
  File "parser.pyx", line 1054, in pandas.parser.TextReader._string_convert (pandas/parser.c:10930)
  File "parser.pyx", line 1336, in pandas.parser._string_box_decode (pandas/parser.c:16091)
  File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/encodings/utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0x80 in position 0: invalid start byte

guyrt · 2013-11-05T00:18:12Z

I tried the same test (switch to 998) on master and it worked.

@Hedendahl could you try this on the latest master rather than version 0.12? You can download it here:
https://github.com/pydata/pandas/archive/master.zip

ivannz · 2016-07-28T06:55:10Z

I believe PR #13788 could have solved this issue.

edit: I ran the use case on 0.18.1 and it indeed segfaulted with exactly the same kind of fault as in issue #13703 . Now on the latest build, with the mentioned PR, it does not crash.

jreback · 2016-07-28T10:06:25Z

can you PR with some test based on this issue

ivannz · 2016-07-28T11:05:59Z

Yes, but is it necessary? This fault occurs on 0.18.1 only for engine="c" and even without encoding='utf-8' parameter, because the data is ASCII anyway. This code reproduces the issue:

import pandas as pd
from cStringIO import StringIO
import pandas.util.testing as tm

def test_issue5291(self):
    # This test recreates the segfault in issue #5291
    n_rows, n_columns, chunksize = 10000, 20, 999
    expected = pd.DataFrame([[float(row)] * n_columns
                             for row in range(n_rows)],
                            dtype=float, columns=None, index=None)
    # Now read the same dataframe in chunks from a CSV
    data = "\r\n".join(",".join(["%.1f"%(r,)] * n_columns)
                       for r in range(n_rows)) + "\r\n"
    chunks_ = self.read_csv(StringIO(data), header=None,
                            chunksize=chunksize)
    result = pd.concat(chunks_, axis=0, ignore_index=True)
    # Check for data corruption if there was no segfault
    tm.assert_frame_equal(result, expected)

test_issue5291(pd)

It is identical to test_parse_trim_buffers() in c_parser_only.py, though I admit that it takes a slightly different path inside _read_rows.

edit: The fault in the original example by @Hedendahl and in this one does not occur on the current master build.

jreback · 2016-07-28T11:14:44Z

why not add a test
just include it in the same whatsnew and put test right below other one

jreback · 2016-07-28T11:15:57Z

actually instead just take original test and use with and w/o encoding

ivannz · 2016-07-28T12:31:38Z

@jreback, should I include the extended test in the currently pending PR (#13819) or in the next one?

jreback · 2016-07-28T14:30:12Z

no separate things go in separate PR's

xref pandas-devgh-13833. Closes pandas-devgh-5291.

xref gh-13833. Closes gh-5291.

xref pandas-devgh-13833. Closes pandas-devgh-5291.

xref pandas-devgh-13833. Closes pandas-devgh-5291. (cherry picked from commit 6f0ff1a)

xref gh-13833. Closes gh-5291. (cherry picked from commit 6f0ff1a)

Hedendahl closed this as completed Oct 22, 2013

Hedendahl reopened this Oct 31, 2013

jreback modified the milestones: 0.15.0, 0.14.0 Apr 9, 2014

jreback modified the milestones: 0.16.0, Next Major Release Mar 3, 2015

ivannz mentioned this issue Jul 29, 2016

TST: A test to cover fault in issue #5291 #13833

Closed

4 tasks

ivannz added a commit to ivannz/pandas that referenced this issue Aug 2, 2016

TST: expanded 'test_parse_trim_buffers' to cover issue pandas-dev#5291

5befb0e

gfyoung added a commit to forking-repos/pandas that referenced this issue Nov 6, 2017

TST: Add another test for segfault in C engine

2bd1311

xref pandas-devgh-13833. Closes pandas-devgh-5291.

gfyoung mentioned this issue Nov 6, 2017

TST: Add another test for segfault in C engine #18128

Merged

gfyoung closed this as completed in #18128 Nov 6, 2017

gfyoung added a commit that referenced this issue Nov 6, 2017

TST: Add another test for segfault in C engine (#18128)

6f0ff1a

xref gh-13833. Closes gh-5291.

watercrossing pushed a commit to watercrossing/pandas that referenced this issue Nov 10, 2017

TST: Add another test for segfault in C engine (pandas-dev#18128)

3a266be

xref pandas-devgh-13833. Closes pandas-devgh-5291.

No-Stream pushed a commit to No-Stream/pandas that referenced this issue Nov 28, 2017

TST: Add another test for segfault in C engine (pandas-dev#18128)

894add0

xref pandas-devgh-13833. Closes pandas-devgh-5291.

TomAugspurger pushed a commit to TomAugspurger/pandas that referenced this issue Dec 8, 2017

TST: Add another test for segfault in C engine (pandas-dev#18128)

4475176

xref pandas-devgh-13833. Closes pandas-devgh-5291. (cherry picked from commit 6f0ff1a)

TomAugspurger pushed a commit that referenced this issue Dec 11, 2017

TST: Add another test for segfault in C engine (#18128)

7dabfc6

xref gh-13833. Closes gh-5291. (cherry picked from commit 6f0ff1a)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Segmentation fault or UnicodeDecodeError when reading csv-file depending on chunksize. #5291

Segmentation fault or UnicodeDecodeError when reading csv-file depending on chunksize. #5291

Hedendahl commented Oct 21, 2013

Hedendahl commented Oct 31, 2013

guyrt commented Nov 2, 2013

guyrt commented Nov 2, 2013

Hedendahl commented Nov 2, 2013

guyrt commented Nov 5, 2013

ivannz commented Jul 28, 2016 •

edited

Loading

jreback commented Jul 28, 2016

ivannz commented Jul 28, 2016 •

edited

Loading

jreback commented Jul 28, 2016

jreback commented Jul 28, 2016

ivannz commented Jul 28, 2016

jreback commented Jul 28, 2016

Segmentation fault or UnicodeDecodeError when reading csv-file depending on chunksize. #5291

Segmentation fault or UnicodeDecodeError when reading csv-file depending on chunksize. #5291

Comments

Hedendahl commented Oct 21, 2013

Hedendahl commented Oct 31, 2013

guyrt commented Nov 2, 2013

guyrt commented Nov 2, 2013

Hedendahl commented Nov 2, 2013

guyrt commented Nov 5, 2013

ivannz commented Jul 28, 2016 • edited Loading

jreback commented Jul 28, 2016

ivannz commented Jul 28, 2016 • edited Loading

jreback commented Jul 28, 2016

jreback commented Jul 28, 2016

ivannz commented Jul 28, 2016

jreback commented Jul 28, 2016

ivannz commented Jul 28, 2016 •

edited

Loading

ivannz commented Jul 28, 2016 •

edited

Loading