Skip to content

Segmentation fault or UnicodeDecodeError when reading csv-file depending on chunksize. #5291

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
Hedendahl opened this issue Oct 21, 2013 · 12 comments · Fixed by #18128
Closed
Labels
IO CSV read_csv, to_csv Unicode Unicode strings

Comments

@Hedendahl
Copy link

I have encountered an issue with the csv parser, pandas.io.parser.read_csv. I get segmentation fault or UnicodeDecodeError when reading a csv-file in chunks and it seems like that the problem depends on the size of the chunks.
Consider the following code:

import codecs
import csv
import pandas as pd


def create_csv_file(columns, rows):
    csv_file_name = 'csv_test_file.csv'
    with codecs.open(csv_file_name, mode='w', encoding='utf_8') as csv_file:
        csv_writer = csv.writer(csv_file, delimiter=',')

        for row in xrange(rows):
            csv_writer.writerow(
                [float(row)] * columns)

    return csv_file_name


def main():
    """
    """
    columns = 20
    rows = 10000
    chunksize = 999
    csv_file_name = create_csv_file(columns, rows)
    reader = pd.io.parsers.read_csv(csv_file_name,
                                    header=None,
                                    chunksize=chunksize,
                                    encoding='utf_8')

    for x, dataframe in enumerate(reader, 1):
        print x * chunksize


if __name__ == "__main__":
    main()

I get segmentation fault from the attached code, when the chunksize is 999 rows. If the chunksize is decreased to 998 rows, I instead get an UnicodeDecodeError. If the chunksize is increased to 1000 rows there are no problems of reading the csv-file. My first guess was that the problem appears when the last chunk include too few rows, but I was surprised when reading of the csv-file with the following setting,

    columns = 20
    rows = 1000
    chunksize = 99

worked properly.

@Hedendahl Hedendahl reopened this Oct 31, 2013
@Hedendahl
Copy link
Author

I'm using python version 2.7.5 and panda version 0.12.0.

@guyrt
Copy link
Contributor

guyrt commented Nov 2, 2013

@Hedendahl I'm not getting this error in master.

python test_5291.py 
999
1998
2997
3996
4995
5994
6993
7992
8991
9990
10989

Can you confirm?

@guyrt
Copy link
Contributor

guyrt commented Nov 2, 2013

May have been this: #5156

@Hedendahl
Copy link
Author

When I execute the consider code and let the chunksize be 999 rows I get the following output,

python2.7 test_pandas_read_csv.py
999
1998
2997
3996
4995
5994
6993
7992
8991
9990
Segmentation fault: 11

If I change the chunksize to 998 rows, I instead get

python2.7 test_pandas_read_csv.py
998
1996
2994
3992
4990
5988
6986
7984
8982
9980
Traceback (most recent call last):
  File "test_pandas_read_csv.py", line 35, in <module>
    main()
  File "test_pandas_read_csv.py", line 30, in main
    for x, dataframe in enumerate(reader, 1):
  File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/pandas/io/parsers.py", line 578, in __iter__
    yield self.read(self.chunksize)
  File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/pandas/io/parsers.py", line 608, in read
    ret = self._engine.read(nrows)
  File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/pandas/io/parsers.py", line 1028, in read
    data = self._reader.read(nrows)
  File "parser.pyx", line 706, in pandas.parser.TextReader.read (pandas/parser.c:6745)
  File "parser.pyx", line 740, in pandas.parser.TextReader._read_low_memory (pandas/parser.c:7146)
  File "parser.pyx", line 804, in pandas.parser.TextReader._read_rows (pandas/parser.c:7780)
  File "parser.pyx", line 890, in pandas.parser.TextReader._convert_column_data (pandas/parser.c:8793)
  File "parser.pyx", line 950, in pandas.parser.TextReader._convert_tokens (pandas/parser.c:9484)
  File "parser.pyx", line 1026, in pandas.parser.TextReader._convert_with_dtype (pandas/parser.c:10642)
  File "parser.pyx", line 1054, in pandas.parser.TextReader._string_convert (pandas/parser.c:10930)
  File "parser.pyx", line 1336, in pandas.parser._string_box_decode (pandas/parser.c:16091)
  File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/encodings/utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0x80 in position 0: invalid start byte

@guyrt
Copy link
Contributor

guyrt commented Nov 5, 2013

I tried the same test (switch to 998) on master and it worked.

@Hedendahl could you try this on the latest master rather than version 0.12? You can download it here:
https://github.com/pydata/pandas/archive/master.zip

@jreback jreback modified the milestones: 0.15.0, 0.14.0 Apr 9, 2014
@jreback jreback modified the milestones: 0.16.0, Next Major Release Mar 3, 2015
@ivannz
Copy link
Contributor

ivannz commented Jul 28, 2016

I believe PR #13788 could have solved this issue.

edit: I ran the use case on 0.18.1 and it indeed segfaulted with exactly the same kind of fault as in issue #13703 . Now on the latest build, with the mentioned PR, it does not crash.

@jreback
Copy link
Contributor

jreback commented Jul 28, 2016

can you PR with some test based on this issue

@ivannz
Copy link
Contributor

ivannz commented Jul 28, 2016

Yes, but is it necessary? This fault occurs on 0.18.1 only for engine="c" and even without encoding='utf-8' parameter, because the data is ASCII anyway. This code reproduces the issue:

import pandas as pd
from cStringIO import StringIO
import pandas.util.testing as tm

def test_issue5291(self):
    # This test recreates the segfault in issue #5291
    n_rows, n_columns, chunksize = 10000, 20, 999
    expected = pd.DataFrame([[float(row)] * n_columns
                             for row in range(n_rows)],
                            dtype=float, columns=None, index=None)
    # Now read the same dataframe in chunks from a CSV
    data = "\r\n".join(",".join(["%.1f"%(r,)] * n_columns)
                       for r in range(n_rows)) + "\r\n"
    chunks_ = self.read_csv(StringIO(data), header=None,
                            chunksize=chunksize)
    result = pd.concat(chunks_, axis=0, ignore_index=True)
    # Check for data corruption if there was no segfault
    tm.assert_frame_equal(result, expected)

test_issue5291(pd)

It is identical to test_parse_trim_buffers() in c_parser_only.py, though I admit that it takes a slightly different path inside _read_rows.

edit: The fault in the original example by @Hedendahl and in this one does not occur on the current master build.

@jreback
Copy link
Contributor

jreback commented Jul 28, 2016

why not add a test
just include it in the same whatsnew and put test right below other one

@jreback
Copy link
Contributor

jreback commented Jul 28, 2016

actually instead just take original test and use with and w/o encoding

@ivannz
Copy link
Contributor

ivannz commented Jul 28, 2016

@jreback, should I include the extended test in the currently pending PR (#13819) or in the next one?

@jreback
Copy link
Contributor

jreback commented Jul 28, 2016

no separate things go in separate PR's

TomAugspurger pushed a commit that referenced this issue Dec 11, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO CSV read_csv, to_csv Unicode Unicode strings
Projects
None yet
4 participants