read_csv parse issues with \r line ending and quoted items #3453

sandbox · 2013-04-25T05:22:37Z

There seems to be an issue with quotes containing the separator in read_csv

>>> pd.read_csv(StringIO.StringIO(' a,b,c\r"a,b","e,d","f,f"'), header=None)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/john/app/venv/lib/python2.7/site-packages/pandas/io/parsers.py", line 399, in parser_f
    return _read(filepath_or_buffer, kwds)
  File "/home/john/app/venv/lib/python2.7/site-packages/pandas/io/parsers.py", line 215, in _read
    return parser.read()
  File "/home/john/app/venv/lib/python2.7/site-packages/pandas/io/parsers.py", line 631, in read
    ret = self._engine.read(nrows)
  File "/home/john/app/venv/lib/python2.7/site-packages/pandas/io/parsers.py", line 954, in read
    data = self._reader.read(nrows)
  File "parser.pyx", line 644, in pandas._parser.TextReader.read (pandas/src/parser.c:5925)
  File "parser.pyx", line 666, in pandas._parser.TextReader._read_low_memory (pandas/src/parser.c:6145)
  File "parser.pyx", line 719, in pandas._parser.TextReader._read_rows (pandas/src/parser.c:6750)
  File "parser.pyx", line 706, in pandas._parser.TextReader._tokenize_rows (pandas/src/parser.c:6634)
  File "parser.pyx", line 1572, in pandas._parser.raise_parser_error (pandas/src/parser.c:17055)
pandas._parser.CParserError: Error tokenizing data. C error: Expected 3 fields in line 2, saw 4

EXPECTED BEHAVIOR:

>>> pd.read_csv(StringIO.StringIO(' a,b,c\n"a,b","e,d","f,f"'), header=None)

     0    1    2
0    a    b    c
1  a,b  e,d  f,f

This should have the same behavior as when the line ending is \n

Maybe this should be in a separate bug report, but a possibly related issue occurs when you don't say header=None

>>> pd.read_csv(StringIO.StringIO(' a,b,c\r"a,b","e,d","f,f"'))
     a    b    c
"a  b"  e,d  f,f

The above shows the first quoted-delimited item set as the index_col. The following shows what happens when we tell pandas to use index_col=False

>>> pd.read_csv(StringIO.StringIO(' a,b,c\r"a,b","e,d","f,f"'), index_col=False)
    a   b    c
0  "a  b"  e,d

EXPECTED BEHAVIOR:

>>> pd.read_csv(StringIO.StringIO(' a,b,c\n"a,b","e,d","f,f"'))
     a    b    c
0  a,b  e,d  f,f

and with index_col=False

>>> pd.read_csv(StringIO.StringIO(' a,b,c\n"a,b","e,d","f,f"'), index_col=False)
     a    b    c
0  a,b  e,d  f,f

Here is my system information if that is necessary

>>> pd.__version__
'0.10.1'
>>> sys.version_info
sys.version_info(major=2, minor=7, micro=3, releaselevel='final', serial=0)
>>> sys.platform
'darwin'
>>> os.name
posix'

The text was updated successfully, but these errors were encountered:

sandbox · 2013-04-25T05:31:20Z

And testing this out with the latest from github gives me the same issues

>>> pd.__version__
'0.12.0.dev-1e2b447'
>>> import StringIO
>>> pd.read_csv(StringIO.StringIO(' a,b,c\r"a,b","e,d","f,f"'), header=None)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/john/app/venv/lib/python2.7/site-packages/pandas/io/parsers.py", line 401, in parser_f
    return _read(filepath_or_buffer, kwds)
  File "/home/john/app/venv/lib/python2.7/site-packages/pandas/io/parsers.py", line 216, in _read
    return parser.read()
  File "/home/john/app/venv/lib/python2.7/site-packages/pandas/io/parsers.py", line 634, in read
    ret = self._engine.read(nrows)
  File "/home/john/app/venv/lib/python2.7/site-packages/pandas/io/parsers.py", line 958, in read
    data = self._reader.read(nrows)
  File "parser.pyx", line 654, in pandas._parser.TextReader.read (pandas/src/parser.c:6014)
  File "parser.pyx", line 676, in pandas._parser.TextReader._read_low_memory (pandas/src/parser.c:6231)
  File "parser.pyx", line 729, in pandas._parser.TextReader._read_rows (pandas/src/parser.c:6833)
  File "parser.pyx", line 716, in pandas._parser.TextReader._tokenize_rows (pandas/src/parser.c:6718)
  File "parser.pyx", line 1582, in pandas._parser.raise_parser_error (pandas/src/parser.c:17131)
pandas._parser.CParserError: Error tokenizing data. C error: Expected 3 fields in line 2, saw 4
>>> pd.read_csv(StringIO.StringIO(' a,b,c\r"a,b","e,d","f,f"'))
     a    b    c
"a  b"  e,d  f,f
>>> pd.read_csv(StringIO.StringIO(' a,b,c\r"a,b","e,d","f,f"'), index_col=False)
    a   b    c
0  "a  b"  e,d

sandbox · 2013-04-25T05:37:32Z

And also confirming that this error occurs in 0.11.0

>>> pd.__version__
'0.11.0'
>>> pd.read_csv(StringIO.StringIO(' a,b,c\r"a,b","e,d","f,f"'), header=None)
pd.read_csv(StringIO.StringIO(' a,b,c\r"a,b","e,d","f,f"'), header=None)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/john/app/venv/lib/python2.7/site-packages/pandas/io/parsers.py", line 401, in parser_f
    return _read(filepath_or_buffer, kwds)
  File "/home/john/app/venv/lib/python2.7/site-packages/pandas/io/parsers.py", line 216, in _read
    return parser.read()
  File "/home/john/app/venv/lib/python2.7/site-packages/pandas/io/parsers.py", line 633, in read
    ret = self._engine.read(nrows)
  File "/home/john/app/venv/lib/python2.7/site-packages/pandas/io/parsers.py", line 957, in read
    data = self._reader.read(nrows)
  File "parser.pyx", line 654, in pandas._parser.TextReader.read (pandas/src/parser.c:5921)
  File "parser.pyx", line 676, in pandas._parser.TextReader._read_low_memory (pandas/src/parser.c:6138)
  File "parser.pyx", line 729, in pandas._parser.TextReader._read_rows (pandas/src/parser.c:6740)
  File "parser.pyx", line 716, in pandas._parser.TextReader._tokenize_rows (pandas/src/parser.c:6625)
  File "parser.pyx", line 1582, in pandas._parser.raise_parser_error (pandas/src/parser.c:17029)
pandas._parser.CParserError: Error tokenizing data. C error: Expected 3 fields in line 2, saw 4

wesm · 2013-06-02T20:58:12Z

Looking

wesm · 2013-06-02T23:27:57Z

All set-- had to make a bit of a mess. We will need to clean up the tokenizer loop one of these days (being mindful of performance of course)

In [1]: import StringIO

In [2]: pd.read_csv(StringIO.StringIO(' a,b,c\n"a,b","e,d","f,f"'), header=None)
Out[2]: 
     0    1    2
0    a    b    c
1  a,b  e,d  f,f

In [3]: pd.read_csv(StringIO.StringIO(' a,b,c\r"a,b","e,d","f,f"'))
Out[3]: 
     a    b    c
0  a,b  e,d  f,f

This was referenced May 1, 2013

BUG: read_csv does not parse csv files with windows line terminator correctly #3501

Closed

to_csv does not quote fields with special characters #3503

Closed

ghost assigned wesm Jun 2, 2013

wesm closed this as completed in c549299 Jun 2, 2013

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

read_csv parse issues with \r line ending and quoted items #3453

read_csv parse issues with \r line ending and quoted items #3453

sandbox commented Apr 25, 2013

sandbox commented Apr 25, 2013

Uh oh!

sandbox commented Apr 25, 2013

Uh oh!

wesm commented Jun 2, 2013

Uh oh!

wesm commented Jun 2, 2013

Uh oh!

Uh oh!

read_csv parse issues with \r line ending and quoted items #3453

read_csv parse issues with \r line ending and quoted items #3453

Comments

sandbox commented Apr 25, 2013

EXPECTED BEHAVIOR:

EXPECTED BEHAVIOR:

sandbox commented Apr 25, 2013

Uh oh!

sandbox commented Apr 25, 2013

Uh oh!

wesm commented Jun 2, 2013

Uh oh!

wesm commented Jun 2, 2013

Uh oh!