Skip to content

read_csv parse issues with \r line ending and quoted items #3453

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
sandbox opened this issue Apr 25, 2013 · 4 comments
Closed

read_csv parse issues with \r line ending and quoted items #3453

sandbox opened this issue Apr 25, 2013 · 4 comments
Assignees
Labels
IO Data IO issues that don't fit into a more specific label
Milestone

Comments

@sandbox
Copy link

sandbox commented Apr 25, 2013

There seems to be an issue with quotes containing the separator in read_csv

>>> pd.read_csv(StringIO.StringIO(' a,b,c\r"a,b","e,d","f,f"'), header=None)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/john/app/venv/lib/python2.7/site-packages/pandas/io/parsers.py", line 399, in parser_f
    return _read(filepath_or_buffer, kwds)
  File "/home/john/app/venv/lib/python2.7/site-packages/pandas/io/parsers.py", line 215, in _read
    return parser.read()
  File "/home/john/app/venv/lib/python2.7/site-packages/pandas/io/parsers.py", line 631, in read
    ret = self._engine.read(nrows)
  File "/home/john/app/venv/lib/python2.7/site-packages/pandas/io/parsers.py", line 954, in read
    data = self._reader.read(nrows)
  File "parser.pyx", line 644, in pandas._parser.TextReader.read (pandas/src/parser.c:5925)
  File "parser.pyx", line 666, in pandas._parser.TextReader._read_low_memory (pandas/src/parser.c:6145)
  File "parser.pyx", line 719, in pandas._parser.TextReader._read_rows (pandas/src/parser.c:6750)
  File "parser.pyx", line 706, in pandas._parser.TextReader._tokenize_rows (pandas/src/parser.c:6634)
  File "parser.pyx", line 1572, in pandas._parser.raise_parser_error (pandas/src/parser.c:17055)
pandas._parser.CParserError: Error tokenizing data. C error: Expected 3 fields in line 2, saw 4
EXPECTED BEHAVIOR:
>>> pd.read_csv(StringIO.StringIO(' a,b,c\n"a,b","e,d","f,f"'), header=None)

     0    1    2
0    a    b    c
1  a,b  e,d  f,f

This should have the same behavior as when the line ending is \n


Maybe this should be in a separate bug report, but a possibly related issue occurs when you don't say header=None

>>> pd.read_csv(StringIO.StringIO(' a,b,c\r"a,b","e,d","f,f"'))
     a    b    c
"a  b"  e,d  f,f

The above shows the first quoted-delimited item set as the index_col. The following shows what happens when we tell pandas to use index_col=False

>>> pd.read_csv(StringIO.StringIO(' a,b,c\r"a,b","e,d","f,f"'), index_col=False)
    a   b    c
0  "a  b"  e,d
EXPECTED BEHAVIOR:
>>> pd.read_csv(StringIO.StringIO(' a,b,c\n"a,b","e,d","f,f"'))
     a    b    c
0  a,b  e,d  f,f

and with index_col=False

>>> pd.read_csv(StringIO.StringIO(' a,b,c\n"a,b","e,d","f,f"'), index_col=False)
     a    b    c
0  a,b  e,d  f,f

Here is my system information if that is necessary

>>> pd.__version__
'0.10.1'
>>> sys.version_info
sys.version_info(major=2, minor=7, micro=3, releaselevel='final', serial=0)
>>> sys.platform
'darwin'
>>> os.name
posix'
@sandbox
Copy link
Author

sandbox commented Apr 25, 2013

And testing this out with the latest from github gives me the same issues

>>> pd.__version__
'0.12.0.dev-1e2b447'
>>> import StringIO
>>> pd.read_csv(StringIO.StringIO(' a,b,c\r"a,b","e,d","f,f"'), header=None)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/john/app/venv/lib/python2.7/site-packages/pandas/io/parsers.py", line 401, in parser_f
    return _read(filepath_or_buffer, kwds)
  File "/home/john/app/venv/lib/python2.7/site-packages/pandas/io/parsers.py", line 216, in _read
    return parser.read()
  File "/home/john/app/venv/lib/python2.7/site-packages/pandas/io/parsers.py", line 634, in read
    ret = self._engine.read(nrows)
  File "/home/john/app/venv/lib/python2.7/site-packages/pandas/io/parsers.py", line 958, in read
    data = self._reader.read(nrows)
  File "parser.pyx", line 654, in pandas._parser.TextReader.read (pandas/src/parser.c:6014)
  File "parser.pyx", line 676, in pandas._parser.TextReader._read_low_memory (pandas/src/parser.c:6231)
  File "parser.pyx", line 729, in pandas._parser.TextReader._read_rows (pandas/src/parser.c:6833)
  File "parser.pyx", line 716, in pandas._parser.TextReader._tokenize_rows (pandas/src/parser.c:6718)
  File "parser.pyx", line 1582, in pandas._parser.raise_parser_error (pandas/src/parser.c:17131)
pandas._parser.CParserError: Error tokenizing data. C error: Expected 3 fields in line 2, saw 4
>>> pd.read_csv(StringIO.StringIO(' a,b,c\r"a,b","e,d","f,f"'))
     a    b    c
"a  b"  e,d  f,f
>>> pd.read_csv(StringIO.StringIO(' a,b,c\r"a,b","e,d","f,f"'), index_col=False)
    a   b    c
0  "a  b"  e,d

@sandbox
Copy link
Author

sandbox commented Apr 25, 2013

And also confirming that this error occurs in 0.11.0

>>> pd.__version__
'0.11.0'
>>> pd.read_csv(StringIO.StringIO(' a,b,c\r"a,b","e,d","f,f"'), header=None)
pd.read_csv(StringIO.StringIO(' a,b,c\r"a,b","e,d","f,f"'), header=None)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/john/app/venv/lib/python2.7/site-packages/pandas/io/parsers.py", line 401, in parser_f
    return _read(filepath_or_buffer, kwds)
  File "/home/john/app/venv/lib/python2.7/site-packages/pandas/io/parsers.py", line 216, in _read
    return parser.read()
  File "/home/john/app/venv/lib/python2.7/site-packages/pandas/io/parsers.py", line 633, in read
    ret = self._engine.read(nrows)
  File "/home/john/app/venv/lib/python2.7/site-packages/pandas/io/parsers.py", line 957, in read
    data = self._reader.read(nrows)
  File "parser.pyx", line 654, in pandas._parser.TextReader.read (pandas/src/parser.c:5921)
  File "parser.pyx", line 676, in pandas._parser.TextReader._read_low_memory (pandas/src/parser.c:6138)
  File "parser.pyx", line 729, in pandas._parser.TextReader._read_rows (pandas/src/parser.c:6740)
  File "parser.pyx", line 716, in pandas._parser.TextReader._tokenize_rows (pandas/src/parser.c:6625)
  File "parser.pyx", line 1582, in pandas._parser.raise_parser_error (pandas/src/parser.c:17029)
pandas._parser.CParserError: Error tokenizing data. C error: Expected 3 fields in line 2, saw 4

@wesm
Copy link
Member

wesm commented Jun 2, 2013

Looking

@wesm wesm closed this as completed in c549299 Jun 2, 2013
@wesm
Copy link
Member

wesm commented Jun 2, 2013

All set-- had to make a bit of a mess. We will need to clean up the tokenizer loop one of these days (being mindful of performance of course)

In [1]: import StringIO

In [2]: pd.read_csv(StringIO.StringIO(' a,b,c\n"a,b","e,d","f,f"'), header=None)
Out[2]: 
     0    1    2
0    a    b    c
1  a,b  e,d  f,f

In [3]: pd.read_csv(StringIO.StringIO(' a,b,c\r"a,b","e,d","f,f"'))
Out[3]: 
     a    b    c
0  a,b  e,d  f,f

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO Data IO issues that don't fit into a more specific label
Projects
None yet
Development

No branches or pull requests

2 participants