Skip to content

Misleading read_csv() error message. It says lines but refers to rows. #22789

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
sinanonur opened this issue Sep 20, 2018 · 5 comments
Closed
Labels
Error Reporting Incorrect or improved errors from pandas good first issue IO CSV read_csv, to_csv
Milestone

Comments

@sinanonur
Copy link

This is not an error per se but might misdirect others as it did to me.

from io import StringIO
# if not using python 3:
# from StringIO import StringIO 
import pandas as pd

TESTDATA = StringIO("""c1;c2;c3
    1,"some text"
    2,"some more 
    text that is 
    multiple lines"
    3,"more text"
    4,"text with missing quote
    """)

pd.read_csv(TESTDATA)

Problem description

This code results in the following message

ParserError: Error tokenizing data. C error: EOF inside string starting at line 4
which seems correct at first sight. But actually the problematic string does not start at line 4 it starts at row 4. I was working with a very large file and did not have the chance to debug it easily. I spendt considerable time trying to figure out the problem with that line in the file.

In a csv file every line does not necessarily correspond to a row.

Expected Output

ParserError: Error tokenizing data. C error: EOF inside string starting at row 4

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.6.5.final.0
python-bits: 64
OS: Linux
OS-release: 4.16.7-041607-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.22.0
pytest: 3.5.1
pip: 18.0
setuptools: 39.1.0
Cython: 0.28.2
numpy: 1.14.3
scipy: 1.1.0
pyarrow: None
xarray: None
IPython: 6.4.0
sphinx: 1.7.4
patsy: 0.5.0
dateutil: 2.7.3
pytz: 2018.4
blosc: None
bottleneck: 1.2.1
tables: 3.4.3
numexpr: 2.6.5
feather: None
matplotlib: 2.2.2
openpyxl: 2.5.3
xlrd: 1.1.0
xlwt: 1.3.0
xlsxwriter: 1.0.4
lxml: 4.2.1
bs4: 4.6.0
html5lib: 1.0.1
sqlalchemy: 1.2.7
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

@WillAyd
Copy link
Member

WillAyd commented Sep 20, 2018

I'm not clear on the distinction you are trying to make - can you provide a more illustrative example?

@WillAyd WillAyd added the Needs Info Clarification about behavior needed to assess issue label Sep 20, 2018
@miccoli
Copy link
Contributor

miccoli commented Sep 23, 2018

@WillAyd : TESTDATA has 8 lines (len(TESTDATA.readlines())), which, if correctly parsed, should result in a DataFrame with 4 rows. In line 7 (1-based numbering)

    4,"text with missing quote

there is an error, which prevents the correct parsing of row 4 (1-based numbering not counting the header row, or 0-based counting also the header row).

I guess that the error message (sorry I had no time to check the source) refers to a "logical" line 4 (0-based), where multi-line strings have already collapsed into a single logical line.

IMHO error messages should contain an unambiguous reference to the line/column (1-based) of the original input file.

@gfyoung gfyoung added Error Reporting Incorrect or improved errors from pandas IO CSV read_csv, to_csv and removed Needs Info Clarification about behavior needed to assess issue labels Sep 23, 2018
@gfyoung
Copy link
Member

gfyoung commented Sep 23, 2018

This proposal seems fine to me.

@miccoli : Would you like to submit a PR for this?

@miccoli
Copy link
Contributor

miccoli commented Sep 23, 2018

@gfyoung : I know, when you suggest an improvement you are morally obliged to implement it! 😃 Unfortunately I have to decline, since I have no spare time at this moment to dedicate for a new PR.

@gfyoung
Copy link
Member

gfyoung commented Sep 23, 2018

@miccoli : No worries! Time is a limited resource, no need to remind me 😉

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Error Reporting Incorrect or improved errors from pandas good first issue IO CSV read_csv, to_csv
Projects
None yet
Development

No branches or pull requests

5 participants