Skip to content

Bug: pd.read_csv does not read lines with data containing leading quotes but not matching close quotes #22661

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
dhruvsakalley opened this issue Sep 11, 2018 · 3 comments
Labels
IO CSV read_csv, to_csv Needs Info Clarification about behavior needed to assess issue

Comments

@dhruvsakalley
Copy link

dhruvsakalley commented Sep 11, 2018

Code Sample, a copy-pastable example if possible

pd.read_csv("pandas_bug.tsv", sep="\t", index_col=None, header=None, encoding='utf-8', skip_blank_lines=True, quotechar='"')

Problem description

The pandas_bug.tsv looks like this
3 abc 5.6
4 "abc" 4.3
5 "error 3.3

This code line results in an error
ParserError: Error tokenizing data. C error: EOF inside string starting at line 2

for another case in a larger file, this error does not occur and pandas silently skips some lines, however, when converting the same tsv file to json via file io, the pandas read_json function handles this gracefully and adds an escape character in front of the string. e.g. "error

Expected Output

0	1	2

0 3 abc 5.6
1 4 abc 4.3
2 5 "error 3.3

Output of pd.show_versions()

INSTALLED VERSIONS ------------------ commit: None python: 3.6.5.final.0 python-bits: 64 OS: Windows OS-release: 10 machine: AMD64 processor: Intel64 Family 6 Model 78 Stepping 3, GenuineIntel byteorder: little LC_ALL: None LANG: None LOCALE: None.None

pandas: 0.22.0
pytest: 3.7.0
pip: 10.0.1
setuptools: 36.2.7
Cython: 0.28.2
numpy: 1.14.5
scipy: 1.0.0
pyarrow: None
xarray: None
IPython: 6.4.0
sphinx: 1.7.4
patsy: 0.5.0
dateutil: 2.6.1
pytz: 2017.3
blosc: None
bottleneck: 1.2.1
tables: 3.4.3
numexpr: 2.6.5
feather: None
matplotlib: 2.2.2
openpyxl: 2.5.3
xlrd: 1.1.0
xlwt: 1.3.0
xlsxwriter: 1.0.4
lxml: 4.2.4
bs4: 4.6.0
html5lib: 0.9999999
sqlalchemy: 1.2.7
pymysql: None
psycopg2: None
jinja2: 2.9.6
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

@TomAugspurger
Copy link
Contributor

Can you fix your expected output? The formatting looks off.

And why do you expect that reading to succeed? You have a malformed CSV, so I would expect an exception to be thrown.

for another case in a larger file, this error does not occur and pandas silently skips some lines,

This sounds like the real bug. Can you provide a reproducible example?

@gfyoung gfyoung added IO CSV read_csv, to_csv Needs Info Clarification about behavior needed to assess issue labels Sep 11, 2018
@NeuroBobster
Copy link

I posted my comment to this issue:
#5500
but now I see that the following day the issue was closed and no one ever referred to my suggestion, so I repost it here:

For the reason I pointed out in my answer to this question:
https://stackoverflow.com/questions/18016037/pandas-parsererror-eof-character-when-reading-multiple-csv-files-to-hdf5/53173373#53173373
I would suggest to make the quoting=csv.QUOTE_NONE default instead of csv.QUOTE_MINIMAL.
It's easier to realise what's going on when your strings are unexpectedly parsed with quotechars then to get the error when there's odd number of quotechars or no error, but unexpected parsing for even number of quotechars.

@mroeschke
Copy link
Member

We're happy to reopen this issue when we can validate the issue with a reproducible example for your larger file with the line skipping.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO CSV read_csv, to_csv Needs Info Clarification about behavior needed to assess issue
Projects
None yet
Development

No branches or pull requests

5 participants