Bug: pd.read_csv does not read lines with data containing leading quotes but not matching close quotes #22661

dhruvsakalley · 2018-09-11T14:00:56Z

Code Sample, a copy-pastable example if possible

pd.read_csv("pandas_bug.tsv", sep="\t", index_col=None, header=None, encoding='utf-8', skip_blank_lines=True, quotechar='"')

Problem description

The pandas_bug.tsv looks like this
3 abc 5.6
4 "abc" 4.3
5 "error 3.3

This code line results in an error
ParserError: Error tokenizing data. C error: EOF inside string starting at line 2

for another case in a larger file, this error does not occur and pandas silently skips some lines, however, when converting the same tsv file to json via file io, the pandas read_json function handles this gracefully and adds an escape character in front of the string. e.g. "error

Expected Output

0	1	2

0 3 abc 5.6
1 4 abc 4.3
2 5 "error 3.3

Output of `pd.show_versions()`

INSTALLED VERSIONS ------------------ commit: None python: 3.6.5.final.0 python-bits: 64 OS: Windows OS-release: 10 machine: AMD64 processor: Intel64 Family 6 Model 78 Stepping 3, GenuineIntel byteorder: little LC_ALL: None LANG: None LOCALE: None.None

pandas: 0.22.0
pytest: 3.7.0
pip: 10.0.1
setuptools: 36.2.7
Cython: 0.28.2
numpy: 1.14.5
scipy: 1.0.0
pyarrow: None
xarray: None
IPython: 6.4.0
sphinx: 1.7.4
patsy: 0.5.0
dateutil: 2.6.1
pytz: 2017.3
blosc: None
bottleneck: 1.2.1
tables: 3.4.3
numexpr: 2.6.5
feather: None
matplotlib: 2.2.2
openpyxl: 2.5.3
xlrd: 1.1.0
xlwt: 1.3.0
xlsxwriter: 1.0.4
lxml: 4.2.4
bs4: 4.6.0
html5lib: 0.9999999
sqlalchemy: 1.2.7
pymysql: None
psycopg2: None
jinja2: 2.9.6
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

The text was updated successfully, but these errors were encountered:

TomAugspurger · 2018-09-11T14:09:23Z

Can you fix your expected output? The formatting looks off.

And why do you expect that reading to succeed? You have a malformed CSV, so I would expect an exception to be thrown.

for another case in a larger file, this error does not occur and pandas silently skips some lines,

This sounds like the real bug. Can you provide a reproducible example?

NeuroBobster · 2018-12-20T15:10:31Z

I posted my comment to this issue:
#5500
but now I see that the following day the issue was closed and no one ever referred to my suggestion, so I repost it here:

For the reason I pointed out in my answer to this question:
https://stackoverflow.com/questions/18016037/pandas-parsererror-eof-character-when-reading-multiple-csv-files-to-hdf5/53173373#53173373
I would suggest to make the quoting=csv.QUOTE_NONE default instead of csv.QUOTE_MINIMAL.
It's easier to realise what's going on when your strings are unexpectedly parsed with quotechars then to get the error when there's odd number of quotechars or no error, but unexpected parsing for even number of quotechars.

mroeschke · 2020-01-24T07:02:52Z

We're happy to reopen this issue when we can validate the issue with a reproducible example for your larger file with the line skipping.

gfyoung added IO CSV read_csv, to_csv Needs Info Clarification about behavior needed to assess issue labels Sep 11, 2018

mroeschke closed this as completed Jan 24, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug: pd.read_csv does not read lines with data containing leading quotes but not matching close quotes #22661

Bug: pd.read_csv does not read lines with data containing leading quotes but not matching close quotes #22661

dhruvsakalley commented Sep 11, 2018 •

edited

Loading

TomAugspurger commented Sep 11, 2018

NeuroBobster commented Dec 20, 2018

mroeschke commented Jan 24, 2020

Bug: pd.read_csv does not read lines with data containing leading quotes but not matching close quotes #22661

Bug: pd.read_csv does not read lines with data containing leading quotes but not matching close quotes #22661

Comments

dhruvsakalley commented Sep 11, 2018 • edited Loading

Code Sample, a copy-pastable example if possible

Problem description

Expected Output

Output of pd.show_versions()

TomAugspurger commented Sep 11, 2018

NeuroBobster commented Dec 20, 2018

mroeschke commented Jan 24, 2020

dhruvsakalley commented Sep 11, 2018 •

edited

Loading

Output of `pd.show_versions()`