Skip to content

BUG: Improve Error Message for Multi-Char Sep + Quotes in Python Engine #14582

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions doc/source/whatsnew/v0.19.2.txt
Original file line number Diff line number Diff line change
Expand Up @@ -30,6 +30,7 @@ Bug Fixes
- Compat with ``dateutil==2.6.0``; segfault reported in the testing suite (:issue:`14621`)
- Allow ``nanoseconds`` in ``Timestamp.replace`` as a kwarg (:issue:`14621`)
- Bug in ``pd.read_csv`` where reading files fails, if the number of headers is equal to the number of lines in the file (:issue:`14515`)
- Bug in ``pd.read_csv`` for the Python engine in which an unhelpful error message was being raised when multi-char delimiters were not being respected with quotes (:issue:`14582`)



Expand Down
5 changes: 5 additions & 0 deletions pandas/io/parsers.py
Original file line number Diff line number Diff line change
Expand Up @@ -2515,6 +2515,11 @@ def _rows_to_cols(self, content):

msg = ('Expected %d fields in line %d, saw %d' %
(col_len, row_num + 1, zip_len))
if len(self.delimiter) > 1 and self.quoting != csv.QUOTE_NONE:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the second part of the check needed (self.quoting != csv.QUOTE_NONE). Because AFAIU also when you do pass this, quotes are still be ignored by the regex expression to split the line, and you can still have this problem.

Copy link
Member Author

@gfyoung gfyoung Nov 25, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jorisvandenbossche : If quoting=csv.QUOTE_NONE, all quotation marks are treated as data, so that's the user's fault, not ours. That's why the check is necessary.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although it's a user error, you can still run into this problem so the message can still be useful I think. But I do see your point, so OK

# see gh-13374
reason = ('Error could possibly be due to quotes being '
'ignored when a multi-char delimiter is used.')
msg += '. ' + reason
raise ValueError(msg)

if self.usecols:
Expand Down
17 changes: 17 additions & 0 deletions pandas/io/tests/parser/python_parser_only.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@
arguments when parsing.
"""

import csv
import sys
import nose

Expand Down Expand Up @@ -204,3 +205,19 @@ def test_encoding_non_utf8_multichar_sep(self):
sep=sep, names=['a', 'b'],
encoding=encoding)
tm.assert_frame_equal(result, expected)

def test_multi_char_sep_quotes(self):
# see gh-13374

data = 'a,,b\n1,,a\n2,,"2,,b"'
msg = 'ignored when a multi-char delimiter is used'

with tm.assertRaisesRegexp(ValueError, msg):
self.read_csv(StringIO(data), sep=',,')

# We expect no match, so there should be an assertion
# error out of the inner context manager.
with tm.assertRaises(AssertionError):
with tm.assertRaisesRegexp(ValueError, msg):
self.read_csv(StringIO(data), sep=',,',
quoting=csv.QUOTE_NONE)