BUG: Improve Error Message for Multi-Char Sep + Quotes in Python Engine #14582

gfyoung · 2016-11-04T00:53:47Z

If there is a field counts mismatch, check whether a multi-char separator was used in conjunction with quotes. Currently, that setup is not respected for the Python engine and can result in improper line breaks. This PR appends that to the error message if that is a possible explanation.

Closes #13374.

codecov-io · 2016-11-04T08:55:52Z

Current coverage is 85.22% (diff: 100%)

Merging #14582 into master will increase coverage by <.01%

@@             master     #14582   diff @@
==========================================
  Files           143        143          
  Lines         50804      50807     +3   
  Methods           0          0          
  Messages          0          0          
  Branches          0          0          
==========================================
+ Hits          43292      43298     +6   
+ Misses         7512       7509     -3   
  Partials          0          0

Powered by Codecov. Last update 75b606a...dda588b

gfyoung · 2016-11-10T05:03:50Z

Getting some weird segfaults that seem unrelated to this PR (also in my other one)...disabling testing for the time being.

jorisvandenbossche · 2016-11-10T08:15:16Z

@gfyoung yes, the segfaults are not your fault: #14621

gfyoung · 2016-11-13T20:04:56Z

@jorisvandenbossche , @jreback : No more segfaulting, and everything looks green. Ready to merge if there are no concerns.

jreback · 2016-11-15T11:52:08Z

pandas/io/tests/parser/common.py

@@ -836,7 +836,8 @@ def test_integer_overflow_bug(self):
        result = self.read_csv(StringIO(data), header=None, sep=' ')
        self.assertTrue(result[0].dtype == np.float64)

-        result = self.read_csv(StringIO(data), header=None, sep='\s+')
+        result = self.read_csv(StringIO(data), header=None,


why do you need to change these tests? this is translated to the c-engine with delim whitespace=True so it shouldn't hit your change

But this has nothing to do with the C engine. This has to do with the Python engine. You still do a buggy regex split even with delimiter="\s+" as you can see here. That's why all of these tests have to change.

and why the difference? that is a major api change.

My change causes an error to be raised whenever the sep parameter is regex and quoting != csv.QUOTE_NONE for the Python engine. This test will break as a result for the Python engine without specifying that quoting=csv.QUOTE_NONE.

jreback · 2016-11-15T11:52:53Z

pandas/io/parsers.py

@@ -801,6 +803,15 @@ def _clean_options(self, options, engine):
                                  " different from '\s+' are"\
                                  " interpreted as regex)"
                engine = 'python'
+            elif quoting != csv.QUOTE_NONE:


\s+ IS delim_whitespace, you shouldn't have to change all of these tests

Not sure why you commented this.

my point is that \s+ is NOT a multi-char separator to the c-engine, it is instead passed as the delim_whitespace kw. So quoting is properly handled. Not sure this should be different in the python parser.

Unfortunately, it is very different. Python's native csv library does not accept regex, so we have to do that buggy regex split. delim_whitespace=True just translates to delimiter="\s+" for the Python parser.

jorisvandenbossche · 2016-11-15T19:21:38Z

I am not sure I am convinced this change is needed. @gfyoung Can you provide a bit more rationale in the top post and examples of how you to change your code or how the result changes?

Reason that I am not fully convinced is that you are disallowing it for all cases where there are quotes in the csv file, while there is only a problem when the delimiter is present in the quoted fields.

gfyoung · 2016-11-15T19:28:39Z

@jorisvandenbossche , @jreback : Maybe the delimiter is not be present in the quoted fields, but how are we to check this without compromising performance in Python? It is easier to insist upon passing in a parameter without raising an error than doing such a check. This change is necessary then to patch the bug I showed in the related issue.

There is no real behaviour change, except for an error being raised depending on what separator is being passed in with the Python engine. ~~All tests that I modified perhaps could be left unchanged if I change some of the logic inside the parser but will need to check.~~

UPDATE: Not feasible because of what I provided above.

jorisvandenbossche · 2016-11-15T21:02:47Z

@gfyoung It's quite possible your proposed change is the sensible thing to do, but I think you need to try to explain your changes and its consequences a bit better (in this case "Title is self-explanatory" is not really true .. :-))

As I understand it now, the consequence of your proposed change is that when somebody now does (on a file that does not necessarily contain quotes):

pd.read_csv(file, sep='\s+', engine='python')

this will raise an error now (so breaking code). To resolve this error, the user needs to add quoting=csv.QUOTE_NONE (as this is not the default).
Is this correct?

In this case, for me this is a trade-off between having to provide an extra (unneeded) keyword in the simple case (no quotes in file) vs having a wrong error message in case of quotes + sep present in quotes (current behaviour that you are trying to fix).

gfyoung · 2016-11-18T03:55:32Z

@jorisvandenbossche : Fair enough. Updated my description.

jorisvandenbossche · 2016-11-18T10:30:01Z

@gfyoung Would it be possible to only raise the error if there are actually quotes in the data? (not sure how performant such a check would be)
That will already solve the trade-off for a lot of usecases I think

gfyoung · 2016-11-18T16:02:56Z

@jorisvandenbossche : That's going to be tricky because we wrap all IO objects as a stream. Unless you want to check every line when reading, I don't know of any other way to do it at the moment. Thoughts?

jreback · 2016-11-23T11:14:12Z

@gfyoung why can't you just happily parse, then if the error happens, check the quoting paramater and raise the correct error?

gfyoung · 2016-11-23T17:11:42Z

@jreback : Because how can you tell if it's a legitimate error in the CSV file or incorrect parsing of quotation marks? There isn't really a way to differentiate AFAICT.

jreback · 2016-11-23T17:26:12Z

@gfyoung

Because how can you tell if it's a legitimate error in the CSV file or incorrect parsing of quotation marks? There isn't really a way to differentiate AFAICT.

so maybe just say that in the error message. (IOW say you can pass quote.NONE to verify if no error.

I think forcing people to pass quote.NONE is pretty painful just to have a more general error message.

gfyoung · 2016-11-23T17:27:49Z

@jreback : Well, parsing regex separators is painful on our end. You don't have to pass in csv.None if you're separator is only one character, so it shouldn't inconvenience people in most cases.

jorisvandenbossche · 2016-11-24T21:11:37Z

Hmm, we should try to make a decision here. Trying to summarize the options:

Current situation: you get an error when having sep inside quoted field. From the error message this is not directly clear
- Could we possibly do a check when such an error message occurs if there are quotes, and then add to the message: "possibly this is due to quotes that are ignored when having a multi-char sep"? (which is what @jreback proposed?) Or just adding it to the message without checking.
PR: force people to use quoting=csv.QUOTE_NONE with the idea of making them aware of the possible problem with quotes.
- The reason I am not really a fan: a) this also forces the user to use this when there is no potential problem (eg no quotes, or no sep chars inside the quotes), ànd b) this will still give you the same error message when you have sep chars inside quotes but do pass csv.QUOTE_NONE as the error message says to do
Checking if there are quotes in the line, and only raising the error in that case
- Disadvantages: possibly expensive performance wise + still forces users to use quoting when they have quotes without sep chars inside + makes implementation more complex for something that is probably not worth it.
...

@gfyoung @jreback Thoughts?
Personally I would go for the first: leaving the current situation but improving the error message.

gfyoung · 2016-11-25T06:43:45Z

@jorisvandenbossche : You do have a point that the error message clarifies a lot what the issue is without necessarily inconveniencing the user. Changed PR to follow the first suggestion.

…gine If there is a field counts mismatch, check whether a multi-char sep was used in conjunction with quotes. Currently, that setup is not respected and can result in improper line breaks. Closes pandas-devgh-13374.

jorisvandenbossche · 2016-11-25T08:44:53Z

pandas/io/parsers.py

@@ -2515,6 +2515,11 @@ def _rows_to_cols(self, content):

            msg = ('Expected %d fields in line %d, saw %d' %
                   (col_len, row_num + 1, zip_len))
+            if len(self.delimiter) > 1 and self.quoting != csv.QUOTE_NONE:


Is the second part of the check needed (self.quoting != csv.QUOTE_NONE). Because AFAIU also when you do pass this, quotes are still be ignored by the regex expression to split the line, and you can still have this problem.

@jorisvandenbossche : If quoting=csv.QUOTE_NONE, all quotation marks are treated as data, so that's the user's fault, not ours. That's why the check is necessary.

Although it's a user error, you can still run into this problem so the message can still be useful I think. But I do see your point, so OK

jreback · 2016-11-25T13:22:26Z

lgtm.

jorisvandenbossche · 2016-11-25T21:21:14Z

@gfyoung thanks!

…gine (#14582) If there is a field counts mismatch, check whether a multi-char sep was used in conjunction with quotes. Currently, that setup is not respected and can result in improper line breaks. Closes gh-13374. (cherry picked from commit d8e427b)

gfyoung force-pushed the multichar-regex-sep-python branch 2 times, most recently from b0d75dc to 648254d Compare November 4, 2016 02:34

gfyoung force-pushed the multichar-regex-sep-python branch from 648254d to 90ac7be Compare November 4, 2016 15:12

sinhrks added Bug IO CSV read_csv, to_csv labels Nov 7, 2016

gfyoung force-pushed the multichar-regex-sep-python branch 3 times, most recently from 883e03d to 0d3a19b Compare November 9, 2016 22:38

gfyoung force-pushed the multichar-regex-sep-python branch from 0d3a19b to f21d9dd Compare November 10, 2016 05:04

gfyoung force-pushed the multichar-regex-sep-python branch 5 times, most recently from 0b0f618 to fb33679 Compare November 13, 2016 02:44

gfyoung force-pushed the multichar-regex-sep-python branch from fb33679 to 5e4d8ce Compare November 14, 2016 22:40

jreback reviewed Nov 15, 2016

View reviewed changes

gfyoung force-pushed the multichar-regex-sep-python branch from 5e4d8ce to e08ac73 Compare November 15, 2016 18:50

gfyoung force-pushed the multichar-regex-sep-python branch 3 times, most recently from e08f2f2 to 3fc4c2d Compare November 17, 2016 17:04

gfyoung force-pushed the multichar-regex-sep-python branch from 3fc4c2d to 641d3e1 Compare November 18, 2016 03:55

gfyoung force-pushed the multichar-regex-sep-python branch from 641d3e1 to d6afa62 Compare November 18, 2016 14:34

gfyoung force-pushed the multichar-regex-sep-python branch 4 times, most recently from 86eda84 to 51e22cf Compare November 23, 2016 01:51

jorisvandenbossche added this to the 0.20.0 milestone Nov 23, 2016

gfyoung force-pushed the multichar-regex-sep-python branch from 51e22cf to 2bd8594 Compare November 25, 2016 06:42

gfyoung changed the title ~~BUG: Disable multichar/regex sep for Python engine in read_csv~~ BUG: Improve Error Message for Multi-Char Sep + Quotes in Python Engine Nov 25, 2016

gfyoung force-pushed the multichar-regex-sep-python branch from 2bd8594 to dda588b Compare November 25, 2016 07:29

jorisvandenbossche reviewed Nov 25, 2016

View reviewed changes

jorisvandenbossche merged commit d8e427b into pandas-dev:master Nov 25, 2016

gfyoung deleted the multichar-regex-sep-python branch November 25, 2016 21:23

jorisvandenbossche modified the milestones: 0.19.2, 0.20.0 Dec 14, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: Improve Error Message for Multi-Char Sep + Quotes in Python Engine #14582

BUG: Improve Error Message for Multi-Char Sep + Quotes in Python Engine #14582

gfyoung commented Nov 4, 2016 •

edited

Loading

codecov-io commented Nov 4, 2016 •

edited

Loading

gfyoung commented Nov 10, 2016

jorisvandenbossche commented Nov 10, 2016

gfyoung commented Nov 13, 2016

jreback Nov 15, 2016

gfyoung Nov 15, 2016 •

edited

Loading

jreback Nov 15, 2016

gfyoung Nov 15, 2016 •

edited

Loading

jreback Nov 15, 2016

gfyoung Nov 15, 2016

jreback Nov 15, 2016

gfyoung Nov 15, 2016 •

edited

Loading

jorisvandenbossche commented Nov 15, 2016

gfyoung commented Nov 15, 2016 •

edited

Loading

jorisvandenbossche commented Nov 15, 2016

gfyoung commented Nov 18, 2016

jorisvandenbossche commented Nov 18, 2016

gfyoung commented Nov 18, 2016

jreback commented Nov 23, 2016

gfyoung commented Nov 23, 2016

jreback commented Nov 23, 2016

gfyoung commented Nov 23, 2016

jorisvandenbossche commented Nov 24, 2016 •

edited

Loading

gfyoung commented Nov 25, 2016

jorisvandenbossche Nov 25, 2016

gfyoung Nov 25, 2016 •

edited

Loading

jorisvandenbossche Nov 25, 2016

jreback commented Nov 25, 2016

jorisvandenbossche commented Nov 25, 2016

BUG: Improve Error Message for Multi-Char Sep + Quotes in Python Engine #14582

BUG: Improve Error Message for Multi-Char Sep + Quotes in Python Engine #14582

Conversation

gfyoung commented Nov 4, 2016 • edited Loading

codecov-io commented Nov 4, 2016 • edited Loading

Current coverage is 85.22% (diff: 100%)

gfyoung commented Nov 10, 2016

jorisvandenbossche commented Nov 10, 2016

gfyoung commented Nov 13, 2016

jreback Nov 15, 2016

Choose a reason for hiding this comment

gfyoung Nov 15, 2016 • edited Loading

Choose a reason for hiding this comment

jreback Nov 15, 2016

Choose a reason for hiding this comment

gfyoung Nov 15, 2016 • edited Loading

Choose a reason for hiding this comment

jreback Nov 15, 2016

Choose a reason for hiding this comment

gfyoung Nov 15, 2016

Choose a reason for hiding this comment

jreback Nov 15, 2016

Choose a reason for hiding this comment

gfyoung Nov 15, 2016 • edited Loading

Choose a reason for hiding this comment

jorisvandenbossche commented Nov 15, 2016

gfyoung commented Nov 15, 2016 • edited Loading

jorisvandenbossche commented Nov 15, 2016

gfyoung commented Nov 18, 2016

jorisvandenbossche commented Nov 18, 2016

gfyoung commented Nov 18, 2016

jreback commented Nov 23, 2016

gfyoung commented Nov 23, 2016

jreback commented Nov 23, 2016

gfyoung commented Nov 23, 2016

jorisvandenbossche commented Nov 24, 2016 • edited Loading

gfyoung commented Nov 25, 2016

jorisvandenbossche Nov 25, 2016

Choose a reason for hiding this comment

gfyoung Nov 25, 2016 • edited Loading

Choose a reason for hiding this comment

jorisvandenbossche Nov 25, 2016

Choose a reason for hiding this comment

jreback commented Nov 25, 2016

jorisvandenbossche commented Nov 25, 2016

gfyoung commented Nov 4, 2016 •

edited

Loading

codecov-io commented Nov 4, 2016 •

edited

Loading

gfyoung Nov 15, 2016 •

edited

Loading

gfyoung Nov 15, 2016 •

edited

Loading

gfyoung Nov 15, 2016 •

edited

Loading

gfyoung commented Nov 15, 2016 •

edited

Loading

jorisvandenbossche commented Nov 24, 2016 •

edited

Loading

gfyoung Nov 25, 2016 •

edited

Loading