Skip to content

BUG: Improve Error Message for Multi-Char Sep + Quotes in Python Engine #14582

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

gfyoung
Copy link
Member

@gfyoung gfyoung commented Nov 4, 2016

If there is a field counts mismatch, check whether a multi-char separator was used in conjunction with quotes. Currently, that setup is not respected for the Python engine and can result in improper line breaks. This PR appends that to the error message if that is a possible explanation.

Closes #13374.

@gfyoung gfyoung force-pushed the multichar-regex-sep-python branch 2 times, most recently from b0d75dc to 648254d Compare November 4, 2016 02:34
@codecov-io
Copy link

codecov-io commented Nov 4, 2016

Current coverage is 85.22% (diff: 100%)

Merging #14582 into master will increase coverage by <.01%

@@             master     #14582   diff @@
==========================================
  Files           143        143          
  Lines         50804      50807     +3   
  Methods           0          0          
  Messages          0          0          
  Branches          0          0          
==========================================
+ Hits          43292      43298     +6   
+ Misses         7512       7509     -3   
  Partials          0          0          

Powered by Codecov. Last update 75b606a...dda588b

@gfyoung gfyoung force-pushed the multichar-regex-sep-python branch from 648254d to 90ac7be Compare November 4, 2016 15:12
@sinhrks sinhrks added Bug IO CSV read_csv, to_csv labels Nov 7, 2016
@gfyoung gfyoung force-pushed the multichar-regex-sep-python branch 3 times, most recently from 883e03d to 0d3a19b Compare November 9, 2016 22:38
@gfyoung
Copy link
Member Author

gfyoung commented Nov 10, 2016

Getting some weird segfaults that seem unrelated to this PR (also in my other one)...disabling testing for the time being.

@gfyoung gfyoung force-pushed the multichar-regex-sep-python branch from 0d3a19b to f21d9dd Compare November 10, 2016 05:04
@jorisvandenbossche
Copy link
Member

@gfyoung yes, the segfaults are not your fault: #14621

@gfyoung gfyoung force-pushed the multichar-regex-sep-python branch 5 times, most recently from 0b0f618 to fb33679 Compare November 13, 2016 02:44
@gfyoung
Copy link
Member Author

gfyoung commented Nov 13, 2016

@jorisvandenbossche , @jreback : No more segfaulting, and everything looks green. Ready to merge if there are no concerns.

@gfyoung gfyoung force-pushed the multichar-regex-sep-python branch from fb33679 to 5e4d8ce Compare November 14, 2016 22:40
@@ -836,7 +836,8 @@ def test_integer_overflow_bug(self):
result = self.read_csv(StringIO(data), header=None, sep=' ')
self.assertTrue(result[0].dtype == np.float64)

result = self.read_csv(StringIO(data), header=None, sep='\s+')
result = self.read_csv(StringIO(data), header=None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do you need to change these tests? this is translated to the c-engine with delim whitespace=True so it shouldn't hit your change

Copy link
Member Author

@gfyoung gfyoung Nov 15, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But this has nothing to do with the C engine. This has to do with the Python engine. You still do a buggy regex split even with delimiter="\s+" as you can see here. That's why all of these tests have to change.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and why the difference? that is a major api change.

Copy link
Member Author

@gfyoung gfyoung Nov 15, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My change causes an error to be raised whenever the sep parameter is regex and quoting != csv.QUOTE_NONE for the Python engine. This test will break as a result for the Python engine without specifying that quoting=csv.QUOTE_NONE.

@@ -801,6 +803,15 @@ def _clean_options(self, options, engine):
" different from '\s+' are"\
" interpreted as regex)"
engine = 'python'
elif quoting != csv.QUOTE_NONE:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

\s+ IS delim_whitespace, you shouldn't have to change all of these tests

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure why you commented this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my point is that \s+ is NOT a multi-char separator to the c-engine, it is instead passed as the delim_whitespace kw. So quoting is properly handled. Not sure this should be different in the python parser.

Copy link
Member Author

@gfyoung gfyoung Nov 15, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, it is very different. Python's native csv library does not accept regex, so we have to do that buggy regex split. delim_whitespace=True just translates to delimiter="\s+" for the Python parser.

@gfyoung gfyoung force-pushed the multichar-regex-sep-python branch from 5e4d8ce to e08ac73 Compare November 15, 2016 18:50
@jorisvandenbossche
Copy link
Member

I am not sure I am convinced this change is needed. @gfyoung Can you provide a bit more rationale in the top post and examples of how you to change your code or how the result changes?

Reason that I am not fully convinced is that you are disallowing it for all cases where there are quotes in the csv file, while there is only a problem when the delimiter is present in the quoted fields.

@gfyoung
Copy link
Member Author

gfyoung commented Nov 15, 2016

@jorisvandenbossche , @jreback : Maybe the delimiter is not be present in the quoted fields, but how are we to check this without compromising performance in Python? It is easier to insist upon passing in a parameter without raising an error than doing such a check. This change is necessary then to patch the bug I showed in the related issue.

There is no real behaviour change, except for an error being raised depending on what separator is being passed in with the Python engine. All tests that I modified perhaps could be left unchanged if I change some of the logic inside the parser but will need to check.

UPDATE: Not feasible because of what I provided above.

@jorisvandenbossche
Copy link
Member

@gfyoung It's quite possible your proposed change is the sensible thing to do, but I think you need to try to explain your changes and its consequences a bit better (in this case "Title is self-explanatory" is not really true .. :-))

As I understand it now, the consequence of your proposed change is that when somebody now does (on a file that does not necessarily contain quotes):

pd.read_csv(file, sep='\s+', engine='python')

this will raise an error now (so breaking code). To resolve this error, the user needs to add quoting=csv.QUOTE_NONE (as this is not the default).
Is this correct?

In this case, for me this is a trade-off between having to provide an extra (unneeded) keyword in the simple case (no quotes in file) vs having a wrong error message in case of quotes + sep present in quotes (current behaviour that you are trying to fix).

@gfyoung gfyoung force-pushed the multichar-regex-sep-python branch 3 times, most recently from e08f2f2 to 3fc4c2d Compare November 17, 2016 17:04
@gfyoung
Copy link
Member Author

gfyoung commented Nov 18, 2016

@jorisvandenbossche : Fair enough. Updated my description.

@gfyoung gfyoung force-pushed the multichar-regex-sep-python branch from 3fc4c2d to 641d3e1 Compare November 18, 2016 03:55
@jorisvandenbossche
Copy link
Member

@gfyoung Would it be possible to only raise the error if there are actually quotes in the data? (not sure how performant such a check would be)
That will already solve the trade-off for a lot of usecases I think

@gfyoung gfyoung force-pushed the multichar-regex-sep-python branch from 641d3e1 to d6afa62 Compare November 18, 2016 14:34
@gfyoung
Copy link
Member Author

gfyoung commented Nov 18, 2016

@jorisvandenbossche : That's going to be tricky because we wrap all IO objects as a stream. Unless you want to check every line when reading, I don't know of any other way to do it at the moment. Thoughts?

@gfyoung gfyoung force-pushed the multichar-regex-sep-python branch 4 times, most recently from 86eda84 to 51e22cf Compare November 23, 2016 01:51
@jreback
Copy link
Contributor

jreback commented Nov 23, 2016

@gfyoung why can't you just happily parse, then if the error happens, check the quoting paramater and raise the correct error?

@jorisvandenbossche jorisvandenbossche added this to the 0.20.0 milestone Nov 23, 2016
@gfyoung
Copy link
Member Author

gfyoung commented Nov 23, 2016

@jreback : Because how can you tell if it's a legitimate error in the CSV file or incorrect parsing of quotation marks? There isn't really a way to differentiate AFAICT.

@jreback
Copy link
Contributor

jreback commented Nov 23, 2016

@gfyoung

Because how can you tell if it's a legitimate error in the CSV file or incorrect parsing of quotation marks? There isn't really a way to differentiate AFAICT.

so maybe just say that in the error message. (IOW say you can pass quote.NONE to verify if no error.

I think forcing people to pass quote.NONE is pretty painful just to have a more general error message.

@gfyoung
Copy link
Member Author

gfyoung commented Nov 23, 2016

@jreback : Well, parsing regex separators is painful on our end. You don't have to pass in csv.None if you're separator is only one character, so it shouldn't inconvenience people in most cases.

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Nov 24, 2016

Hmm, we should try to make a decision here. Trying to summarize the options:

  • Current situation: you get an error when having sep inside quoted field. From the error message this is not directly clear
    • Could we possibly do a check when such an error message occurs if there are quotes, and then add to the message: "possibly this is due to quotes that are ignored when having a multi-char sep"? (which is what @jreback proposed?) Or just adding it to the message without checking.
  • PR: force people to use quoting=csv.QUOTE_NONE with the idea of making them aware of the possible problem with quotes.
    • The reason I am not really a fan: a) this also forces the user to use this when there is no potential problem (eg no quotes, or no sep chars inside the quotes), ànd b) this will still give you the same error message when you have sep chars inside quotes but do pass csv.QUOTE_NONE as the error message says to do
  • Checking if there are quotes in the line, and only raising the error in that case
    • Disadvantages: possibly expensive performance wise + still forces users to use quoting when they have quotes without sep chars inside + makes implementation more complex for something that is probably not worth it.
  • ...

@gfyoung @jreback Thoughts?
Personally I would go for the first: leaving the current situation but improving the error message.

@gfyoung gfyoung force-pushed the multichar-regex-sep-python branch from 51e22cf to 2bd8594 Compare November 25, 2016 06:42
@gfyoung gfyoung changed the title BUG: Disable multichar/regex sep for Python engine in read_csv BUG: Improve Error Message for Multi-Char Sep + Quotes in Python Engine Nov 25, 2016
@gfyoung
Copy link
Member Author

gfyoung commented Nov 25, 2016

@jorisvandenbossche : You do have a point that the error message clarifies a lot what the issue is without necessarily inconveniencing the user. Changed PR to follow the first suggestion.

…gine

If there is a field counts mismatch, check whether
a multi-char sep was used in conjunction with quotes.
Currently, that setup is not respected and can result
in improper line breaks.

Closes pandas-devgh-13374.
@gfyoung gfyoung force-pushed the multichar-regex-sep-python branch from 2bd8594 to dda588b Compare November 25, 2016 07:29
@@ -2515,6 +2515,11 @@ def _rows_to_cols(self, content):

msg = ('Expected %d fields in line %d, saw %d' %
(col_len, row_num + 1, zip_len))
if len(self.delimiter) > 1 and self.quoting != csv.QUOTE_NONE:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the second part of the check needed (self.quoting != csv.QUOTE_NONE). Because AFAIU also when you do pass this, quotes are still be ignored by the regex expression to split the line, and you can still have this problem.

Copy link
Member Author

@gfyoung gfyoung Nov 25, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jorisvandenbossche : If quoting=csv.QUOTE_NONE, all quotation marks are treated as data, so that's the user's fault, not ours. That's why the check is necessary.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although it's a user error, you can still run into this problem so the message can still be useful I think. But I do see your point, so OK

@jreback
Copy link
Contributor

jreback commented Nov 25, 2016

lgtm.

@jorisvandenbossche jorisvandenbossche merged commit d8e427b into pandas-dev:master Nov 25, 2016
@jorisvandenbossche
Copy link
Member

@gfyoung thanks!

@gfyoung gfyoung deleted the multichar-regex-sep-python branch November 25, 2016 21:23
@jorisvandenbossche jorisvandenbossche modified the milestones: 0.19.2, 0.20.0 Dec 14, 2016
jorisvandenbossche pushed a commit that referenced this pull request Dec 15, 2016
…gine (#14582)

If there is a field counts mismatch, check whether
a multi-char sep was used in conjunction with quotes.
Currently, that setup is not respected and can result
in improper line breaks.

Closes gh-13374.
(cherry picked from commit d8e427b)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug IO CSV read_csv, to_csv
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants