-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG: Improve Error Message for Multi-Char Sep + Quotes in Python Engine #14582
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: Improve Error Message for Multi-Char Sep + Quotes in Python Engine #14582
Conversation
b0d75dc
to
648254d
Compare
Current coverage is 85.22% (diff: 100%)@@ master #14582 diff @@
==========================================
Files 143 143
Lines 50804 50807 +3
Methods 0 0
Messages 0 0
Branches 0 0
==========================================
+ Hits 43292 43298 +6
+ Misses 7512 7509 -3
Partials 0 0
|
648254d
to
90ac7be
Compare
883e03d
to
0d3a19b
Compare
Getting some weird segfaults that seem unrelated to this PR (also in my other one)...disabling testing for the time being. |
0d3a19b
to
f21d9dd
Compare
0b0f618
to
fb33679
Compare
@jorisvandenbossche , @jreback : No more segfaulting, and everything looks green. Ready to merge if there are no concerns. |
fb33679
to
5e4d8ce
Compare
@@ -836,7 +836,8 @@ def test_integer_overflow_bug(self): | |||
result = self.read_csv(StringIO(data), header=None, sep=' ') | |||
self.assertTrue(result[0].dtype == np.float64) | |||
|
|||
result = self.read_csv(StringIO(data), header=None, sep='\s+') | |||
result = self.read_csv(StringIO(data), header=None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do you need to change these tests? this is translated to the c-engine with delim whitespace=True so it shouldn't hit your change
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But this has nothing to do with the C engine. This has to do with the Python engine. You still do a buggy regex split even with delimiter="\s+"
as you can see here. That's why all of these tests have to change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and why the difference? that is a major api change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My change causes an error to be raised whenever the sep
parameter is regex and quoting != csv.QUOTE_NONE
for the Python engine. This test will break as a result for the Python engine without specifying that quoting=csv.QUOTE_NONE
.
@@ -801,6 +803,15 @@ def _clean_options(self, options, engine): | |||
" different from '\s+' are"\ | |||
" interpreted as regex)" | |||
engine = 'python' | |||
elif quoting != csv.QUOTE_NONE: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
\s+
IS delim_whitespace, you shouldn't have to change all of these tests
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure why you commented this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
my point is that \s+
is NOT a multi-char separator to the c-engine, it is instead passed as the delim_whitespace
kw. So quoting is properly handled. Not sure this should be different in the python parser.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately, it is very different. Python's native csv
library does not accept regex, so we have to do that buggy regex split. delim_whitespace=True
just translates to delimiter="\s+"
for the Python parser.
5e4d8ce
to
e08ac73
Compare
I am not sure I am convinced this change is needed. @gfyoung Can you provide a bit more rationale in the top post and examples of how you to change your code or how the result changes? Reason that I am not fully convinced is that you are disallowing it for all cases where there are quotes in the csv file, while there is only a problem when the delimiter is present in the quoted fields. |
@jorisvandenbossche , @jreback : Maybe the delimiter is not be present in the quoted fields, but how are we to check this without compromising performance in Python? It is easier to insist upon passing in a parameter without raising an error than doing such a check. This change is necessary then to patch the bug I showed in the related issue. There is no real behaviour change, except for an error being raised depending on what separator is being passed in with the Python engine. UPDATE: Not feasible because of what I provided above. |
@gfyoung It's quite possible your proposed change is the sensible thing to do, but I think you need to try to explain your changes and its consequences a bit better (in this case "Title is self-explanatory" is not really true .. :-)) As I understand it now, the consequence of your proposed change is that when somebody now does (on a file that does not necessarily contain quotes):
this will raise an error now (so breaking code). To resolve this error, the user needs to add In this case, for me this is a trade-off between having to provide an extra (unneeded) keyword in the simple case (no quotes in file) vs having a wrong error message in case of quotes + sep present in quotes (current behaviour that you are trying to fix). |
e08f2f2
to
3fc4c2d
Compare
@jorisvandenbossche : Fair enough. Updated my description. |
3fc4c2d
to
641d3e1
Compare
@gfyoung Would it be possible to only raise the error if there are actually quotes in the data? (not sure how performant such a check would be) |
641d3e1
to
d6afa62
Compare
@jorisvandenbossche : That's going to be tricky because we wrap all IO objects as a stream. Unless you want to check every line when reading, I don't know of any other way to do it at the moment. Thoughts? |
86eda84
to
51e22cf
Compare
@gfyoung why can't you just happily parse, then if the error happens, check the quoting paramater and raise the correct error? |
@jreback : Because how can you tell if it's a legitimate error in the CSV file or incorrect parsing of quotation marks? There isn't really a way to differentiate AFAICT. |
so maybe just say that in the error message. (IOW say you can pass I think forcing people to pass |
@jreback : Well, parsing regex separators is painful on our end. You don't have to pass in |
Hmm, we should try to make a decision here. Trying to summarize the options:
@gfyoung @jreback Thoughts? |
51e22cf
to
2bd8594
Compare
@jorisvandenbossche : You do have a point that the error message clarifies a lot what the issue is without necessarily inconveniencing the user. Changed PR to follow the first suggestion. |
…gine If there is a field counts mismatch, check whether a multi-char sep was used in conjunction with quotes. Currently, that setup is not respected and can result in improper line breaks. Closes pandas-devgh-13374.
2bd8594
to
dda588b
Compare
@@ -2515,6 +2515,11 @@ def _rows_to_cols(self, content): | |||
|
|||
msg = ('Expected %d fields in line %d, saw %d' % | |||
(col_len, row_num + 1, zip_len)) | |||
if len(self.delimiter) > 1 and self.quoting != csv.QUOTE_NONE: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the second part of the check needed (self.quoting != csv.QUOTE_NONE
). Because AFAIU also when you do pass this, quotes are still be ignored by the regex expression to split the line, and you can still have this problem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jorisvandenbossche : If quoting=csv.QUOTE_NONE
, all quotation marks are treated as data, so that's the user's fault, not ours. That's why the check is necessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Although it's a user error, you can still run into this problem so the message can still be useful I think. But I do see your point, so OK
lgtm. |
@gfyoung thanks! |
If there is a field counts mismatch, check whether a multi-char separator was used in conjunction with quotes. Currently, that setup is not respected for the Python engine and can result in improper line breaks. This PR appends that to the error message if that is a possible explanation.
Closes #13374.