-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG: Python parser breaks with quotes and multi-char sep #13374
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
isn't a multi-char sep turned into a regex? (in python engine) |
Yes, but that's besides the point in the grand scheme of things. It's just that the support multi-char or regex isn't really there in the Python engine if it breaks in simple cases like this. |
does it work when no quotes are there? |
Should work AFAICT without quotes. I only stumbled across it when I was trying to figure out a clever way to get |
ok, why don't we just raise NotImplementedError if quoting is not None and python engine is selected (onlly for multi-char sep?) I suspect its not worth it to actually implement, but an error would be fine. |
The extra field will be created irregardless of the quoting style. It's because of the faulty splitting that the error will be raised, not the quoting style. The quotes in this example just serve to illustrate why the splitting is faulty. The alternative to make it "work" is to |
…gine If there is a field counts mismatch, check whether a multi-char sep was used in conjunction with quotes. Currently, that setup is not respected and can result in improper line breaks. Closes pandas-devgh-13374.
…gine If there is a field counts mismatch, check whether a multi-char sep was used in conjunction with quotes. Currently, that setup is not respected and can result in improper line breaks. Closes pandas-devgh-13374.
I actually get an error when I have a bad line and don't specify a delimiter (using the python engine), due to the following code: if len(self.delimiter) > 1 and self.quoting != csv.QUOTE_NONE:
# see gh-13374
reason = ('Error could possibly be due to quotes being '
'ignored when a multi-char delimiter is used.')
msg += '. ' + reason A It should be relatively simple to fix, just change the statement to if self.delimiter and len(self.delimiter) > 1 and self.quoting != csv.QUOTE_NONE: Can you reproduce this? edit from pandas import read_csv
from StringIO import StringIO
csv_str = "a,b,c\n0,1,2\n3,4,5,6\n7,8,9"
chunks = read_csv(StringIO(csv_str), header=0, chunksize=2048, sep=None,
error_bad_lines=False, warn_bad_lines=True,
engine='python', iterator=True,
tupleize_cols=True)
for chunk in chunks:
print chunk edit simplified sample |
@matthax : Preferable would if you could provide a CSV in string form for the rest of us to reproduce e.g.: data = ...
read_csv(StringIO(data), engine='python',...) |
from pandas import read_csv
from StringIO import StringIO
csv_str = "a,b,c\n0,1,2\n3,4,5,6\n7,8,9"
chunks = read_csv(StringIO(csv_str), header=0, chunksize=2048, sep=None,
error_bad_lines=False, warn_bad_lines=True,
engine='python', iterator=True,
tupleize_cols=True)
for chunk in chunks:
print chunk How's this work? edit: simplified sample |
@mattax : That indeed breaks! Feel free to patch in a PR, and also try to see if you can provide a simpler example than this one for testing / reproducing this issue. |
On
master
(b722222):I expect this command to work, but because no parsing is done on quoted fields as can be seen here, an extra field is produced, breaking the parser. Note that this does not affect the C parser because multi-char delimiters are not supported. Similar to what we saw in #10911 and #12775, but unless we want to write the
tokenizer.c
code in Python, a similar fix does not seem trivial.The text was updated successfully, but these errors were encountered: