Skip to content

BUG: Python parser breaks with quotes and multi-char sep #13374

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
gfyoung opened this issue Jun 5, 2016 · 10 comments · Fixed by #17465
Closed

BUG: Python parser breaks with quotes and multi-char sep #13374

gfyoung opened this issue Jun 5, 2016 · 10 comments · Fixed by #17465
Labels
Error Reporting Incorrect or improved errors from pandas IO CSV read_csv, to_csv
Milestone

Comments

@gfyoung
Copy link
Member

gfyoung commented Jun 5, 2016

On master (b722222):

>>> data = 'a,,b\n1,,a\n2,,"2,,b"'
>>> read_csv(StringIO(data), sep=',,', engine='python')
...
ValueError: Expected 2 fields in line 3, saw 3

I expect this command to work, but because no parsing is done on quoted fields as can be seen here, an extra field is produced, breaking the parser. Note that this does not affect the C parser because multi-char delimiters are not supported. Similar to what we saw in #10911 and #12775, but unless we want to write the tokenizer.c code in Python, a similar fix does not seem trivial.

@jreback
Copy link
Contributor

jreback commented Jun 6, 2016

isn't a multi-char sep turned into a regex? (in python engine)

@jreback jreback added the IO CSV read_csv, to_csv label Jun 6, 2016
@gfyoung
Copy link
Member Author

gfyoung commented Jun 6, 2016

Yes, but that's besides the point in the grand scheme of things. It's just that the support multi-char or regex isn't really there in the Python engine if it breaks in simple cases like this.

@jreback
Copy link
Contributor

jreback commented Jun 6, 2016

does it work when no quotes are there?

@gfyoung
Copy link
Member Author

gfyoung commented Jun 6, 2016

Should work AFAICT without quotes. I only stumbled across it when I was trying to figure out a clever way to get regex or multi-char sep to work on the C engine by doing some pre-processing in Python layer, but this is an obstacle to that.

@jreback
Copy link
Contributor

jreback commented Jun 6, 2016

ok, why don't we just raise NotImplementedError if quoting is not None and python engine is selected (onlly for multi-char sep?)

I suspect its not worth it to actually implement, but an error would be fine.

@jreback jreback added Difficulty Novice Error Reporting Incorrect or improved errors from pandas labels Jun 6, 2016
@gfyoung
Copy link
Member Author

gfyoung commented Jun 7, 2016

The extra field will be created irregardless of the quoting style. It's because of the faulty splitting that the error will be raised, not the quoting style. The quotes in this example just serve to illustrate why the splitting is faulty.

The alternative to make it "work" is to regex replace the sep with a more conventional one like , but of course that means loss of data in the cases of quoted or commented data. So my first instinct is to say that we actually don't support regex or multi-char sep for either engine. How about that?

gfyoung added a commit to forking-repos/pandas that referenced this issue Nov 4, 2016
gfyoung added a commit to forking-repos/pandas that referenced this issue Nov 4, 2016
gfyoung added a commit to forking-repos/pandas that referenced this issue Nov 4, 2016
gfyoung added a commit to forking-repos/pandas that referenced this issue Nov 4, 2016
gfyoung added a commit to forking-repos/pandas that referenced this issue Nov 8, 2016
gfyoung added a commit to forking-repos/pandas that referenced this issue Nov 8, 2016
gfyoung added a commit to forking-repos/pandas that referenced this issue Nov 9, 2016
gfyoung added a commit to forking-repos/pandas that referenced this issue Nov 10, 2016
gfyoung added a commit to forking-repos/pandas that referenced this issue Nov 10, 2016
gfyoung added a commit to forking-repos/pandas that referenced this issue Nov 11, 2016
gfyoung added a commit to forking-repos/pandas that referenced this issue Nov 12, 2016
gfyoung added a commit to forking-repos/pandas that referenced this issue Nov 13, 2016
gfyoung added a commit to forking-repos/pandas that referenced this issue Nov 13, 2016
gfyoung added a commit to forking-repos/pandas that referenced this issue Nov 14, 2016
gfyoung added a commit to forking-repos/pandas that referenced this issue Nov 15, 2016
gfyoung added a commit to forking-repos/pandas that referenced this issue Nov 15, 2016
gfyoung added a commit to forking-repos/pandas that referenced this issue Nov 16, 2016
gfyoung added a commit to forking-repos/pandas that referenced this issue Nov 17, 2016
gfyoung added a commit to forking-repos/pandas that referenced this issue Nov 18, 2016
gfyoung added a commit to forking-repos/pandas that referenced this issue Nov 18, 2016
gfyoung added a commit to forking-repos/pandas that referenced this issue Nov 20, 2016
gfyoung added a commit to forking-repos/pandas that referenced this issue Nov 22, 2016
gfyoung added a commit to forking-repos/pandas that referenced this issue Nov 22, 2016
gfyoung added a commit to forking-repos/pandas that referenced this issue Nov 23, 2016
gfyoung added a commit to forking-repos/pandas that referenced this issue Nov 25, 2016
…gine

If there is a field counts mismatch, check whether
a multi-char sep was used in conjunction with quotes.
Currently, that setup is not respected and can result
in improper line breaks.

Closes pandas-devgh-13374.
gfyoung added a commit to forking-repos/pandas that referenced this issue Nov 25, 2016
…gine

If there is a field counts mismatch, check whether
a multi-char sep was used in conjunction with quotes.
Currently, that setup is not respected and can result
in improper line breaks.

Closes pandas-devgh-13374.
@jreback jreback added this to the 0.20.0 milestone Nov 25, 2016
jorisvandenbossche pushed a commit that referenced this issue Nov 25, 2016
…gine (#14582)

If there is a field counts mismatch, check whether
a multi-char sep was used in conjunction with quotes.
Currently, that setup is not respected and can result
in improper line breaks.

Closes gh-13374.
jorisvandenbossche pushed a commit that referenced this issue Dec 15, 2016
…gine (#14582)

If there is a field counts mismatch, check whether
a multi-char sep was used in conjunction with quotes.
Currently, that setup is not respected and can result
in improper line breaks.

Closes gh-13374.
(cherry picked from commit d8e427b)
@matthax
Copy link
Contributor

matthax commented Sep 7, 2017

I actually get an error when I have a bad line and don't specify a delimiter (using the python engine), due to the following code:

                if len(self.delimiter) > 1 and self.quoting != csv.QUOTE_NONE:
                    # see gh-13374
                    reason = ('Error could possibly be due to quotes being '
                              'ignored when a multi-char delimiter is used.')
                    msg += '. ' + reason

A TypeError is thrown because self.delimiter which is NoneType doesn't have a len() operator.

It should be relatively simple to fix, just change the statement to

    if self.delimiter and len(self.delimiter) > 1 and self.quoting != csv.QUOTE_NONE:

Can you reproduce this?

edit
I've confirmed I can reproduce the issue, I can provide sample data but you can just use a simple csv with a header and three columns, one with an extra element. Here's some sample code:

from pandas import read_csv
from StringIO import StringIO

csv_str = "a,b,c\n0,1,2\n3,4,5,6\n7,8,9"

chunks = read_csv(StringIO(csv_str), header=0, chunksize=2048, sep=None,
                  error_bad_lines=False, warn_bad_lines=True,
                  engine='python', iterator=True,
                  tupleize_cols=True)
for chunk in chunks:
    print chunk

edit simplified sample

@gfyoung
Copy link
Member Author

gfyoung commented Sep 7, 2017

@matthax : Preferable would if you could provide a CSV in string form for the rest of us to reproduce e.g.:

data = ...
read_csv(StringIO(data), engine='python',...)

@matthax
Copy link
Contributor

matthax commented Sep 7, 2017

@gfyoung

from pandas import read_csv
from StringIO import StringIO

csv_str = "a,b,c\n0,1,2\n3,4,5,6\n7,8,9"

chunks = read_csv(StringIO(csv_str), header=0, chunksize=2048, sep=None,
                  error_bad_lines=False, warn_bad_lines=True,
                  engine='python', iterator=True,
                  tupleize_cols=True)
for chunk in chunks:
    print chunk

How's this work?

edit: simplified sample

@gfyoung
Copy link
Member Author

gfyoung commented Sep 7, 2017

@mattax : That indeed breaks! Feel free to patch in a PR, and also try to see if you can provide a simpler example than this one for testing / reproducing this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Error Reporting Incorrect or improved errors from pandas IO CSV read_csv, to_csv
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants