-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
csv parsing is too restrictive #9294
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
You need to explicity allow this, otherwise its designed to give an error.
|
when I try adding
Looks like pandas is assuming the first line is the start of the data table, and is trying to tokenize based on the number of fields it found there, even though the entire rest of the file has 8 fields per row. I know that the header on this file is such that if I do skiprows=4, pandas can read it just fine. I don't think R has the ability to autodetect headers, but it sure would be nice if pandas had an option to do it, or if we could specify the number of rows to look at, to help determine what might be a header (?). Just an idea. |
Just my 2 cents, but I don't think pandas should try to guess here. Even though warning on every line past the first one is almost surely not what the user wants, it could also indicate a bad file, where a column wasn't written correctly. Perhaps with another keyword or option to |
Input files should follow RFC4180 as closely as possible. The optional header should be the first line, so we shouldn't need to look past the first line to determine the header. |
My point is that it's a common problem, and pandas could help alleviate the issue, which would save everyone time and duplicated effort. Even if there's just an optional flag to return a list of tuples structured as (line number, number of fields in that line), that kind of output would be more useful than what it's already doing, which actually already includes that information, just not in a usable format. |
@szeitlin their are a number of issues w.r.t. how to 'report' bad/warnings of lines, e.g. see #5686 so maybe someone can propose an API, e.g. we could do something like replicates the current api
and errors be something like:
|
something I was curious about is whether we can still retrieve these bogus cells in some way... My understanding is that |
So this was a while ago, but I ended up actually catching those errors and parsing them, in order to figure out what was going on with my files. It worked ok, but it's probably kind of fragile (if anything about this part of pandas changes, it will break). I'd post the code here, but it was at my old job, so I don't have access to the repo anymore. 😂 |
yeah these bogus lines are difficult to catch unfortunately (at least not in native Pandas) |
On this csv file:
pandas.read_csv()
gives meThis seems odd to me. There's nothing in those fields to worry about anyway, they can just be dropped. Even if there was data there, if it doesn't have a column to go with it should be dropped too. Lots of csv files are messy like this; that's basically why they're used all the time.
I am on pandas '0.15.2' and Python3
The text was updated successfully, but these errors were encountered: