-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
read_csv() & extra trailing comma(s) cause parsing issues. #2886
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
This isn't a totally trivial problem. I'll take a look into it, though |
With the mismatched trailing commas If I have a similar issue processing files where the last field is freeform text and the separator is sometimes included. I have two different workarounds depending on the situation:
where 4 is the max number of fields that I want to have. This option requires a second delimiter (all valid I hope this workaround helps. |
@davidjameshumphreys if you would like to do a PR to edit the http://pandas.pydata.org/pandas-docs/dev/cookbook.html#csv (doc/source/cookbook.rst), and add a link to this issue, would be great! |
Wow I had no idea read_csv() had the option "error_bad_lines=False". That will save me some headaches in the future for sure! It's documented however not explicitly added to the top argument header section which is why I never found it. |
Added link to CSV section referencing issue pandas-dev#2886 where read_csv fails with badly aligned data. @jreback should I expand the text to describe other scenarios?
@dragoljub given the error_bad_lines (which @davidjameshumphreys is documentingin the correct place as well as this recipe), do you think anything needs to be added to read_csv? or just close this? |
I think it would be a nice feature for the parser to ignore extra trailing commas so this can be turned into a feature request rather than an issue. Since we can pre-process to drop extra commas or ignore malformed lines we can close this as an issue. |
Does pandas have any ability process or just get list of bad lines? |
@tbicr bad lines can be ignored with the option |
I sorted this issue with use of csv module. My code look like import csv df = pd.read_csv('filename.csv', parse_dates=True, dtype=Object, delimiter="\t", quoting=csv.QUOTE_NONE, encoding='utf-8') |
@davidjameshumphreys I met this similar problem today and I've tried your second pre-processing method. Changing all the first |
Have the same "extra comma" in the CSVs due to a free form field. Another way besides adding " |
I have a table. It has the same issue. But the following command works! |
I have run into a few opportunities to further improve the wonderful read_csv() function. I'm using the latest x64 0.10.1 build from 10-Feb-2013 16:52.
Symptoms:
I believe the opportunity to fix these exceptions would require simply ignoring any extra trailing commas. This is how many CSV readers work such as opening CSVs in excel. In my case I regularly work with 100K line CSVs that occasionally have extra trailing columns causing read_csv() to fail. Perhaps its possible to have an option to ignore trailing commas, or even better an option to ignore/skip any malformed rows without raising a terminal exception. :)
-Gagi
If a CSV has 'n' matched extra trailing columns and you do not specify any index_col then the parser will correctly assume that the first 'n' columns are the index , if you set index_col=False it fails with: IndexError: list index out of range
The text was updated successfully, but these errors were encountered: