-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
"Bad" lines with too few fields #9729
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
If you need data validation like this, I think it is much easier / better to simply count the NaN's.
all that said, this maybe could be in incoporated in the api discussions here: #9549 |
So - I could count the NaN's but the behaviour is the same with (what I consider) valid input compared to invalid input. Here's an example: I'd say this is a well formed CSV for tabular data - the same number of fields in each record:
This has a null value in the 1st (non-header) record, but given the 2nd comma it's implicit that the null is intentional:
This has a null record in the 1st (non-header) record, but it's not implicit that there should be a NaN in the 3rd record, or if the input is just mangled:
I'm using pandas in an environment where data integrity is very important - so I'd preferably like to make the processing be a tiny bit more verbose about potentially missing data. |
I would also like to see error_bad_lines trigger for too few fields. Embedded line breaks are a common cause of splitting a valid row into two rows. I have many files with 80 columns for each record, but embedded line breaks give me 37 on one line and 43 on the next line. Is there any other approach to finding these bad lines? |
I would also like to have this error for too few fields. In any case, when there are missing fields, the current read_csv behaviour of adding NAs assumes that it is the last fields that are missing, which is often not the case. Eg. if something breaks in the first field of the last line
dataframe becomes
Which breaks everything eg. dateparse, converters, etc. |
@kelvin22 your example is not easy to interpret as its not very obvious where the error is (in an automatic way). However, these types of things will almost always be impossible to dtype coerce, so they will end up as object. If you are expecting float/datetime/int columns then you can at least be clued in that something is wrong. |
Sorry @jreback, I'll have another crack at explaining. Currently the behaviour works great if a cell at the end of the row is mangled/missing
However it is just as likely to happen anywhere else in the line:
In this case there's no easy way to recover, and these two parameters would both raise an error.
The counting NAs workaround can't be used as the dataframe doesn't get formed. If |
Just came across this, though it took a while to figure it out - as mentioned by @kelvin22, I was seeing errors about e.g. an integer field not being allowed to be NA. As @Noxville mentions, Pandas is actually changing the information content of the data, and hence can't (easily) be used in a pipeline where data integrity is important. From that perspective, I'd call this a critical bug. |
@kodonnell Note that this issue is an enhancement request, and not a bug. Pandas is only 'changing' content because of a malformed csv file (and the fact that NA's are not allowed in integer fields is a whole other issue, and with current pandas an inherent limitation, but that is being worked on). See also issue #15122. Comments, suggestions, pull requests very welcome! |
@jorisvandenbossche - I consider not failing when it should to be a bug. It can certainly be a bug for downstream users (who e.g. adhere to a tighter CSV specification). #15122 - looks good. |
|
As per http://pandas.pydata.org/pandas-docs/stable/io.html#handling-bad-lines, records with too many fields cause (depending on error_bad_lines) an exception to be thrown, or stderr to be written to.
Would it be possible to add an option, defaulting to False, that gave similar warnings/errors if there are too few fields in a record (compared to all the other records. The current behaviour is just to insert NaN's - but there are cases where data integrity is important so knowing that some records are missing fields is important.
Cheers ^
The text was updated successfully, but these errors were encountered: