Skip to content

"Bad" lines with too few fields #9729

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
Noxville opened this issue Mar 25, 2015 · 10 comments
Open

"Bad" lines with too few fields #9729

Noxville opened this issue Mar 25, 2015 · 10 comments
Labels
Enhancement IO CSV read_csv, to_csv

Comments

@Noxville
Copy link

As per http://pandas.pydata.org/pandas-docs/stable/io.html#handling-bad-lines, records with too many fields cause (depending on error_bad_lines) an exception to be thrown, or stderr to be written to.

Would it be possible to add an option, defaulting to False, that gave similar warnings/errors if there are too few fields in a record (compared to all the other records. The current behaviour is just to insert NaN's - but there are cases where data integrity is important so knowing that some records are missing fields is important.

Cheers ^

@jreback
Copy link
Contributor

jreback commented Mar 25, 2015

If you need data validation like this, I think it is much easier / better to simply count the NaN's.

In [18]: df = pd.read_csv(StringIO("""a,b,c\n1,2\n3,4,5\n"""))

In [19]: df
Out[19]: 
   a  b   c
0  1  2 NaN
1  3  4   5

In [20]: df.count(1)
Out[20]: 
0    2
1    3
dtype: int64

In [21]: df[df.count(1)!=len(df.columns)]
Out[21]: 
   a  b   c
0  1  2 NaN

all that said, this maybe could be in incoporated in the api discussions here: #9549
cc @selasley

@jreback jreback added API Design IO CSV read_csv, to_csv labels Mar 25, 2015
@jreback jreback added this to the Next Major Release milestone Mar 25, 2015
@Noxville
Copy link
Author

So - I could count the NaN's but the behaviour is the same with (what I consider) valid input compared to invalid input. Here's an example:

I'd say this is a well formed CSV for tabular data - the same number of fields in each record:

a,b,c
1,2,3
4,5,6

This has a null value in the 1st (non-header) record, but given the 2nd comma it's implicit that the null is intentional:

a,b,c
1,,3
4,5,6

This has a null record in the 1st (non-header) record, but it's not implicit that there should be a NaN in the 3rd record, or if the input is just mangled:

a,b,c
1,2
4,5,6

I'm using pandas in an environment where data integrity is very important - so I'd preferably like to make the processing be a tiny bit more verbose about potentially missing data.

@glepore70
Copy link

I would also like to see error_bad_lines trigger for too few fields. Embedded line breaks are a common cause of splitting a valid row into two rows. I have many files with 80 columns for each record, but embedded line breaks give me 37 on one line and 43 on the next line.

Is there any other approach to finding these bad lines?

@kelvin22
Copy link

kelvin22 commented Oct 8, 2015

I would also like to have this error for too few fields.
I come across @glepore70 case of line break splitting a row into two (due to malformed quote or otherwise) a fair bit...not sure if there is a good way to detect this.

In any case, when there are missing fields, the current read_csv behaviour of adding NAs assumes that it is the last fields that are missing, which is often not the case.

Eg. if something breaks in the first field of the last line

a,b,c,d
1, '12/12/2015', 'hi', False
1, '12/12/2015', 'hi', False
'12/12/2015', 'hi', False

dataframe becomes

a,b,c,d
1, '12/12/2015', 'hi', False
1, '12/12/2015', 'hi', False
'12/12/2015', 'hi', False, NA

Which breaks everything eg. dateparse, converters, etc.
I don't believe there is any real way to handle these rows either, as the data is not yet in the dataframe.

@jreback
Copy link
Contributor

jreback commented Oct 8, 2015

@kelvin22 your example is not easy to interpret as its not very obvious where the error is (in an automatic way). However, these types of things will almost always be impossible to dtype coerce, so they will end up as object. If you are expecting float/datetime/int columns then you can at least be clued in that something is wrong.

@kelvin22
Copy link

kelvin22 commented Oct 8, 2015

Sorry @jreback, I'll have another crack at explaining.

Currently the behaviour works great if a cell at the end of the row is mangled/missing

1, '12/12/2015', 'hi', False
1, '12/12/2015', 'hi', False
1, '12/12/2015', 'hi', (NA)
1, '12/12/2015', 'hi', False
1, '12/12/2015', 'hi', False

However it is just as likely to happen anywhere else in the line:

'col1', 'col2', 'col3', 'col4'
1, '12/12/2015', 'hi', False
1, '12/12/2015', 'hi', False
'12/12/2015', 'hi', False, (NA)
1, '12/12/2015', 'hi', False
1, '12/12/2015', 'hi', False

In this case there's no easy way to recover, and these two parameters would both raise an error.

read_csv( dtype = { 'col3': str} , parse_dates = 'col2' )

The counting NAs workaround can't be used as the dataframe doesn't get formed.

If error_bad_lines = False also worked with too few lines, the dud line would be skipped and everything would continue working.

@kodonnell
Copy link

kodonnell commented Jan 30, 2017

Just came across this, though it took a while to figure it out - as mentioned by @kelvin22, I was seeing errors about e.g. an integer field not being allowed to be NA.

As @Noxville mentions, Pandas is actually changing the information content of the data, and hence can't (easily) be used in a pipeline where data integrity is important. From that perspective, I'd call this a critical bug.

@jorisvandenbossche
Copy link
Member

@kodonnell Note that this issue is an enhancement request, and not a bug. Pandas is only 'changing' content because of a malformed csv file (and the fact that NA's are not allowed in integer fields is a whole other issue, and with current pandas an inherent limitation, but that is being worked on).

See also issue #15122. Comments, suggestions, pull requests very welcome!

@kodonnell
Copy link

@jorisvandenbossche - I consider not failing when it should to be a bug. It can certainly be a bug for downstream users (who e.g. adhere to a tighter CSV specification).

#15122 - looks good.

@mroeschke mroeschke removed this from the Contributions Welcome milestone Oct 13, 2022
@sniperyyc
Copy link

df.dropna() seems to be a simple solution to me if you just want to "skip" bad lines with too few fields -- missing fields will be parsed as NaN when calling read_csv()

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement IO CSV read_csv, to_csv
Projects
None yet
Development

No branches or pull requests

8 participants