QST: Inconsistent behaviour in checking number of fields per row while read_csv() #41754

buhtz · 2021-06-01T09:18:51Z

I post this as a "Question" because I am quite new to pandas. So maybe I miss some understandings and the "problem" described by me is by design and you have good reasons for that.
I use pandas 1.2.4, with Python 3.9.4 on Windows 10 64 bit.

As a user I would expect that pandas check the number of fields per row when importing via csv file. But IMHO it does not in all cases.

Example 1

Here is a csv file without header and but a set names= attribute with three fields. So pandas should be able to know how many fields/columns should be in the CSV file. The second row contains 4 instead of 3 fields.

import pandas
import io

csv_without_header = io.StringIO(
'A;B;C\n'
'D;E;X;Y\n'
'F;G;H'
)

df = pandas.read_csv(csv_without_header, encoding='utf-8', sep=';',
                     header=None,
                     names=['First', 'Second', 'Third'])

Pandas import this without warnrings or errors. The 4th field in the 2nd row is simply ignored.

Example 2

I added a header line into the csv file with again three fields.
So pandas should be able to know how many fields/columns should be in the CSV file.
And again the second row contains 4 instead of 3 fields.

csv_with_header = io.StringIO(
'First;Second;Third\n'
'A;B;C\n'
'D;E;X;Y\n'
'F;G;H'
)

df = pandas.read_csv(csv_with_header, encoding='utf-8', sep=';')

Here an error occurs as I expect.
pandas.errors.ParserError: Error tokenizing data. C error: Expected 3 fields in line 3, saw 4

Example 3

There are less then 3 fields in the 2nd row. Again here is no warning or error. The missing field is set with NaN. And here it does not matter if you give the number of (expected) fields via header line in the CSV or via names= attribute.

csv_with_header = io.StringIO(
'First;Second;Third\n'
'A;B;C\n'
'D;Y\n'
'F;G;H'
)

csv_without_header = io.StringIO(
'A;B;C\n'
'D;Y\n'
'F;G;H'
)

df_a = pandas.read_csv(csv_with_header, encoding='utf-8', sep=';')
df_b = pandas.read_csv(csv_without_header, encoding='utf-8', sep=';', names=['First', 'Second', 'Third'])

Want I want is to import CSV files and be informed if there are to many or less then the expected number of fields in any row.

The text was updated successfully, but these errors were encountered:

testingcan · 2021-06-01T11:32:30Z

To me this behaviour is quite consistent with what I would expect a CSV import to do. Having a value outside of the defined columns just doesn't make any sense, so the ParserError is appropriate here. However, having a NaN or missing value is logically sound in my opinion and happens more often than not.

For example, if the contents of your CSV file were from some input, where you have to enter three fields, but the CSV contains a row with four columns, this means you have a bigger problem. If you have a row with only two columns and one NaN that just means that one input was left blank and you need to handle the data appropriately. Since this is a case-by-case basis (e.g. do you still consider this a valid datum, do you ignore, do you populate it with default values?), but the other case is almost always a problem with the file, I think this behaviour makes sense.

In any way, checking for empty values after importing is always a good step, using - in your third example - df_a.isnull() or df_a.isnull().sum.

phofl · 2021-06-01T19:24:46Z

This is a duplicate, please search the issue tracker,

buhtz · 2021-06-02T14:25:15Z

It is impolite to assume that each user opening an issue is stupid and lazy.
Of course I search the issue tracker. I assume I used the wrong keywords.

But it is also impolite to point to other Issues without giving the links.
It is a feature of HyperText to link your words to content. ;)

If you want contributers to pandas you should work on your attitude.

And as a "pandas member" you should know how to use GitHubs Issue tracker.
if there is a duplicate Issue. Link to the origin Issue and close the duplicate.

Anything else is waste of ressources.

jreback · 2021-06-02T15:32:07Z

@Codeberg-AsGithubAlternative-buhtz we have 3500 issue and very limited reviewers / volunteers. Please help us out here.

lithomas1 · 2021-06-02T18:53:27Z

Hi @Codeberg-AsGithubAlternative-buhtz,
1 is a bug and is related to #40333/#22144 and may be fixed by #40629
3 is not a bug. It is mentioned in the user guide. https://pandas.pydata.org/docs/user_guide/io.html#handling-bad-lines.
Can you close the issue if I have answered your questions.
Thanks.

buhtz · 2021-06-02T22:17:51Z

Thanks for the Issues and PR. I will monitor them.

For 3: Yes the (IMHO inconsistent) behavior is described in docs. But it does not make it right. There should be no difference between the handling of to few and to many columns. But it is just my opinion.
But because of this situation I can never trust pandas while importing CSV files. Before using pandas I have to do some pedantic checks by my own.

MarcoGorelli · 2021-06-03T07:38:43Z

Anything else is waste of ressources.

Given the sheer volume and quality of (voluntary!) work produced (see https://github.com/pandas-dev/pandas/commits?author=phofl for a start, and that's excluding reviews + issue triage), is demanding that they search the issue tracker for you really the best use of their resources?

you should know how to use GitHubs Issue tracker

Such comments are unwelcome - you've been warned

buhtz added Needs Triage Issue that has not been reviewed by a pandas team member Usage Question labels Jun 1, 2021

lithomas1 added IO CSV read_csv, to_csv Closing Candidate May be closeable, needs more eyeballs and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Jun 2, 2021

buhtz closed this as completed Jun 2, 2021

pandas-dev locked as too heated and limited conversation to collaborators Jun 3, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

QST: Inconsistent behaviour in checking number of fields per row while read_csv() #41754

QST: Inconsistent behaviour in checking number of fields per row while read_csv() #41754

buhtz commented Jun 1, 2021 •

edited

Loading

testingcan commented Jun 1, 2021

phofl commented Jun 1, 2021

buhtz commented Jun 2, 2021 •

edited

Loading

jreback commented Jun 2, 2021

lithomas1 commented Jun 2, 2021 •

edited

Loading

buhtz commented Jun 2, 2021

MarcoGorelli commented Jun 3, 2021

QST: Inconsistent behaviour in checking number of fields per row while read_csv() #41754

QST: Inconsistent behaviour in checking number of fields per row while read_csv() #41754

Comments

buhtz commented Jun 1, 2021 • edited Loading

Example 1

Example 2

Example 3

testingcan commented Jun 1, 2021

phofl commented Jun 1, 2021

buhtz commented Jun 2, 2021 • edited Loading

jreback commented Jun 2, 2021

lithomas1 commented Jun 2, 2021 • edited Loading

buhtz commented Jun 2, 2021

MarcoGorelli commented Jun 3, 2021

buhtz commented Jun 1, 2021 •

edited

Loading

buhtz commented Jun 2, 2021 •

edited

Loading

lithomas1 commented Jun 2, 2021 •

edited

Loading