Inconsistent behavior of `read_csv` when given an additional value on the first row of CSV file #33037

tlorieul · 2020-03-26T11:56:30Z

Short description

Using the following invalid CSV:

col1_name,col2_name,col3_name
0,1,2,X
4,5,6,
6,7,8

It has 3 columns but the first row has 4 values and loading it should raise a ParsingError.

However, loading using:

pd.read_csv("test.csv", index_col=False)

does not raise any exception and returns the following DataFrame:

   col1_name  col2_name  col3_name
0          0          1          2
1          4          5          6
2          6          7          8

Thus, it silently drops the additional value X.

Problem description

This will happening if the first row and the following ones are invalid.
If the first invalid row is not the first row, it will throw a ParsingError exception.
I.e. the following CSV produces the same results (silently dropping additional value):

col1_name,col2_name,col3_name
0,1,2,X
4,5,6,X
6,7,8

col1_name,col2_name,col3_name
0,1,2,X
4,5,6,X
6,7,8,X

col1_name,col2_name,col3_name
0,1,2,X
4,5,6
6,7,8,X

But this, as expected, throws an exception:

col1_name,col2_name,col3_name
0,1,2
4,5,6,X
6,7,8,X

Finally, if there are two additional values instead of a single one, it throws the following exception:

IndexError: list index out of range

Having a consistent behavior by throwing an exception in every of the previous cases would be enjoyable.
The fact that it is silent make it harder to validate CSV files.

Expected Output

Throw a ParsingError in the previous cases.

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit : None
python : 3.7.7.final.0
python-bits : 64
OS : Linux
OS-release : 4.15.0-91-generic
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 1.0.3
numpy : 1.18.1
pytz : 2019.3
dateutil : 2.8.1
pip : 20.0.2
setuptools : 46.1.1.post20200323
Cython : 0.29.15
pytest : 5.4.1
hypothesis : None
sphinx : 2.4.4
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 2.11.1
IPython : 7.13.0
pandas_datareader: None
bs4 : None
bottleneck : None
fastparquet : None
gcsfs : None
lxml.etree : None
matplotlib : 3.1.3
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pytables : None
pytest : 5.4.1
pyxlsb : None
s3fs : None
scipy : 1.4.1
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
xlsxwriter : None
numba : None

Edits

corrected copy-paste error in the returned DataFrame example (cf. reply by @gfyoung)

The text was updated successfully, but these errors were encountered:

TomAugspurger · 2020-03-26T14:42:55Z

cc @gfyoung.

gfyoung · 2020-03-27T04:40:31Z

@tlorieul : We use the first row as a heuristic to gauge the number of columns, and admittedly, some of the cases get pretty gnarly because there are some consistency issues that I'm not surprised we still have.

Implementations of improvement are most certainly welcome!

I just wanted to start with that first, now to address some of your comments:

It has 3 columns but the first row has 4 values and loading it should raise a ParsingError.

It does not in fact. You get this instead:

   col1_name  col2_name col3_name
0          1          2         X
4          5          6       NaN
6          7          8       NaN

does not raise any exception and returns the following DataFrame:

True, but you get this instead:

   col1_name  col2_name  col3_name
0          0          1          2
1          4          5          6
2          6          7          8

tlorieul · 2020-03-27T11:31:05Z

@gfyoung : Actually, the first result you show is what you get by calling read_csv with default value for index_col, I should have insisted on this.
Indeed otherwise, it seems a plausibly expected behavior.

But, when setting index_col=False and when given a header in the CSV file, I would have expected pandas to use it to derive the number of expected columns.
Is there a rational in not doing so?

You are right for the returned DataFrame, I messed up when submitting the issue, sorry about that.

gfyoung · 2020-03-27T17:11:00Z

But, when setting index_col=False and when given a header in the CSV file, I would have expected pandas to use it to derive the number of expected columns.

I'm confused by this question. index_col is used for determining the index of the DataFrame, which is based on first column of data, not the header, which is the first row.

tlorieul · 2020-03-27T17:39:09Z

Yes, but what I meant is that when index_col=False then it forces pandas not to use the first column as an index and thus the number of columns in the header should be exactly the number of values per row (at least I do not see why it should not be the case).

On the other hand, when index_col is not set to False an ambiguity still exists. It could either be that there is as many columns in the header than in the other rows or that there is an additional column in the other rows corresponding to the index (and thus completing the rows with missing values with NaN). To remove this ambiguity, we need to look to at the rows, which is what pandas seems to be doing.

gfyoung · 2020-03-27T17:49:40Z

Yes, but what I meant is that when index_col=False then it forces pandas not to use the first column as an index and thus the number of columns in the header should be exactly the number of values per row (at least I do not see why it should not be the case).

To some extent that already is happening. The header we infer has three columns, so the "X" from the first row is omitted (same with the NaN in the second row).

tlorieul · 2020-03-27T18:56:08Z

So you believe this should be silently omitted without at least a warning, an exception or some form of feedback to the user? Le ven. 27 mars 2020 à 18:49, gfyoung <[email protected]> a écrit :

…

Yes, but what I meant is that when index_col=False then it forces pandas not to use the first column as an index and thus the number of columns in the header should be exactly the number of values per row (at least I do not see why it should not be the case). To some extent that already is happening. The header we infer has three columns, so the "X" from the first row is omitted (same with the NaN in the second row). — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#33037 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACQPRJD4SZHEF6W6TG373ULRJTRMHANCNFSM4LUFPA6A> .

gfyoung · 2020-03-27T19:14:49Z

I think a warning would be an interesting option to explore. Losing data is something we try to avoid as much as possible.

jreback · 2021-01-01T21:53:24Z

duplicate of #21768

TomAugspurger added the IO CSV read_csv, to_csv label Mar 26, 2020

gfyoung added the API Design label Mar 27, 2020

mproszewska mentioned this issue Apr 25, 2020

BUG: Add warning if rows have more columns than expected #33782

Closed

5 tasks

jreback added this to the Contributions Welcome milestone Nov 26, 2020

jreback closed this as completed Jan 1, 2021

jreback added the Duplicate Report Duplicate issue or pull request label Jan 1, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inconsistent behavior of `read_csv` when given an additional value on the first row of CSV file #33037

Inconsistent behavior of `read_csv` when given an additional value on the first row of CSV file #33037

tlorieul commented Mar 26, 2020 •

edited

Loading

INSTALLED VERSIONS

TomAugspurger commented Mar 26, 2020

gfyoung commented Mar 27, 2020 •

edited

Loading

tlorieul commented Mar 27, 2020

gfyoung commented Mar 27, 2020

tlorieul commented Mar 27, 2020

gfyoung commented Mar 27, 2020

tlorieul commented Mar 27, 2020 via email

gfyoung commented Mar 27, 2020

jreback commented Jan 1, 2021

Inconsistent behavior of read_csv when given an additional value on the first row of CSV file #33037

Inconsistent behavior of read_csv when given an additional value on the first row of CSV file #33037

Comments

tlorieul commented Mar 26, 2020 • edited Loading

Short description

Problem description

Expected Output

Output of pd.show_versions()

INSTALLED VERSIONS

Edits

TomAugspurger commented Mar 26, 2020

gfyoung commented Mar 27, 2020 • edited Loading

tlorieul commented Mar 27, 2020

gfyoung commented Mar 27, 2020

tlorieul commented Mar 27, 2020

gfyoung commented Mar 27, 2020

tlorieul commented Mar 27, 2020 via email

gfyoung commented Mar 27, 2020

jreback commented Jan 1, 2021

Inconsistent behavior of `read_csv` when given an additional value on the first row of CSV file #33037

Inconsistent behavior of `read_csv` when given an additional value on the first row of CSV file #33037

tlorieul commented Mar 26, 2020 •

edited

Loading

Output of `pd.show_versions()`

gfyoung commented Mar 27, 2020 •

edited

Loading