Skip to content

Inconsistent behavior of read_csv when given an additional value on the first row of CSV file #33037

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
tlorieul opened this issue Mar 26, 2020 · 9 comments
Labels
API Design Duplicate Report Duplicate issue or pull request IO CSV read_csv, to_csv

Comments

@tlorieul
Copy link

tlorieul commented Mar 26, 2020

Short description

Using the following invalid CSV:

col1_name,col2_name,col3_name
0,1,2,X
4,5,6,
6,7,8

It has 3 columns but the first row has 4 values and loading it should raise a ParsingError.

However, loading using:

pd.read_csv("test.csv", index_col=False)

does not raise any exception and returns the following DataFrame:

   col1_name  col2_name  col3_name
0          0          1          2
1          4          5          6
2          6          7          8

Thus, it silently drops the additional value X.

Problem description

This will happening if the first row and the following ones are invalid.
If the first invalid row is not the first row, it will throw a ParsingError exception.
I.e. the following CSV produces the same results (silently dropping additional value):

col1_name,col2_name,col3_name
0,1,2,X
4,5,6,X
6,7,8
col1_name,col2_name,col3_name
0,1,2,X
4,5,6,X
6,7,8,X
col1_name,col2_name,col3_name
0,1,2,X
4,5,6
6,7,8,X

But this, as expected, throws an exception:

col1_name,col2_name,col3_name
0,1,2
4,5,6,X
6,7,8,X

Finally, if there are two additional values instead of a single one, it throws the following exception:

IndexError: list index out of range

Having a consistent behavior by throwing an exception in every of the previous cases would be enjoyable.
The fact that it is silent make it harder to validate CSV files.

Expected Output

Throw a ParsingError in the previous cases.

Output of pd.show_versions()

INSTALLED VERSIONS

commit : None
python : 3.7.7.final.0
python-bits : 64
OS : Linux
OS-release : 4.15.0-91-generic
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 1.0.3
numpy : 1.18.1
pytz : 2019.3
dateutil : 2.8.1
pip : 20.0.2
setuptools : 46.1.1.post20200323
Cython : 0.29.15
pytest : 5.4.1
hypothesis : None
sphinx : 2.4.4
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 2.11.1
IPython : 7.13.0
pandas_datareader: None
bs4 : None
bottleneck : None
fastparquet : None
gcsfs : None
lxml.etree : None
matplotlib : 3.1.3
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pytables : None
pytest : 5.4.1
pyxlsb : None
s3fs : None
scipy : 1.4.1
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
xlsxwriter : None
numba : None

Edits

  • corrected copy-paste error in the returned DataFrame example (cf. reply by @gfyoung)
@TomAugspurger
Copy link
Contributor

cc @gfyoung.

@TomAugspurger TomAugspurger added the IO CSV read_csv, to_csv label Mar 26, 2020
@gfyoung
Copy link
Member

gfyoung commented Mar 27, 2020

@tlorieul : We use the first row as a heuristic to gauge the number of columns, and admittedly, some of the cases get pretty gnarly because there are some consistency issues that I'm not surprised we still have.

Implementations of improvement are most certainly welcome!


I just wanted to start with that first, now to address some of your comments:

It has 3 columns but the first row has 4 values and loading it should raise a ParsingError.

It does not in fact. You get this instead:

   col1_name  col2_name col3_name
0          1          2         X
4          5          6       NaN
6          7          8       NaN

does not raise any exception and returns the following DataFrame:

True, but you get this instead:

   col1_name  col2_name  col3_name
0          0          1          2
1          4          5          6
2          6          7          8

@tlorieul
Copy link
Author

@gfyoung : Actually, the first result you show is what you get by calling read_csv with default value for index_col, I should have insisted on this.
Indeed otherwise, it seems a plausibly expected behavior.

But, when setting index_col=False and when given a header in the CSV file, I would have expected pandas to use it to derive the number of expected columns.
Is there a rational in not doing so?

You are right for the returned DataFrame, I messed up when submitting the issue, sorry about that.

@gfyoung
Copy link
Member

gfyoung commented Mar 27, 2020

But, when setting index_col=False and when given a header in the CSV file, I would have expected pandas to use it to derive the number of expected columns.

I'm confused by this question. index_col is used for determining the index of the DataFrame, which is based on first column of data, not the header, which is the first row.

@tlorieul
Copy link
Author

Yes, but what I meant is that when index_col=False then it forces pandas not to use the first column as an index and thus the number of columns in the header should be exactly the number of values per row (at least I do not see why it should not be the case).

On the other hand, when index_col is not set to False an ambiguity still exists. It could either be that there is as many columns in the header than in the other rows or that there is an additional column in the other rows corresponding to the index (and thus completing the rows with missing values with NaN). To remove this ambiguity, we need to look to at the rows, which is what pandas seems to be doing.

@gfyoung
Copy link
Member

gfyoung commented Mar 27, 2020

Yes, but what I meant is that when index_col=False then it forces pandas not to use the first column as an index and thus the number of columns in the header should be exactly the number of values per row (at least I do not see why it should not be the case).

To some extent that already is happening. The header we infer has three columns, so the "X" from the first row is omitted (same with the NaN in the second row).

@tlorieul
Copy link
Author

tlorieul commented Mar 27, 2020 via email

@gfyoung
Copy link
Member

gfyoung commented Mar 27, 2020

I think a warning would be an interesting option to explore. Losing data is something we try to avoid as much as possible.

@jreback jreback added this to the Contributions Welcome milestone Nov 26, 2020
@jreback
Copy link
Contributor

jreback commented Jan 1, 2021

duplicate of #21768

@jreback jreback closed this as completed Jan 1, 2021
@jreback jreback added the Duplicate Report Duplicate issue or pull request label Jan 1, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design Duplicate Report Duplicate issue or pull request IO CSV read_csv, to_csv
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants