-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
pd.read_table: Using space as delimiter on file with trailing space gives cryptic error #21768
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
How weird! Investigation and PR are certainly welcome! |
I'd argue this is behavior as expected. Consider it was a comma separated file instead, the provided csv example would look like
which would indicate that this csv file has 8 columns. (the last column being nulls). Since 7 columns are passed in the |
@asishm : Ah, that's a very good point. Nevertheless, it's not immediately obvious that that was the case IMO. Might you be able to pinpoint where we perform that kind of logic? I can see two ways to fix this:
|
@gfyoung I believe the documentation for
I would agree that for the case In [113]: s
Out[113]: '2270433 3 21322.889 11924.667 5228.753 1.0 -1 \n2270432 3 21322.297 11924.667 5228.605 1.0 2270433 '
In [114]: pd.read_csv(io.StringIO(s), header=None, sep=' ', names=list('abcd'))
Out[114]:
a b c d
2270433 3 21322.889 11924.667 5228.753 1.0 -1 NaN
2270432 3 21322.297 11924.667 5228.605 1.0 2270433 NaN
In [115]: pd.read_csv(io.StringIO(s), header=None, sep=' ', names=list('abcdef'))
Out[115]:
a b c d e f
2270433 3 21322.889 11924.667 5228.753 1.0 -1 NaN
2270432 3 21322.297 11924.667 5228.605 1.0 2270433 NaN
In [116]: pd.read_csv(io.StringIO(s), header=None, sep=' ', names=list('abcdefg'))
Out[116]:
a b c d e f g
2270433 3 21322.889 11924.667 5228.753 1.0 -1 NaN
2270432 3 21322.297 11924.667 5228.605 1.0 2270433 NaN
In [117]: pd.read_csv(io.StringIO(s), header=None, sep=' ', names=list('abcdefgh'))
Out[117]:
a b c d e f g h
0 2270433 3 21322.889 11924.667 5228.753 1.0 -1 NaN
1 2270432 3 21322.297 11924.667 5228.605 1.0 2270433 NaN
In [118]: pd.read_csv(io.StringIO(s), header=None, sep=' ', names=list('abcdefg'), index_col=False)
Out[118]:
a b c d e f g
0 2270433 3 21322.889 11924.667 5228.753 1.0 -1
1 2270432 3 21322.297 11924.667 5228.605 1.0 2270433
In [119]: pd.read_csv(io.StringIO(s), header=None, sep=' ', names=list('abcdef'), index_col=False) # error - List index out of range
In [121]: pd.read_csv(io.StringIO(s), header=None, sep=' ', names=list('abcdef'), index_col=False, engine='python')
Out[121]:
a b c d e f
0 2270433 3 21322.889 11924.667 5228.753 1.0
1 2270432 3 21322.297 11924.667 5228.605 1.0 Something that is curious here is the difference between the python and C engine parsers when |
@asishm : Thanks for pointing out. I think the docs could be clarified further to make this more explicit. That discrepancy that you found is a bug in the Python engine IMO. Malformed CSV shouldn't just drop columns silently like that. |
@gfyoung This behavior seems to have changed in the past. The c engine does not raise an error now either. Should we change this to raise errors instead of dropping the data for both engines? |
IMO if data loss is imminent, then we should error, warn at least if we don't want to make an API change. |
Thx for the response. I looked into this a bit, We have a few cases covering exactly this behavior (#26218 for example). So warning would be appropriate here before raising probably? |
Let's go with warning for now. |
removing milestone. but it looks like the PR is almost ready |
test.csv
:Problem description
Attempting to load test.csv with pd.read_table() results in the following errors:
TypeError: Cannot cast array from dtype('float64') to dtype('int32') according to the rule 'safe'
and
ValueError: cannot safely convert passed user dtype of int32 for float64 dtyped data in column 2
Expected behavior:
Either trailing whitespace is ignored by Pandas, or throw a more informative error than "cannot safely convert passed user dtype of int32 for float64". It took me a really long time to figure out that this was caused by trailing spaces in the csv.
The text was updated successfully, but these errors were encountered: