Skip to content

pd.read_table: Using space as delimiter on file with trailing space gives cryptic error #21768

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
crypdick opened this issue Jul 6, 2018 · 10 comments · Fixed by #38587
Closed
Labels
Error Reporting Incorrect or improved errors from pandas IO CSV read_csv, to_csv
Milestone

Comments

@crypdick
Copy link

crypdick commented Jul 6, 2018

import numpy as np
import os
import pandas as pd

# put test.csv in same folder as script
mydir = os.path.dirname(os.path.abspath(__file__))
csv_path = os.path.join(mydir, "test.csv")

df = pd.read_table(csv_path, sep=' ',
                   comment='#',
                   header=None,
                   skip_blank_lines=True,
                   names=["A", "B", "C", "D", "E", "F", "G"],
                   dtype={"A": np.int32,
                       "B": np.int32,
                       "C": np.float64,
                       "D": np.float64,
                       "E": np.float64,
                       "F": np.float64,
                       "G": np.int32})

test.csv:

2270433 3 21322.889 11924.667 5228.753 1.0 -1 
2270432 3 21322.297 11924.667 5228.605 1.0 2270433 

Problem description

Attempting to load test.csv with pd.read_table() results in the following errors:
TypeError: Cannot cast array from dtype('float64') to dtype('int32') according to the rule 'safe'

and

ValueError: cannot safely convert passed user dtype of int32 for float64 dtyped data in column 2

Expected behavior:

Either trailing whitespace is ignored by Pandas, or throw a more informative error than "cannot safely convert passed user dtype of int32 for float64". It took me a really long time to figure out that this was caused by trailing spaces in the csv.

@gfyoung gfyoung added IO CSV read_csv, to_csv Error Reporting Incorrect or improved errors from pandas labels Jul 6, 2018
@gfyoung
Copy link
Member

gfyoung commented Jul 6, 2018

How weird! Investigation and PR are certainly welcome!

@asishm
Copy link
Contributor

asishm commented Jul 7, 2018

I'd argue this is behavior as expected.

Consider it was a comma separated file instead, the provided csv example would look like

2270433,3,21322.889,11924.667,5228.753,1.0,-1,
2270432,3,21322.297,11924.667,5228.605,1.0,2270433,

which would indicate that this csv file has 8 columns. (the last column being nulls). Since 7 columns are passed in the names parameter, it is forced to try and set the first column as the index and column G is a column of np.nans. But since the dtype was specified to be int32, it raises the error which would be as per expectations as there's no way to convert a nan to an int.

@gfyoung
Copy link
Member

gfyoung commented Jul 7, 2018

@asishm : Ah, that's a very good point. Nevertheless, it's not immediately obvious that that was the case IMO. Might you be able to pinpoint where we perform that kind of logic? I can see two ways to fix this:

  • Add this to the documentation that this how names behaves
  • Change the handling of what we do when len(names) is not equal to the length the parsed columns in the table (especially when index_col=None by default in this case).

@asishm
Copy link
Contributor

asishm commented Jul 8, 2018

@gfyoung I believe the documentation for index_col does specify this specific case:

index_col : int or sequence or False, default None
Column to use as the row labels of the DataFrame. If a sequence is given, a
MultiIndex is used. If you have a malformed file with delimiters at the end
of each line, you might consider index_col=False to force pandas to not
use the first column as the index (row names)

I would agree that for the case index_col=None and where names is specified, it should be noted in the documentation that if len(names) < len(actual_columns), then the first x (difference) would be treated as the index. I can't point exactly where this happens due to my unfamiliarity with pandas internals, but see below for examples:

In [113]: s
Out[113]: '2270433 3 21322.889 11924.667 5228.753 1.0 -1 \n2270432 3 21322.297 11924.667 5228.605 1.0 2270433 '

In [114]: pd.read_csv(io.StringIO(s), header=None, sep=' ', names=list('abcd'))
Out[114]:
                                      a    b        c   d
2270433 3 21322.889 11924.667  5228.753  1.0       -1 NaN
2270432 3 21322.297 11924.667  5228.605  1.0  2270433 NaN

In [115]: pd.read_csv(io.StringIO(s), header=None, sep=' ', names=list('abcdef'))
Out[115]:
                   a          b         c    d        e   f
2270433 3  21322.889  11924.667  5228.753  1.0       -1 NaN
2270432 3  21322.297  11924.667  5228.605  1.0  2270433 NaN

In [116]: pd.read_csv(io.StringIO(s), header=None, sep=' ', names=list('abcdefg'))
Out[116]:
         a          b          c         d    e        f   g
2270433  3  21322.889  11924.667  5228.753  1.0       -1 NaN
2270432  3  21322.297  11924.667  5228.605  1.0  2270433 NaN

In [117]: pd.read_csv(io.StringIO(s), header=None, sep=' ', names=list('abcdefgh'))
Out[117]:
         a  b          c          d         e    f        g   h
0  2270433  3  21322.889  11924.667  5228.753  1.0       -1 NaN
1  2270432  3  21322.297  11924.667  5228.605  1.0  2270433 NaN

In [118]: pd.read_csv(io.StringIO(s), header=None, sep=' ', names=list('abcdefg'), index_col=False)
Out[118]:
         a  b          c          d         e    f        g
0  2270433  3  21322.889  11924.667  5228.753  1.0       -1
1  2270432  3  21322.297  11924.667  5228.605  1.0  2270433

In [119]: pd.read_csv(io.StringIO(s), header=None, sep=' ', names=list('abcdef'), index_col=False) # error - List index out of range

In [121]: pd.read_csv(io.StringIO(s), header=None, sep=' ', names=list('abcdef'), index_col=False, engine='python')
Out[121]:
         a  b          c          d         e    f
0  2270433  3  21322.889  11924.667  5228.753  1.0
1  2270432  3  21322.297  11924.667  5228.605  1.0

Something that is curious here is the difference between the python and C engine parsers when index_col=False as demonstrated in the last 2 examples. C parser raises the list index out of range, but python parser just discards the extra columns at the end.

@gfyoung
Copy link
Member

gfyoung commented Jul 8, 2018

@asishm : Thanks for pointing out. I think the docs could be clarified further to make this more explicit. That discrepancy that you found is a bug in the Python engine IMO. Malformed CSV shouldn't just drop columns silently like that.

@phofl
Copy link
Member

phofl commented Dec 13, 2020

@gfyoung This behavior seems to have changed in the past. The c engine does not raise an error now either. Should we change this to raise errors instead of dropping the data for both engines?

@gfyoung
Copy link
Member

gfyoung commented Dec 13, 2020

IMO if data loss is imminent, then we should error, warn at least if we don't want to make an API change.

@phofl
Copy link
Member

phofl commented Dec 19, 2020

Thx for the response.

I looked into this a bit, We have a few cases covering exactly this behavior (#26218 for example). So warning would be appropriate here before raising probably?

@gfyoung
Copy link
Member

gfyoung commented Dec 19, 2020

Let's go with warning for now.

@simonjayhawkins
Copy link
Member

removing milestone. but it looks like the PR is almost ready

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Error Reporting Incorrect or improved errors from pandas IO CSV read_csv, to_csv
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants