pd.read_table: Using space as delimiter on file with trailing space gives cryptic error #21768

crypdick · 2018-07-06T16:05:50Z

import numpy as np
import os
import pandas as pd

# put test.csv in same folder as script
mydir = os.path.dirname(os.path.abspath(__file__))
csv_path = os.path.join(mydir, "test.csv")

df = pd.read_table(csv_path, sep=' ',
                   comment='#',
                   header=None,
                   skip_blank_lines=True,
                   names=["A", "B", "C", "D", "E", "F", "G"],
                   dtype={"A": np.int32,
                       "B": np.int32,
                       "C": np.float64,
                       "D": np.float64,
                       "E": np.float64,
                       "F": np.float64,
                       "G": np.int32})

test.csv:

2270433 3 21322.889 11924.667 5228.753 1.0 -1 
2270432 3 21322.297 11924.667 5228.605 1.0 2270433

Problem description

Attempting to load test.csv with pd.read_table() results in the following errors:
TypeError: Cannot cast array from dtype('float64') to dtype('int32') according to the rule 'safe'

and

ValueError: cannot safely convert passed user dtype of int32 for float64 dtyped data in column 2

Expected behavior:

Either trailing whitespace is ignored by Pandas, or throw a more informative error than "cannot safely convert passed user dtype of int32 for float64". It took me a really long time to figure out that this was caused by trailing spaces in the csv.

The text was updated successfully, but these errors were encountered:

gfyoung · 2018-07-06T19:17:55Z

How weird! Investigation and PR are certainly welcome!

asishm · 2018-07-07T18:42:00Z

I'd argue this is behavior as expected.

Consider it was a comma separated file instead, the provided csv example would look like

2270433,3,21322.889,11924.667,5228.753,1.0,-1,
2270432,3,21322.297,11924.667,5228.605,1.0,2270433,

which would indicate that this csv file has 8 columns. (the last column being nulls). Since 7 columns are passed in the names parameter, it is forced to try and set the first column as the index and column G is a column of np.nans. But since the dtype was specified to be int32, it raises the error which would be as per expectations as there's no way to convert a nan to an int.

gfyoung · 2018-07-07T21:20:05Z

@asishm : Ah, that's a very good point. Nevertheless, it's not immediately obvious that that was the case IMO. Might you be able to pinpoint where we perform that kind of logic? I can see two ways to fix this:

Add this to the documentation that this how names behaves
Change the handling of what we do when len(names) is not equal to the length the parsed columns in the table (especially when index_col=None by default in this case).

asishm · 2018-07-08T01:43:49Z

@gfyoung I believe the documentation for index_col does specify this specific case:

index_col : int or sequence or False, default None
Column to use as the row labels of the DataFrame. If a sequence is given, a
MultiIndex is used. If you have a malformed file with delimiters at the end
of each line, you might consider index_col=False to force pandas to not
use the first column as the index (row names)

I would agree that for the case index_col=None and where names is specified, it should be noted in the documentation that if len(names) < len(actual_columns), then the first x (difference) would be treated as the index. I can't point exactly where this happens due to my unfamiliarity with pandas internals, but see below for examples:

In [113]: s
Out[113]: '2270433 3 21322.889 11924.667 5228.753 1.0 -1 \n2270432 3 21322.297 11924.667 5228.605 1.0 2270433 '

In [114]: pd.read_csv(io.StringIO(s), header=None, sep=' ', names=list('abcd'))
Out[114]:
                                      a    b        c   d
2270433 3 21322.889 11924.667  5228.753  1.0       -1 NaN
2270432 3 21322.297 11924.667  5228.605  1.0  2270433 NaN

In [115]: pd.read_csv(io.StringIO(s), header=None, sep=' ', names=list('abcdef'))
Out[115]:
                   a          b         c    d        e   f
2270433 3  21322.889  11924.667  5228.753  1.0       -1 NaN
2270432 3  21322.297  11924.667  5228.605  1.0  2270433 NaN

In [116]: pd.read_csv(io.StringIO(s), header=None, sep=' ', names=list('abcdefg'))
Out[116]:
         a          b          c         d    e        f   g
2270433  3  21322.889  11924.667  5228.753  1.0       -1 NaN
2270432  3  21322.297  11924.667  5228.605  1.0  2270433 NaN

In [117]: pd.read_csv(io.StringIO(s), header=None, sep=' ', names=list('abcdefgh'))
Out[117]:
         a  b          c          d         e    f        g   h
0  2270433  3  21322.889  11924.667  5228.753  1.0       -1 NaN
1  2270432  3  21322.297  11924.667  5228.605  1.0  2270433 NaN

In [118]: pd.read_csv(io.StringIO(s), header=None, sep=' ', names=list('abcdefg'), index_col=False)
Out[118]:
         a  b          c          d         e    f        g
0  2270433  3  21322.889  11924.667  5228.753  1.0       -1
1  2270432  3  21322.297  11924.667  5228.605  1.0  2270433

In [119]: pd.read_csv(io.StringIO(s), header=None, sep=' ', names=list('abcdef'), index_col=False) # error - List index out of range

In [121]: pd.read_csv(io.StringIO(s), header=None, sep=' ', names=list('abcdef'), index_col=False, engine='python')
Out[121]:
         a  b          c          d         e    f
0  2270433  3  21322.889  11924.667  5228.753  1.0
1  2270432  3  21322.297  11924.667  5228.605  1.0

Something that is curious here is the difference between the python and C engine parsers when index_col=False as demonstrated in the last 2 examples. C parser raises the list index out of range, but python parser just discards the extra columns at the end.

gfyoung · 2018-07-08T01:53:56Z

@asishm : Thanks for pointing out. I think the docs could be clarified further to make this more explicit. That discrepancy that you found is a bug in the Python engine IMO. Malformed CSV shouldn't just drop columns silently like that.

phofl · 2020-12-13T22:09:59Z

@gfyoung This behavior seems to have changed in the past. The c engine does not raise an error now either. Should we change this to raise errors instead of dropping the data for both engines?

gfyoung · 2020-12-13T22:42:37Z

IMO if data loss is imminent, then we should error, warn at least if we don't want to make an API change.

phofl · 2020-12-19T19:37:42Z

Thx for the response.

I looked into this a bit, We have a few cases covering exactly this behavior (#26218 for example). So warning would be appropriate here before raising probably?

gfyoung · 2020-12-19T20:23:33Z

Let's go with warning for now.

simonjayhawkins · 2021-06-11T13:10:51Z

removing milestone. but it looks like the PR is almost ready

gfyoung added IO CSV read_csv, to_csv Error Reporting Incorrect or improved errors from pandas labels Jul 6, 2018

phofl mentioned this issue Dec 19, 2020

ENH: Raise ParserWarning when length of names does not match length of data #38587

Merged

5 tasks

jreback added this to the 1.3 milestone Dec 21, 2020

jreback mentioned this issue Jan 1, 2021

Inconsistent behavior of read_csv when given an additional value on the first row of CSV file #33037

Closed

jreback mentioned this issue Mar 16, 2021

read_csv: Warn when too many names are specified #40449

Closed

simonjayhawkins removed this from the 1.3 milestone May 25, 2021

jreback added this to the 1.3 milestone Jun 3, 2021

simonjayhawkins removed this from the 1.3 milestone Jun 11, 2021

jreback added this to the 1.3 milestone Jun 16, 2021

jreback closed this as completed in #38587 Jun 16, 2021

This was referenced Nov 28, 2021

BUG: read_csv not erroring on a bad line with extra columns #40333

Closed

Read CSV error_bad_lines does not error for too many values in first data row #12519

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pd.read_table: Using space as delimiter on file with trailing space gives cryptic error #21768

pd.read_table: Using space as delimiter on file with trailing space gives cryptic error #21768

crypdick commented Jul 6, 2018

gfyoung commented Jul 6, 2018

asishm commented Jul 7, 2018

gfyoung commented Jul 7, 2018 •

edited

Loading

asishm commented Jul 8, 2018

gfyoung commented Jul 8, 2018

phofl commented Dec 13, 2020

gfyoung commented Dec 13, 2020

phofl commented Dec 19, 2020 •

edited

Loading

gfyoung commented Dec 19, 2020

simonjayhawkins commented Jun 11, 2021

pd.read_table: Using space as delimiter on file with trailing space gives cryptic error #21768

pd.read_table: Using space as delimiter on file with trailing space gives cryptic error #21768

Comments

crypdick commented Jul 6, 2018

Problem description

gfyoung commented Jul 6, 2018

asishm commented Jul 7, 2018

gfyoung commented Jul 7, 2018 • edited Loading

asishm commented Jul 8, 2018

gfyoung commented Jul 8, 2018

phofl commented Dec 13, 2020

gfyoung commented Dec 13, 2020

phofl commented Dec 19, 2020 • edited Loading

gfyoung commented Dec 19, 2020

simonjayhawkins commented Jun 11, 2021

gfyoung commented Jul 7, 2018 •

edited

Loading

phofl commented Dec 19, 2020 •

edited

Loading