csv parsing gets confused on first line commas #9295

kousu · 2015-01-19T09:12:01Z

This csv file loads fine:

DecisionM,IntelligentM,freq,total
0, 5, 9, 20 
0, 6, 21,33
0, 7, 35,65
0, 8, 35,83
0, 9, 14,41
0, 10, 10,26
1, 5, 11,20
1, 6, 12,33
1, 7, 30,65
1, 8, 48,83
1, 9, 27, 41  
1, 10, 16, 26

In [11]: pandas.read_csv("speeddating.csv")
Out[11]: 
    DecisionM  IntelligentM  freq  total
0           0             5     9     20
1           0             6    21     33
2           0             7    35     65
3           0             8    35     83
4           0             9    14     41
5           0            10    10     26
6           1             5    11     20
7           1             6    12     33
8           1             7    30     65
9           1             8    48     83
10          1             9    27     41
11          1            10    16     26

In [12]:

A small tweak causes the dataset to be silently corrupted:

DecisionM,IntelligentM,freq,total
0, 5, 9, 20, 
0, 6, 21,33
0, 7, 35,65
0, 8, 35,83
0, 9, 14,41
0, 10, 10,26
1, 5, 11,20
1, 6, 12,33
1, 7, 30,65
1, 8, 48,83
1, 9, 27, 41  
1, 10, 16, 26

In [10]: pandas.read_csv("speeddating.csv")
Out[10]: 
   DecisionM  IntelligentM  freq  total
0          5             9    20    NaN
0          6            21    33    NaN
0          7            35    65    NaN
0          8            35    83    NaN
0          9            14    41    NaN
0         10            10    26    NaN
1          5            11    20    NaN
1          6            12    33    NaN
1          7            30    65    NaN
1          8            48    83    NaN
1          9            27    41    NaN
1         10            16    26    NaN

Notice how every column is shifted over by one. I'm confused about this in light of #9294: shouldn't it just die?

Putting the extra comma on the second row causes #9294 .

I'm on Pandas '0.15.2' on Python3 on ArchLinux.

The text was updated successfully, but these errors were encountered:

shoyer · 2015-01-19T09:27:34Z

It appears that if pandas encounters one more column in the first row than in the header, it assumes that the extra column should be used for the index. You can try the option index_col=False, which disables this behavior, though it may just cause things to choke (like in your other issue).

jreback · 2015-01-19T12:21:52Z

The first row is inspected to see if it matches the number of header rows. It not clear that its 'bad', so that is the spec ongoing.

In [12]: read_csv(StringIO(data),index_col=False)
Out[12]: 
    DecisionM  IntelligentM  freq  total
0           0             5     9     20
1           0             6    21     33
2           0             7    35     65
3           0             8    35     83
4           0             9    14     41
5           0            10    10     26
6           1             5    11     20
7           1             6    12     33
8           1             7    30     65
9           1             8    48     83
10          1             9    27     41
11          1            10    16     26

jreback closed this as completed Jan 19, 2015

jreback added IO CSV read_csv, to_csv Usage Question labels Jan 19, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

csv parsing gets confused on first line commas #9295

csv parsing gets confused on first line commas #9295

kousu commented Jan 19, 2015

shoyer commented Jan 19, 2015

jreback commented Jan 19, 2015

csv parsing gets confused on first line commas #9295

csv parsing gets confused on first line commas #9295

Comments

kousu commented Jan 19, 2015

shoyer commented Jan 19, 2015

jreback commented Jan 19, 2015