io.read_csv: ValueError due to long lines #12203

dacoex · 2016-02-02T13:24:59Z

I try to read a csv file were some lines are longer than the rest.

Pandas C engine throws an error with these lines.
But I do not was to skip these as discussed in a comment on a similar issue.

I prefer to "cut" the "bad columns" off using usecols. But I get the following errors:

df = pd.read_csv(file_path, sep=',', skiprows=1, header=None, 
                       usecols=range(0,23), nrows=None, 
                         engine='python')

Throws:

ValueError: Expected 10 fields in line 100, saw 20

Why is the ValueError raised although I explicitly defined the columns to use?

References:

The text was updated successfully, but these errors were encountered:

dacoex · 2016-02-02T14:19:26Z

Here's a longer but self-contained example based on pandas read_csv and filter columns with usecols


import pandas as pd
#from cSt import StringIO
#from cStringIO import StringIO

import io

# works
csv = r"""dummy,date,loc,x
bar,20090101,a,1
bar,20090102,a,3
bar,20090103,a,5
bar,20090101,b,1
bar,20090102,b,3
bar,20090103,b,5"""

df = pd.read_csv(io.StringIO(csv),
        header=0,
        sep=",",
        index_col=["date", "loc"], 
        usecols=["date", "loc", "x"],
        parse_dates=["date"])
df2 = pd.read_csv(io.StringIO(csv),
        header=None,
        sep=",",
        usecols=[0, 2, 3])
#        parse_dates=0)

# throws value error
# ValueError: Expected 4 fields in line 5, saw 6

csv = r"""dummy,date,loc,x
bar,20090101,a,1
bar,20090102,a,3
bar,20090103,a,5
bar,20090101,b,1,2,2
bar,20090102,b,3
bar,20090103,b,5"""

df2 = pd.read_csv(io.StringIO(csv),
        header=None,
        sep=",",
        usecols=[0, 2, 3],
         engine='python')
#        parse_dates=0)

# works

csv = r"""dummy,date,loc,x
bar,20090101,a,1
bar,20090102,a,3
bar,20090103,a,5
bar,20090101,b,1,2,2
bar,20090102,b,3
bar,20090103,b,5"""

mycols = ['col_1', 'col_2', 'col_3', 'col_4', 'col_5', 'col_6']
df3 = pd.read_csv(io.StringIO(csv),
        header=None,
        sep=",",
        names=mycols,
         engine='python')
#        parse_dates=0)


# works but inefficient

csv = r"""dummy,date,loc,x
bar,20090101,a,1
bar,20090102,a,3
bar,20090103,a,5
bar,20090101,b,1,2,2
bar,20090102,b,3
bar,20090103,b,5"""

mycols = list(map(str, range(6)))
df4 = pd.read_csv(io.StringIO(csv),
        header=None,
        sep=",",
        names=mycols,
         engine='python')
#        parse_dates=0)

df4_cut = df4.iloc[:,:4]

# improved parametrised solution:

csv = r"""dummy,date,loc,x
bar,20090101,a,1
bar,20090102,a,3
bar,20090103,a,5
bar,20090101,b,1,2,2
bar,20090102,b,3
bar,20090103,b,5"""

col_expected = 4
col_saw = 6
mycols = list(map(str, range(col_saw)))
df5 = pd.read_csv(io.StringIO(csv),
        header=None,
        sep=",",
        names=mycols,
         engine='python')
#        parse_dates=0)

df5_cut = df5.iloc[:,:col_expected]

So why does the version where the columns are selected based on names work and the version where use cols are given does not work?

jreback · 2016-02-02T17:10:27Z

looks like a dupe of this: #9755

pls confirm

dacoex · 2016-02-03T14:16:05Z

I connot confirm.

The reason: in all examples of #9755 names and usecols were given.

In my case, it's data without any column names.

So I would like to be able to specify the columns to be used without using names.

jreback · 2016-02-03T14:19:17Z

ok, i'll mark it. though I suspect its a very similar soln. pull-requests welcome.

dacoex · 2016-02-04T07:37:21Z

I noted that nrows is even more sensitive

dacoex · 2016-02-04T12:40:08Z

The whole problem is that usecols is used to derive a subset of the cleanly parsed csv file.

E.g. when reading


1,2,3
1,2,3
1,2,3
1,2,3
1,2,3

usecols[0,1] would aloow to get a df with the col 1 and col 2.

What I seek, any others on SO, is to limit the number of columns that are parsed initially.

e.g.

1,2,3
1,2,3
1,2,3,4,5,6,7,8,9,0
1,2,3
1,2,3,4,5,6,7,8,9,0,,4,5,6,7,8,9,0

In this case, usecols does not apply and work.
The different length of the 2 bad rows confuse teh reader completely.

But real world data, often from dataloggers, produces such artefacts on random basis...

Closes pandas-devgh-12203. [ci skip]

jreback added Duplicate Report Duplicate issue or pull request IO CSV read_csv, to_csv labels Feb 2, 2016

jreback added API Design Difficulty Intermediate and removed Duplicate Report Duplicate issue or pull request labels Feb 3, 2016

jreback added this to the Next Major Release milestone Feb 3, 2016

brechtm mentioned this issue Feb 3, 2016

read_csv with names, usecols and parse_dates #9755

Closed

gfyoung mentioned this issue Mar 7, 2016

BUG: Respect 'usecols' parameter even when CSV rows are uneven #12551

Closed

jreback modified the milestones: 0.18.1, Next Major Release Mar 7, 2016

gfyoung added a commit to forking-repos/pandas that referenced this issue Mar 18, 2016

BUG: Respect 'usecols' parameter even when CSV rows are uneven

d3824dc

Closes pandas-devgh-12203. [ci skip]

jreback closed this as completed in e55875e Mar 20, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

io.read_csv: ValueError due to long lines #12203

io.read_csv: ValueError due to long lines #12203

dacoex commented Feb 2, 2016

dacoex commented Feb 2, 2016

jreback commented Feb 2, 2016

dacoex commented Feb 3, 2016

jreback commented Feb 3, 2016

dacoex commented Feb 4, 2016

dacoex commented Feb 4, 2016

io.read_csv: ValueError due to long lines #12203

io.read_csv: ValueError due to long lines #12203

Comments

dacoex commented Feb 2, 2016

dacoex commented Feb 2, 2016

jreback commented Feb 2, 2016

dacoex commented Feb 3, 2016

jreback commented Feb 3, 2016

dacoex commented Feb 4, 2016

dacoex commented Feb 4, 2016