BUG: read_csv with trailing comma, specified header names and usecols gives confusing error #29042

jorisvandenbossche · 2019-10-16T20:33:37Z

Encountered this somewhat strange case: in case you have a malformed file (trailing comma's), we typically set the first column as the index. But if you also pass custom names, and use usecols, in that case you get a cryptic error message:

import io

s = """a, b, c, d 
1,2,3,4, 
5,6,7,8,"""

>>> pd.read_csv(io.StringIO(s), header=0, names=['A', 'B', 'C', 'D'], usecols=[2,3]) 
...
ValueError: Passed header names mismatches usecols

I am not fully sure what the behaviour should be, but the current error message is not very helpful (since the usecols argument is only using integers, they are positional, and don't need to match the header names)

The text was updated successfully, but these errors were encountered:

Sankareswari-Ramamoorthy · 2019-10-19T04:14:53Z

Hi, The file has a header ,so not very sure if both header=0 and names attribute can coexist. I did not get the issue when I either change header as none or when removing the names attribute. Since it's to do with header names attribute and names combined with usecols, it has thrown this error.

jorisvandenbossche · 2019-10-20T12:20:23Z

The file has a header ,so not very sure if both header=0 and names attribute can coexist.

As far as I understand, they can co-exist. The goal here was to overwrite the existing names. The header keyword explanation mentions this use case:

Explicitly pass header=0 to be able to replace existing names

I just realized that I now understand the error message: if you specify both custom names and usecols, the names need to be already the selected names based on usecols:

In [8]: pd.read_csv(io.StringIO(s), header=0, names=['C', 'D'], usecols=[2,3]) 
Out[8]: 
   C  D
0  3  4
1  7  8

works correctly.
However, if not using integers as usecols, you still get the same confusing error message: pd.read_csv(io.StringIO(s), header=0, names=['A', 'B', 'C', 'D'], usecols=['C', 'D'])

megz2020 · 2019-10-20T12:22:40Z

usecols :Return a subset of the columns.
For example, a valid list-like usecols parameter would be [0, 1, 2] or ['foo', 'bar', 'baz']. Element order is ignored, so usecols=[0, 1] is the same as [1, 0].
in your example you should write usecols like this
pd.read_csv(io.StringIO(s), header=0, names=['A', 'B', 'C', 'D'], usecols=[0,1,2,3])

or can write code like this:
pd.read_csv(io.StringIO(s), header=0, names=['A', 'B', 'C', 'D'], usecols=None, index_col=False)

jorisvandenbossche · 2019-10-20T12:30:13Z

in your example you should write usecols like this

That works, but the purpose was to only select a subset of the columns.

megz2020 · 2019-10-20T13:01:39Z

if you want to select subset col you cant try this:
pd.read_csv(io.StringIO(s), header=0, index_col= False, usecols=[0, 3])
or like in your code :
pd.read_csv(io.StringIO(s), header=0, index_col= False, names=['A', 'B', 'C', 'D'], usecols=[0, 3])

jorisvandenbossche · 2019-10-21T06:50:28Z

Ah, yes, that works (also specifying index_col=False seems to fix it), but that doesn't change the reported issue: the error message is completely confusing.

dmitriyshashkin · 2019-12-07T09:19:17Z

I've encountered a similar issue:

import io
import pandas as pd

s = """1,2,3
a,b,c"""

pd.read_csv(io.StringIO(s), names=['a','b'], usecols=['a'])

Expected behavior: pandas uses the first column as index and returns data frame with one column "a" and values "2, b"

Actual behavior: ValueError: Passed header names mismatches usecols

pandas 0.25.3
python 3.8

The error is quite confusing, especially that in the actual situation I had a long list of columns and most of them were also mentioned in usecols. Spent some time looking for the mismatch mentioned in the error

dmitriyshashkin · 2019-12-07T09:33:34Z

The same problem discussed on SO: https://stackoverflow.com/questions/31017823/pandas-returns-passed-header-names-mismatches-usecols-error

phofl · 2020-12-13T01:16:20Z

Apart from a bad error message, I think the behavior displayed by @jorisvandenbossche is actually a bug.

pd.read_csv(io.StringIO(s), header=0, names=['A', 'B', 'C', 'D', 'E'], usecols=[2,3])

works

while leaving 'E' out does not.

jorisvandenbossche added Error Reporting Incorrect or improved errors from pandas IO CSV read_csv, to_csv labels Oct 16, 2019

jorisvandenbossche added this to the Contributions Welcome milestone Oct 16, 2019

phofl mentioned this issue Dec 13, 2020

ENH: Improve error message when names and usecols do not match for read_csv with engine=c #38452

Merged

5 tasks

jreback closed this as completed in #38452 Dec 13, 2020

jreback modified the milestones: Contributions Welcome, 1.3 Dec 13, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: read_csv with trailing comma, specified header names and usecols gives confusing error #29042

BUG: read_csv with trailing comma, specified header names and usecols gives confusing error #29042

jorisvandenbossche commented Oct 16, 2019

Sankareswari-Ramamoorthy commented Oct 19, 2019

jorisvandenbossche commented Oct 20, 2019

megz2020 commented Oct 20, 2019 •

edited

Loading

jorisvandenbossche commented Oct 20, 2019

megz2020 commented Oct 20, 2019 •

edited

Loading

jorisvandenbossche commented Oct 21, 2019 •

edited

Loading

dmitriyshashkin commented Dec 7, 2019

dmitriyshashkin commented Dec 7, 2019

phofl commented Dec 13, 2020

BUG: read_csv with trailing comma, specified header names and usecols gives confusing error #29042

BUG: read_csv with trailing comma, specified header names and usecols gives confusing error #29042

Comments

jorisvandenbossche commented Oct 16, 2019

Sankareswari-Ramamoorthy commented Oct 19, 2019

jorisvandenbossche commented Oct 20, 2019

megz2020 commented Oct 20, 2019 • edited Loading

jorisvandenbossche commented Oct 20, 2019

megz2020 commented Oct 20, 2019 • edited Loading

jorisvandenbossche commented Oct 21, 2019 • edited Loading

dmitriyshashkin commented Dec 7, 2019

dmitriyshashkin commented Dec 7, 2019

phofl commented Dec 13, 2020

megz2020 commented Oct 20, 2019 •

edited

Loading

megz2020 commented Oct 20, 2019 •

edited

Loading

jorisvandenbossche commented Oct 21, 2019 •

edited

Loading