Skip to content

BUG: read_csv with trailing comma, specified header names and usecols gives confusing error #29042

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
jorisvandenbossche opened this issue Oct 16, 2019 · 9 comments · Fixed by #38452
Labels
Error Reporting Incorrect or improved errors from pandas IO CSV read_csv, to_csv
Milestone

Comments

@jorisvandenbossche
Copy link
Member

Encountered this somewhat strange case: in case you have a malformed file (trailing comma's), we typically set the first column as the index. But if you also pass custom names, and use usecols, in that case you get a cryptic error message:

import io

s = """a, b, c, d 
1,2,3,4, 
5,6,7,8,""" 
>>> pd.read_csv(io.StringIO(s), header=0, names=['A', 'B', 'C', 'D'], usecols=[2,3]) 
...
ValueError: Passed header names mismatches usecols

I am not fully sure what the behaviour should be, but the current error message is not very helpful (since the usecols argument is only using integers, they are positional, and don't need to match the header names)

@jorisvandenbossche jorisvandenbossche added Error Reporting Incorrect or improved errors from pandas IO CSV read_csv, to_csv labels Oct 16, 2019
@jorisvandenbossche jorisvandenbossche added this to the Contributions Welcome milestone Oct 16, 2019
@Sankareswari-Ramamoorthy

Hi, The file has a header ,so not very sure if both header=0 and names attribute can coexist. I did not get the issue when I either change header as none or when removing the names attribute. Since it's to do with header names attribute and names combined with usecols, it has thrown this error.

@jorisvandenbossche
Copy link
Member Author

The file has a header ,so not very sure if both header=0 and names attribute can coexist.

As far as I understand, they can co-exist. The goal here was to overwrite the existing names. The header keyword explanation mentions this use case:

Explicitly pass header=0 to be able to replace existing names

I just realized that I now understand the error message: if you specify both custom names and usecols, the names need to be already the selected names based on usecols:

In [8]: pd.read_csv(io.StringIO(s), header=0, names=['C', 'D'], usecols=[2,3]) 
Out[8]: 
   C  D
0  3  4
1  7  8

works correctly.
However, if not using integers as usecols, you still get the same confusing error message: pd.read_csv(io.StringIO(s), header=0, names=['A', 'B', 'C', 'D'], usecols=['C', 'D'])

@megz2020
Copy link

megz2020 commented Oct 20, 2019

usecols :Return a subset of the columns.
For example, a valid list-like usecols parameter would be [0, 1, 2] or ['foo', 'bar', 'baz']. Element order is ignored, so usecols=[0, 1] is the same as [1, 0].
in your example you should write usecols like this
pd.read_csv(io.StringIO(s), header=0, names=['A', 'B', 'C', 'D'], usecols=[0,1,2,3])

or can write code like this:
pd.read_csv(io.StringIO(s), header=0, names=['A', 'B', 'C', 'D'], usecols=None, index_col=False)

@jorisvandenbossche
Copy link
Member Author

in your example you should write usecols like this

That works, but the purpose was to only select a subset of the columns.

@megz2020
Copy link

megz2020 commented Oct 20, 2019

if you want to select subset col you cant try this:
pd.read_csv(io.StringIO(s), header=0, index_col= False, usecols=[0, 3])
or like in your code :
pd.read_csv(io.StringIO(s), header=0, index_col= False, names=['A', 'B', 'C', 'D'], usecols=[0, 3])

@jorisvandenbossche
Copy link
Member Author

jorisvandenbossche commented Oct 21, 2019

Ah, yes, that works (also specifying index_col=False seems to fix it), but that doesn't change the reported issue: the error message is completely confusing.

@dmitriyshashkin
Copy link

I've encountered a similar issue:

import io
import pandas as pd

s = """1,2,3
a,b,c"""

pd.read_csv(io.StringIO(s), names=['a','b'], usecols=['a'])

Expected behavior: pandas uses the first column as index and returns data frame with one column "a" and values "2, b"

Actual behavior: ValueError: Passed header names mismatches usecols

pandas 0.25.3
python 3.8

The error is quite confusing, especially that in the actual situation I had a long list of columns and most of them were also mentioned in usecols. Spent some time looking for the mismatch mentioned in the error

@dmitriyshashkin
Copy link

@phofl
Copy link
Member

phofl commented Dec 13, 2020

Apart from a bad error message, I think the behavior displayed by @jorisvandenbossche is actually a bug.

pd.read_csv(io.StringIO(s), header=0, names=['A', 'B', 'C', 'D', 'E'], usecols=[2,3]) 

works

while leaving 'E' out does not.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Error Reporting Incorrect or improved errors from pandas IO CSV read_csv, to_csv
Projects
None yet
6 participants