BUG: Error when using usecols with names in read_csv #44147

nancy-aggarwal · 2021-10-22T16:03:41Z

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
I have confirmed this bug exists on the master branch of pandas.

Reproducible Example

import pandas as pd

# pd.show_versions()

df = pd.DataFrame({'X': [1, 2, 3],
 'V1': [34,  6, 5],
 'V2': [4, 7, 3],
 'V3': [ 5,  56, 2]})

df.to_csv('df1.csv',index=False)

## Now read.

df_1 = pd.read_csv('df1.csv',usecols = ['X','V1','V3'],header = 0)
print(df_1)
df_2 = pd.read_csv('df1.csv',usecols = [0,1,3],header = 0,names  = ['T','V1','V3'])
print(df_2)
df_3 = pd.read_csv('df1.csv',usecols = ['X','V1','V3'],header = 0,names  = ['T','y1','y3'])
print(df_3)

Issue Description

When importing a csv into a dataframe, I want to import certain columns and rename some of them. See example below.

import pandas as pd

# pd.show_versions()

df = pd.DataFrame({'X': [1, 2, 3],
 'V1': [34,  6, 5],
 'V2': [4, 7, 3],
 'V3': [ 5,  56, 2]})

df.to_csv('df1.csv',index=False)

## Now read.

df_1 = pd.read_csv('df1.csv',usecols = ['X','V1','V3'],header = 0)
print(df_1)
df_2 = pd.read_csv('df1.csv',usecols = [0,1,3],header = 0,names  = ['T','V1','V3'])
print(df_2)
df_3 = pd.read_csv('df1.csv',usecols = ['X','V1','V3'],header = 0,names  = ['T','y1','y3'])
print(df_3)

When doing this, I get the error message that "ValueError: Usecols do not match columns, columns expected but not found: ['X']"

The code works when I don't use the names option.
The code also works when I supply the column indices to usecols instead of column names in the csv file. This is prone to human error, so I'd like to avoid it.

Expected Behavior

Expect df_3 to be created with new column names.

Installed Versions

INSTALLED VERSIONS

commit : 945c9ed
python : 3.9.7.final.0
python-bits : 64
OS : Windows
OS-release : 10
Version : 10.0.19042
machine : AMD64
processor : Intel64 Family 6 Model 142 Stepping 12, GenuineIntel
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : English_United States.1252

pandas : 1.3.4
numpy : 1.21.3
pytz : 2021.3
dateutil : 2.8.2
pip : 21.3
setuptools : 57.4.0
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 3.0.2
IPython : 7.28.0
pandas_datareader: None
bs4 : None
bottleneck : None
fsspec : None
fastparquet : None
gcsfs : None
matplotlib : 3.4.3
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pyxlsb : None
s3fs : None
scipy : 1.7.1
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
numba : None

The text was updated successfully, but these errors were encountered:

phofl · 2021-10-24T20:09:58Z

I think this is as expected, but the docs could be a bit clearer. If names is given, only the names are used, not the inferred column names from the file. Would you like to submit a PR to improve the documentation?

nancy-aggarwal · 2021-10-24T20:35:18Z

Should this be the expected behavior though? Isn't that in contradiction with the second case?
Even if that was true, why is the error message referring to the names specified in usecols and existing in the the csv?

Usecols do not match columns, columns expected but not found: ['V3', 'V1', 'X']

I'd expect usecols to refer to column names in the csv file, and names to be applied after the importing was done.
Otherwise, really the function read_csv should throw an error saying that when names is provided usecols will not accept strings.

In fact, I tried the variations when using names and not specifying usecols and the results are pretty ridiculous. Try the following snippet and compare the numbers typed in the dataframe vs the ones returned by read_csv.

import pandas as pd

df = pd.DataFrame({'X': [1, 2, 3],
 'V1': [34,  6, 5],
 'V2': [4, 7, 3],
 'V3': [ 5,  56, 2]})

df.to_csv('df1.csv',index=False)

## Now read.
df_4 = pd.read_csv('df1.csv'
,header = 0
,names  = ['X','V1','V3'])
print(df_4)

df_5 = pd.read_csv('df1.csv'
,header = 0
,names  = ['T','y1','y3'])
print(df_5)

CloseChoice · 2021-10-27T20:11:10Z

Just to make @nancy-aggarwal examples clear:

>>> import pandas as pd
>>> 
>>> df = pd.DataFrame({'X': [1, 2, 3],
...  'V1': [34,  6, 5],
...  'V2': [4, 7, 3],
...  'V3': [ 5,  56, 2]})
>>> 
>>> df.to_csv('df1.csv',index=False)
>>> df_4 = pd.read_csv('df1.csv'
... ,header = 0
... ,names  = ['X','V1','V3'])
>>> print(df_4)
    X  V1  V3
1  34   4   5
2   6   7  56
3   5   3   2
>>> df_5 = pd.read_csv('df1.csv'
... ,header = 0
... ,names  = ['T','y1','y3'])
>>> print(df_5)
    T  y1  y3
1  34   4   5
2   6   7  56
3   5   3   2

what is happening here, is that the first column is interpreted as index and names overrides the column names, just as it is stated in the documentation: If the file contains a header row, then you should explicitly pass header=0 to override the column names. Maybe we could improve the docs a bit but I don't see a bug here.

nancy-aggarwal · 2021-10-27T20:46:01Z

Header=0 is in the call and typically headers are about rows and indexes are about columns. I understand what it's doing, but I don't understand why that's the expected behavior. Sorry :-/ For example, can you explain what would be a typical use case for the behavior in df_4, and why that's the default?

And the behavior in df_4 and df_5 still don't explain (to me) why the line df_3 gives the error Usecols do not match columns, columns expected but not found: ['V3', 'V1', 'X'] despite those columns being in the csv. Also, can you provide an example of how you'd modify the function to get the correct df_3?
Is names somehow different from what I'm interpreting it to be (ie names that should be assigned to the columns after they are imported)?

CloseChoice · 2021-10-28T08:02:27Z

When using .to_csv the default is to save the index. So it is only sensible that the default for .read_csv is to take the first column as index.
With the header=0 and names=[...] you can save an extra operation (it's another question if you consider it worth the confusion)

df = pd.DataFrame({'X': [1, 2, 3],
 'V1': [34,  6, 5],
 'V2': [4, 7, 3],
 'V3': [ 5,  56, 2]})

df.to_csv('df1.csv',index=False)

## Now read.
df_4 = pd.read_csv('df1.csv' , header = 0, names  = ['X','V1','V3'])

df_41 = pd.read_csv('df1.csv')
df_41.columns = ['X','V1','V3']  #same as df_4

Milind220 · 2021-11-06T18:55:40Z

Hi, I'd love to take this issue as my first contribution if that's alright :)

I understand that it's a pretty small issue of minor ambiguity in the docs. I can just add a line in the docs to make this usage clear, and what's to be expected. (You can also suggest to me what should be added)

Milind220 · 2021-11-06T18:55:45Z

Take

nancy-aggarwal added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Oct 22, 2021

CloseChoice added Docs good first issue and removed Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Oct 27, 2021

mroeschke added the IO CSV read_csv, to_csv label Oct 30, 2021

github-actions bot assigned Milind220 Nov 6, 2021

phofl mentioned this issue Nov 13, 2021

DOC: Some minor doc cleanups #44440

Merged

2 tasks

jreback added this to the 1.4 milestone Nov 14, 2021

jreback closed this as completed in #44440 Nov 14, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: Error when using usecols with names in read_csv #44147

BUG: Error when using usecols with names in read_csv #44147

nancy-aggarwal commented Oct 22, 2021 •

edited

Loading

INSTALLED VERSIONS

phofl commented Oct 24, 2021

nancy-aggarwal commented Oct 24, 2021

CloseChoice commented Oct 27, 2021

nancy-aggarwal commented Oct 27, 2021

CloseChoice commented Oct 28, 2021

Milind220 commented Nov 6, 2021

Milind220 commented Nov 6, 2021

BUG: Error when using usecols with names in read_csv #44147

BUG: Error when using usecols with names in read_csv #44147

Comments

nancy-aggarwal commented Oct 22, 2021 • edited Loading

Reproducible Example

Issue Description

Expected Behavior

Installed Versions

INSTALLED VERSIONS

phofl commented Oct 24, 2021

nancy-aggarwal commented Oct 24, 2021

CloseChoice commented Oct 27, 2021

nancy-aggarwal commented Oct 27, 2021

CloseChoice commented Oct 28, 2021

Milind220 commented Nov 6, 2021

Milind220 commented Nov 6, 2021

nancy-aggarwal commented Oct 22, 2021 •

edited

Loading