Skip to content

BUG: Error when using usecols with names in read_csv #44147

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
3 tasks done
nancy-aggarwal opened this issue Oct 22, 2021 · 7 comments · Fixed by #44440
Closed
3 tasks done

BUG: Error when using usecols with names in read_csv #44147

nancy-aggarwal opened this issue Oct 22, 2021 · 7 comments · Fixed by #44440
Assignees
Labels
Milestone

Comments

@nancy-aggarwal
Copy link

nancy-aggarwal commented Oct 22, 2021

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the master branch of pandas.

Reproducible Example

import pandas as pd

# pd.show_versions()

df = pd.DataFrame({'X': [1, 2, 3],
 'V1': [34,  6, 5],
 'V2': [4, 7, 3],
 'V3': [ 5,  56, 2]})

df.to_csv('df1.csv',index=False)

## Now read.

df_1 = pd.read_csv('df1.csv',usecols = ['X','V1','V3'],header = 0)
print(df_1)
df_2 = pd.read_csv('df1.csv',usecols = [0,1,3],header = 0,names  = ['T','V1','V3'])
print(df_2)
df_3 = pd.read_csv('df1.csv',usecols = ['X','V1','V3'],header = 0,names  = ['T','y1','y3'])
print(df_3)

Issue Description

When importing a csv into a dataframe, I want to import certain columns and rename some of them. See example below.

import pandas as pd

# pd.show_versions()

df = pd.DataFrame({'X': [1, 2, 3],
 'V1': [34,  6, 5],
 'V2': [4, 7, 3],
 'V3': [ 5,  56, 2]})

df.to_csv('df1.csv',index=False)

## Now read.

df_1 = pd.read_csv('df1.csv',usecols = ['X','V1','V3'],header = 0)
print(df_1)
df_2 = pd.read_csv('df1.csv',usecols = [0,1,3],header = 0,names  = ['T','V1','V3'])
print(df_2)
df_3 = pd.read_csv('df1.csv',usecols = ['X','V1','V3'],header = 0,names  = ['T','y1','y3'])
print(df_3)

When doing this, I get the error message that "ValueError: Usecols do not match columns, columns expected but not found: ['X']"

The code works when I don't use the names option.
The code also works when I supply the column indices to usecols instead of column names in the csv file. This is prone to human error, so I'd like to avoid it.

Expected Behavior

Expect df_3 to be created with new column names.

Installed Versions

INSTALLED VERSIONS

commit : 945c9ed
python : 3.9.7.final.0
python-bits : 64
OS : Windows
OS-release : 10
Version : 10.0.19042
machine : AMD64
processor : Intel64 Family 6 Model 142 Stepping 12, GenuineIntel
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : English_United States.1252

pandas : 1.3.4
numpy : 1.21.3
pytz : 2021.3
dateutil : 2.8.2
pip : 21.3
setuptools : 57.4.0
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 3.0.2
IPython : 7.28.0
pandas_datareader: None
bs4 : None
bottleneck : None
fsspec : None
fastparquet : None
gcsfs : None
matplotlib : 3.4.3
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pyxlsb : None
s3fs : None
scipy : 1.7.1
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
numba : None

@nancy-aggarwal nancy-aggarwal added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Oct 22, 2021
@phofl
Copy link
Member

phofl commented Oct 24, 2021

I think this is as expected, but the docs could be a bit clearer. If names is given, only the names are used, not the inferred column names from the file. Would you like to submit a PR to improve the documentation?

@nancy-aggarwal
Copy link
Author

Should this be the expected behavior though? Isn't that in contradiction with the second case?
Even if that was true, why is the error message referring to the names specified in usecols and existing in the the csv?

Usecols do not match columns, columns expected but not found: ['V3', 'V1', 'X']

I'd expect usecols to refer to column names in the csv file, and names to be applied after the importing was done.
Otherwise, really the function read_csv should throw an error saying that when names is provided usecols will not accept strings.

In fact, I tried the variations when using names and not specifying usecols and the results are pretty ridiculous. Try the following snippet and compare the numbers typed in the dataframe vs the ones returned by read_csv.

import pandas as pd

df = pd.DataFrame({'X': [1, 2, 3],
 'V1': [34,  6, 5],
 'V2': [4, 7, 3],
 'V3': [ 5,  56, 2]})

df.to_csv('df1.csv',index=False)

## Now read.
df_4 = pd.read_csv('df1.csv'
,header = 0
,names  = ['X','V1','V3'])
print(df_4)

df_5 = pd.read_csv('df1.csv'
,header = 0
,names  = ['T','y1','y3'])
print(df_5)

@CloseChoice
Copy link
Member

Just to make @nancy-aggarwal examples clear:

>>> import pandas as pd
>>> 
>>> df = pd.DataFrame({'X': [1, 2, 3],
...  'V1': [34,  6, 5],
...  'V2': [4, 7, 3],
...  'V3': [ 5,  56, 2]})
>>> 
>>> df.to_csv('df1.csv',index=False)
>>> df_4 = pd.read_csv('df1.csv'
... ,header = 0
... ,names  = ['X','V1','V3'])
>>> print(df_4)
    X  V1  V3
1  34   4   5
2   6   7  56
3   5   3   2
>>> df_5 = pd.read_csv('df1.csv'
... ,header = 0
... ,names  = ['T','y1','y3'])
>>> print(df_5)
    T  y1  y3
1  34   4   5
2   6   7  56
3   5   3   2

what is happening here, is that the first column is interpreted as index and names overrides the column names, just as it is stated in the documentation: If the file contains a header row, then you should explicitly pass header=0 to override the column names. Maybe we could improve the docs a bit but I don't see a bug here.

@CloseChoice CloseChoice added Docs good first issue and removed Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Oct 27, 2021
@nancy-aggarwal
Copy link
Author

Header=0 is in the call and typically headers are about rows and indexes are about columns. I understand what it's doing, but I don't understand why that's the expected behavior. Sorry :-/ For example, can you explain what would be a typical use case for the behavior in df_4, and why that's the default?

And the behavior in df_4 and df_5 still don't explain (to me) why the line df_3 gives the error Usecols do not match columns, columns expected but not found: ['V3', 'V1', 'X'] despite those columns being in the csv. Also, can you provide an example of how you'd modify the function to get the correct df_3?
Is names somehow different from what I'm interpreting it to be (ie names that should be assigned to the columns after they are imported)?

@CloseChoice
Copy link
Member

When using .to_csv the default is to save the index. So it is only sensible that the default for .read_csv is to take the first column as index.
With the header=0 and names=[...] you can save an extra operation (it's another question if you consider it worth the confusion)

df = pd.DataFrame({'X': [1, 2, 3],
 'V1': [34,  6, 5],
 'V2': [4, 7, 3],
 'V3': [ 5,  56, 2]})

df.to_csv('df1.csv',index=False)

## Now read.
df_4 = pd.read_csv('df1.csv' , header = 0, names  = ['X','V1','V3'])

df_41 = pd.read_csv('df1.csv')
df_41.columns = ['X','V1','V3']  #same as df_4

@mroeschke mroeschke added the IO CSV read_csv, to_csv label Oct 30, 2021
@Milind220
Copy link

Hi, I'd love to take this issue as my first contribution if that's alright :)

I understand that it's a pretty small issue of minor ambiguity in the docs. I can just add a line in the docs to make this usage clear, and what's to be expected. (You can also suggest to me what should be added)

@Milind220
Copy link

Take

@jreback jreback added this to the 1.4 milestone Nov 14, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants