Strip whitespace from column names when usecols in read_csv #14480

rahulporuri · 2016-10-24T09:09:09Z

A small, complete example of the issue

when loading a file of the type, where headers have a trailing whitespace,

a ,b ,c 
1,2,3
4,5,6

I would expect the following code to work and give the result

pandas.read_table('test_data.csv', sep=',', usecols=['a', 'b'])

Expected Output

   a  b
0  1  2
1  4  5

Actual Output

Neither the c nor the python engine produce the expected result.
the tracebacks have been concatenated for brevity.

/Users/rahulporuri/Github/pandas/pandas/io/parsers.pyc in __init__(self, src, **kwds)
   1431 
   1432             if len(self.names) < len(self.usecols):
-> 1433                 raise ValueError("Usecols do not match names.")
   1434 
   1435         self._set_noconvert_columns()

ValueError: Usecols do not match names.

/Users/rahulporuri/Github/pandas/pandas/io/parsers.pyc in _handle_usecols(self, columns, usecols_key)
   2199                 for u in self.usecols:
   2200                     if isinstance(u, string_types):
-> 2201                         col_indices.append(usecols_key.index(u))
   2202                     else:
   2203                         col_indices.append(u)

ValueError: 'a' is not in list

This is related to an issue reported earlier #14460 on stripping columns/column names of whitespaces.

On a side note, if the file has column names with leading whitespaces instead of trailing whitespaces, adding the skipinitialspace=True kwarg to pandas.read_table produces the expected result.

Output of `pd.show_versions()`

## INSTALLED VERSIONS

commit: 794f792
python: 2.7.11.final.0
python-bits: 64
OS: Darwin
OS-release: 16.0.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: en_US.UTF-8
LANG: en_US.UTF-8
LOCALE: None.None

pandas: 0.19.0+27.g794f792
nose: None
pip: 8.1.2
setuptools: 23.1.0
Cython: 0.24
numpy: 1.10.4
scipy: 0.17.1
statsmodels: None
xarray: 0.7.2
IPython: 4.1.2
sphinx: 1.4.1
patsy: None
dateutil: 2.5.2
pytz: 2016.7
blosc: None
bottleneck: None
tables: 3.2.2
numexpr: 2.6.0
matplotlib: 1.5.1
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.8
boto: None
pandas_datareader: None

The text was updated successfully, but these errors were encountered:

jorisvandenbossche · 2016-10-24T09:33:35Z

This just boils down to the issue you already reported. Pandas does not strip whitespace from the columns, so your actual column names are 'a ' and 'b ' (so including the space), and consequently your usecols=['a', 'b'] do not work as they don't match the column names (using usecols=['a ', 'b '] works as expected).

rahulporuri · 2016-10-24T09:35:43Z

yes. Do we want pandas to strip whitespace from column names/columns? if not by default, maybe using another kwarg to pandas.read_table

jreback · 2016-10-24T10:04:26Z

you can already do #14234, or post-strip with .str.strip(). so not sure this is compelling.

jorisvandenbossche · 2016-10-24T10:15:26Z

I think it is possibly useful, but that's already discussed in the other issue (#14460). So let's close this one.

rahulporuri · 2016-10-24T10:31:59Z

ohh. my bad. i didn't know that we could pass a callable to usecols. and agreed, we can strip post loading the dataframe and then drop unnecessary columns but that is a workaround IMO.

jorisvandenbossche · 2016-10-24T10:39:00Z

@rahulporuri the callables are at the moment only a PR for enhancement, not in master or released version.

maxima120 · 2022-09-03T12:46:04Z

pretty much all software i use including excel write column names in csv with spaces: "a, b, c, ...". Pandas should be able to read these correctly as intended names of columns are a b c. This is obvious behaviour and not a bug. Hence if pandas think its "b " "c " etc then that is a bug.

People often forget that bug is not only when something blows up in your face. Bug is also when something behaves differently from what is reasonable to expect. And this is exactly the case.

jorisvandenbossche added the IO CSV read_csv, to_csv label Oct 24, 2016

jorisvandenbossche changed the title ~~Q : Strip whitespace from column names when usecols in pandas.read_table~~ Strip whitespace from column names when usecols in read_csv Oct 24, 2016

jorisvandenbossche closed this as completed Oct 24, 2016

jorisvandenbossche modified the milestones: 0.19.1, No action Oct 24, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Strip whitespace from column names when usecols in read_csv #14480

Strip whitespace from column names when usecols in read_csv #14480

rahulporuri commented Oct 24, 2016

jorisvandenbossche commented Oct 24, 2016

rahulporuri commented Oct 24, 2016 •

edited

Loading

jreback commented Oct 24, 2016

jorisvandenbossche commented Oct 24, 2016

rahulporuri commented Oct 24, 2016

jorisvandenbossche commented Oct 24, 2016

maxima120 commented Sep 3, 2022

Strip whitespace from column names when usecols in read_csv #14480

Strip whitespace from column names when usecols in read_csv #14480

Comments

rahulporuri commented Oct 24, 2016

A small, complete example of the issue

Expected Output

Actual Output

Output of pd.show_versions()

jorisvandenbossche commented Oct 24, 2016

rahulporuri commented Oct 24, 2016 • edited Loading

jreback commented Oct 24, 2016

jorisvandenbossche commented Oct 24, 2016

rahulporuri commented Oct 24, 2016

jorisvandenbossche commented Oct 24, 2016

maxima120 commented Sep 3, 2022

Output of `pd.show_versions()`

rahulporuri commented Oct 24, 2016 •

edited

Loading