Skip to content

Strip whitespace from column names when usecols in read_csv #14480

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
rahulporuri opened this issue Oct 24, 2016 · 7 comments
Closed

Strip whitespace from column names when usecols in read_csv #14480

rahulporuri opened this issue Oct 24, 2016 · 7 comments
Labels
IO CSV read_csv, to_csv

Comments

@rahulporuri
Copy link

A small, complete example of the issue

when loading a file of the type, where headers have a trailing whitespace,

a ,b ,c 
1,2,3
4,5,6

I would expect the following code to work and give the result

pandas.read_table('test_data.csv', sep=',', usecols=['a', 'b'])

Expected Output

   a  b
0  1  2
1  4  5

Actual Output

Neither the c nor the python engine produce the expected result.
the tracebacks have been concatenated for brevity.

/Users/rahulporuri/Github/pandas/pandas/io/parsers.pyc in __init__(self, src, **kwds)
   1431 
   1432             if len(self.names) < len(self.usecols):
-> 1433                 raise ValueError("Usecols do not match names.")
   1434 
   1435         self._set_noconvert_columns()

ValueError: Usecols do not match names.
/Users/rahulporuri/Github/pandas/pandas/io/parsers.pyc in _handle_usecols(self, columns, usecols_key)
   2199                 for u in self.usecols:
   2200                     if isinstance(u, string_types):
-> 2201                         col_indices.append(usecols_key.index(u))
   2202                     else:
   2203                         col_indices.append(u)

ValueError: 'a' is not in list

This is related to an issue reported earlier #14460 on stripping columns/column names of whitespaces.

On a side note, if the file has column names with leading whitespaces instead of trailing whitespaces, adding the skipinitialspace=True kwarg to pandas.read_table produces the expected result.

Output of pd.show_versions()

## INSTALLED VERSIONS

commit: 794f792
python: 2.7.11.final.0
python-bits: 64
OS: Darwin
OS-release: 16.0.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: en_US.UTF-8
LANG: en_US.UTF-8
LOCALE: None.None

pandas: 0.19.0+27.g794f792
nose: None
pip: 8.1.2
setuptools: 23.1.0
Cython: 0.24
numpy: 1.10.4
scipy: 0.17.1
statsmodels: None
xarray: 0.7.2
IPython: 4.1.2
sphinx: 1.4.1
patsy: None
dateutil: 2.5.2
pytz: 2016.7
blosc: None
bottleneck: None
tables: 3.2.2
numexpr: 2.6.0
matplotlib: 1.5.1
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.8
boto: None
pandas_datareader: None

@jorisvandenbossche jorisvandenbossche added the IO CSV read_csv, to_csv label Oct 24, 2016
@jorisvandenbossche jorisvandenbossche changed the title Q : Strip whitespace from column names when usecols in pandas.read_table Strip whitespace from column names when usecols in read_csv Oct 24, 2016
@jorisvandenbossche
Copy link
Member

This just boils down to the issue you already reported. Pandas does not strip whitespace from the columns, so your actual column names are 'a ' and 'b ' (so including the space), and consequently your usecols=['a', 'b'] do not work as they don't match the column names (using usecols=['a ', 'b '] works as expected).

@rahulporuri
Copy link
Author

rahulporuri commented Oct 24, 2016

yes. Do we want pandas to strip whitespace from column names/columns? if not by default, maybe using another kwarg to pandas.read_table

@jreback
Copy link
Contributor

jreback commented Oct 24, 2016

you can already do #14234, or post-strip with .str.strip(). so not sure this is compelling.

@jorisvandenbossche
Copy link
Member

I think it is possibly useful, but that's already discussed in the other issue (#14460). So let's close this one.

@jorisvandenbossche jorisvandenbossche modified the milestones: 0.19.1, No action Oct 24, 2016
@rahulporuri
Copy link
Author

ohh. my bad. i didn't know that we could pass a callable to usecols. and agreed, we can strip post loading the dataframe and then drop unnecessary columns but that is a workaround IMO.

@jorisvandenbossche
Copy link
Member

@rahulporuri the callables are at the moment only a PR for enhancement, not in master or released version.

@maxima120
Copy link

pretty much all software i use including excel write column names in csv with spaces: "a, b, c, ...". Pandas should be able to read these correctly as intended names of columns are a b c. This is obvious behaviour and not a bug. Hence if pandas think its "b " "c " etc then that is a bug.

People often forget that bug is not only when something blows up in your face. Bug is also when something behaves differently from what is reasonable to expect. And this is exactly the case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO CSV read_csv, to_csv
Projects
None yet
Development

No branches or pull requests

4 participants