Skip to content

ENH: allow usecols to be case insensitive #14154

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
aechase opened this issue Sep 5, 2016 · 9 comments · Fixed by #14234
Closed

ENH: allow usecols to be case insensitive #14154

aechase opened this issue Sep 5, 2016 · 9 comments · Fixed by #14234
Labels
Enhancement Error Reporting Incorrect or improved errors from pandas IO CSV read_csv, to_csv
Milestone

Comments

@aechase
Copy link
Contributor

aechase commented Sep 5, 2016

I have a wide delimited file that includes columns with predictable names but unpredictable capitalisation, as shown in this toy code.

import pandas as pd
from io import StringIO

data = """1,1,1,1
2,2,2,2
3,3,3,3
4,4,4,4
5,5,5,5"""
names = ['col1', 'col2', 'COL3', 'Col4']
df = pd.read_csv(StringIO(data), names=names, usecols=None)
df.head()
col1 col2 COL3 Col4
0 1 1 1 1
1 2 2 2 2
2 3 3 3 3
3 4 4 4 4
4 5 5 5 5

I want to pass a list of column names to usecols when importing the file, but I won't be able to unless I match the names exactly:

df = pd.read_csv(StringIO(data), names=names, usecols=['col3', 'col4'])
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-14-92d038c5ac8c> in <module>()
      1 cols = ['col3', 'col4']
----> 2 df1 = pd.read_csv('test.csv', usecols=['col3', 'col4'])
      3 df1.head()

/Users/alexchase/anaconda/lib/python3.5/site-packages/pandas/io/parsers.py in parser_f(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, squeeze, prefix, mangle_dupe_cols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, skipfooter, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, dayfirst, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, escapechar, comment, encoding, dialect, tupleize_cols, error_bad_lines, warn_bad_lines, skip_footer, doublequote, delim_whitespace, as_recarray, compact_ints, use_unsigned, low_memory, buffer_lines, memory_map, float_precision)
    560                     skip_blank_lines=skip_blank_lines)
    561 
--> 562         return _read(filepath_or_buffer, kwds)
    563 
    564     parser_f.__name__ = name

/Users/alexchase/anaconda/lib/python3.5/site-packages/pandas/io/parsers.py in _read(filepath_or_buffer, kwds)
    313 
    314     # Create the parser.
--> 315     parser = TextFileReader(filepath_or_buffer, **kwds)
    316 
    317     if (nrows is not None) and (chunksize is not None):

/Users/alexchase/anaconda/lib/python3.5/site-packages/pandas/io/parsers.py in __init__(self, f, engine, **kwds)
    643             self.options['has_index_names'] = kwds['has_index_names']
    644 
--> 645         self._make_engine(self.engine)
    646 
    647     def close(self):

/Users/alexchase/anaconda/lib/python3.5/site-packages/pandas/io/parsers.py in _make_engine(self, engine)
    797     def _make_engine(self, engine='c'):
    798         if engine == 'c':
--> 799             self._engine = CParserWrapper(self.f, **self.options)
    800         else:
    801             if engine == 'python':

/Users/alexchase/anaconda/lib/python3.5/site-packages/pandas/io/parsers.py in __init__(self, src, **kwds)
   1255 
   1256             if len(self.names) < len(self.usecols):
-> 1257                 raise ValueError("Usecols do not match names.")
   1258 
   1259         self._set_noconvert_columns()

ValueError: Usecols do not match names.

I see two problems here. First, the error text would be more helpful if it specified what column name(s) is/are wrong. That's pretty easy to fix. Second, if I know that capitalisation might be a problem, I have to check every column name individually before I can pass a list to usecols. One possible solution would be to modify this part and this part of parsers.py to case-transform the lists of column names before matching. This behaviour could be made the default, or it could be optional via an ignore_col_case keyword in the various read functions.


INSTALLED VERSIONS

commit: None
python: 3.5.2.final.0
python-bits: 64
OS: Darwin
OS-release: 15.6.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_GB.UTF-8

pandas: 0.18.1
nose: 1.3.7
pip: 8.1.2
setuptools: 25.1.6
Cython: 0.24.1
numpy: 1.11.1
scipy: 0.18.0
statsmodels: 0.6.1
xarray: None
IPython: 5.1.0
sphinx: 1.4.1
patsy: 0.4.1
dateutil: 2.5.3
pytz: 2016.6.1
blosc: None
bottleneck: 1.1.0
tables: 3.2.3.1
numexpr: 2.6.1
matplotlib: 1.5.1
openpyxl: 2.3.2
xlrd: 1.0.0
xlwt: 1.1.2
xlsxwriter: 0.9.2
lxml: 3.6.4
bs4: 4.4.1
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: 1.0.13
pymysql: 0.7.6.None
psycopg2: None
jinja2: 2.8
boto: 2.40.0
pandas_datareader: None

@jreback
Copy link
Contributor

jreback commented Sep 5, 2016

a better soln would be to allow usecols to take a callable. Then you can validate however you want, e.g.

df = pd.read_csv(StringIO(data), names=names, usecols=lambda x: x.lower() in ['col3', 'col4'])

@jreback
Copy link
Contributor

jreback commented Sep 5, 2016

cc @gfyoung

@jreback jreback added Enhancement Difficulty Novice Error Reporting Incorrect or improved errors from pandas IO CSV read_csv, to_csv labels Sep 5, 2016
@jreback jreback added this to the Next Major Release milestone Sep 5, 2016
@jreback
Copy link
Contributor

jreback commented Sep 5, 2016

yes a better error message would be nice as well.

@aechase
Copy link
Contributor Author

aechase commented Sep 5, 2016

@jreback should passing a callable to usecols implicitly rename the columns as well, or should it preserve the original formatting?

@jreback
Copy link
Contributor

jreback commented Sep 5, 2016

no, it takes a single column and returns a boolean. These should match the names,

In [1]: def usecols(x):
   ...:     return x.lower() in ['col3', 'col4']
   ...: 

In [2]: [usecols(x) for x in ['col1', 'col2', 'Col3', 'col4']]
Out[2]: [False, False, True, True]

@jreback
Copy link
Contributor

jreback commented Sep 5, 2016

this would also easily allow #10882

In [3]: def skipcols(x):
   ...:     return x.lower() not in ['col3', 'col4']

In [4]: [skipcols(x) for x in ['col1', 'col2', 'Col3', 'col4']]
Out[4]: [True, True, False, False]

though skipcols could be a separate kw.

@gfyoung
Copy link
Member

gfyoung commented Sep 6, 2016

This indeed would be a nice enhancement! However, perhaps a better name (that would encompass everything) would be colfilter?

@aechase
Copy link
Contributor Author

aechase commented Sep 6, 2016

@jreback I'll need a few days to make all these changes. In the meantime, are you interested in a PR that just updates the error message?

@jreback
Copy link
Contributor

jreback commented Sep 6, 2016

sure a separate PR for the error message would be be great

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement Error Reporting Incorrect or improved errors from pandas IO CSV read_csv, to_csv
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants