ENH: allow usecols to be case insensitive #14154

aechase · 2016-09-05T15:24:54Z

I have a wide delimited file that includes columns with predictable names but unpredictable capitalisation, as shown in this toy code.

import pandas as pd
from io import StringIO

data = """1,1,1,1
2,2,2,2
3,3,3,3
4,4,4,4
5,5,5,5"""
names = ['col1', 'col2', 'COL3', 'Col4']
df = pd.read_csv(StringIO(data), names=names, usecols=None)
df.head()

	col1	col2	COL3	Col4
0	1	1	1	1
1	2	2	2	2
2	3	3	3	3
3	4	4	4	4
4	5	5	5	5

I want to pass a list of column names to usecols when importing the file, but I won't be able to unless I match the names exactly:

df = pd.read_csv(StringIO(data), names=names, usecols=['col3', 'col4'])

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-14-92d038c5ac8c> in <module>()
      1 cols = ['col3', 'col4']
----> 2 df1 = pd.read_csv('test.csv', usecols=['col3', 'col4'])
      3 df1.head()

/Users/alexchase/anaconda/lib/python3.5/site-packages/pandas/io/parsers.py in parser_f(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, squeeze, prefix, mangle_dupe_cols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, skipfooter, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, dayfirst, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, escapechar, comment, encoding, dialect, tupleize_cols, error_bad_lines, warn_bad_lines, skip_footer, doublequote, delim_whitespace, as_recarray, compact_ints, use_unsigned, low_memory, buffer_lines, memory_map, float_precision)
    560                     skip_blank_lines=skip_blank_lines)
    561 
--> 562         return _read(filepath_or_buffer, kwds)
    563 
    564     parser_f.__name__ = name

/Users/alexchase/anaconda/lib/python3.5/site-packages/pandas/io/parsers.py in _read(filepath_or_buffer, kwds)
    313 
    314     # Create the parser.
--> 315     parser = TextFileReader(filepath_or_buffer, **kwds)
    316 
    317     if (nrows is not None) and (chunksize is not None):

/Users/alexchase/anaconda/lib/python3.5/site-packages/pandas/io/parsers.py in __init__(self, f, engine, **kwds)
    643             self.options['has_index_names'] = kwds['has_index_names']
    644 
--> 645         self._make_engine(self.engine)
    646 
    647     def close(self):

/Users/alexchase/anaconda/lib/python3.5/site-packages/pandas/io/parsers.py in _make_engine(self, engine)
    797     def _make_engine(self, engine='c'):
    798         if engine == 'c':
--> 799             self._engine = CParserWrapper(self.f, **self.options)
    800         else:
    801             if engine == 'python':

/Users/alexchase/anaconda/lib/python3.5/site-packages/pandas/io/parsers.py in __init__(self, src, **kwds)
   1255 
   1256             if len(self.names) < len(self.usecols):
-> 1257                 raise ValueError("Usecols do not match names.")
   1258 
   1259         self._set_noconvert_columns()

ValueError: Usecols do not match names.

I see two problems here. First, the error text would be more helpful if it specified what column name(s) is/are wrong. That's pretty easy to fix. Second, if I know that capitalisation might be a problem, I have to check every column name individually before I can pass a list to usecols. One possible solution would be to modify this part and this part of parsers.py to case-transform the lists of column names before matching. This behaviour could be made the default, or it could be optional via an ignore_col_case keyword in the various read functions.

INSTALLED VERSIONS

commit: None
python: 3.5.2.final.0
python-bits: 64
OS: Darwin
OS-release: 15.6.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_GB.UTF-8

pandas: 0.18.1
nose: 1.3.7
pip: 8.1.2
setuptools: 25.1.6
Cython: 0.24.1
numpy: 1.11.1
scipy: 0.18.0
statsmodels: 0.6.1
xarray: None
IPython: 5.1.0
sphinx: 1.4.1
patsy: 0.4.1
dateutil: 2.5.3
pytz: 2016.6.1
blosc: None
bottleneck: 1.1.0
tables: 3.2.3.1
numexpr: 2.6.1
matplotlib: 1.5.1
openpyxl: 2.3.2
xlrd: 1.0.0
xlwt: 1.1.2
xlsxwriter: 0.9.2
lxml: 3.6.4
bs4: 4.4.1
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: 1.0.13
pymysql: 0.7.6.None
psycopg2: None
jinja2: 2.8
boto: 2.40.0
pandas_datareader: None

The text was updated successfully, but these errors were encountered:

jreback · 2016-09-05T15:29:16Z

a better soln would be to allow usecols to take a callable. Then you can validate however you want, e.g.

df = pd.read_csv(StringIO(data), names=names, usecols=lambda x: x.lower() in ['col3', 'col4'])

jreback · 2016-09-05T15:29:22Z

cc @gfyoung

jreback · 2016-09-05T15:30:00Z

yes a better error message would be nice as well.

aechase · 2016-09-05T15:38:23Z

@jreback should passing a callable to usecols implicitly rename the columns as well, or should it preserve the original formatting?

jreback · 2016-09-05T15:41:55Z

no, it takes a single column and returns a boolean. These should match the names,

In [1]: def usecols(x):
   ...:     return x.lower() in ['col3', 'col4']
   ...: 

In [2]: [usecols(x) for x in ['col1', 'col2', 'Col3', 'col4']]
Out[2]: [False, False, True, True]

jreback · 2016-09-05T15:42:56Z

this would also easily allow #10882

In [3]: def skipcols(x):
   ...:     return x.lower() not in ['col3', 'col4']

In [4]: [skipcols(x) for x in ['col1', 'col2', 'Col3', 'col4']]
Out[4]: [True, True, False, False]

though skipcols could be a separate kw.

gfyoung · 2016-09-06T02:12:20Z

This indeed would be a nice enhancement! However, perhaps a better name (that would encompass everything) would be colfilter?

aechase · 2016-09-06T13:00:21Z

@jreback I'll need a few days to make all these changes. In the meantime, are you interested in a PR that just updates the error message?

jreback · 2016-09-06T13:18:03Z

sure a separate PR for the error message would be be great

jreback added Enhancement Difficulty Novice Error Reporting Incorrect or improved errors from pandas IO CSV read_csv, to_csv labels Sep 5, 2016

jreback added this to the Next Major Release milestone Sep 5, 2016

aechase mentioned this issue Sep 6, 2016

ERR: improves message raised for bad names in usecols (GH14154) #14163

Closed

4 tasks

WillAyd mentioned this issue Sep 16, 2016

ENH: Allow usecols to accept callable (GH14154) #14234

Merged

jreback modified the milestones: 0.20.0, Next Major Release Dec 4, 2016

jorisvandenbossche closed this as completed in #14234 Dec 6, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: allow usecols to be case insensitive #14154

ENH: allow usecols to be case insensitive #14154

aechase commented Sep 5, 2016

INSTALLED VERSIONS

jreback commented Sep 5, 2016

jreback commented Sep 5, 2016

jreback commented Sep 5, 2016

aechase commented Sep 5, 2016

jreback commented Sep 5, 2016

jreback commented Sep 5, 2016 •

edited

Loading

gfyoung commented Sep 6, 2016

aechase commented Sep 6, 2016

jreback commented Sep 6, 2016 •

edited

Loading

ENH: allow usecols to be case insensitive #14154

ENH: allow usecols to be case insensitive #14154

Comments

aechase commented Sep 5, 2016

INSTALLED VERSIONS

jreback commented Sep 5, 2016

jreback commented Sep 5, 2016

jreback commented Sep 5, 2016

aechase commented Sep 5, 2016

jreback commented Sep 5, 2016

jreback commented Sep 5, 2016 • edited Loading

gfyoung commented Sep 6, 2016

aechase commented Sep 6, 2016

jreback commented Sep 6, 2016 • edited Loading

jreback commented Sep 5, 2016 •

edited

Loading

jreback commented Sep 6, 2016 •

edited

Loading