Skip to content

Commit a747961

Browse files
committed
BUG: read_excel return empty dataframe when using usecols and restored
capability of passing column labels for columns to be read - [x] closes #18273 - [x] tests added / passed - [x] passes git diff master --name-only -- "*.py" | grep "pandas/" | xargs -r flake8 - [x] whatsnew entry This commit reimplements usage of 'usecols' as a list of columns lables, list of ints or a callable for read_excel function. The 'usecols' as used in pandas 0.22 is renamed as 'usecols_excel' and is enables the feature of receiving column indexes as a list.
1 parent 8ddc0fd commit a747961

File tree

4 files changed

+182
-36
lines changed

4 files changed

+182
-36
lines changed

doc/source/whatsnew/v0.23.0.txt

+2
Original file line numberDiff line numberDiff line change
@@ -856,6 +856,7 @@ Other API Changes
856856
- Constructing a Series from a list of length 1 no longer broadcasts this list when a longer index is specified (:issue:`19714`, :issue:`20391`).
857857
- :func:`DataFrame.to_dict` with ``orient='index'`` no longer casts int columns to float for a DataFrame with only int and float columns (:issue:`18580`)
858858
- A user-defined-function that is passed to :func:`Series.rolling().aggregate() <pandas.core.window.Rolling.aggregate>`, :func:`DataFrame.rolling().aggregate() <pandas.core.window.Rolling.aggregate>`, or its expanding cousins, will now *always* be passed a ``Series``, rather than a ``np.array``; ``.apply()`` only has the ``raw`` keyword, see :ref:`here <whatsnew_0230.enhancements.window_raw>`. This is consistent with the signatures of ``.aggregate()`` across pandas (:issue:`20584`)
859+
- Changed the named argument `usecols` at :func:`read_excel` to `usecols_excel` that receives a list of index numbers or A1 index to select the columns that must be in the DataFrame, so the `usecols` argument can serve its purpose to select the columns that must be in the DataFrame using column labels (:issue:`18273`)
859860

860861
.. _whatsnew_0230.deprecations:
861862

@@ -1166,6 +1167,7 @@ I/O
11661167
- Bug in :func:`DataFrame.to_latex()` where a ``MultiIndex`` with an empty string as its name would result in incorrect output (:issue:`18669`)
11671168
- Bug in :func:`read_json` where large numeric values were causing an ``OverflowError`` (:issue:`18842`)
11681169
- Bug in :func:`DataFrame.to_parquet` where an exception was raised if the write destination is S3 (:issue:`19134`)
1170+
- Bug in :func:`read_excel` where `usecols_excel` named argument as a list of strings were returning a empty DataFrame (:issue:`18273`)
11691171
- :class:`Interval` now supported in :func:`DataFrame.to_excel` for all Excel file types (:issue:`19242`)
11701172
- :class:`Timedelta` now supported in :func:`DataFrame.to_excel` for all Excel file types (:issue:`19242`, :issue:`9155`, :issue:`19900`)
11711173
- Bug in :meth:`pandas.io.stata.StataReader.value_labels` raising an ``AttributeError`` when called on very old files. Now returns an empty dict (:issue:`19417`)

pandas/io/excel.py

+68-19
Original file line numberDiff line numberDiff line change
@@ -85,19 +85,41 @@
8585
Column (0-indexed) to use as the row labels of the DataFrame.
8686
Pass None if there is no such column. If a list is passed,
8787
those columns will be combined into a ``MultiIndex``. If a
88-
subset of data is selected with ``usecols``, index_col
88+
subset of data is selected with ``usecols_excel``, index_col
8989
is based on the subset.
9090
parse_cols : int or list, default None
9191
9292
.. deprecated:: 0.21.0
93-
Pass in `usecols` instead.
94-
95-
usecols : int or list, default None
93+
Pass in `usecols_excel` instead.
94+
95+
usecols : list-like or callable, default None
96+
Return a subset of the columns. If list-like, all elements must either
97+
be positional (i.e. integer indices into the document columns) or string
98+
that correspond to column names provided either by the user in `names` or
99+
inferred from the document header row(s). For example, a valid list-like
100+
`usecols` parameter would be [0, 1, 2] or ['foo', 'bar', 'baz']. Element
101+
order is ignored, so ``usecols=[0, 1]`` is the same as ``[1, 0]`` and
102+
``usecols=['foo', 'bar']`` is the same as ``['bar', 'foo']``.
103+
To instantiate a DataFrame from ``data`` with element order preserved use
104+
``pd.read_excel(data, usecols=['foo', 'bar'])[['foo', 'bar']]`` for columns
105+
in ``['foo', 'bar']`` order or
106+
``pd.read_excel(data, usecols=['foo', 'bar'])[['bar', 'foo']]``
107+
for ``['bar', 'foo']`` order.
108+
109+
If callable, the callable function will be evaluated against the column
110+
names, returning names where the callable function evaluates to True. An
111+
example of a valid callable argument would be ``lambda x: x.upper() in
112+
['AAA', 'BBB', 'DDD']``. Using this parameter results in much faster
113+
parsing time and lower memory usage.
114+
usecols_excel : int or list, default None
96115
* If None then parse all columns,
97116
* If int then indicates last column to be parsed
98117
* If list of ints then indicates list of column numbers to be parsed
99118
* If string then indicates comma separated list of Excel column letters and
100-
column ranges (e.g. "A:E" or "A,C,E:F"). Ranges are inclusive of
119+
column ranges (e.g. "A:E" or "A,C,E:F") to be parsed. Ranges are
120+
inclusive of both sides.
121+
* If list of strings each string shall be an Excel column letter or column
122+
range (e.g. "A:E" or "A,C,E:F") to be parsed. Ranges are inclusive of
101123
both sides.
102124
squeeze : boolean, default False
103125
If the parsed data only contains one column then return a Series
@@ -278,14 +300,14 @@ def get_writer(engine_name):
278300

279301

280302
@Appender(_read_excel_doc)
281-
@deprecate_kwarg("parse_cols", "usecols")
303+
@deprecate_kwarg("parse_cols", "usecols_excel")
282304
@deprecate_kwarg("skip_footer", "skipfooter")
283305
def read_excel(io,
284306
sheet_name=0,
285307
header=0,
286308
names=None,
287309
index_col=None,
288-
usecols=None,
310+
usecols_excel=None,
289311
squeeze=False,
290312
dtype=None,
291313
engine=None,
@@ -320,7 +342,7 @@ def read_excel(io,
320342
header=header,
321343
names=names,
322344
index_col=index_col,
323-
usecols=usecols,
345+
usecols_excel=usecols_excel,
324346
squeeze=squeeze,
325347
dtype=dtype,
326348
converters=converters,
@@ -413,7 +435,7 @@ def parse(self,
413435
header=0,
414436
names=None,
415437
index_col=None,
416-
usecols=None,
438+
usecols_excel=None,
417439
squeeze=False,
418440
converters=None,
419441
true_values=None,
@@ -439,7 +461,7 @@ def parse(self,
439461
header=header,
440462
names=names,
441463
index_col=index_col,
442-
usecols=usecols,
464+
usecols_excel=usecols_excel,
443465
squeeze=squeeze,
444466
converters=converters,
445467
true_values=true_values,
@@ -455,7 +477,7 @@ def parse(self,
455477
convert_float=convert_float,
456478
**kwds)
457479

458-
def _should_parse(self, i, usecols):
480+
def _should_parse(self, i, usecols_excel):
459481

460482
def _range2cols(areas):
461483
"""
@@ -481,18 +503,26 @@ def _excel2num(x):
481503
cols.append(_excel2num(rng))
482504
return cols
483505

484-
if isinstance(usecols, int):
485-
return i <= usecols
486-
elif isinstance(usecols, compat.string_types):
487-
return i in _range2cols(usecols)
506+
if isinstance(usecols_excel, int):
507+
return i <= usecols_excel
508+
# check if usecols_excel is a string that indicates a comma separated
509+
# list of Excel column letters and column ranges
510+
elif isinstance(usecols_excel, compat.string_types):
511+
return i in _range2cols(usecols_excel)
512+
# check if usecols_excel is a list of strings, each one indicating a
513+
# Excel column letter or a column range
514+
elif all(isinstance(x, compat.string_types) for x in usecols_excel):
515+
usecols_excel_str = ",".join(usecols_excel)
516+
return i in _range2cols(usecols_excel_str)
488517
else:
489-
return i in usecols
518+
return i in usecols_excel
490519

491520
def _parse_excel(self,
492521
sheetname=0,
493522
header=0,
494523
names=None,
495524
index_col=None,
525+
usecols_excel=None,
496526
usecols=None,
497527
squeeze=False,
498528
dtype=None,
@@ -512,6 +542,10 @@ def _parse_excel(self,
512542

513543
_validate_header_arg(header)
514544

545+
if (usecols is not None) and (usecols_excel is not None):
546+
raise TypeError("Cannot specify both `usecols` and `usecols_excel`"
547+
". Choose one of them.")
548+
515549
if 'chunksize' in kwds:
516550
raise NotImplementedError("chunksize keyword of read_excel "
517551
"is not implemented")
@@ -615,13 +649,27 @@ def _parse_cell(cell_contents, cell_typ):
615649
row = []
616650
for j, (value, typ) in enumerate(zip(sheet.row_values(i),
617651
sheet.row_types(i))):
618-
if usecols is not None and j not in should_parse:
619-
should_parse[j] = self._should_parse(j, usecols)
652+
if usecols_excel is not None and j not in should_parse:
653+
should_parse[j] = self._should_parse(j, usecols_excel)
620654

621-
if usecols is None or should_parse[j]:
655+
if usecols_excel is None or should_parse[j]:
622656
row.append(_parse_cell(value, typ))
623657
data.append(row)
624658

659+
# Check if some string in usecols may be interpreted as a Excel
660+
# positional column
661+
if (usecols is not None) and (not callable(usecols)) and \
662+
(not all(isinstance(x, int) for x in usecols)) and \
663+
any(isinstance(x, compat.string_types) and x.isalpha()
664+
for x in usecols):
665+
warnings.warn("The `usecols` named argument used to refer to "
666+
"Excel column letters or ranges and int "
667+
"positional indexes was renamed to "
668+
"`usecols_excel`. Now `usecols` is used to "
669+
"pass either a list of only string column lables"
670+
" or a list of only integer positional indexes.",
671+
UserWarning, stacklevel=3)
672+
625673
if sheet.nrows == 0:
626674
output[asheetname] = DataFrame()
627675
continue
@@ -674,6 +722,7 @@ def _parse_cell(cell_contents, cell_typ):
674722
dtype=dtype,
675723
true_values=true_values,
676724
false_values=false_values,
725+
usecols=usecols,
677726
skiprows=skiprows,
678727
nrows=nrows,
679728
na_values=na_values,

pandas/io/parsers.py

+18
Original file line numberDiff line numberDiff line change
@@ -1980,6 +1980,24 @@ def TextParser(*args, **kwds):
19801980
parse_dates : boolean, default False
19811981
keep_date_col : boolean, default False
19821982
date_parser : function, default None
1983+
usecols : list-like or callable, default None
1984+
Return a subset of the columns. If list-like, all elements must strings
1985+
that correspond to column names provided either by the user in `names`
1986+
or inferred from the document header row(s). For example, a valid
1987+
list-like `usecols` parameter would be ['foo', 'bar', 'baz']. Element
1988+
order is ignored, so ``usecols=['foo', 'bar']`` is the same as
1989+
``['bar', 'foo']``.
1990+
To instantiate a DataFrame from ``data`` with element order preserved
1991+
use ``pd.read_excel(data, usecols=['foo', 'bar'])[['foo', 'bar']]``
1992+
for columns in ``['foo', 'bar']`` order or
1993+
``pd.read_excel(data, usecols=['foo', 'bar'])[['bar', 'foo']]``
1994+
for ``['bar', 'foo']`` order.
1995+
1996+
If callable, the callable function will be evaluated against the column
1997+
names, returning names where the callable function evaluates to True.
1998+
An example of a valid callable argument would be ``lambda x: x.upper()
1999+
in ['AAA', 'BBB', 'DDD']``. Using this parameter results in much faster
2000+
parsing time and lower memory usage.
19832001
skiprows : list of integers
19842002
Row numbers to skip
19852003
skipfooter : int

0 commit comments

Comments
 (0)