Skip to content

Commit e257100

Browse files
committed
BUG: read_excel return empty dataframe when using usecols and restored
capability of passing column labels for columns to be read - [x] closes #18273 - [x] tests added / passed - [x] passes git diff master --name-only -- "*.py" | grep "pandas/" | xargs -r flake8 - [x] whatsnew entry Created 'usecols_excel' that receives a string containing comma separated Excel ranges and columns. Changed 'usecols' named argument, now it receives a list of strings containing column labels or a list of integers representing column indexes or a callable for 'read_excel' function. Created and altered tests to reflect the new usage of these named arguments. 'index_col' keyword used to indicated which columns in the subset of selected columns by 'usecols' or 'usecols_excel' that should be the index of the DataFrame read. Now 'index_col' indicates which columns of the DataFrame will be the index even if that column is not in the subset of the selected columns.
1 parent 415012f commit e257100

File tree

4 files changed

+228
-77
lines changed

4 files changed

+228
-77
lines changed

doc/source/io.rst

+31-8
Original file line numberDiff line numberDiff line change
@@ -2852,23 +2852,46 @@ Parsing Specific Columns
28522852

28532853
It is often the case that users will insert columns to do temporary computations
28542854
in Excel and you may not want to read in those columns. ``read_excel`` takes
2855-
a ``usecols`` keyword to allow you to specify a subset of columns to parse.
2855+
either a ``usecols`` or ``usecols_excel`` keyword to allow you to specify a
2856+
subset of columns to parse. Note that you can not use both ``usecols`` and
2857+
``usecols_excel`` named arguments at the same time.
28562858

2857-
If ``usecols`` is an integer, then it is assumed to indicate the last column
2858-
to be parsed.
2859+
If ``usecols_excel`` is supplied, then it is assumed that indicates a comma
2860+
separated list of Excel column letters and column ranges to be parsed.
28592861

28602862
.. code-block:: python
28612863
2862-
read_excel('path_to_file.xls', 'Sheet1', usecols=2)
2864+
read_excel('path_to_file.xls', 'Sheet1', usecols_excel='A:E')
2865+
read_excel('path_to_file.xls', 'Sheet1', usecols_excel='A,C,E:F')
28632866
2864-
If `usecols` is a list of integers, then it is assumed to be the file column
2865-
indices to be parsed.
2867+
If ``usecols`` is a list of integers, then it is assumed to be the file
2868+
column indices to be parsed.
28662869

28672870
.. code-block:: python
28682871
2869-
read_excel('path_to_file.xls', 'Sheet1', usecols=[0, 2, 3])
2872+
read_excel('path_to_file.xls', 'Sheet1', usecols=[1, 3, 5])
2873+
2874+
Element order is ignored, so ``usecols_excel=[0, 1]`` is the same as ``[1, 0]``.
2875+
2876+
If ``usecols`` is a list of strings, then it is assumed that each string
2877+
correspond to column names provided either by the user in `names` or
2878+
inferred from the document header row(s) and those strings define which columns
2879+
will be parsed.
2880+
2881+
.. code-block:: python
2882+
2883+
read_excel('path_to_file.xls', 'Sheet1', usecols=['foo', 'bar'])
2884+
2885+
Element order is ignored, so ``usecols=['baz', 'joe']`` is the same as
2886+
``['joe', 'baz']``.
2887+
2888+
If ``usecols`` is callable, the callable function will be evaluated against the
2889+
column names, returning names where the callable function evaluates to True.
2890+
2891+
.. code-block:: python
2892+
2893+
read_excel('path_to_file.xls', 'Sheet1', usecols=lambda x: x.isalpha())
28702894
2871-
Element order is ignored, so ``usecols=[0, 1]`` is the same as ``[1, 0]``.
28722895
28732896
Parsing Dates
28742897
+++++++++++++

doc/source/whatsnew/v0.24.0.txt

+2-2
Original file line numberDiff line numberDiff line change
@@ -36,7 +36,7 @@ Datetimelike API Changes
3636
Other API Changes
3737
^^^^^^^^^^^^^^^^^
3838

39-
-
39+
- :func:`read_excel` has gained the keyword argument ``usecols_excel`` that receives a string containing comma separated Excel ranges and columns. The ``usecols`` keyword argument at :func:`read_excel` had removed support for a string containing comma separated Excel ranges and columns and for an int indicating the first j columns to be read in a ``DataFrame``. Also, the ``usecols`` keyword argument at :func:`read_excel` had added support for receiving a list of strings containing column labels and a callable. (:issue:`18273`)
4040
-
4141
-
4242

@@ -148,7 +148,7 @@ I/O
148148
^^^
149149

150150
-
151-
-
151+
- Bug in :func:`read_excel` where ``usecols`` keyword argument as a list of strings were returning a empty ``DataFrame`` (:issue:`18273`)
152152
-
153153

154154
Plotting

pandas/io/excel.py

+82-17
Original file line numberDiff line numberDiff line change
@@ -10,6 +10,8 @@
1010
import abc
1111
import warnings
1212
import numpy as np
13+
import string
14+
import re
1315
from io import UnsupportedOperation
1416

1517
from pandas.core.dtypes.common import (
@@ -85,20 +87,42 @@
8587
Column (0-indexed) to use as the row labels of the DataFrame.
8688
Pass None if there is no such column. If a list is passed,
8789
those columns will be combined into a ``MultiIndex``. If a
88-
subset of data is selected with ``usecols``, index_col
89-
is based on the subset.
90+
subset of data is selected with ``usecols_excel`` or ``usecols``,
91+
index_col is based on the subset.
9092
parse_cols : int or list, default None
9193
9294
.. deprecated:: 0.21.0
9395
Pass in `usecols` instead.
9496
95-
usecols : int or list, default None
97+
usecols : list-like or callable, default None
98+
Return a subset of the columns. If list-like, all elements must either
99+
be positional (i.e. integer indices into the document columns) or string
100+
that correspond to column names provided either by the user in `names` or
101+
inferred from the document header row(s). For example, a valid list-like
102+
`usecols` parameter would be [0, 1, 2] or ['foo', 'bar', 'baz']. Note that
103+
you can not give both ``usecols`` and ``usecols_excel`` keyword arguments
104+
at the same time.
105+
106+
If callable, the callable function will be evaluated against the column
107+
names, returning names where the callable function evaluates to True. An
108+
example of a valid callable argument would be ``lambda x: x.upper() in
109+
['AAA', 'BBB', 'DDD']``.
110+
111+
.. versionadded:: 0.24.0
112+
Added support to column labels and now `usecols_excel` is the keyword that
113+
receives separated comma list of excel columns and ranges.
114+
usecols_excel : string, default None
115+
Return a subset of the columns from a spreadsheet specified as Excel column
116+
ranges and columns. Note that you can not use both ``usecols`` and
117+
``usecols_excel`` keyword arguments at the same time.
118+
96119
* If None then parse all columns,
97-
* If int then indicates last column to be parsed
98-
* If list of ints then indicates list of column numbers to be parsed
99120
* If string then indicates comma separated list of Excel column letters and
100-
column ranges (e.g. "A:E" or "A,C,E:F"). Ranges are inclusive of
101-
both sides.
121+
column ranges (e.g. "A:E" or "A,C,E:F") to be parsed. Ranges are
122+
inclusive of both sides.
123+
124+
.. versionadded:: 0.24.0
125+
102126
squeeze : boolean, default False
103127
If the parsed data only contains one column then return a Series
104128
dtype : Type name or dict of column -> type, default None
@@ -269,6 +293,19 @@ def _get_default_writer(ext):
269293
return _default_writers[ext]
270294

271295

296+
def _is_excel_columns_notation(columns):
297+
"""
298+
Receives a string and check if the string is a comma separated list of
299+
Excel index columns and index ranges. An Excel range is a string with two
300+
column indexes separated by ':').
301+
"""
302+
if isinstance(columns, compat.string_types) and all(
303+
(x in string.ascii_letters) for x in re.split(r',|:', columns)):
304+
return True
305+
306+
return False
307+
308+
272309
def get_writer(engine_name):
273310
try:
274311
return _writers[engine_name]
@@ -286,6 +323,7 @@ def read_excel(io,
286323
names=None,
287324
index_col=None,
288325
usecols=None,
326+
usecols_excel=None,
289327
squeeze=False,
290328
dtype=None,
291329
engine=None,
@@ -311,6 +349,7 @@ def read_excel(io,
311349
header=header,
312350
names=names,
313351
index_col=index_col,
352+
usecols_excel=usecols_excel,
314353
usecols=usecols,
315354
squeeze=squeeze,
316355
dtype=dtype,
@@ -405,6 +444,7 @@ def parse(self,
405444
names=None,
406445
index_col=None,
407446
usecols=None,
447+
usecols_excel=None,
408448
squeeze=False,
409449
converters=None,
410450
true_values=None,
@@ -439,6 +479,7 @@ def parse(self,
439479
header=header,
440480
names=names,
441481
index_col=index_col,
482+
usecols_excel=usecols_excel,
442483
usecols=usecols,
443484
squeeze=squeeze,
444485
converters=converters,
@@ -455,7 +496,7 @@ def parse(self,
455496
convert_float=convert_float,
456497
**kwds)
457498

458-
def _should_parse(self, i, usecols):
499+
def _should_parse(self, i, usecols_excel, usecols):
459500

460501
def _range2cols(areas):
461502
"""
@@ -481,19 +522,20 @@ def _excel2num(x):
481522
cols.append(_excel2num(rng))
482523
return cols
483524

484-
if isinstance(usecols, int):
485-
return i <= usecols
486-
elif isinstance(usecols, compat.string_types):
487-
return i in _range2cols(usecols)
488-
else:
489-
return i in usecols
525+
# check if usecols_excel is a string that indicates a comma separated
526+
# list of Excel column letters and column ranges
527+
if isinstance(usecols_excel, compat.string_types):
528+
return i in _range2cols(usecols_excel)
529+
530+
return True
490531

491532
def _parse_excel(self,
492533
sheet_name=0,
493534
header=0,
494535
names=None,
495536
index_col=None,
496537
usecols=None,
538+
usecols_excel=None,
497539
squeeze=False,
498540
dtype=None,
499541
true_values=None,
@@ -512,6 +554,25 @@ def _parse_excel(self,
512554

513555
_validate_header_arg(header)
514556

557+
if (usecols is not None) and (usecols_excel is not None):
558+
raise ValueError("Cannot specify both `usecols` and "
559+
"`usecols_excel`. Choose one of them.")
560+
561+
# Check if some string in usecols may be interpreted as a Excel
562+
# range or positional column
563+
elif _is_excel_columns_notation(usecols):
564+
warnings.warn("The `usecols` keyword argument used to refer to "
565+
"Excel ranges and columns as strings was "
566+
"renamed to `usecols_excel`.", UserWarning,
567+
stacklevel=3)
568+
usecols_excel = usecols
569+
usecols = None
570+
571+
elif (usecols_excel is not None) and not _is_excel_columns_notation(
572+
usecols_excel):
573+
raise TypeError("`usecols_excel` must be None or a string as a "
574+
"comma separeted Excel ranges and columns.")
575+
515576
if 'chunksize' in kwds:
516577
raise NotImplementedError("chunksize keyword of read_excel "
517578
"is not implemented")
@@ -615,10 +676,13 @@ def _parse_cell(cell_contents, cell_typ):
615676
row = []
616677
for j, (value, typ) in enumerate(zip(sheet.row_values(i),
617678
sheet.row_types(i))):
618-
if usecols is not None and j not in should_parse:
619-
should_parse[j] = self._should_parse(j, usecols)
679+
if ((usecols is not None) or (usecols_excel is not None) or
680+
(j not in should_parse)):
681+
should_parse[j] = self._should_parse(j, usecols_excel,
682+
usecols)
620683

621-
if usecols is None or should_parse[j]:
684+
if (((usecols_excel is None) and (usecols is None)) or
685+
should_parse[j]):
622686
row.append(_parse_cell(value, typ))
623687
data.append(row)
624688

@@ -674,6 +738,7 @@ def _parse_cell(cell_contents, cell_typ):
674738
dtype=dtype,
675739
true_values=true_values,
676740
false_values=false_values,
741+
usecols=usecols,
677742
skiprows=skiprows,
678743
nrows=nrows,
679744
na_values=na_values,

0 commit comments

Comments
 (0)