Skip to content

Commit 6c6eede

Browse files
committed
BUG: read_excel return empty dataframe when using usecols and restored
capability of passing column labels for columns to be read - [x] closes #18273 - [x] tests added / passed - [x] passes git diff master --name-only -- "*.py" | grep "pandas/" | xargs -r flake8 - [x] whatsnew entry Created 'usecols_excel' that receives a string containing comma separated Excel ranges and columns. Changed 'usecols' named argument, now it receives a list of strings containing column labels or a list of integers representing column indexes or a callable for 'read_excel' function. Created and altered tests to reflect the new usage of these named arguments. 'index_col' keyword used to indicated which columns in the subset of selected columns by 'usecols' or 'usecols_excel' that should be the index of the DataFrame read. Now 'index_col' indicates which columns of the DataFrame will be the index even if that column is not in the subset of the selected columns.
1 parent 4274b84 commit 6c6eede

File tree

5 files changed

+234
-74
lines changed

5 files changed

+234
-74
lines changed

doc/source/io.rst

+36-6
Original file line numberDiff line numberDiff line change
@@ -2852,23 +2852,53 @@ Parsing Specific Columns
28522852

28532853
It is often the case that users will insert columns to do temporary computations
28542854
in Excel and you may not want to read in those columns. ``read_excel`` takes
2855-
a ``usecols`` keyword to allow you to specify a subset of columns to parse.
2855+
either a ``usecols`` or ``usecols_excel`` keyword to allow you to specify a
2856+
subset of columns to parse. Note that you can not use both ``usecols`` and
2857+
``usecols_excel`` named arguments at the same time.
2858+
2859+
If ``usecols_excel`` is supplied, then it is assumed that indicates a comma
2860+
separated list of Excel column letters and column ranges to be parsed.
2861+
2862+
.. code-block:: python
2863+
2864+
read_excel('path_to_file.xls', 'Sheet1', usecols_excel='A:E')
2865+
read_excel('path_to_file.xls', 'Sheet1', usecols_excel='A,C,E:F')
28562866
28572867
If ``usecols`` is an integer, then it is assumed to indicate the last column
28582868
to be parsed.
28592869

28602870
.. code-block:: python
28612871
2862-
read_excel('path_to_file.xls', 'Sheet1', usecols=2)
2872+
read_excel('path_to_file.xls', 'Sheet1', usecols_excel=2)
2873+
2874+
If ``usecols`` is a list of integers, then it is assumed to be the file
2875+
column indices to be parsed.
2876+
2877+
.. code-block:: python
2878+
2879+
read_excel('path_to_file.xls', 'Sheet1', usecols=[1, 3, 5])
2880+
2881+
Element order is ignored, so ``usecols_excel=[0, 1]`` is the same as ``[1, 0]``.
2882+
2883+
If ``usecols`` is a list of strings, then it is assumed that each string
2884+
correspond to column names provided either by the user in `names` or
2885+
inferred from the document header row(s) and those strings define which columns
2886+
will be parsed.
2887+
2888+
.. code-block:: python
2889+
2890+
read_excel('path_to_file.xls', 'Sheet1', usecols=['foo', 'bar'])
2891+
2892+
Element order is ignored, so ``usecols=['baz', 'joe']`` is the same as
2893+
``['joe', 'baz']``.
28632894

2864-
If `usecols` is a list of integers, then it is assumed to be the file column
2865-
indices to be parsed.
2895+
If ``usecols`` is callable, the callable function will be evaluated against the
2896+
column names, returning names where the callable function evaluates to True.
28662897

28672898
.. code-block:: python
28682899
2869-
read_excel('path_to_file.xls', 'Sheet1', usecols=[0, 2, 3])
2900+
read_excel('path_to_file.xls', 'Sheet1', usecols=lambda x: x.isalpha())
28702901
2871-
Element order is ignored, so ``usecols=[0, 1]`` is the same as ``[1, 0]``.
28722902
28732903
Parsing Dates
28742904
+++++++++++++

doc/source/whatsnew/v0.23.0.txt

+1
Original file line numberDiff line numberDiff line change
@@ -1325,6 +1325,7 @@ I/O
13251325
- Bug in :func:`DataFrame.to_latex()` where missing space characters caused wrong escaping and produced non-valid latex in some cases (:issue:`20859`)
13261326
- Bug in :func:`read_json` where large numeric values were causing an ``OverflowError`` (:issue:`18842`)
13271327
- Bug in :func:`DataFrame.to_parquet` where an exception was raised if the write destination is S3 (:issue:`19134`)
1328+
- Bug in :func:`read_excel` where ``usecols`` keyword argument as a list of strings were returning a empty ``DataFrame`` (:issue:`18273`)
13281329
- :class:`Interval` now supported in :func:`DataFrame.to_excel` for all Excel file types (:issue:`19242`)
13291330
- :class:`Timedelta` now supported in :func:`DataFrame.to_excel` for all Excel file types (:issue:`19242`, :issue:`9155`, :issue:`19900`)
13301331
- Bug in :meth:`pandas.io.stata.StataReader.value_labels` raising an ``AttributeError`` when called on very old files. Now returns an empty dict (:issue:`19417`)

doc/source/whatsnew/v0.24.0.txt

+1-1
Original file line numberDiff line numberDiff line change
@@ -35,7 +35,7 @@ Datetimelike API Changes
3535
Other API Changes
3636
^^^^^^^^^^^^^^^^^
3737

38-
-
38+
- :func:`read_excel` has gained the keyword argument ``usecols_excel`` that receives a string containing comma separated Excel ranges and columns. The ``usecols`` keyword argument at :func:`read_excel` had removed support for a string containing comma separated Excel ranges and columns and for an int indicating the first j columns to be read in a ``DataFrame``. Also, the ``usecols`` keyword argument at :func:`read_excel` had added support for receiving a list of strings containing column labels and a callable. (:issue:`18273`)
3939
-
4040
-
4141

pandas/io/excel.py

+83-17
Original file line numberDiff line numberDiff line change
@@ -10,6 +10,8 @@
1010
import abc
1111
import warnings
1212
import numpy as np
13+
import string
14+
import re
1315
from io import UnsupportedOperation
1416

1517
from pandas.core.dtypes.common import (
@@ -85,20 +87,45 @@
8587
Column (0-indexed) to use as the row labels of the DataFrame.
8688
Pass None if there is no such column. If a list is passed,
8789
those columns will be combined into a ``MultiIndex``. If a
88-
subset of data is selected with ``usecols``, index_col
89-
is based on the subset.
90+
subset of data is selected with ``usecols_excel`` or ``usecols``,
91+
index_col is based on the subset.
9092
parse_cols : int or list, default None
9193
9294
.. deprecated:: 0.21.0
9395
Pass in `usecols` instead.
9496
95-
usecols : int or list, default None
97+
usecols : list-like or callable or int, default None
98+
Return a subset of the columns. If list-like, all elements must either
99+
be positional (i.e. integer indices into the document columns) or string
100+
that correspond to column names provided either by the user in `names` or
101+
inferred from the document header row(s). For example, a valid list-like
102+
`usecols` parameter would be [0, 1, 2] or ['foo', 'bar', 'baz']. Note that
103+
you can not give both ``usecols`` and ``usecols_excel`` keyword arguments
104+
at the same time.
105+
106+
If callable, the callable function will be evaluated against the column
107+
names, returning names where the callable function evaluates to True. An
108+
example of a valid callable argument would be ``lambda x: x.upper() in
109+
['AAA', 'BBB', 'DDD']``.
110+
111+
.. versionadded:: 0.24.0
112+
Added support to column labels and now `usecols_excel` is the keyword that
113+
receives separated comma list of excel columns and ranges.
114+
usecols_excel : string or list, default None
115+
Return a subset of the columns from a spreadsheet specified as Excel column
116+
ranges and columns. Note that you can not use both ``usecols`` and
117+
``usecols_excel`` keyword arguments at the same time.
118+
96119
* If None then parse all columns,
97-
* If int then indicates last column to be parsed
98-
* If list of ints then indicates list of column numbers to be parsed
99120
* If string then indicates comma separated list of Excel column letters and
100-
column ranges (e.g. "A:E" or "A,C,E:F"). Ranges are inclusive of
101-
both sides.
121+
column ranges (e.g. "A:E" or "A,C,E:F") to be parsed. Ranges are
122+
inclusive of both sides.
123+
* If list of strings each string shall be an Excel column letter or column
124+
range (e.g. ["A:E"] or ["A", "C", "E:F"]) to be parsed. Ranges are
125+
inclusive of both sides.
126+
127+
.. versionadded:: 0.24.0
128+
102129
squeeze : boolean, default False
103130
If the parsed data only contains one column then return a Series
104131
dtype : Type name or dict of column -> type, default None
@@ -269,6 +296,17 @@ def _get_default_writer(ext):
269296
return _default_writers[ext]
270297

271298

299+
def _is_excel_columns_notation(columns):
300+
"""Receives a string and check if the string is a comma separated list of
301+
Excel index columns and index ranges. An Excel range is a string with two
302+
column indexes separated by ':')."""
303+
if isinstance(columns, compat.string_types) and all(
304+
(x in string.ascii_letters) for x in re.split(r',|:', columns)):
305+
return True
306+
307+
return False
308+
309+
272310
def get_writer(engine_name):
273311
try:
274312
return _writers[engine_name]
@@ -286,6 +324,7 @@ def read_excel(io,
286324
names=None,
287325
index_col=None,
288326
usecols=None,
327+
usecols_excel=None,
289328
squeeze=False,
290329
dtype=None,
291330
engine=None,
@@ -311,6 +350,7 @@ def read_excel(io,
311350
header=header,
312351
names=names,
313352
index_col=index_col,
353+
usecols_excel=usecols_excel,
314354
usecols=usecols,
315355
squeeze=squeeze,
316356
dtype=dtype,
@@ -405,6 +445,7 @@ def parse(self,
405445
names=None,
406446
index_col=None,
407447
usecols=None,
448+
usecols_excel=None,
408449
squeeze=False,
409450
converters=None,
410451
true_values=None,
@@ -439,6 +480,7 @@ def parse(self,
439480
header=header,
440481
names=names,
441482
index_col=index_col,
483+
usecols_excel=usecols_excel,
442484
usecols=usecols,
443485
squeeze=squeeze,
444486
converters=converters,
@@ -455,7 +497,7 @@ def parse(self,
455497
convert_float=convert_float,
456498
**kwds)
457499

458-
def _should_parse(self, i, usecols):
500+
def _should_parse(self, i, usecols_excel, usecols):
459501

460502
def _range2cols(areas):
461503
"""
@@ -481,19 +523,20 @@ def _excel2num(x):
481523
cols.append(_excel2num(rng))
482524
return cols
483525

484-
if isinstance(usecols, int):
485-
return i <= usecols
486-
elif isinstance(usecols, compat.string_types):
487-
return i in _range2cols(usecols)
488-
else:
489-
return i in usecols
526+
# check if usecols_excel is a string that indicates a comma separated
527+
# list of Excel column letters and column ranges
528+
if isinstance(usecols_excel, compat.string_types):
529+
return i in _range2cols(usecols_excel)
530+
531+
return True
490532

491533
def _parse_excel(self,
492534
sheet_name=0,
493535
header=0,
494536
names=None,
495537
index_col=None,
496538
usecols=None,
539+
usecols_excel=None,
497540
squeeze=False,
498541
dtype=None,
499542
true_values=None,
@@ -512,6 +555,25 @@ def _parse_excel(self,
512555

513556
_validate_header_arg(header)
514557

558+
if (usecols is not None) and (usecols_excel is not None):
559+
raise ValueError("Cannot specify both `usecols` and "
560+
"`usecols_excel`. Choose one of them.")
561+
562+
# Check if some string in usecols may be interpreted as a Excel
563+
# range or positional column
564+
elif _is_excel_columns_notation(usecols):
565+
warnings.warn("The `usecols` keyword argument used to refer to "
566+
"Excel ranges and columns as strings was "
567+
"renamed to `usecols_excel`.", UserWarning,
568+
stacklevel=3)
569+
usecols_excel = usecols
570+
usecols = None
571+
572+
elif (usecols_excel is not None) and not _is_excel_columns_notation(
573+
usecols_excel):
574+
raise TypeError("`usecols_excel` must be None or a string as a "
575+
"comma separeted Excel ranges and columns.")
576+
515577
if 'chunksize' in kwds:
516578
raise NotImplementedError("chunksize keyword of read_excel "
517579
"is not implemented")
@@ -615,10 +677,13 @@ def _parse_cell(cell_contents, cell_typ):
615677
row = []
616678
for j, (value, typ) in enumerate(zip(sheet.row_values(i),
617679
sheet.row_types(i))):
618-
if usecols is not None and j not in should_parse:
619-
should_parse[j] = self._should_parse(j, usecols)
680+
if ((usecols is not None) or (usecols_excel is not None) or
681+
(j not in should_parse)):
682+
should_parse[j] = self._should_parse(j, usecols_excel,
683+
usecols)
620684

621-
if usecols is None or should_parse[j]:
685+
if (((usecols_excel is None) and (usecols is None)) or
686+
should_parse[j]):
622687
row.append(_parse_cell(value, typ))
623688
data.append(row)
624689

@@ -674,6 +739,7 @@ def _parse_cell(cell_contents, cell_typ):
674739
dtype=dtype,
675740
true_values=true_values,
676741
false_values=false_values,
742+
usecols=usecols,
677743
skiprows=skiprows,
678744
nrows=nrows,
679745
na_values=na_values,

0 commit comments

Comments
 (0)