Skip to content

Commit ddbc258

Browse files
committed
BUG: Delegate more of Excel parsing to CSV
The idea is that we read the Excel file, get the data, and then let the TextParser handle the reading and parsing. We shouldn't be doing a lot of work that is already defined in parsers.py In doing so, we identified several bugs: * index_col=None was not being respected * usecols behavior was inconsistent with that of read_csv for list of strings and callable inputs * usecols was not being validated as proper Excel column names when passed as a string. Closes pandas-devgh-18273. Closes pandas-devgh-20480.
1 parent adc54fe commit ddbc258

File tree

4 files changed

+664
-510
lines changed

4 files changed

+664
-510
lines changed

doc/source/io.rst

+21
Original file line numberDiff line numberDiff line change
@@ -2867,6 +2867,27 @@ indices to be parsed.
28672867
28682868
Element order is ignored, so ``usecols=[0, 1]`` is the same as ``[1, 0]``.
28692869

2870+
.. versionadded:: 0.24
2871+
2872+
If ``usecols`` is a list of strings, it is assumed that each string corresponds
2873+
to a column name provided either by the user in ``names`` or inferred from the
2874+
document header row(s). Those strings define which columns will be parsed:
2875+
2876+
.. code-block:: python
2877+
2878+
read_excel('path_to_file.xls', 'Sheet1', usecols=['foo', 'bar'])
2879+
2880+
Element order is ignored, so ``usecols=['baz', 'joe']`` is the same as ``['joe', 'baz']``.
2881+
2882+
.. versionadded:: 0.24
2883+
2884+
If ``usecols`` is callable, the callable function will be evaluated against
2885+
the column names, returning names where the callable function evaluates to ``True``.
2886+
2887+
.. code-block:: python
2888+
2889+
read_excel('path_to_file.xls', 'Sheet1', usecols=lambda x: x.isalpha())
2890+
28702891
Parsing Dates
28712892
+++++++++++++
28722893

doc/source/whatsnew/v0.24.0.txt

+3
Original file line numberDiff line numberDiff line change
@@ -237,6 +237,7 @@ Other Enhancements
237237
- Compatibility with Matplotlib 3.0 (:issue:`22790`).
238238
- Added :meth:`Interval.overlaps`, :meth:`IntervalArray.overlaps`, and :meth:`IntervalIndex.overlaps` for determining overlaps between interval-like objects (:issue:`21998`)
239239
- :meth:`Timestamp.tz_localize`, :meth:`DatetimeIndex.tz_localize`, and :meth:`Series.tz_localize` have gained the ``nonexistent`` argument for alternative handling of nonexistent times. See :ref:`timeseries.timezone_nonexsistent` (:issue:`8917`)
240+
- :meth:`read_excel()` now accepts ``usecols`` as a list of column names or callable (:issue:`18273`)
240241

241242
.. _whatsnew_0240.api_breaking:
242243

@@ -1298,6 +1299,8 @@ Notice how we now instead output ``np.nan`` itself instead of a stringified form
12981299
- Bug in :meth:`HDFStore.append` when appending a :class:`DataFrame` with an empty string column and ``min_itemsize`` < 8 (:issue:`12242`)
12991300
- Bug in :meth:`read_csv()` in which :class:`MultiIndex` index names were being improperly handled in the cases when they were not provided (:issue:`23484`)
13001301
- Bug in :meth:`read_html()` in which the error message was not displaying the valid flavors when an invalid one was provided (:issue:`23549`)
1302+
- Bug in :meth:`read_excel()` in which ``index_col=None`` was not being respected and parsing index columns anyway (:issue:`20480`)
1303+
- Bug in :meth:`read_excel()` in which ``usecols`` was not being validated for proper column names when passed in as a string (:issue:`20480`)
13011304

13021305
Plotting
13031306
^^^^^^^^

pandas/io/excel.py

+128-67
Original file line numberDiff line numberDiff line change
@@ -17,8 +17,8 @@
1717
import pandas._libs.json as json
1818
import pandas.compat as compat
1919
from pandas.compat import (
20-
OrderedDict, add_metaclass, lrange, map, range, reduce, string_types, u,
21-
zip)
20+
OrderedDict, add_metaclass, lrange, map, range, string_types, u, zip)
21+
from pandas.core.dtypes.api import is_integer
2222
from pandas.errors import EmptyDataError
2323
from pandas.util._decorators import Appender, deprecate_kwarg
2424

@@ -93,13 +93,22 @@
9393
.. deprecated:: 0.21.0
9494
Pass in `usecols` instead.
9595
96-
usecols : int or list, default None
97-
* If None then parse all columns,
98-
* If int then indicates last column to be parsed
99-
* If list of ints then indicates list of column numbers to be parsed
100-
* If string then indicates comma separated list of Excel column letters and
101-
column ranges (e.g. "A:E" or "A,C,E:F"). Ranges are inclusive of
96+
usecols : int, str, list-like, or callable default None
97+
* If None, then parse all columns,
98+
* If int, then indicates last column to be parsed
99+
* If string, then indicates comma separated list of Excel column letters
100+
and column ranges (e.g. "A:E" or "A,C,E:F"). Ranges are inclusive of
102101
both sides.
102+
* If list of ints, then indicates list of column numbers to be parsed.
103+
* If list of strings, then indicates list of column names to be parsed.
104+
105+
.. versionadded:: 0.24.0
106+
107+
* If callable, then evaluate each column name against it and parse the
108+
column if the callable returns ``True``.
109+
110+
.. versionadded:: 0.24.0
111+
103112
squeeze : boolean, default False
104113
If the parsed data only contains one column then return a Series
105114
dtype : Type name or dict of column -> type, default None
@@ -466,39 +475,6 @@ def parse(self,
466475
convert_float=convert_float,
467476
**kwds)
468477

469-
def _should_parse(self, i, usecols):
470-
471-
def _range2cols(areas):
472-
"""
473-
Convert comma separated list of column names and column ranges to a
474-
list of 0-based column indexes.
475-
476-
>>> _range2cols('A:E')
477-
[0, 1, 2, 3, 4]
478-
>>> _range2cols('A,C,Z:AB')
479-
[0, 2, 25, 26, 27]
480-
"""
481-
def _excel2num(x):
482-
"Convert Excel column name like 'AB' to 0-based column index"
483-
return reduce(lambda s, a: s * 26 + ord(a) - ord('A') + 1,
484-
x.upper().strip(), 0) - 1
485-
486-
cols = []
487-
for rng in areas.split(','):
488-
if ':' in rng:
489-
rng = rng.split(':')
490-
cols += lrange(_excel2num(rng[0]), _excel2num(rng[1]) + 1)
491-
else:
492-
cols.append(_excel2num(rng))
493-
return cols
494-
495-
if isinstance(usecols, int):
496-
return i <= usecols
497-
elif isinstance(usecols, compat.string_types):
498-
return i in _range2cols(usecols)
499-
else:
500-
return i in usecols
501-
502478
def _parse_excel(self,
503479
sheet_name=0,
504480
header=0,
@@ -527,10 +503,6 @@ def _parse_excel(self,
527503
raise NotImplementedError("chunksize keyword of read_excel "
528504
"is not implemented")
529505

530-
if parse_dates is True and index_col is None:
531-
warnings.warn("The 'parse_dates=True' keyword of read_excel was "
532-
"provided without an 'index_col' keyword value.")
533-
534506
import xlrd
535507
from xlrd import (xldate, XL_CELL_DATE,
536508
XL_CELL_ERROR, XL_CELL_BOOLEAN,
@@ -620,17 +592,13 @@ def _parse_cell(cell_contents, cell_typ):
620592
sheet = self.book.sheet_by_index(asheetname)
621593

622594
data = []
623-
should_parse = {}
595+
usecols = _maybe_convert_usecols(usecols)
624596

625597
for i in range(sheet.nrows):
626598
row = []
627599
for j, (value, typ) in enumerate(zip(sheet.row_values(i),
628600
sheet.row_types(i))):
629-
if usecols is not None and j not in should_parse:
630-
should_parse[j] = self._should_parse(j, usecols)
631-
632-
if usecols is None or should_parse[j]:
633-
row.append(_parse_cell(value, typ))
601+
row.append(_parse_cell(value, typ))
634602
data.append(row)
635603

636604
if sheet.nrows == 0:
@@ -642,31 +610,30 @@ def _parse_cell(cell_contents, cell_typ):
642610

643611
# forward fill and pull out names for MultiIndex column
644612
header_names = None
645-
if header is not None:
646-
if is_list_like(header):
647-
header_names = []
648-
control_row = [True] * len(data[0])
649-
for row in header:
650-
if is_integer(skiprows):
651-
row += skiprows
652-
653-
data[row], control_row = _fill_mi_header(
654-
data[row], control_row)
655-
header_name, data[row] = _pop_header_name(
656-
data[row], index_col)
657-
header_names.append(header_name)
658-
else:
659-
data[header] = _trim_excel_header(data[header])
613+
if header is not None and is_list_like(header):
614+
header_names = []
615+
control_row = [True] * len(data[0])
616+
617+
for row in header:
618+
if is_integer(skiprows):
619+
row += skiprows
620+
621+
data[row], control_row = _fill_mi_header(
622+
data[row], control_row)
623+
header_name, _ = _pop_header_name(
624+
data[row], index_col)
625+
header_names.append(header_name)
660626

661627
if is_list_like(index_col):
662-
# forward fill values for MultiIndex index
628+
# Forward fill values for MultiIndex index.
663629
if not is_list_like(header):
664630
offset = 1 + header
665631
else:
666632
offset = 1 + max(header)
667633

668634
for col in index_col:
669635
last = data[offset][col]
636+
670637
for row in range(offset + 1, len(data)):
671638
if data[row][col] == '' or data[row][col] is None:
672639
data[row][col] = last
@@ -693,11 +660,14 @@ def _parse_cell(cell_contents, cell_typ):
693660
thousands=thousands,
694661
comment=comment,
695662
skipfooter=skipfooter,
663+
usecols=usecols,
696664
**kwds)
697665

698666
output[asheetname] = parser.read(nrows=nrows)
667+
699668
if names is not None:
700669
output[asheetname].columns = names
670+
701671
if not squeeze or isinstance(output[asheetname], DataFrame):
702672
output[asheetname].columns = output[
703673
asheetname].columns.set_names(header_names)
@@ -726,6 +696,97 @@ def __exit__(self, exc_type, exc_value, traceback):
726696
self.close()
727697

728698

699+
def _excel2num(x):
700+
"""
701+
Convert Excel column name like 'AB' to 0-based column index.
702+
703+
Parameters
704+
----------
705+
x : str
706+
The Excel column name to convert to a 0-based column index.
707+
708+
Returns
709+
-------
710+
num : int
711+
The column index corresponding to the name.
712+
713+
Raises
714+
------
715+
ValueError
716+
Part of the Excel column name was invalid.
717+
"""
718+
index = 0
719+
720+
for c in x.upper().strip():
721+
cp = ord(c)
722+
723+
if cp < ord("A") or cp > ord("Z"):
724+
raise ValueError("Invalid column name: {x}".format(x=x))
725+
726+
index = index * 26 + cp - ord("A") + 1
727+
728+
return index - 1
729+
730+
731+
def _range2cols(areas):
732+
"""
733+
Convert comma separated list of column names and ranges to indices.
734+
735+
Parameters
736+
----------
737+
areas : str
738+
A string containing a sequence of column ranges (or areas).
739+
740+
Returns
741+
-------
742+
cols : list
743+
A list of 0-based column indices.
744+
745+
Examples
746+
--------
747+
>>> _range2cols('A:E')
748+
[0, 1, 2, 3, 4]
749+
>>> _range2cols('A,C,Z:AB')
750+
[0, 2, 25, 26, 27]
751+
"""
752+
cols = []
753+
754+
for rng in areas.split(","):
755+
if ":" in rng:
756+
rng = rng.split(":")
757+
cols.extend(lrange(_excel2num(rng[0]), _excel2num(rng[1]) + 1))
758+
else:
759+
cols.append(_excel2num(rng))
760+
761+
return cols
762+
763+
764+
def _maybe_convert_usecols(usecols):
765+
"""
766+
Convert `usecols` into a compatible format for parsing in `parsers.py`.
767+
768+
Parameters
769+
----------
770+
usecols : object
771+
The use-columns object to potentially convert.
772+
773+
Returns
774+
-------
775+
converted : object
776+
The compatible format of `usecols`.
777+
"""
778+
if usecols is None:
779+
return usecols
780+
781+
if is_integer(usecols):
782+
return lrange(usecols + 1)
783+
784+
if isinstance(usecols, compat.string_types):
785+
return _range2cols(usecols)
786+
787+
return usecols
788+
789+
729790
def _validate_freeze_panes(freeze_panes):
730791
if freeze_panes is not None:
731792
if (

0 commit comments

Comments
 (0)