Skip to content

Commit 487a336

Browse files
committed
BUG: Delegate more of Excel parsing to CSV
The idea is that we read the Excel file, get the data, and then let the TextParser handle the reading and parsing. We shouldn't be doing a lot of work that is already defined in parsers.py In doing so, we identified several bugs: * index_col=None was not being respected * usecols behavior was inconsistent with that of read_csv for list of strings and callable inputs * usecols was not being validated as proper Excel column names when passed as a string. Closes pandas-devgh-18273. Closes pandas-devgh-20480.
1 parent adc54fe commit 487a336

File tree

4 files changed

+663
-510
lines changed

4 files changed

+663
-510
lines changed

doc/source/io.rst

+21
Original file line numberDiff line numberDiff line change
@@ -2867,6 +2867,27 @@ indices to be parsed.
28672867
28682868
Element order is ignored, so ``usecols=[0, 1]`` is the same as ``[1, 0]``.
28692869

2870+
.. versionadded:: 0.24
2871+
2872+
If ``usecols`` is a list of strings, it is assumed that each string corresponds
2873+
to a column name provided either by the user in ``names`` or inferred from the
2874+
document header row(s). Those strings define which columns will be parsed:
2875+
2876+
.. code-block:: python
2877+
2878+
read_excel('path_to_file.xls', 'Sheet1', usecols=['foo', 'bar'])
2879+
2880+
Element order is ignored, so ``usecols=['baz', 'joe']`` is the same as ``['joe', 'baz']``.
2881+
2882+
.. versionadded:: 0.24
2883+
2884+
If ``usecols`` is callable, the callable function will be evaluated against
2885+
the column names, returning names where the callable function evaluates to ``True``.
2886+
2887+
.. code-block:: python
2888+
2889+
read_excel('path_to_file.xls', 'Sheet1', usecols=lambda x: x.isalpha())
2890+
28702891
Parsing Dates
28712892
+++++++++++++
28722893

doc/source/whatsnew/v0.24.0.txt

+3
Original file line numberDiff line numberDiff line change
@@ -237,6 +237,7 @@ Other Enhancements
237237
- Compatibility with Matplotlib 3.0 (:issue:`22790`).
238238
- Added :meth:`Interval.overlaps`, :meth:`IntervalArray.overlaps`, and :meth:`IntervalIndex.overlaps` for determining overlaps between interval-like objects (:issue:`21998`)
239239
- :meth:`Timestamp.tz_localize`, :meth:`DatetimeIndex.tz_localize`, and :meth:`Series.tz_localize` have gained the ``nonexistent`` argument for alternative handling of nonexistent times. See :ref:`timeseries.timezone_nonexsistent` (:issue:`8917`)
240+
- :meth:`read_excel()` now accepts ``usecols`` as a list of column names or callable (:issue:`18273`)
240241

241242
.. _whatsnew_0240.api_breaking:
242243

@@ -1298,6 +1299,8 @@ Notice how we now instead output ``np.nan`` itself instead of a stringified form
12981299
- Bug in :meth:`HDFStore.append` when appending a :class:`DataFrame` with an empty string column and ``min_itemsize`` < 8 (:issue:`12242`)
12991300
- Bug in :meth:`read_csv()` in which :class:`MultiIndex` index names were being improperly handled in the cases when they were not provided (:issue:`23484`)
13001301
- Bug in :meth:`read_html()` in which the error message was not displaying the valid flavors when an invalid one was provided (:issue:`23549`)
1302+
- Bug in :meth:`read_excel()` in which ``index_col=None`` was not being respected and parsing index columns anyway (:issue:`20480`)
1303+
- Bug in :meth:`read_excel()` in which ``usecols`` was not being validated for proper column names when passed in as a string (:issue:`20480`)
13011304

13021305
Plotting
13031306
^^^^^^^^

pandas/io/excel.py

+127-67
Original file line numberDiff line numberDiff line change
@@ -17,8 +17,7 @@
1717
import pandas._libs.json as json
1818
import pandas.compat as compat
1919
from pandas.compat import (
20-
OrderedDict, add_metaclass, lrange, map, range, reduce, string_types, u,
21-
zip)
20+
OrderedDict, add_metaclass, lrange, map, range, string_types, u, zip)
2221
from pandas.errors import EmptyDataError
2322
from pandas.util._decorators import Appender, deprecate_kwarg
2423

@@ -93,13 +92,22 @@
9392
.. deprecated:: 0.21.0
9493
Pass in `usecols` instead.
9594
96-
usecols : int or list, default None
97-
* If None then parse all columns,
98-
* If int then indicates last column to be parsed
99-
* If list of ints then indicates list of column numbers to be parsed
100-
* If string then indicates comma separated list of Excel column letters and
101-
column ranges (e.g. "A:E" or "A,C,E:F"). Ranges are inclusive of
95+
usecols : int, str, list-like, or callable default None
96+
* If None, then parse all columns,
97+
* If int, then indicates last column to be parsed
98+
* If string, then indicates comma separated list of Excel column letters
99+
and column ranges (e.g. "A:E" or "A,C,E:F"). Ranges are inclusive of
102100
both sides.
101+
* If list of ints, then indicates list of column numbers to be parsed.
102+
* If list of strings, then indicates list of column names to be parsed.
103+
104+
.. versionadded:: 0.24.0
105+
106+
* If callable, then evaluate each column name against it and parse the
107+
column if the callable returns ``True``.
108+
109+
.. versionadded:: 0.24.0
110+
103111
squeeze : boolean, default False
104112
If the parsed data only contains one column then return a Series
105113
dtype : Type name or dict of column -> type, default None
@@ -466,39 +474,6 @@ def parse(self,
466474
convert_float=convert_float,
467475
**kwds)
468476

469-
def _should_parse(self, i, usecols):
470-
471-
def _range2cols(areas):
472-
"""
473-
Convert comma separated list of column names and column ranges to a
474-
list of 0-based column indexes.
475-
476-
>>> _range2cols('A:E')
477-
[0, 1, 2, 3, 4]
478-
>>> _range2cols('A,C,Z:AB')
479-
[0, 2, 25, 26, 27]
480-
"""
481-
def _excel2num(x):
482-
"Convert Excel column name like 'AB' to 0-based column index"
483-
return reduce(lambda s, a: s * 26 + ord(a) - ord('A') + 1,
484-
x.upper().strip(), 0) - 1
485-
486-
cols = []
487-
for rng in areas.split(','):
488-
if ':' in rng:
489-
rng = rng.split(':')
490-
cols += lrange(_excel2num(rng[0]), _excel2num(rng[1]) + 1)
491-
else:
492-
cols.append(_excel2num(rng))
493-
return cols
494-
495-
if isinstance(usecols, int):
496-
return i <= usecols
497-
elif isinstance(usecols, compat.string_types):
498-
return i in _range2cols(usecols)
499-
else:
500-
return i in usecols
501-
502477
def _parse_excel(self,
503478
sheet_name=0,
504479
header=0,
@@ -527,10 +502,6 @@ def _parse_excel(self,
527502
raise NotImplementedError("chunksize keyword of read_excel "
528503
"is not implemented")
529504

530-
if parse_dates is True and index_col is None:
531-
warnings.warn("The 'parse_dates=True' keyword of read_excel was "
532-
"provided without an 'index_col' keyword value.")
533-
534505
import xlrd
535506
from xlrd import (xldate, XL_CELL_DATE,
536507
XL_CELL_ERROR, XL_CELL_BOOLEAN,
@@ -620,17 +591,13 @@ def _parse_cell(cell_contents, cell_typ):
620591
sheet = self.book.sheet_by_index(asheetname)
621592

622593
data = []
623-
should_parse = {}
594+
usecols = _maybe_convert_usecols(usecols)
624595

625596
for i in range(sheet.nrows):
626597
row = []
627598
for j, (value, typ) in enumerate(zip(sheet.row_values(i),
628599
sheet.row_types(i))):
629-
if usecols is not None and j not in should_parse:
630-
should_parse[j] = self._should_parse(j, usecols)
631-
632-
if usecols is None or should_parse[j]:
633-
row.append(_parse_cell(value, typ))
600+
row.append(_parse_cell(value, typ))
634601
data.append(row)
635602

636603
if sheet.nrows == 0:
@@ -642,31 +609,30 @@ def _parse_cell(cell_contents, cell_typ):
642609

643610
# forward fill and pull out names for MultiIndex column
644611
header_names = None
645-
if header is not None:
646-
if is_list_like(header):
647-
header_names = []
648-
control_row = [True] * len(data[0])
649-
for row in header:
650-
if is_integer(skiprows):
651-
row += skiprows
652-
653-
data[row], control_row = _fill_mi_header(
654-
data[row], control_row)
655-
header_name, data[row] = _pop_header_name(
656-
data[row], index_col)
657-
header_names.append(header_name)
658-
else:
659-
data[header] = _trim_excel_header(data[header])
612+
if header is not None and is_list_like(header):
613+
header_names = []
614+
control_row = [True] * len(data[0])
615+
616+
for row in header:
617+
if is_integer(skiprows):
618+
row += skiprows
619+
620+
data[row], control_row = _fill_mi_header(
621+
data[row], control_row)
622+
header_name, _ = _pop_header_name(
623+
data[row], index_col)
624+
header_names.append(header_name)
660625

661626
if is_list_like(index_col):
662-
# forward fill values for MultiIndex index
627+
# Forward fill values for MultiIndex index.
663628
if not is_list_like(header):
664629
offset = 1 + header
665630
else:
666631
offset = 1 + max(header)
667632

668633
for col in index_col:
669634
last = data[offset][col]
635+
670636
for row in range(offset + 1, len(data)):
671637
if data[row][col] == '' or data[row][col] is None:
672638
data[row][col] = last
@@ -693,11 +659,14 @@ def _parse_cell(cell_contents, cell_typ):
693659
thousands=thousands,
694660
comment=comment,
695661
skipfooter=skipfooter,
662+
usecols=usecols,
696663
**kwds)
697664

698665
output[asheetname] = parser.read(nrows=nrows)
666+
699667
if names is not None:
700668
output[asheetname].columns = names
669+
701670
if not squeeze or isinstance(output[asheetname], DataFrame):
702671
output[asheetname].columns = output[
703672
asheetname].columns.set_names(header_names)
@@ -726,6 +695,97 @@ def __exit__(self, exc_type, exc_value, traceback):
726695
self.close()
727696

728697

698+
def _excel2num(x):
699+
"""
700+
Convert Excel column name like 'AB' to 0-based column index.
701+
702+
Parameters
703+
----------
704+
x : str
705+
The Excel column name to convert to a 0-based column index.
706+
707+
Returns
708+
-------
709+
num : int
710+
The column index corresponding to the name.
711+
712+
Raises
713+
------
714+
ValueError
715+
Part of the Excel column name was invalid.
716+
"""
717+
index = 0
718+
719+
for c in x.upper().strip():
720+
cp = ord(c)
721+
722+
if cp < ord("A") or cp > ord("Z"):
723+
raise ValueError("Invalid column name: {x}".format(x=x))
724+
725+
index = index * 26 + cp - ord("A") + 1
726+
727+
return index - 1
728+
729+
730+
def _range2cols(areas):
731+
"""
732+
Convert comma separated list of column names and ranges to indices.
733+
734+
Parameters
735+
----------
736+
areas : str
737+
A string containing a sequence of column ranges (or areas).
738+
739+
Returns
740+
-------
741+
cols : list
742+
A list of 0-based column indices.
743+
744+
Examples
745+
--------
746+
>>> _range2cols('A:E')
747+
[0, 1, 2, 3, 4]
748+
>>> _range2cols('A,C,Z:AB')
749+
[0, 2, 25, 26, 27]
750+
"""
751+
cols = []
752+
753+
for rng in areas.split(","):
754+
if ":" in rng:
755+
rng = rng.split(":")
756+
cols.extend(lrange(_excel2num(rng[0]), _excel2num(rng[1]) + 1))
757+
else:
758+
cols.append(_excel2num(rng))
759+
760+
return cols
761+
762+
763+
def _maybe_convert_usecols(usecols):
764+
"""
765+
Convert `usecols` into a compatible format for parsing in `parsers.py`.
766+
767+
Parameters
768+
----------
769+
usecols : object
770+
The use-columns object to potentially convert.
771+
772+
Returns
773+
-------
774+
converted : object
775+
The compatible format of `usecols`.
776+
"""
777+
if usecols is None:
778+
return usecols
779+
780+
if is_integer(usecols):
781+
return lrange(usecols + 1)
782+
783+
if isinstance(usecols, compat.string_types):
784+
return _range2cols(usecols)
785+
786+
return usecols
787+
788+
729789
def _validate_freeze_panes(freeze_panes):
730790
if freeze_panes is not None:
731791
if (

0 commit comments

Comments
 (0)