Skip to content

Commit 207b13b

Browse files
gfyoungPingviinituutti
authored andcommitted
BUG: Delegate more of Excel parsing to CSV (pandas-dev#23544)
The idea is that we read the Excel file, get the data, and then let the TextParser handle the reading and parsing. We shouldn't be doing a lot of work that is already defined in parsers.py In doing so, we identified several bugs: * index_col=None was not being respected * usecols behavior was inconsistent with that of read_csv for list of strings and callable inputs * usecols was not being validated as proper Excel column names when passed as a string. Closes pandas-devgh-18273. Closes pandas-devgh-20480.
1 parent e8fe182 commit 207b13b

File tree

4 files changed

+670
-511
lines changed

4 files changed

+670
-511
lines changed

doc/source/io.rst

+28-1
Original file line numberDiff line numberDiff line change
@@ -2861,7 +2861,13 @@ to be parsed.
28612861
28622862
read_excel('path_to_file.xls', 'Sheet1', usecols=2)
28632863
2864-
If `usecols` is a list of integers, then it is assumed to be the file column
2864+
You can also specify a comma-delimited set of Excel columns and ranges as a string:
2865+
2866+
.. code-block:: python
2867+
2868+
read_excel('path_to_file.xls', 'Sheet1', usecols='A,C:E')
2869+
2870+
If ``usecols`` is a list of integers, then it is assumed to be the file column
28652871
indices to be parsed.
28662872

28672873
.. code-block:: python
@@ -2870,6 +2876,27 @@ indices to be parsed.
28702876
28712877
Element order is ignored, so ``usecols=[0, 1]`` is the same as ``[1, 0]``.
28722878

2879+
.. versionadded:: 0.24
2880+
2881+
If ``usecols`` is a list of strings, it is assumed that each string corresponds
2882+
to a column name provided either by the user in ``names`` or inferred from the
2883+
document header row(s). Those strings define which columns will be parsed:
2884+
2885+
.. code-block:: python
2886+
2887+
read_excel('path_to_file.xls', 'Sheet1', usecols=['foo', 'bar'])
2888+
2889+
Element order is ignored, so ``usecols=['baz', 'joe']`` is the same as ``['joe', 'baz']``.
2890+
2891+
.. versionadded:: 0.24
2892+
2893+
If ``usecols`` is callable, the callable function will be evaluated against
2894+
the column names, returning names where the callable function evaluates to ``True``.
2895+
2896+
.. code-block:: python
2897+
2898+
read_excel('path_to_file.xls', 'Sheet1', usecols=lambda x: x.isalpha())
2899+
28732900
Parsing Dates
28742901
+++++++++++++
28752902

doc/source/whatsnew/v0.24.0.txt

+3
Original file line numberDiff line numberDiff line change
@@ -238,6 +238,7 @@ Other Enhancements
238238
- Added :meth:`Interval.overlaps`, :meth:`IntervalArray.overlaps`, and :meth:`IntervalIndex.overlaps` for determining overlaps between interval-like objects (:issue:`21998`)
239239
- :func:`~DataFrame.to_parquet` now supports writing a ``DataFrame`` as a directory of parquet files partitioned by a subset of the columns when ``engine = 'pyarrow'`` (:issue:`23283`)
240240
- :meth:`Timestamp.tz_localize`, :meth:`DatetimeIndex.tz_localize`, and :meth:`Series.tz_localize` have gained the ``nonexistent`` argument for alternative handling of nonexistent times. See :ref:`timeseries.timezone_nonexsistent` (:issue:`8917`)
241+
- :meth:`read_excel()` now accepts ``usecols`` as a list of column names or callable (:issue:`18273`)
241242

242243
.. _whatsnew_0240.api_breaking:
243244

@@ -1300,6 +1301,8 @@ Notice how we now instead output ``np.nan`` itself instead of a stringified form
13001301
- Bug in :meth:`HDFStore.append` when appending a :class:`DataFrame` with an empty string column and ``min_itemsize`` < 8 (:issue:`12242`)
13011302
- Bug in :meth:`read_csv()` in which :class:`MultiIndex` index names were being improperly handled in the cases when they were not provided (:issue:`23484`)
13021303
- Bug in :meth:`read_html()` in which the error message was not displaying the valid flavors when an invalid one was provided (:issue:`23549`)
1304+
- Bug in :meth:`read_excel()` in which ``index_col=None`` was not being respected and parsing index columns anyway (:issue:`20480`)
1305+
- Bug in :meth:`read_excel()` in which ``usecols`` was not being validated for proper column names when passed in as a string (:issue:`20480`)
13031306

13041307
Plotting
13051308
^^^^^^^^

pandas/io/excel.py

+127-67
Original file line numberDiff line numberDiff line change
@@ -17,8 +17,7 @@
1717
import pandas._libs.json as json
1818
import pandas.compat as compat
1919
from pandas.compat import (
20-
OrderedDict, add_metaclass, lrange, map, range, reduce, string_types, u,
21-
zip)
20+
OrderedDict, add_metaclass, lrange, map, range, string_types, u, zip)
2221
from pandas.errors import EmptyDataError
2322
from pandas.util._decorators import Appender, deprecate_kwarg
2423

@@ -93,13 +92,22 @@
9392
.. deprecated:: 0.21.0
9493
Pass in `usecols` instead.
9594
96-
usecols : int or list, default None
97-
* If None then parse all columns,
98-
* If int then indicates last column to be parsed
99-
* If list of ints then indicates list of column numbers to be parsed
100-
* If string then indicates comma separated list of Excel column letters and
101-
column ranges (e.g. "A:E" or "A,C,E:F"). Ranges are inclusive of
95+
usecols : int, str, list-like, or callable default None
96+
* If None, then parse all columns,
97+
* If int, then indicates last column to be parsed
98+
* If string, then indicates comma separated list of Excel column letters
99+
and column ranges (e.g. "A:E" or "A,C,E:F"). Ranges are inclusive of
102100
both sides.
101+
* If list of ints, then indicates list of column numbers to be parsed.
102+
* If list of strings, then indicates list of column names to be parsed.
103+
104+
.. versionadded:: 0.24.0
105+
106+
* If callable, then evaluate each column name against it and parse the
107+
column if the callable returns ``True``.
108+
109+
.. versionadded:: 0.24.0
110+
103111
squeeze : boolean, default False
104112
If the parsed data only contains one column then return a Series
105113
dtype : Type name or dict of column -> type, default None
@@ -466,39 +474,6 @@ def parse(self,
466474
convert_float=convert_float,
467475
**kwds)
468476

469-
def _should_parse(self, i, usecols):
470-
471-
def _range2cols(areas):
472-
"""
473-
Convert comma separated list of column names and column ranges to a
474-
list of 0-based column indexes.
475-
476-
>>> _range2cols('A:E')
477-
[0, 1, 2, 3, 4]
478-
>>> _range2cols('A,C,Z:AB')
479-
[0, 2, 25, 26, 27]
480-
"""
481-
def _excel2num(x):
482-
"Convert Excel column name like 'AB' to 0-based column index"
483-
return reduce(lambda s, a: s * 26 + ord(a) - ord('A') + 1,
484-
x.upper().strip(), 0) - 1
485-
486-
cols = []
487-
for rng in areas.split(','):
488-
if ':' in rng:
489-
rng = rng.split(':')
490-
cols += lrange(_excel2num(rng[0]), _excel2num(rng[1]) + 1)
491-
else:
492-
cols.append(_excel2num(rng))
493-
return cols
494-
495-
if isinstance(usecols, int):
496-
return i <= usecols
497-
elif isinstance(usecols, compat.string_types):
498-
return i in _range2cols(usecols)
499-
else:
500-
return i in usecols
501-
502477
def _parse_excel(self,
503478
sheet_name=0,
504479
header=0,
@@ -527,10 +502,6 @@ def _parse_excel(self,
527502
raise NotImplementedError("chunksize keyword of read_excel "
528503
"is not implemented")
529504

530-
if parse_dates is True and index_col is None:
531-
warnings.warn("The 'parse_dates=True' keyword of read_excel was "
532-
"provided without an 'index_col' keyword value.")
533-
534505
import xlrd
535506
from xlrd import (xldate, XL_CELL_DATE,
536507
XL_CELL_ERROR, XL_CELL_BOOLEAN,
@@ -620,17 +591,13 @@ def _parse_cell(cell_contents, cell_typ):
620591
sheet = self.book.sheet_by_index(asheetname)
621592

622593
data = []
623-
should_parse = {}
594+
usecols = _maybe_convert_usecols(usecols)
624595

625596
for i in range(sheet.nrows):
626597
row = []
627598
for j, (value, typ) in enumerate(zip(sheet.row_values(i),
628599
sheet.row_types(i))):
629-
if usecols is not None and j not in should_parse:
630-
should_parse[j] = self._should_parse(j, usecols)
631-
632-
if usecols is None or should_parse[j]:
633-
row.append(_parse_cell(value, typ))
600+
row.append(_parse_cell(value, typ))
634601
data.append(row)
635602

636603
if sheet.nrows == 0:
@@ -642,31 +609,30 @@ def _parse_cell(cell_contents, cell_typ):
642609

643610
# forward fill and pull out names for MultiIndex column
644611
header_names = None
645-
if header is not None:
646-
if is_list_like(header):
647-
header_names = []
648-
control_row = [True] * len(data[0])
649-
for row in header:
650-
if is_integer(skiprows):
651-
row += skiprows
652-
653-
data[row], control_row = _fill_mi_header(
654-
data[row], control_row)
655-
header_name, data[row] = _pop_header_name(
656-
data[row], index_col)
657-
header_names.append(header_name)
658-
else:
659-
data[header] = _trim_excel_header(data[header])
612+
if header is not None and is_list_like(header):
613+
header_names = []
614+
control_row = [True] * len(data[0])
615+
616+
for row in header:
617+
if is_integer(skiprows):
618+
row += skiprows
619+
620+
data[row], control_row = _fill_mi_header(
621+
data[row], control_row)
622+
header_name, _ = _pop_header_name(
623+
data[row], index_col)
624+
header_names.append(header_name)
660625

661626
if is_list_like(index_col):
662-
# forward fill values for MultiIndex index
627+
# Forward fill values for MultiIndex index.
663628
if not is_list_like(header):
664629
offset = 1 + header
665630
else:
666631
offset = 1 + max(header)
667632

668633
for col in index_col:
669634
last = data[offset][col]
635+
670636
for row in range(offset + 1, len(data)):
671637
if data[row][col] == '' or data[row][col] is None:
672638
data[row][col] = last
@@ -693,11 +659,14 @@ def _parse_cell(cell_contents, cell_typ):
693659
thousands=thousands,
694660
comment=comment,
695661
skipfooter=skipfooter,
662+
usecols=usecols,
696663
**kwds)
697664

698665
output[asheetname] = parser.read(nrows=nrows)
666+
699667
if names is not None:
700668
output[asheetname].columns = names
669+
701670
if not squeeze or isinstance(output[asheetname], DataFrame):
702671
output[asheetname].columns = output[
703672
asheetname].columns.set_names(header_names)
@@ -726,6 +695,97 @@ def __exit__(self, exc_type, exc_value, traceback):
726695
self.close()
727696

728697

698+
def _excel2num(x):
699+
"""
700+
Convert Excel column name like 'AB' to 0-based column index.
701+
702+
Parameters
703+
----------
704+
x : str
705+
The Excel column name to convert to a 0-based column index.
706+
707+
Returns
708+
-------
709+
num : int
710+
The column index corresponding to the name.
711+
712+
Raises
713+
------
714+
ValueError
715+
Part of the Excel column name was invalid.
716+
"""
717+
index = 0
718+
719+
for c in x.upper().strip():
720+
cp = ord(c)
721+
722+
if cp < ord("A") or cp > ord("Z"):
723+
raise ValueError("Invalid column name: {x}".format(x=x))
724+
725+
index = index * 26 + cp - ord("A") + 1
726+
727+
return index - 1
728+
729+
730+
def _range2cols(areas):
731+
"""
732+
Convert comma separated list of column names and ranges to indices.
733+
734+
Parameters
735+
----------
736+
areas : str
737+
A string containing a sequence of column ranges (or areas).
738+
739+
Returns
740+
-------
741+
cols : list
742+
A list of 0-based column indices.
743+
744+
Examples
745+
--------
746+
>>> _range2cols('A:E')
747+
[0, 1, 2, 3, 4]
748+
>>> _range2cols('A,C,Z:AB')
749+
[0, 2, 25, 26, 27]
750+
"""
751+
cols = []
752+
753+
for rng in areas.split(","):
754+
if ":" in rng:
755+
rng = rng.split(":")
756+
cols.extend(lrange(_excel2num(rng[0]), _excel2num(rng[1]) + 1))
757+
else:
758+
cols.append(_excel2num(rng))
759+
760+
return cols
761+
762+
763+
def _maybe_convert_usecols(usecols):
764+
"""
765+
Convert `usecols` into a compatible format for parsing in `parsers.py`.
766+
767+
Parameters
768+
----------
769+
usecols : object
770+
The use-columns object to potentially convert.
771+
772+
Returns
773+
-------
774+
converted : object
775+
The compatible format of `usecols`.
776+
"""
777+
if usecols is None:
778+
return usecols
779+
780+
if is_integer(usecols):
781+
return lrange(usecols + 1)
782+
783+
if isinstance(usecols, compat.string_types):
784+
return _range2cols(usecols)
785+
786+
return usecols
787+
788+
729789
def _validate_freeze_panes(freeze_panes):
730790
if freeze_panes is not None:
731791
if (

0 commit comments

Comments
 (0)