Skip to content

Commit 0d39ca1

Browse files
chris-b1jreback
authored andcommitted
API: read_excel signature
1 parent 274abee commit 0d39ca1

File tree

4 files changed

+260
-180
lines changed

4 files changed

+260
-180
lines changed

doc/source/io.rst

+113-62
Original file line numberDiff line numberDiff line change
@@ -1980,100 +1980,85 @@ Excel files
19801980
-----------
19811981

19821982
The :func:`~pandas.read_excel` method can read Excel 2003 (``.xls``) and
1983-
Excel 2007 (``.xlsx``) files using the ``xlrd`` Python
1984-
module and use the same parsing code as the above to convert tabular data into
1985-
a DataFrame. See the :ref:`cookbook<cookbook.excel>` for some
1983+
Excel 2007+ (``.xlsx``) files using the ``xlrd`` Python
1984+
module. The :meth:`~DataFrame.to_excel` instance method is used for
1985+
saving a ``DataFrame`` to Excel. Generally the semantics are
1986+
similar to working with :ref:`csv<io.read_csv_table>` data. See the :ref:`cookbook<cookbook.excel>` for some
19861987
advanced strategies
19871988

19881989
.. _io.excel_reader:
19891990

19901991
Reading Excel Files
19911992
'''''''''''''''''''
19921993

1993-
.. versionadded:: 0.17
1994+
In the most basic use-case, ``read_excel`` takes a path to an Excel
1995+
file, and the ``sheetname`` indicating which sheet to parse.
19941996

1995-
``read_excel`` can read a ``MultiIndex`` index, by passing a list of columns to ``index_col``
1996-
and a ``MultiIndex`` column by passing a list of rows to ``header``. If either the ``index``
1997-
or ``columns`` have serialized level names those will be read in as well by specifying
1998-
the rows/columns that make up the levels.
1999-
2000-
.. ipython:: python
1997+
.. code-block:: python
20011998
2002-
# MultiIndex index - no names
2003-
df = pd.DataFrame({'a':[1,2,3,4], 'b':[5,6,7,8]},
2004-
index=pd.MultiIndex.from_product([['a','b'],['c','d']]))
2005-
df.to_excel('path_to_file.xlsx')
2006-
df = pd.read_excel('path_to_file.xlsx', index_col=[0,1])
2007-
df
1999+
# Returns a DataFrame
2000+
read_excel('path_to_file.xls', sheetname='Sheet1')
20082001
2009-
# MultiIndex index - with names
2010-
df.index = df.index.set_names(['lvl1', 'lvl2'])
2011-
df.to_excel('path_to_file.xlsx')
2012-
df = pd.read_excel('path_to_file.xlsx', index_col=[0,1])
2013-
df
20142002
2015-
# MultiIndex index and column - with names
2016-
df.columns = pd.MultiIndex.from_product([['a'],['b', 'd']], names=['c1', 'c2'])
2017-
df.to_excel('path_to_file.xlsx')
2018-
df = pd.read_excel('path_to_file.xlsx',
2019-
index_col=[0,1], header=[0,1])
2020-
df
2003+
.. _io.excel.excelfile_class:
20212004

2022-
.. ipython:: python
2023-
:suppress:
2005+
``ExcelFile`` class
2006+
+++++++++++++++++++
20242007

2025-
import os
2026-
os.remove('path_to_file.xlsx')
2008+
To faciliate working with multiple sheets from the same file, the ``ExcelFile``
2009+
class can be used to wrap the file and can be be passed into ``read_excel``
2010+
There will be a performance benefit for reading multiple sheets as the file is
2011+
read into memory only once.
20272012

2028-
.. warning::
2013+
.. code-block:: python
20292014
2030-
Excel files saved in version 0.16.2 or prior that had index names will still able to be read in,
2031-
but the ``has_index_names`` argument must specified to ``True``.
2015+
xlsx = pd.ExcelFile('path_to_file.xls)
2016+
df = pd.read_excel(xlsx, 'Sheet1')
20322017
2033-
.. versionadded:: 0.16
2018+
The ``ExcelFile`` class can also be used as a context manager.
20342019
2035-
``read_excel`` can read more than one sheet, by setting ``sheetname`` to either
2036-
a list of sheet names, a list of sheet positions, or ``None`` to read all sheets.
2020+
.. code-block:: python
20372021
2038-
.. versionadded:: 0.13
2022+
with pd.ExcelFile('path_to_file.xls') as xls:
2023+
df1 = pd.read_excel(xls, 'Sheet1')
2024+
df2 = pd.read_excel(xls, 'Sheet2')
20392025
2040-
Sheets can be specified by sheet index or sheet name, using an integer or string,
2041-
respectively.
2026+
The ``sheet_names`` property will generate
2027+
a list of the sheet names in the file.
20422028
2043-
.. versionadded:: 0.12
2029+
The primary use-case for an ``ExcelFile`` is parsing multiple sheets with
2030+
different parameters
20442031
2045-
``ExcelFile`` has been moved to the top level namespace.
2032+
.. code-block:: python
20462033
2047-
There are two approaches to reading an excel file. The ``read_excel`` function
2048-
and the ``ExcelFile`` class. ``read_excel`` is for reading one file
2049-
with file-specific arguments (ie. identical data formats across sheets).
2050-
``ExcelFile`` is for reading one file with sheet-specific arguments (ie. various data
2051-
formats across sheets). Choosing the approach is largely a question of
2052-
code readability and execution speed.
2034+
data = {}
2035+
# For when Sheet1's format differs from Sheet2
2036+
with pd.ExcelFile('path_to_file.xls') as xls:
2037+
data['Sheet1'] = pd.read_excel(xls, 'Sheet1', index_col=None, na_values=['NA'])
2038+
data['Sheet2'] = pd.read_excel(xls, 'Sheet2', index_col=1)
20532039
2054-
Equivalent class and function approaches to read a single sheet:
2040+
Note that if the same parsing parameters are used for all sheets, a list
2041+
of sheet names can simply be passed to ``read_excel`` with no loss in performance.
20552042
20562043
.. code-block:: python
20572044
20582045
# using the ExcelFile class
2059-
xls = pd.ExcelFile('path_to_file.xls')
2060-
data = xls.parse('Sheet1', index_col=None, na_values=['NA'])
2046+
data = {}
2047+
with pd.ExcelFile('path_to_file.xls') as xls:
2048+
data['Sheet1'] = read_excel(xls, 'Sheet1', index_col=None, na_values=['NA'])
2049+
data['Sheet2'] = read_excel(xls, 'Sheet2', index_col=None, na_values=['NA'])
20612050
2062-
# using the read_excel function
2063-
data = read_excel('path_to_file.xls', 'Sheet1', index_col=None, na_values=['NA'])
2051+
# equivalent using the read_excel function
2052+
data = read_excel('path_to_file.xls', ['Sheet1', 'Sheet2'], index_col=None, na_values=['NA'])
20642053
2065-
Equivalent class and function approaches to read multiple sheets:
2054+
.. versionadded:: 0.12
20662055
2067-
.. code-block:: python
2056+
``ExcelFile`` has been moved to the top level namespace.
20682057
2069-
data = {}
2070-
# For when Sheet1's format differs from Sheet2
2071-
xls = pd.ExcelFile('path_to_file.xls')
2072-
data['Sheet1'] = xls.parse('Sheet1', index_col=None, na_values=['NA'])
2073-
data['Sheet2'] = xls.parse('Sheet2', index_col=1)
2058+
.. versionadded:: 0.17
2059+
2060+
``read_excel`` can take an ``ExcelFile`` object as input
20742061
2075-
# For when Sheet1's format is identical to Sheet2
2076-
data = read_excel('path_to_file.xls', ['Sheet1','Sheet2'], index_col=None, na_values=['NA'])
20772062
20782063
.. _io.excel.specifying_sheets:
20792064
@@ -2125,6 +2110,72 @@ Using a list to get multiple sheets:
21252110
# Returns the 1st and 4th sheet, as a dictionary of DataFrames.
21262111
read_excel('path_to_file.xls',sheetname=['Sheet1',3])
21272112
2113+
.. versionadded:: 0.16
2114+
2115+
``read_excel`` can read more than one sheet, by setting ``sheetname`` to either
2116+
a list of sheet names, a list of sheet positions, or ``None`` to read all sheets.
2117+
2118+
.. versionadded:: 0.13
2119+
2120+
Sheets can be specified by sheet index or sheet name, using an integer or string,
2121+
respectively.
2122+
2123+
.. _io.excel.reading_multiindex:
2124+
2125+
Reading a ``MultiIndex``
2126+
++++++++++++++++++++++++
2127+
2128+
.. versionadded:: 0.17
2129+
2130+
``read_excel`` can read a ``MultiIndex`` index, by passing a list of columns to ``index_col``
2131+
and a ``MultiIndex`` column by passing a list of rows to ``header``. If either the ``index``
2132+
or ``columns`` have serialized level names those will be read in as well by specifying
2133+
the rows/columns that make up the levels.
2134+
2135+
For example, to read in a ``MultiIndex`` index without names:
2136+
2137+
.. ipython:: python
2138+
2139+
df = pd.DataFrame({'a':[1,2,3,4], 'b':[5,6,7,8]},
2140+
index=pd.MultiIndex.from_product([['a','b'],['c','d']]))
2141+
df.to_excel('path_to_file.xlsx')
2142+
df = pd.read_excel('path_to_file.xlsx', index_col=[0,1])
2143+
df
2144+
2145+
If the index has level names, they will parsed as well, using the same
2146+
parameters.
2147+
2148+
.. ipython:: python
2149+
2150+
df.index = df.index.set_names(['lvl1', 'lvl2'])
2151+
df.to_excel('path_to_file.xlsx')
2152+
df = pd.read_excel('path_to_file.xlsx', index_col=[0,1])
2153+
df
2154+
2155+
2156+
If the source file has both ``MultiIndex`` index and columns, lists specifying each
2157+
should be passed to ``index_col`` and ``header``
2158+
2159+
.. ipython:: python
2160+
2161+
df.columns = pd.MultiIndex.from_product([['a'],['b', 'd']], names=['c1', 'c2'])
2162+
df.to_excel('path_to_file.xlsx')
2163+
df = pd.read_excel('path_to_file.xlsx',
2164+
index_col=[0,1], header=[0,1])
2165+
df
2166+
2167+
.. ipython:: python
2168+
:suppress:
2169+
2170+
import os
2171+
os.remove('path_to_file.xlsx')
2172+
2173+
.. warning::
2174+
2175+
Excel files saved in version 0.16.2 or prior that had index names will still able to be read in,
2176+
but the ``has_index_names`` argument must specified to ``True``.
2177+
2178+
21282179
Parsing Specific Columns
21292180
++++++++++++++++++++++++
21302181

doc/source/whatsnew/v0.17.0.txt

+2
Original file line numberDiff line numberDiff line change
@@ -938,6 +938,8 @@ Other API Changes
938938
- When constructing ``DataFrame`` with an array of ``complex64`` dtype previously meant the corresponding column
939939
was automatically promoted to the ``complex128`` dtype. Pandas will now preserve the itemsize of the input for complex data (:issue:`10952`)
940940
- some numeric reduction operators would return ``ValueError``, rather than ``TypeError`` on object types that includes strings and numbers (:issue:`11131`)
941+
- Passing currently unsupported ``chunksize`` argument to ``read_excel`` or ``ExcelFile.parse`` will now raise ``NotImplementedError`` (:issue:`8011`)
942+
- Allow an ``ExcelFile`` object to be passed into ``read_excel`` (:issue:`11198`)
941943
- ``DatetimeIndex.union`` does not infer ``freq`` if ``self`` and the input have ``None`` as ``freq`` (:issue:`11086`)
942944
- ``NaT``'s methods now either raise ``ValueError``, or return ``np.nan`` or ``NaT`` (:issue:`9513`)
943945

pandas/io/excel.py

+54-38
Original file line numberDiff line numberDiff line change
@@ -70,12 +70,20 @@ def get_writer(engine_name):
7070
except KeyError:
7171
raise ValueError("No Excel writer '%s'" % engine_name)
7272

73-
74-
excel_doc_common = """
73+
def read_excel(io, sheetname=0, header=0, skiprows=None, skip_footer=0,
74+
index_col=None, parse_cols=None, parse_dates=False,
75+
date_parser=None, na_values=None, thousands=None,
76+
convert_float=True, has_index_names=None, converters=None,
77+
engine=None, **kwds):
78+
"""
7579
Read an Excel table into a pandas DataFrame
7680
7781
Parameters
78-
----------%(io)s
82+
----------
83+
io : string, file-like object, pandas ExcelFile, or xlrd workbook.
84+
The string could be a URL. Valid URL schemes include http, ftp, s3,
85+
and file. For file URLs, a host is expected. For instance, a local
86+
file could be file://localhost/path/to/workbook.xlsx
7987
sheetname : string, int, mixed list of strings/ints, or None, default 0
8088
8189
Strings are used for sheet names, Integers are used in zero-indexed sheet
@@ -122,18 +130,24 @@ def get_writer(engine_name):
122130
na_values : list-like, default None
123131
List of additional strings to recognize as NA/NaN
124132
thousands : str, default None
125-
Thousands separator
133+
Thousands separator for parsing string columns to numeric. Note that
134+
this parameter is only necessary for columns stored as TEXT in Excel,
135+
any numeric columns will automatically be parsed, regardless of display
136+
format.
126137
keep_default_na : bool, default True
127138
If na_values are specified and keep_default_na is False the default NaN
128139
values are overridden, otherwise they're appended to
129140
verbose : boolean, default False
130-
Indicate number of NA values placed in non-numeric columns%(eng)s
141+
Indicate number of NA values placed in non-numeric columns
142+
engine: string, default None
143+
If io is not a buffer or path, this must be set to identify io.
144+
Acceptable values are None or xlrd
131145
convert_float : boolean, default True
132146
convert integral floats to int (i.e., 1.0 --> 1). If False, all numeric
133147
data will be read in as floats: Excel stores all numbers as floats
134148
internally
135149
has_index_names : boolean, default None
136-
DEPCRECATED: for version 0.17+ index names will be automatically inferred
150+
DEPRECATED: for version 0.17+ index names will be automatically inferred
137151
based on index_col. To read Excel output from 0.16.2 and prior that
138152
had saved index names, use True.
139153
@@ -144,28 +158,21 @@ def get_writer(engine_name):
144158
for more information on when a Dict of Dataframes is returned.
145159
146160
"""
147-
read_excel_kwargs = dict()
148-
read_excel_kwargs['io'] = """
149-
io : string, file-like object, or xlrd workbook.
150-
The string could be a URL. Valid URL schemes include http, ftp, s3,
151-
and file. For file URLs, a host is expected. For instance, a local
152-
file could be file://localhost/path/to/workbook.xlsx"""
153-
read_excel_kwargs['eng'] = """
154-
engine: string, default None
155-
If io is not a buffer or path, this must be set to identify io.
156-
Acceptable values are None or xlrd"""
157-
158-
@Appender(excel_doc_common % read_excel_kwargs)
159-
def read_excel(io, sheetname=0, **kwds):
160-
engine = kwds.pop('engine', None)
161161

162-
return ExcelFile(io, engine=engine).parse(sheetname=sheetname, **kwds)
162+
if not isinstance(io, ExcelFile):
163+
io = ExcelFile(io, engine=engine)
163164

165+
return io._parse_excel(
166+
sheetname=sheetname, header=header, skiprows=skiprows,
167+
index_col=index_col, parse_cols=parse_cols, parse_dates=parse_dates,
168+
date_parser=date_parser, na_values=na_values, thousands=thousands,
169+
convert_float=convert_float, has_index_names=has_index_names,
170+
skip_footer=skip_footer, converters=converters, **kwds)
164171

165172
class ExcelFile(object):
166173
"""
167174
Class for parsing tabular excel sheets into DataFrame objects.
168-
Uses xlrd. See ExcelFile.parse for more documentation
175+
Uses xlrd. See read_excel for more documentation
169176
170177
Parameters
171178
----------
@@ -207,23 +214,16 @@ def __init__(self, io, **kwds):
207214
raise ValueError('Must explicitly set engine if not passing in'
208215
' buffer or path for io.')
209216

210-
@Appender(excel_doc_common % dict(io='', eng=''))
211217
def parse(self, sheetname=0, header=0, skiprows=None, skip_footer=0,
212218
index_col=None, parse_cols=None, parse_dates=False,
213-
date_parser=None, na_values=None, thousands=None, chunksize=None,
219+
date_parser=None, na_values=None, thousands=None,
214220
convert_float=True, has_index_names=None, converters=None, **kwds):
221+
"""
222+
Parse specified sheet(s) into a DataFrame
215223
216-
skipfooter = kwds.pop('skipfooter', None)
217-
if skipfooter is not None:
218-
skip_footer = skipfooter
219-
220-
_validate_header_arg(header)
221-
if has_index_names is not None:
222-
warn("\nThe has_index_names argument is deprecated; index names "
223-
"will be automatically inferred based on index_col.\n"
224-
"This argmument is still necessary if reading Excel output "
225-
"from 0.16.2 or prior with index names.", FutureWarning,
226-
stacklevel=3)
224+
Equivalent to read_excel(ExcelFile, ...) See the read_excel
225+
docstring for more info on accepted parameters
226+
"""
227227

228228
return self._parse_excel(sheetname=sheetname, header=header,
229229
skiprows=skiprows,
@@ -232,7 +232,7 @@ def parse(self, sheetname=0, header=0, skiprows=None, skip_footer=0,
232232
parse_cols=parse_cols,
233233
parse_dates=parse_dates,
234234
date_parser=date_parser, na_values=na_values,
235-
thousands=thousands, chunksize=chunksize,
235+
thousands=thousands,
236236
skip_footer=skip_footer,
237237
convert_float=convert_float,
238238
converters=converters,
@@ -274,8 +274,25 @@ def _excel2num(x):
274274
def _parse_excel(self, sheetname=0, header=0, skiprows=None, skip_footer=0,
275275
index_col=None, has_index_names=None, parse_cols=None,
276276
parse_dates=False, date_parser=None, na_values=None,
277-
thousands=None, chunksize=None, convert_float=True,
277+
thousands=None, convert_float=True,
278278
verbose=False, **kwds):
279+
280+
skipfooter = kwds.pop('skipfooter', None)
281+
if skipfooter is not None:
282+
skip_footer = skipfooter
283+
284+
_validate_header_arg(header)
285+
if has_index_names is not None:
286+
warn("\nThe has_index_names argument is deprecated; index names "
287+
"will be automatically inferred based on index_col.\n"
288+
"This argmument is still necessary if reading Excel output "
289+
"from 0.16.2 or prior with index names.", FutureWarning,
290+
stacklevel=3)
291+
292+
if 'chunksize' in kwds:
293+
raise NotImplementedError("Reading an Excel file in chunks "
294+
"is not implemented")
295+
279296
import xlrd
280297
from xlrd import (xldate, XL_CELL_DATE,
281298
XL_CELL_ERROR, XL_CELL_BOOLEAN,
@@ -416,7 +433,6 @@ def _parse_cell(cell_contents,cell_typ):
416433
date_parser=date_parser,
417434
skiprows=skiprows,
418435
skip_footer=skip_footer,
419-
chunksize=chunksize,
420436
**kwds)
421437

422438
output[asheetname] = parser.read()

0 commit comments

Comments
 (0)