Skip to content

DEPR: Adjust read excel behavior for xlrd >= 2.0 #38571

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 19 commits into from
Dec 23, 2020
Merged
Show file tree
Hide file tree
Changes from 9 commits
Commits
Show all changes
19 commits
Select commit Hold shift + click to select a range
06d79b9
DEPR: Adjust read excel behavior for xlrd >= 2.0
rhshadrach Dec 13, 2020
e2c29ff
Use stringified path, suppress warnings in io.test_common
rhshadrach Dec 19, 2020
f56970b
Use get_handle & seek
rhshadrach Dec 19, 2020
e992049
inspection raises if not a xls/zip file; tests for missing/corrupt file
rhshadrach Dec 19, 2020
13cd483
Merge branch 'master' of https://github.com/pandas-dev/pandas into xl…
rhshadrach Dec 20, 2020
792b53a
Minor doc fixes
rhshadrach Dec 20, 2020
266b18b
Fix test to handle zh_CN.utf8
rhshadrach Dec 20, 2020
56cf956
Use LooseVersion
rhshadrach Dec 21, 2020
e1878a0
Merge branch 'master' of https://github.com/pandas-dev/pandas into xl…
rhshadrach Dec 21, 2020
0146258
DEPR: Adjust read excel behavior for xlrd >= 2.0
rhshadrach Dec 22, 2020
00bfba3
Improvements from feedback, xlrd>=2.0 for two builds
rhshadrach Dec 22, 2020
05c0725
Merge branch 'master' of https://github.com/pandas-dev/pandas into xl…
rhshadrach Dec 22, 2020
4a8801d
Ignore FutureWarning in excel tests
rhshadrach Dec 23, 2020
59e0cb9
Added debug statements Windows
rhshadrach Dec 23, 2020
8f8b74e
Fixed stacklevel determination for Windows, added test
rhshadrach Dec 23, 2020
f286497
Merge branch 'master' of https://github.com/pandas-dev/pandas into xl…
rhshadrach Dec 23, 2020
8c95acd
Changed error message on non-zip file
rhshadrach Dec 23, 2020
4619363
Handle error with xlsb files
rhshadrach Dec 23, 2020
05a6f09
xfail instead
rhshadrach Dec 23, 2020
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
18 changes: 7 additions & 11 deletions doc/source/whatsnew/v1.2.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -10,23 +10,19 @@ including other versions of pandas.

.. warning::

The packages `xlrd <https://xlrd.readthedocs.io/en/latest/>`_ for reading excel
files and `xlwt <https://xlwt.readthedocs.io/en/latest/>`_ for
writing excel files are no longer maintained. These are the only engines in pandas
The package `xlrd <https://xlrd.readthedocs.io/en/latest/>`_ for reading excel
files now only supports the xls format. The package
`xlwt <https://xlwt.readthedocs.io/en/latest/>`_ for writing excel files
is no longer maintained. These are the only engines in pandas
that support the xls format.

Previously, the default argument ``engine=None`` to ``pd.read_excel``
would result in using the ``xlrd`` engine in many cases. If
`openpyxl <https://openpyxl.readthedocs.io/en/stable/>`_ is installed,
many of these cases will now default to using the ``openpyxl`` engine.
See the :func:`read_excel` documentation for more details. Attempting to read
``.xls`` files or specifying ``engine="xlrd"`` to ``pd.read_excel`` will not
raise a warning. However users should be aware that ``xlrd`` is already
broken with certain package configurations, for example with Python 3.9
when `defusedxml <https://github.com/tiran/defusedxml/>`_ is installed, and
is anticipated to be unusable in the future.

Attempting to use the the ``xlwt`` engine will raise a ``FutureWarning``
See the :func:`read_excel` documentation for more details.

Attempting to use the ``xlwt`` engine will raise a ``FutureWarning``
unless the option :attr:`io.excel.xls.writer` is set to ``"xlwt"``.
While this option is now deprecated and will also raise a ``FutureWarning``,
it can be globally set and the warning suppressed. Users are recommended to
Expand Down
234 changes: 149 additions & 85 deletions pandas/io/excel/_base.py
Original file line number Diff line number Diff line change
@@ -1,23 +1,26 @@
import abc
import datetime
from distutils.version import LooseVersion
import inspect
from io import BufferedIOBase, BytesIO, RawIOBase
import os
from textwrap import fill
from typing import Any, Dict, Mapping, Union, cast
from typing import IO, Any, Dict, Mapping, Optional, Union, cast
import warnings
import zipfile

from pandas._config import config

from pandas._libs.parsers import STR_NA_VALUES
from pandas._typing import Buffer, FilePathOrBuffer, StorageOptions
from pandas.compat._optional import import_optional_dependency
from pandas.errors import EmptyDataError
from pandas.util._decorators import Appender, deprecate_nonkeyword_arguments
from pandas.util._decorators import Appender, deprecate_nonkeyword_arguments, doc

from pandas.core.dtypes.common import is_bool, is_float, is_integer, is_list_like

from pandas.core.frame import DataFrame
from pandas.core.shared_docs import _shared_docs

from pandas.io.common import IOHandles, get_handle, stringify_path, validate_header_arg
from pandas.io.excel._util import (
Expand Down Expand Up @@ -111,22 +114,19 @@
- "pyxlsb" supports Binary Excel files.

.. versionchanged:: 1.2.0
The engine `xlrd <https://xlrd.readthedocs.io/en/latest/>`_
is no longer maintained, and is not supported with
python >= 3.9. When ``engine=None``, the following logic will be
used to determine the engine.

- If ``path_or_buffer`` is an OpenDocument format (.odf, .ods, .odt),
then `odf <https://pypi.org/project/odfpy/>`_ will be used.
- Otherwise if ``path_or_buffer`` is a bytes stream, the file has the
extension ``.xls``, or is an ``xlrd`` Book instance, then ``xlrd`` will
be used.
- Otherwise if `openpyxl <https://pypi.org/project/openpyxl/>`_ is installed,
then ``openpyxl`` will be used.
- Otherwise ``xlrd`` will be used and a ``FutureWarning`` will be raised.

Specifying ``engine="xlrd"`` will continue to be allowed for the
indefinite future.

The engine `xlrd <https://xlrd.readthedocs.io/en/latest/>`_ only reads
xls files in version 2.0 and above. When ``engine=None``,
the following logic will be used to determine the engine.

- If ``path_or_buffer`` is an OpenDocument format (.odf, .ods, .odt),
then `odf <https://pypi.org/project/odfpy/>`_ will be used.
- Otherwise if ``path_or_buffer`` is an xls format,
``xlrd`` will be used.
- Otherwise if `openpyxl <https://pypi.org/project/openpyxl/>`_ is installed,
then ``openpyxl`` will be used.
- Otherwise if ``xlrd >= 2.0`` is installed, a ``ValueError`` will be raised.
- Otherwise ``xlrd`` will be used and a ``FutureWarning`` will be raised.

converters : dict, default None
Dict of functions for converting values in certain columns. Keys can
Expand Down Expand Up @@ -888,39 +888,90 @@ def close(self):
return content


def _is_ods_stream(stream: Union[BufferedIOBase, RawIOBase]) -> bool:
XLS_SIGNATURE = b"\xD0\xCF\x11\xE0\xA1\xB1\x1A\xE1"
ZIP_SIGNATURE = b"PK\x03\x04"
PEEK_SIZE = max(len(XLS_SIGNATURE), len(ZIP_SIGNATURE))


@doc(storage_options=_shared_docs["storage_options"])
def inspect_excel_format(
path: Optional[str] = None,
content: Union[None, BufferedIOBase, RawIOBase, bytes] = None,
storage_options: StorageOptions = None,
) -> str:
"""
Check if the stream is an OpenDocument Spreadsheet (.ods) file
Inspect the path or content of an excel file and get its format.

At least one of path or content must be not None. If both are not None,
content will take precedence.

It uses magic values inside the stream
Adopted from xlrd: https://github.com/python-excel/xlrd.

Parameters
----------
stream : Union[BufferedIOBase, RawIOBase]
IO stream with data which might be an ODS file
path : str, optional
Path to file to inspect. May be a URL.
content : file-like object, optional
Content of file to inspect.
{storage_options}

Returns
-------
is_ods : bool
Boolean indication that this is indeed an ODS file or not
str
Format of file.

Raises
------
ValueError
If resulting stream is empty.
BadZipFile
If resulting stream does not have an XLS signature and is not a valid zipfile.
"""
stream.seek(0)
is_ods = False
if stream.read(4) == b"PK\003\004":
stream.seek(30)
is_ods = (
stream.read(54) == b"mimetype"
b"application/vnd.oasis.opendocument.spreadsheet"
)
stream.seek(0)
return is_ods
content_or_path: Union[None, str, BufferedIOBase, RawIOBase, IO[bytes]]
if isinstance(content, bytes):
content_or_path = BytesIO(content)
else:
content_or_path = content or path
assert content_or_path is not None

with get_handle(
content_or_path, "rb", storage_options=storage_options, is_text=False
) as handle:
stream = handle.handle
stream.seek(0)
buf = stream.read(PEEK_SIZE)
if buf is None:
raise ValueError("stream is empty")
else:
assert isinstance(buf, bytes)
peek = buf
stream.seek(0)

if peek.startswith(XLS_SIGNATURE):
return "xls"

# ZipFile typing is overly-strict
# https://github.com/python/typeshed/issues/4212
zf = zipfile.ZipFile(stream) # type: ignore[arg-type]

# Workaround for some third party files that use forward slashes and
# lower case names.
component_names = [name.replace("\\", "/").lower() for name in zf.namelist()]

if "xl/workbook.xml" in component_names:
return "xlsx"
if "xl/workbook.bin" in component_names:
return "xlsb"
if "content.xml" in component_names:
return "ods"
return "zip"


class ExcelFile:
"""
Class for parsing tabular excel sheets into DataFrame objects.

Uses xlrd engine by default. See read_excel for more documentation
See read_excel for more documentation.

Parameters
----------
Expand All @@ -940,22 +991,18 @@ class ExcelFile:

.. versionchanged:: 1.2.0

The engine `xlrd <https://xlrd.readthedocs.io/en/latest/>`_
is no longer maintained, and is not supported with
python >= 3.9. When ``engine=None``, the following logic will be
used to determine the engine.
The engine `xlrd <https://xlrd.readthedocs.io/en/latest/>`_ only reads
xls files in version 2.0 and above. When ``engine=None``,
the following logic will be used to determine the engine.

- If ``path_or_buffer`` is an OpenDocument format (.odf, .ods, .odt),
then `odf <https://pypi.org/project/odfpy/>`_ will be used.
- Otherwise if ``path_or_buffer`` is a bytes stream, the file has the
extension ``.xls``, or is an ``xlrd`` Book instance, then ``xlrd``
will be used.
- Otherwise if ``path_or_buffer`` is an xls format,
``xlrd`` will be used.
- Otherwise if `openpyxl <https://pypi.org/project/openpyxl/>`_ is installed,
then ``openpyxl`` will be used.
- Otherwise if ``xlrd >= 2.0`` is installed, a ``ValueError`` will be raised.
- Otherwise ``xlrd`` will be used and a ``FutureWarning`` will be raised.

Specifying ``engine="xlrd"`` will continue to be allowed for the
indefinite future.
"""

from pandas.io.excel._odfreader import ODFReader
Expand All @@ -973,40 +1020,55 @@ class ExcelFile:
def __init__(
self, path_or_buffer, engine=None, storage_options: StorageOptions = None
):
if engine is None:
# Determine ext and use odf for ods stream/file
if isinstance(path_or_buffer, (BufferedIOBase, RawIOBase)):
ext = None
if _is_ods_stream(path_or_buffer):
engine = "odf"
else:
ext = os.path.splitext(str(path_or_buffer))[-1]
if ext == ".ods":
engine = "odf"
if engine is not None and engine not in self._engines:
raise ValueError(f"Unknown engine: {engine}")

if (
import_optional_dependency(
"xlrd", raise_on_missing=False, on_version="ignore"
)
is not None
):
from xlrd import Book
# Could be a str, ExcelFile, Book, etc.
self.io = path_or_buffer
# Always a string
self._io = stringify_path(path_or_buffer)

if isinstance(path_or_buffer, Book):
engine = "xlrd"
# Determine xlrd version if installed
if (
import_optional_dependency(
"xlrd", raise_on_missing=False, on_version="ignore"
)
is None
):
xlrd_version = None
else:
import xlrd

# GH 35029 - Prefer openpyxl except for xls files
if engine is None:
if ext is None or isinstance(path_or_buffer, bytes) or ext == ".xls":
engine = "xlrd"
elif (
xlrd_version = LooseVersion(xlrd.__version__)

if isinstance(path_or_buffer, (BufferedIOBase, RawIOBase, bytes)):
ext = inspect_excel_format(
content=path_or_buffer, storage_options=storage_options
)
elif xlrd_version is not None and isinstance(path_or_buffer, xlrd.Book):
ext = "xls"
else:
# path_or_buffer is path-like, use stringified path
ext = inspect_excel_format(
path=str(self._io), storage_options=storage_options
)

if engine is None:
if ext == "ods":
engine = "odf"
elif ext == "xls":
engine = "xlrd"
else:
# GH 35029 - Prefer openpyxl except for xls files
if (
import_optional_dependency(
"openpyxl", raise_on_missing=False, on_version="ignore"
)
is not None
):
engine = "openpyxl"
else:
elif xlrd_version is not None and xlrd_version < "2":
# If xlrd_version >= "2", we error below
caller = inspect.stack()[1]
if (
caller.filename.endswith("pandas/io/excel/_base.py")
Expand All @@ -1016,28 +1078,30 @@ def __init__(
else:
stacklevel = 2
warnings.warn(
"The xlrd engine is no longer maintained and is not "
"supported when using pandas with python >= 3.9. However, "
"the engine xlrd will continue to be allowed for the "
"indefinite future. Beginning with pandas 1.2.0, the "
"openpyxl engine will be used if it is installed and the "
"engine argument is not specified. Either install openpyxl "
"or specify engine='xlrd' to silence this warning.",
f"Your version of xlrd is {xlrd_version}. In xlrd >= 2.0, "
f"only the xls format is supported. As a result, the "
f"openpyxl engine will be used if it is installed "
f"and the engine argument is not specified. Either install "
f"openpyxl or specify engine='xlrd' to silence this warning.",
FutureWarning,
stacklevel=stacklevel,
)
engine = "xlrd"
if engine not in self._engines:
raise ValueError(f"Unknown engine: {engine}")

if (
engine == "xlrd"
and xlrd_version is not None
and xlrd_version >= "2"
and not ext == "xls"
):
raise ValueError(
f"Your version of xlrd is {xlrd_version}. In xlrd >= 2.0, "
f"only the xls format is supported."
)

self.engine = engine
self.storage_options = storage_options

# Could be a str, ExcelFile, Book, etc.
self.io = path_or_buffer
# Always a string
self._io = stringify_path(path_or_buffer)

self._reader = self._engines[engine](self._io, storage_options=storage_options)

def __fspath__(self):
Expand Down
4 changes: 4 additions & 0 deletions pandas/tests/io/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,4 +14,8 @@
r"Use 'tree.iter\(\)' or 'list\(tree.iter\(\)\)' instead."
":PendingDeprecationWarning"
),
# GH 26552
pytest.mark.filterwarnings(
"ignore:As the xlwt package is no longer maintained:FutureWarning"
),
]
15 changes: 15 additions & 0 deletions pandas/tests/io/excel/__init__.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,9 @@
from distutils.version import LooseVersion

import pytest

from pandas.compat._optional import import_optional_dependency

pytestmark = [
pytest.mark.filterwarnings(
# Looks like tree.getiterator is deprecated in favor of tree.iter
Expand All @@ -14,3 +18,14 @@
"ignore:As the xlwt package is no longer maintained:FutureWarning"
),
]


if (
import_optional_dependency("xlrd", raise_on_missing=False, on_version="ignore")
is None
):
xlrd_version = None
else:
import xlrd

xlrd_version = LooseVersion(xlrd.__version__)
Loading