Skip to content

DEPR: Deprecate using xlrd engine for read_excel #35029

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 16 commits into from
Dec 1, 2020
Merged
Show file tree
Hide file tree
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions doc/source/whatsnew/v1.2.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -144,6 +144,8 @@ Deprecations
~~~~~~~~~~~~
- Deprecated parameter ``inplace`` in :meth:`MultiIndex.set_codes` and :meth:`MultiIndex.set_levels` (:issue:`35626`)
- Deprecated parameter ``dtype`` in :~meth:`Index.copy` on method all index classes. Use the :meth:`Index.astype` method instead for changing dtype(:issue:`35853`)
- :func:`read_excel` "xlrd" engine is deprecated. The recommended engine is "openpyxl" for "xlsx" and "xlsm" files, because "xlrd" is no longer maintained (:issue:`28547`).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for updating. it's the ``read_excel "xlrd" engine is deprecated bit that I wanted removed

IIUC the xlrd is not deprecated. it's only that that default engine used will change.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it deprecated for use with xlsx files? IOW in the future, will only openpyxl be supported for xlsx? Sounds reasonable to me.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, that's not my understanding of the discussion in #28547

from #28547 (comment)

Considering that I think we need to deprecate using xlrd in favor of openpyxl. We might not necessarily need to remove the former and it does offer some functionality the latter doesn't (namely reading .xls files) but should at the very least start moving towards the latter

Indeed, the discussion has not explicitly mentioned disallowing xlrd for formats that openpyxl supports. but if the xlrd engine is not removed, we should decide now whether we would restrict it's use.

Copy link
Contributor

@bashtage bashtage Aug 26, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My read of this is to only keep xlrd for xls where it is required, and to deprecate where it is not. In the long run, if xlrd breaks and no one takes over its maintenance, then we will either have to vendor xlrd or remove support for xls. In either case minimizing use of xlrd seems like a good idea to me.

I suppose to me the only point of deprecating is to start a path to removal of a feature. If there is not going to be removal in the future, then why bother with deprecation?

One could always discourage xlrd + xlsx with a noisy FutureWarning telling users that xlrd is unmaintained and they should install openpyxl for reading xlsx files.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@WillAyd is very specific regarding this issue. He states that we should warn first, then change the default (and maybe even remove the xlrd engine for xlsx and xlsm files altogether).
But not at this point, only a deprecation warning is asked, to notify users that this engine is no longer the preferred engine.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to clarify we should only be changing the default reader to openpyxl. I think it's fine to keep xlrd around as a YMMV situation

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I just reverted the changes that made openpyxl the default. I am now very confused. I thought that this change was just about warning people about pending deprecation of the xlrd reader, not to switch them over already.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See your comments of July 1 and 3.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see where the wording is confusing, but yes we only warn now and change in the future. We always manage user-facing changes that way

-
-

.. ---------------------------------------------------------------------------
Expand Down
18 changes: 15 additions & 3 deletions pandas/io/excel/_base.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@
import os
from textwrap import fill
from typing import Any, Mapping, Union
import warnings

from pandas._config import config

Expand Down Expand Up @@ -837,7 +838,7 @@ class ExcelFile:
engine : str, default None
If io is not a buffer or path, this must be set to identify io.
Supported engines: ``xlrd``, ``openpyxl``, ``odf``, ``pyxlsb``,
default ``xlrd``.
default ``xlrd`` for .xls* files, ``odf`` for .ods files.
Engine compatibility :
- ``xlrd`` supports most old/new Excel file formats.
- ``openpyxl`` supports newer Excel file formats.
Expand All @@ -860,15 +861,26 @@ class ExcelFile:
def __init__(
self, path_or_buffer, engine=None, storage_options: StorageOptions = None
):
ext = None
if not isinstance(path_or_buffer, (BufferedIOBase, RawIOBase)):
ext = os.path.splitext(str(path_or_buffer))[-1][1:]

if engine is None:
engine = "xlrd"
if isinstance(path_or_buffer, (BufferedIOBase, RawIOBase)):
if _is_ods_stream(path_or_buffer):
engine = "odf"
else:
ext = os.path.splitext(str(path_or_buffer))[-1]
if ext == ".ods":
if ext == "ods":
engine = "odf"

elif engine == "xlrd" and ext in ("xlsx", "xlsm"):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This warning should be in the if engine is None branch

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you sure? This will mean that that people also get a warning when they ask for the default (which is still xlrd), instead of when they explicitly ask for xlrd.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea so the point of it is that people who want to suppress the warnings will get a head start and explicitly request engine="openpyxl", which is a good thing to sniff out any bugs

warnings.warn(
'The Excel reader engine "xlrd" is deprecated, use "openpyxl" instead. '
'Specify engine="openpyxl" to suppress this warning.',
FutureWarning,
stacklevel=2,
)
if engine not in self._engines:
raise ValueError(f"Unknown engine: {engine}")

Expand Down
7 changes: 6 additions & 1 deletion pandas/io/excel/_openpyxl.py
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
from datetime import datetime
from typing import List

import numpy as np
Expand Down Expand Up @@ -517,7 +518,11 @@ def _convert_cell(self, cell, convert_float: bool) -> Scalar:

# TODO: replace with openpyxl constants
if cell.is_date:
return cell.value
try:
# workaround for inaccurate timestamp notation in excel
return datetime.fromtimestamp(round(cell.value.timestamp()))
except (AttributeError, OSError):
return cell.value
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this changing?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was a work-around for a bug in openpyxl (https://foss.heptapod.net/openpyxl/openpyxl/-/issues/1493), but is only apparent when you do a round trip save to xlsx and read back xlsx using openpyxl.
As this is not tested in any unit test, this can be removed. Agreed?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If adding code we should have a test. so either need to add test or can remove.

my preference would be to have this in a separate PR, so should raise pandas issue for this if removing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair enough, remove it here and make a separate PR that includes a regression test.

elif cell.data_type == "e":
return np.nan
elif cell.data_type == "b":
Expand Down
8 changes: 7 additions & 1 deletion pandas/tests/io/excel/test_readers.py
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,9 @@
marks=[
td.skip_if_no("xlrd"),
pytest.mark.filterwarnings("ignore:.*(tree\\.iter|html argument)"),
pytest.mark.filterwarnings(
'ignore:The Excel reader engine "xlrd" is deprecated,'
),
],
),
pytest.param(
Expand Down Expand Up @@ -942,7 +945,10 @@ def test_read_excel_squeeze(self, read_ext):
expected = pd.Series([1, 2, 3], name="a")
tm.assert_series_equal(actual, expected)

def test_deprecated_kwargs(self, read_ext):
def test_deprecated_kwargs(self, engine, read_ext):
if engine == "xlrd":
pytest.skip("Use of xlrd engine produces a FutureWarning as well")

with tm.assert_produces_warning(FutureWarning, raise_on_extra_warnings=False):
pd.read_excel("test1" + read_ext, "Sheet1", 0)

Expand Down
3 changes: 3 additions & 0 deletions pandas/tests/io/excel/test_writers.py
Original file line number Diff line number Diff line change
Expand Up @@ -1199,6 +1199,9 @@ def test_datetimes(self, path):

tm.assert_series_equal(write_frame["A"], read_frame["A"])

@pytest.mark.filterwarnings(
'ignore:The Excel reader engine "xlrd" is deprecated:FutureWarning'
)
def test_bytes_io(self, engine):
# see gh-7074
bio = BytesIO()
Expand Down
28 changes: 27 additions & 1 deletion pandas/tests/io/excel/test_xlrd.py
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,9 @@ def skip_ods_and_xlsb_files(read_ext):
pytest.skip("Not valid for xlrd")


@pytest.mark.filterwarnings(
'ignore:The Excel reader engine "xlrd" is deprecated:FutureWarning'
)
def test_read_xlrd_book(read_ext, frame):
df = frame

Expand All @@ -36,8 +39,31 @@ def test_read_xlrd_book(read_ext, frame):


# TODO: test for openpyxl as well
@pytest.mark.filterwarnings(
'ignore:The Excel reader engine "xlrd" is deprecated:FutureWarning'
)
def test_excel_table_sheet_by_index(datapath, read_ext):
path = datapath("io", "data", "excel", f"test1{read_ext}")
with pd.ExcelFile(path) as excel:
with pd.ExcelFile(path, engine="xlrd") as excel:
with pytest.raises(xlrd.XLRDError):
pd.read_excel(excel, sheet_name="asdf")


def test_excel_file_warning_with_xlsx_file(datapath):
# GH 29375
path = datapath("io", "data", "excel", "test1.xlsx")
with tm.assert_produces_warning(
FutureWarning, check_stacklevel=True, raise_on_extra_warnings=False
) as w:
pd.ExcelFile(path, engine="xlrd")
assert '"xlrd" is deprecated, use "openpyxl" instead.' in str(w[0].message)


def test_read_excel_warning_with_xlsx_file(tmpdir, datapath):
# GH 29375
path = datapath("io", "data", "excel", "test1.xlsx")
with tm.assert_produces_warning(
FutureWarning, check_stacklevel=False, raise_on_extra_warnings=False
) as w:
pd.read_excel(path, "Sheet1", engine="xlrd")
assert '"xlrd" is deprecated, use "openpyxl" instead.' in str(w[0].message)