Skip to content

Pandas read_excel sometimes using xlrd which has deprecated code in python 3.8.1 #30851

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
mhooreman opened this issue Jan 9, 2020 · 6 comments
Labels
IO Excel read_excel, to_excel

Comments

@mhooreman
Copy link

Hello,

Under some circumstances that I'm unable to systematically reproduce, pd.read_excel still uses xlrd.

With python 3.8, I get a DeprecationWarning, which I can't fix.

Since there is no maintainer anymore for xlrd, I must come back to you to get some advises. Would you be so kind to help me?

I'm using pd.read_excel within joblib parallel subprocesses, using the devault backend. There is no specific option given to read_excel, and I have a mix of xls and xlsx files. The "interesting part" of the exception is show below.

Unfortunately, when I try it manually using ipython interactive shell, I have no issue, even with the parallel joblib processing.

Thanks a lot.

  File "########/src/context/erpcrm_new/etl/_chunks/_base.py", line 229, in readExcel
    ret = pd.read_excel(
  File "########/.local/share/virtualenvs-_GHTURhP/lib/python3.8/site-packages/pandas/util/_decorators.py", line 208, in wrapper
    return func(*args, **kwargs)
  File "########/.local/share/virtualenvs-_GHTURhP/lib/python3.8/site-packages/pandas/io/excel/_base.py", line 310, in read_excel
    io = ExcelFile(io, engine=engine)
  File "########/.local/share/virtualenvs-_GHTURhP/lib/python3.8/site-packages/pandas/io/excel/_base.py", line 819, in __init__
    self._reader = self._engines[engine](self._io)
  File "########/.local/share/virtualenvs-_GHTURhP/lib/python3.8/site-packages/pandas/io/excel/_xlrd.py", line 21, in __init__
    super().__init__(filepath_or_buffer)
  File "########/.local/share/virtualenvs-_GHTURhP/lib/python3.8/site-packages/pandas/io/excel/_base.py", line 359, in __init__
    self.book = self.load_workbook(filepath_or_buffer)
  File "########/.local/share/virtualenvs-_GHTURhP/lib/python3.8/site-packages/pandas/io/excel/_xlrd.py", line 36, in load_workbook
    return open_workbook(filepath_or_buffer)
  File "########/.local/share/virtualenvs-_GHTURhP/lib/python3.8/site-packages/xlrd/__init__.py", line 130, in open_workbook
    bk = xlsx.open_workbook_2007_xml(
  File "########/.local/share/virtualenvs-_GHTURhP/lib/python3.8/site-packages/xlrd/xlsx.py", line 812, in open_workbook_2007_xml
    x12book.process_stream(zflo, 'Workbook')
  File "########/.local/share/virtualenvs-_GHTURhP/lib/python3.8/site-packages/xlrd/xlsx.py", line 266, in process_stream
    for elem in self.tree.iter() if Element_has_iter else self.tree.getiterator():
  File "/opt/python/3.8.1/lib/python3.8/xml/etree/ElementTree.py", line 622, in getiterator
    warnings.warn(
DeprecationWarning: This method will be removed in future versions.  Use 'tree.iter()' or 'list(tree.iter())' instead.
@WillAyd
Copy link
Member

WillAyd commented Jan 9, 2020

read_excel uses xlrd by default. We have a PR to deprecate that default but not yet merged #29375

You can explicitly request pd.read_excel(..., engine="openpyxl") for now until that changes in the future

@jbrockmendel jbrockmendel added the IO Excel read_excel, to_excel label Jan 9, 2020
@WillAyd
Copy link
Member

WillAyd commented Jan 9, 2020

Closing as we by default always use xlrd right now; could use help to deprecate it in mentioned PR if something you are interested in @mhooreman

@WillAyd WillAyd closed this as completed Jan 9, 2020
@mhooreman
Copy link
Author

Thanks @WillAyd . I'll pass openpyxl as argument.
Please notice that the pd.read_excel doc is not 100% accurate with the implemented behavior:

engine : str, default None
If io is not a buffer or path, this must be set to identify io.
Acceptable values are None or xlrd.

while the code of io.excel._base.ExcelFile gives:

_engines = {"xlrd": _XlrdReader, "openpyxl": _OpenpyxlReader, "odf": _ODFReader}

@mhooreman
Copy link
Author

Sorry to come back to this guys, but I have xls files (old format) in my data source as well. So, xlrd will have to be used in that case. Any workaround idea?

@WillAyd
Copy link
Member

WillAyd commented Jan 10, 2020

The doc issue you mention should already be fixed in dev. With regards to the files, none of the other excel engines support reading .xls files. You can continue to use xlrd (I don't think we will outright remove, just move default to openpyxl) but obviously that project is unmaintained so no there are no guarantees on how that will work in the long run

@CursosAGT
Copy link

python 3.10 pandas==1.3.5 2021/12 the problem still continues with xml.parsers.expat.ExpatError: mismatched tag:

Planillas_EXCEL\variable_tiempo.xlsx dimi_fecha
Traceback (most recent call last):

File "C:\Python310\lib\xml\etree\ElementTree.py", line 1718, in feed self.parser.Parse(data, False)

xml.parsers.expat.ExpatError: mismatched tag: line 2, column 313904

During handling of the above exception, another exception occurred:

Traceback (most recent call last):

df = pd.read_excel(file_, engine='openpyxl')

or

df = pd.read_excel(file_)

File "C:\Python310\lib\site-packages\pandas\util_decorators.py", line 311, in wrapper return func(*args, **kwargs) File "C:\Python310\lib\site-packages\pandas\io\excel_base.py", line 372, in read_excel data = io.parse( File "C:\Python310\lib\site-packages\pandas\io\excel_base.py", line 1272, in parse return self._reader.parse( File "C:\Python310\lib\site-packages\pandas\io\excel_base.py", line 539, in parse data = self.get_sheet_data(sheet, convert_float)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO Excel read_excel, to_excel
Projects
None yet
Development

No branches or pull requests

4 participants