BUG: read_excel(engine='openpyxl') with "unsized" XLSX issue #34821

kuraga · 2020-06-16T10:21:39Z

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
(optional) I have confirmed this bug exists on the master branch of pandas.

Problem description

Consider code:

import pandas as pd
pd.read_excel('test.xlsx', index_col=[0, 1], engine='openpyxl')

with some special test.xlsx (see below).

According to openpyxl issue #1483 and openpyxl documentation, user (i.e. Pandas) have to call sheet.calculate_dimension(force=True) if any workbook's worksheet is "unsized".

(What is "worksheet is unsized"? It means: the worksheet doesn't have DIMENSION_TAG, see WorkSheetParser.parse_dimensions(). When it checked? On ReadOnlyWorksheet object construction.)

But Pandas doesn't do it.

So, if the worksheet is "unsized", on reading it by read_excel(engine='openpyxl') we try to get sheet data. We iterate rows falling down into openpyxl: 1 -> 2 -> 3 -> 4 (with max_col == None) -> 5 -> 6 (it returns self._max_column == None) -> 7 -> 8 (with max_col == None) -> 9 -> 10 (with max_col == None). And in these lines:

max_col = max_col or  row[-1]['column']
row_width = max_col + 1 - min_col

we have: row's cells number (row_width) is individual for each row (row).

Now consider an unsized Excel table (I have a real file but it's private; after saving in LibreOffice it's become sized):

We get sheet data:

[ [`A1`],
  [ 'B1', 'B2' ] ]

But index_col=[0, 1] suppose to have >=2 cols in each row:

Traceback (most recent call last):
  File "test.py", line 6, in <module>
    pd.read_excel('test.xlsx', index_col=[0, 1], engine='openpyxl')
  File "/home/sasha/miniconda3/lib/python3.7/site-packages/pandas/io/excel/_base.py", line 334, in read_excel
    **kwds,
  File "/home/sasha/miniconda3/lib/python3.7/site-packages/pandas/io/excel/_base.py", line 888, in parse
    **kwds,
  File "/home/sasha/miniconda3/lib/python3.7/site-packages/pandas/io/excel/_base.py", line 480, in parse
    last = data[offset][col]
IndexError: list index out of range

IMHO, patch could be:

diff --git a/pandas/io/excel/_openpyxl.py b/pandas/io/excel/_openpyxl.py
index c4327316d..3efbf4abc 100644
--- a/pandas/io/excel/_openpyxl.py
+++ b/pandas/io/excel/_openpyxl.py
@@ -536,6 +536,9 @@ class _OpenpyxlReader(_BaseExcelReader):
 
     def get_sheet_data(self, sheet, convert_float: bool) -> List[List[Scalar]]:
         data: List[List[Scalar]] = []
+
+        sheet.calculate_dimension(force=True)
+
         for row in sheet.rows:
             data.append([self._convert_cell(cell, convert_float) for cell in row])

(But it may provide some drawbacks).
With this patch, sheet data would be:

[ [`A1`, ''],
  [ 'B1', 'B2' ] ]

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit : None
python : 3.7.7.final.0
python-bits : 64
OS : Linux
OS-release : 5.4.42-calculate
machine : x86_64
processor : Intel(R) Core(TM) i5-7200U CPU @ 2.50GHz
byteorder : little
LC_ALL : None
LANG : ru_RU.utf8
LOCALE : ru_RU.UTF-8

pandas : 1.0.3
numpy : 1.18.1
pytz : 2020.1
dateutil : 2.8.1
pip : 20.0.2
setuptools : 46.4.0.post20200518
Cython : 0.29.17
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : 1.2.8
lxml.etree : 4.5.0
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 2.11.2
IPython : 7.13.0
pandas_datareader: None
bs4 : None
bottleneck : None
fastparquet : None
gcsfs : None
lxml.etree : 4.5.0
matplotlib : 3.1.3
numexpr : None
odfpy : None
openpyxl : 3.0.3
pandas_gbq : None
pyarrow : None
pytables : None
pytest : None
pyxlsb : None
s3fs : None
scipy : 1.4.1
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : 1.2.0
xlwt : None
xlsxwriter : 1.2.8
numba : None

Thanks!

The text was updated successfully, but these errors were encountered:

jbrockmendel · 2020-09-03T01:12:59Z

According to openpyxl issue #1483 and openpyxl documentation, user (i.e. Pandas) have to call sheet.calculate_dimension(force=True) if any workbook's worksheet is "unsized".

Want to make a PR to do this?

theoniko · 2020-12-21T12:22:13Z

take

asishm · 2021-01-31T22:01:20Z

@kuraga would you be able to check if #39486 fixes this?

rhshadrach · 2021-02-01T22:50:05Z

@kuraga - My expectation is yes, but without a sheet to test it on there is nothing we can do. I'm going to mark this accordingly but if we get no response, my thought is to close.

kuraga · 2021-02-02T23:02:59Z

Sorry don't have sources now.

rhshadrach · 2021-02-02T23:16:02Z

@kuraga - not a problem, thanks for the response.

kuraga added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Jun 16, 2020

jbrockmendel added IO Excel read_excel, to_excel and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Sep 3, 2020

github-actions bot assigned theoniko Dec 21, 2020

theoniko removed their assignment Dec 21, 2020

rhshadrach added Needs Info Clarification about behavior needed to assess issue Closing Candidate May be closeable, needs more eyeballs labels Feb 1, 2021

rhshadrach closed this as completed Feb 2, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: read_excel(engine='openpyxl') with "unsized" XLSX issue #34821

BUG: read_excel(engine='openpyxl') with "unsized" XLSX issue #34821

kuraga commented Jun 16, 2020 •

edited

Loading

INSTALLED VERSIONS

jbrockmendel commented Sep 3, 2020

theoniko commented Dec 21, 2020

asishm commented Jan 31, 2021

rhshadrach commented Feb 1, 2021

kuraga commented Feb 2, 2021

rhshadrach commented Feb 2, 2021

BUG: read_excel(engine='openpyxl') with "unsized" XLSX issue #34821

BUG: read_excel(engine='openpyxl') with "unsized" XLSX issue #34821

Comments

kuraga commented Jun 16, 2020 • edited Loading

Problem description

Output of pd.show_versions()

INSTALLED VERSIONS

jbrockmendel commented Sep 3, 2020

theoniko commented Dec 21, 2020

asishm commented Jan 31, 2021

rhshadrach commented Feb 1, 2021

kuraga commented Feb 2, 2021

rhshadrach commented Feb 2, 2021

kuraga commented Jun 16, 2020 •

edited

Loading

Output of `pd.show_versions()`