Skip to content

BUG: read_excel(engine='openpyxl') with "unsized" XLSX issue #34821

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
2 of 3 tasks
kuraga opened this issue Jun 16, 2020 · 6 comments
Closed
2 of 3 tasks

BUG: read_excel(engine='openpyxl') with "unsized" XLSX issue #34821

kuraga opened this issue Jun 16, 2020 · 6 comments
Labels
Bug Closing Candidate May be closeable, needs more eyeballs IO Excel read_excel, to_excel Needs Info Clarification about behavior needed to assess issue

Comments

@kuraga
Copy link

kuraga commented Jun 16, 2020

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • (optional) I have confirmed this bug exists on the master branch of pandas.


Problem description

Consider code:

import pandas as pd
pd.read_excel('test.xlsx', index_col=[0, 1], engine='openpyxl')

with some special test.xlsx (see below).

According to openpyxl issue #1483 and openpyxl documentation, user (i.e. Pandas) have to call sheet.calculate_dimension(force=True) if any workbook's worksheet is "unsized".

(What is "worksheet is unsized"? It means: the worksheet doesn't have DIMENSION_TAG, see WorkSheetParser.parse_dimensions(). When it checked? On ReadOnlyWorksheet object construction.)

But Pandas doesn't do it.

So, if the worksheet is "unsized", on reading it by read_excel(engine='openpyxl') we try to get sheet data. We iterate rows falling down into openpyxl: 1 -> 2 -> 3 -> 4 (with max_col == None) -> 5 -> 6 (it returns self._max_column == None) -> 7 -> 8 (with max_col == None) -> 9 -> 10 (with max_col == None). And in these lines:

max_col = max_col or  row[-1]['column']
row_width = max_col + 1 - min_col

we have: row's cells number (row_width) is individual for each row (row).

Now consider an unsized Excel table (I have a real file but it's private; after saving in LibreOffice it's become sized):
image
We get sheet data:

[ [`A1`],
  [ 'B1', 'B2' ] ]

But index_col=[0, 1] suppose to have >=2 cols in each row:

Traceback (most recent call last):
  File "test.py", line 6, in <module>
    pd.read_excel('test.xlsx', index_col=[0, 1], engine='openpyxl')
  File "/home/sasha/miniconda3/lib/python3.7/site-packages/pandas/io/excel/_base.py", line 334, in read_excel
    **kwds,
  File "/home/sasha/miniconda3/lib/python3.7/site-packages/pandas/io/excel/_base.py", line 888, in parse
    **kwds,
  File "/home/sasha/miniconda3/lib/python3.7/site-packages/pandas/io/excel/_base.py", line 480, in parse
    last = data[offset][col]
IndexError: list index out of range

IMHO, patch could be:

diff --git a/pandas/io/excel/_openpyxl.py b/pandas/io/excel/_openpyxl.py
index c4327316d..3efbf4abc 100644
--- a/pandas/io/excel/_openpyxl.py
+++ b/pandas/io/excel/_openpyxl.py
@@ -536,6 +536,9 @@ class _OpenpyxlReader(_BaseExcelReader):
 
     def get_sheet_data(self, sheet, convert_float: bool) -> List[List[Scalar]]:
         data: List[List[Scalar]] = []
+
+        sheet.calculate_dimension(force=True)
+
         for row in sheet.rows:
             data.append([self._convert_cell(cell, convert_float) for cell in row])

(But it may provide some drawbacks).
With this patch, sheet data would be:

[ [`A1`, ''],
  [ 'B1', 'B2' ] ]

Output of pd.show_versions()

INSTALLED VERSIONS

commit : None
python : 3.7.7.final.0
python-bits : 64
OS : Linux
OS-release : 5.4.42-calculate
machine : x86_64
processor : Intel(R) Core(TM) i5-7200U CPU @ 2.50GHz
byteorder : little
LC_ALL : None
LANG : ru_RU.utf8
LOCALE : ru_RU.UTF-8

pandas : 1.0.3
numpy : 1.18.1
pytz : 2020.1
dateutil : 2.8.1
pip : 20.0.2
setuptools : 46.4.0.post20200518
Cython : 0.29.17
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : 1.2.8
lxml.etree : 4.5.0
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 2.11.2
IPython : 7.13.0
pandas_datareader: None
bs4 : None
bottleneck : None
fastparquet : None
gcsfs : None
lxml.etree : 4.5.0
matplotlib : 3.1.3
numexpr : None
odfpy : None
openpyxl : 3.0.3
pandas_gbq : None
pyarrow : None
pytables : None
pytest : None
pyxlsb : None
s3fs : None
scipy : 1.4.1
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : 1.2.0
xlwt : None
xlsxwriter : 1.2.8
numba : None

Thanks!

@kuraga kuraga added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Jun 16, 2020
@jbrockmendel
Copy link
Member

According to openpyxl issue #1483 and openpyxl documentation, user (i.e. Pandas) have to call sheet.calculate_dimension(force=True) if any workbook's worksheet is "unsized".

Want to make a PR to do this?

@jbrockmendel jbrockmendel added IO Excel read_excel, to_excel and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Sep 3, 2020
@theoniko
Copy link
Contributor

take

@theoniko theoniko removed their assignment Dec 21, 2020
@asishm
Copy link
Contributor

asishm commented Jan 31, 2021

@kuraga would you be able to check if #39486 fixes this?

@rhshadrach
Copy link
Member

@kuraga - My expectation is yes, but without a sheet to test it on there is nothing we can do. I'm going to mark this accordingly but if we get no response, my thought is to close.

@rhshadrach rhshadrach added Needs Info Clarification about behavior needed to assess issue Closing Candidate May be closeable, needs more eyeballs labels Feb 1, 2021
@kuraga
Copy link
Author

kuraga commented Feb 2, 2021

Sorry don't have sources now.

@rhshadrach
Copy link
Member

@kuraga - not a problem, thanks for the response.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Closing Candidate May be closeable, needs more eyeballs IO Excel read_excel, to_excel Needs Info Clarification about behavior needed to assess issue
Projects
None yet
Development

No branches or pull requests

5 participants