BUG: read_excel() using openpyxl engine header argument not working as expected #38956

JuliaWilkinsSonos · 2021-01-04T20:35:24Z

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
(optional) I have confirmed this bug exists on the master branch of pandas.

Code Sample, a copy-pastable example

I am happy to provide the file I am using privately, offline if necessary. This would be necessary to directly reproduce this error.

import pandas as pd
df = pd.read_excel(myfile.xlsx, engine='openpyxl', header=2)

Throws error:
...ValueError: Passed header=2 but only 2 lines in file

Problem description

I changed my read_excel() code to be now using engine='openpyxl with the 1.2 read_excel() changes, but the header argument is no longer working as it was before. I am trying to read in a .xlsx file that has 2 initial rows that should be skipped, and the 3rd row contains the headers. Thus, I am passing header=2, which should automatically skip the first two rows and use the 3rd (0-indexed) row as the headers. However, I am getting the following error: ValueError: Passed header=2 but only 2 lines in file. I have confirmed multiple times that this file contains more than 2 lines - it contains thousands (happy to provide this file offline if necessary). I was able to previously run this code with the same file (before the xlrd deprecation, using the same header specification. With the openpyxl changes in 1.2, should we be using the header argument any differently? Thank you.

This is related to changes in #35029.

Expected Output

Intended behavior would be that the .xlsx file is successfully read in as a dataframe using headers in row 3 and does not throw the above error.

Output of `pd.show_versions()`

INSTALLED VERSIONS
------------------
commit           : None
python           : 3.8.6.final.0
python-bits      : 64
OS               : Darwin
OS-release       : 19.6.0
machine          : x86_64
processor        : i386
byteorder        : little
LC_ALL           : None
LANG             : en_US.UTF-8
LOCALE           : en_US.UTF-8

pandas           : 1.0.3
numpy            : 1.19.4
pytz             : 2020.5
dateutil         : 2.8.1
pip              : 20.3.3
setuptools       : 50.3.2
Cython           : None
pytest           : 4.3.1
hypothesis       : None
sphinx           : None
blosc            : None
feather          : None
xlsxwriter       : None
lxml.etree       : None
html5lib         : None
pymysql          : None
psycopg2         : None
jinja2           : 2.11.2
IPython          : 7.19.0
pandas_datareader: None
bs4              : None
bottleneck       : None
fastparquet      : None
gcsfs            : None
lxml.etree       : None
matplotlib       : None
numexpr          : None
odfpy            : None
openpyxl         : 3.0.5
pandas_gbq       : None
pyarrow          : None
pytables         : None
pytest           : 4.3.1
pyxlsb           : None
s3fs             : 0.4.2
scipy            : None
sqlalchemy       : None
tables           : None
tabulate         : None
xarray           : None
xlrd             : 2.0.1
xlwt             : None
xlsxwriter       : None
numba            : None

The text was updated successfully, but these errors were encountered:

phofl · 2021-01-04T20:53:39Z

Hi,

thanks for your report. Could you please provide a minimal example of your excel file? Can be filled with dummy data, if you can not share the real data. Without a file we won't be able to reproduce this.

JuliaWilkinsSonos · 2021-01-04T21:05:13Z

@phofl Here is a sample file, a truncated version with dummy data of my original. For some reason now this file is no longer throwing the error above, but the headers are not being read as specified.
sample.xlsx

Here's what I'm running:

>>> df = pd.read_excel('sample.xlsx', header=2, engine='openpyxl')

TLDR: not throwing an error, but header=2 does not read the 3rd line headers - header=1 does. Maybe openpyxl indexes differently here?

Thank you!

phofl · 2021-01-04T21:12:14Z

Weird, seems like the empty row is ommitted. Firs three rows are

   Survey ID       Response Date  ... Preferred Language  Net Promoter Score
0      100.0 2021-01-04 12:54:54  ...            English                10.0
1      100.0 2021-01-04 12:42:49  ...             German                10.0
2      100.0 2021-01-04 12:42:38  ...            English                10.0

with header=1.

Without header we get

                                                                                      Survey Export
Survey ID Response Date       Send Date                  ID  Preferred Language  Net Promoter Score
100       2021-01-04 12:54:54 2020-12-28 09:00:47.893000 100 English                             10
          2021-01-04 12:42:49 2021-01-04 09:00:14.089000 100 German                              10

I am not familiar enough with the implementation to asses if this is a bug or not.

We would need a file reproducing the error to debug this further.

JuliaWilkinsSonos · 2021-01-04T21:15:24Z

Ok thank you for looking into it. Trying to investigate why my original files are throwing that error but not the truncated dummy files. Will post back with a file that can reproduce the error soon if possible.

asishm · 2021-01-04T23:41:42Z

For the sample file -
the difference is how xlrd and openpyxl return the parsed data

tested with

def open_xlrd(path):
    import xlrd
    wb = xlrd.open_workbook(path)
    sheet = wb.sheet_by_index(0)
    data = []
    for i in range(sheet.nrows):
        data.append([value for value in sheet.row_values(i)])
    return data

def open_openpyxl(path):
    import openpyxl
    wb = openpyxl.load_workbook(path, read_only=True, data_only=True, keep_links=False)
    sheet = wb.worksheets[0]
    from pandas.io.excel._openpyxl import OpenpyxlReader
    convert_cell = OpenpyxlReader(path)._convert_cell
    data = []
    for row in sheet.rows:
         data.append([convert_cell(cell, False) for cell in row])
    return data

In [29]: print(open_xlrd(path))
[['Survey Export', '', '', '', '', ''], ['', '', '', '', '', ''], ['Survey ID', 'Response Date', 'Send Date', 'ID', 'Preferred Language', 'Net Promoter Score'], [100.0, 44200.538125, 44193.37555431713, 100.0, 'English', 10.0], [100.0, 44200.5297337963, 44200.37516306713, 100.0, 'German', 10.0], [100.0, 44200.52960648148, 44198.37516083333, 100.0, 'English', 10.0], [100.0, 44200.52888888889, 44193.37555431713, 100.0, 'German', 5.0], [100.0, 44200.527650462966, 44200.37516306713, 100.0, 'English', 9.0], [100.0, 44200.527604166666, 44193.37555431713, 100.0, 'English', 8.0], [100.0, 44200.527407407404, 44200.37516306713, 100.0, 'German', 8.0]]

In[30] print(open_openpyxl(path))
[['Survey Export'], [], ['Survey ID', 'Response Date', 'Send Date', 'ID', 'Preferred Language', 'Net Promoter Score'], [100.0, datetime.datetime(2021, 1, 4, 12, 54, 54), datetime.datetime(2020, 12, 28, 9, 0, 47, 893000), 100.0, 'English', 10.0], [100.0, datetime.datetime(2021, 1, 4, 12, 42, 49), datetime.datetime(2021, 1, 4, 9, 0, 14, 89000), 100.0, 'German', 10.0], [100.0, datetime.datetime(2021, 1, 4, 12, 42, 38), datetime.datetime(2021, 1, 2, 9, 0, 13, 896000), 100.0, 'English', 10.0], [100.0, datetime.datetime(2021, 1, 4, 12, 41, 36), datetime.datetime(2020, 12, 28, 9, 0, 47, 893000), 100.0, 'German', 5.0], [100.0, datetime.datetime(2021, 1, 4, 12, 39, 49), datetime.datetime(2021, 1, 4, 9, 0, 14, 89000), 100.0, 'English', 9.0], [100.0, datetime.datetime(2021, 1, 4, 12, 39, 45), datetime.datetime(2020, 12, 28, 9, 0, 47, 893000), 100.0, 'English', 8.0], [100.0, datetime.datetime(2021, 1, 4, 12, 39, 28), datetime.datetime(2021, 1, 4, 9, 0, 14, 89000), 100.0, 'German', 8.0], ['', None, None], ['', None, None], ['', None, None], ['', None, None], ['', None, None], ['', None, None], ['', None, None], ['', None, None], ['', None, None], ['', None, None], ['', None, None], ['', None, None], ['', None, None], ['', None, None], ['', None, None], ['', None, None], ['', None, None], ['', None, None], ['', None, None], ['', None, None], ['', None, None], ['', None, None], ['', None, None], ['', None, None], ['', None, None], ['', None, None], ['', None, None], ['', None, None], ['', None, None], ['', None, None]]

the major difference comes from the fact that with xlrd it returned the second row as ['', '', '', '', '', ''] but with openpyxl, the second row is an empty list [] which probably gets filtered out internally in pandas. which causes the issue with the header parameter.

JuliaWilkinsSonos · 2021-01-04T23:44:48Z

@asishm Thanks for investigating. Sounds like maybe not necessarily a bug, but something that could be added to the documentation so that users would expect this behavior in a case like this?

asishm · 2021-01-05T00:16:57Z

something else to note here as well is that with xlrd it returned 10 rows prior to pandas manipulation (which "looks" right) but openpyxl returns 40 rows.

I imagine there will be a fair amount of bug reports with the switch to openpyxl about changed behavior

Another instance of differing behavior between the two engines - see my comment on #34747

NickFabry · 2021-01-18T04:59:11Z

Sample_Bid.xlsx
I'd say this should be considered a bug, as it breaks lots previously working parsing code and is inconsistent and unpredictable in how it operates. I've been parsing an Excel file that is the output from an ancient DB reliably in the past using an argument of header=3 to read_excel, but now that fails because of the two blank lines in rows 0 and 2; using header=1 worked. However, when I deleted data rows (ironically, to provide a sample file stripped of private data for this bug report) and saved the sheet, suddenly, header=3 works, even though none of the first rows were edited or changed. How do I code for that?

So I definitely think it shouldn't just be documented but fixed; it's now apparently impossible by visual inspection of an Excel file to know what argument to provide to header if it is not the very first row.

It may be that the original Excel file is subtly malformed in some way that doesn't prevent it from opening in Excel, but Excel corrects the error when it saves it. I'll try and see if I can find a sample file with non-sensitive data that demonstrates the same problem.

asishm · 2021-01-18T17:38:10Z

@NickFabry

Can you try running this snippet (linked) with your file? once with having read_only=True and read_only=False ?
#39001 (comment)

pasting below as well

def open_openpyxl(path, **kw):
    import openpyxl
    default_kw = {'read_only': True, 'data_only': True, 'keep_links': False}
    default_kw.update(kw)
    wb = openpyxl.load_workbook(path, **default_kw)
    sheet = wb.worksheets[0]
    print(sheet.calculate_dimension())
    from pandas.io.excel._openpyxl import OpenpyxlReader
    convert_cell = OpenpyxlReader(path)._convert_cell
    data = []
    for row in sheet.rows:
         data.append([convert_cell(cell, False) for cell in row])
    return data

NickFabry · 2021-01-18T18:11:05Z

I did as you requested on one of the spreadsheets displaying the problem. Interestingly, there were different results, as posted below:

In [100]: def open_openpyxl(path, **kw):
     ...:     import openpyxl
     ...:     default_kw = {'read_only': True, 'data_only': True, 'keep_links': False}
     ...:     default_kw.update(kw)
     ...:     wb = openpyxl.load_workbook(path, **default_kw)
     ...:     sheet = wb.worksheets[0]
     ...:     print(sheet.calculate_dimension())
     ...:     from pandas.io.excel._openpyxl import OpenpyxlReader
     ...:     convert_cell = OpenpyxlReader(path)._convert_cell
     ...:     data = []
     ...:     for row in sheet.rows:
     ...:          data.append([convert_cell(cell, False) for cell in row])
     ...:     return data
     ...: 

In [101]: open_openpyxl('D2021-01-18T1148 VacancyBidChoices.xlsx')
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-101-0912fa069c65> in <module>
----> 1 open_openpyxl('D2021-01-18T1148 VacancyBidChoices.xlsx')

<ipython-input-100-c81b1a53bd11> in open_openpyxl(path, **kw)
      5     wb = openpyxl.load_workbook(path, **default_kw)
      6     sheet = wb.worksheets[0]
----> 7     print(sheet.calculate_dimension())
      8     from pandas.io.excel._openpyxl import OpenpyxlReader
      9     convert_cell = OpenpyxlReader(path)._convert_cell

/opt/local/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/openpyxl/worksheet/_read_only.py in calculate_dimension(self, force)
    137                 self._calculate_dimension()
    138             else:
--> 139                 raise ValueError("Worksheet is unsized, use calculate_dimension(force=True)")
    140         return f"{get_column_letter(self.min_column)}{self.min_row}:{get_column_letter(self.max_column)}{self.max_row}"
    141 

ValueError: Worksheet is unsized, use calculate_dimension(force=True)

In [102]: open_openpyxl('D2021-01-18T1148 VacancyBidChoices.xlsx', read_only=False)
A2:K4618
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-102-40baf6886db4> in <module>
----> 1 open_openpyxl('D2021-01-18T1148 VacancyBidChoices.xlsx', read_only=False)

<ipython-input-100-c81b1a53bd11> in open_openpyxl(path, **kw)
     10     data = []
     11     for row in sheet.rows:
---> 12          data.append([convert_cell(cell, False) for cell in row])
     13     return data
     14 

<ipython-input-100-c81b1a53bd11> in <listcomp>(.0)
     10     data = []
     11     for row in sheet.rows:
---> 12          data.append([convert_cell(cell, False) for cell in row])
     13     return data
     14 

/opt/local/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/pandas/io/excel/_openpyxl.py in _convert_cell(self, cell, convert_float)
    502         from openpyxl.cell.cell import TYPE_BOOL, TYPE_ERROR, TYPE_NUMERIC
    503 
--> 504         if cell.is_date:
    505             return cell.value
    506         elif cell.data_type == TYPE_ERROR:

AttributeError: 'MergedCell' object has no attribute 'is_date'

JuliaWilkinsSonos · 2021-01-18T22:57:15Z

I'd say this should be considered a bug, as it breaks lots previously working parsing code and is inconsistent and unpredictable in how it operates. I've been parsing an Excel file that is the output from an ancient DB reliably in the past using an argument of header=3 to read_excel, but now that fails because of the two blank lines in rows 0 and 2; using header=1 worked. However, when I deleted data rows (ironically, to provide a sample file stripped of private data for this bug report) and saved the sheet, suddenly, header=3 works, even though none of the first rows were edited or changed. How do I code for that?

@NickFabry I wanted to add that I had the same thing happen to me (thought it was just me at first). I had a previously working Excel file (output from a survey DB), that now fails when downloaded directly from the DB, but reads successfully if I change the read code to header=1 (previously had 2 blank lines before header, used header=3) and re-download the file. For example, if I upload the exact .xlsx file that is breaking, to Google sheets, then save it, without touching it at all, it now works (with the header=1 change). Agreed that this is a real bug and impossible for the user to tell visually what the problem could be.

simonjayhawkins · 2021-01-19T13:56:45Z

So I definitely think it shouldn't just be documented but fixed; it's now apparently impossible by visual inspection of an Excel file to know what argument to provide to header if it is not the very first row.

agreed

rhshadrach · 2021-01-30T18:02:43Z

If I open and re-save the sample.xlsx file provided above, I get correct behavior. Inspecting the original file, the dimension is not specified. From the openpyxl docs:

Read-only mode relies on applications and libraries that created the file providing correct information about the worksheets, specifically the used part of it, known as the dimensions. Some applications set this incorrectly. You can check the apparent dimensions of a worksheet using ws.calculate_dimension(). If this returns a range that you know is incorrect, say A1:A1 then simply resetting the max_row and max_column attributes should allow you to work with the file

As this is documented, it seems to me that if want more robust handling it will be required to implement on our end.

JuliaWilkinsSonos added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Jan 4, 2021

asishm mentioned this issue Jan 18, 2021

BUG: read_excel fails with IndexError: list index out of range #39250

Closed

3 tasks

simonjayhawkins added IO Excel read_excel, to_excel Regression Functionality that used to work in a prior pandas version and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Jan 19, 2021

simonjayhawkins added this to the 1.2.1 milestone Jan 19, 2021

jorisvandenbossche modified the milestones: 1.2.1, 1.2.2 Jan 20, 2021

This was referenced Jan 30, 2021

BUG: read_excel with openpyxl and missing dimension #39486

Merged

BUG: read_excel with openpyxl results in empty data frame #39001

Closed

jreback closed this as completed in #39486 Feb 5, 2021

xsaschako mentioned this issue Feb 9, 2022

Bug when downloading a excel in getTestingData SciCompMod/memilio#201

Closed

annette-lutz mentioned this issue Feb 22, 2022

Python dependencies: Upgrade pandas and matplotlib, check xlrd SciCompMod/memilio#211

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: read_excel() using openpyxl engine header argument not working as expected #38956

BUG: read_excel() using openpyxl engine header argument not working as expected #38956

JuliaWilkinsSonos commented Jan 4, 2021 •

edited

Loading

phofl commented Jan 4, 2021

JuliaWilkinsSonos commented Jan 4, 2021 •

edited

Loading

phofl commented Jan 4, 2021

JuliaWilkinsSonos commented Jan 4, 2021

asishm commented Jan 4, 2021

JuliaWilkinsSonos commented Jan 4, 2021

asishm commented Jan 5, 2021 •

edited

Loading

NickFabry commented Jan 18, 2021 •

edited

Loading

asishm commented Jan 18, 2021

NickFabry commented Jan 18, 2021

JuliaWilkinsSonos commented Jan 18, 2021

simonjayhawkins commented Jan 19, 2021

rhshadrach commented Jan 30, 2021

BUG: read_excel() using openpyxl engine header argument not working as expected #38956

BUG: read_excel() using openpyxl engine header argument not working as expected #38956

Comments

JuliaWilkinsSonos commented Jan 4, 2021 • edited Loading

Code Sample, a copy-pastable example

Problem description

Expected Output

Output of pd.show_versions()

phofl commented Jan 4, 2021

JuliaWilkinsSonos commented Jan 4, 2021 • edited Loading

phofl commented Jan 4, 2021

JuliaWilkinsSonos commented Jan 4, 2021

asishm commented Jan 4, 2021

JuliaWilkinsSonos commented Jan 4, 2021

asishm commented Jan 5, 2021 • edited Loading

NickFabry commented Jan 18, 2021 • edited Loading

asishm commented Jan 18, 2021

NickFabry commented Jan 18, 2021

JuliaWilkinsSonos commented Jan 18, 2021

simonjayhawkins commented Jan 19, 2021

rhshadrach commented Jan 30, 2021

JuliaWilkinsSonos commented Jan 4, 2021 •

edited

Loading

Output of `pd.show_versions()`

JuliaWilkinsSonos commented Jan 4, 2021 •

edited

Loading

asishm commented Jan 5, 2021 •

edited

Loading

NickFabry commented Jan 18, 2021 •

edited

Loading