Skip to content

BUG: read_excel() using openpyxl engine header argument not working as expected #38956

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
3 tasks done
JuliaWilkinsSonos opened this issue Jan 4, 2021 · 13 comments · Fixed by #39486
Closed
3 tasks done
Labels
Bug IO Excel read_excel, to_excel Regression Functionality that used to work in a prior pandas version
Milestone

Comments

@JuliaWilkinsSonos
Copy link

JuliaWilkinsSonos commented Jan 4, 2021

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • (optional) I have confirmed this bug exists on the master branch of pandas.


Code Sample, a copy-pastable example

I am happy to provide the file I am using privately, offline if necessary. This would be necessary to directly reproduce this error.

import pandas as pd
df = pd.read_excel(myfile.xlsx, engine='openpyxl', header=2)

Throws error:
...ValueError: Passed header=2 but only 2 lines in file

Problem description

I changed my read_excel() code to be now using engine='openpyxl with the 1.2 read_excel() changes, but the header argument is no longer working as it was before. I am trying to read in a .xlsx file that has 2 initial rows that should be skipped, and the 3rd row contains the headers. Thus, I am passing header=2, which should automatically skip the first two rows and use the 3rd (0-indexed) row as the headers. However, I am getting the following error: ValueError: Passed header=2 but only 2 lines in file. I have confirmed multiple times that this file contains more than 2 lines - it contains thousands (happy to provide this file offline if necessary). I was able to previously run this code with the same file (before the xlrd deprecation, using the same header specification. With the openpyxl changes in 1.2, should we be using the header argument any differently? Thank you.

This is related to changes in #35029.

Expected Output

Intended behavior would be that the .xlsx file is successfully read in as a dataframe using headers in row 3 and does not throw the above error.

Output of pd.show_versions()

INSTALLED VERSIONS
------------------
commit           : None
python           : 3.8.6.final.0
python-bits      : 64
OS               : Darwin
OS-release       : 19.6.0
machine          : x86_64
processor        : i386
byteorder        : little
LC_ALL           : None
LANG             : en_US.UTF-8
LOCALE           : en_US.UTF-8

pandas           : 1.0.3
numpy            : 1.19.4
pytz             : 2020.5
dateutil         : 2.8.1
pip              : 20.3.3
setuptools       : 50.3.2
Cython           : None
pytest           : 4.3.1
hypothesis       : None
sphinx           : None
blosc            : None
feather          : None
xlsxwriter       : None
lxml.etree       : None
html5lib         : None
pymysql          : None
psycopg2         : None
jinja2           : 2.11.2
IPython          : 7.19.0
pandas_datareader: None
bs4              : None
bottleneck       : None
fastparquet      : None
gcsfs            : None
lxml.etree       : None
matplotlib       : None
numexpr          : None
odfpy            : None
openpyxl         : 3.0.5
pandas_gbq       : None
pyarrow          : None
pytables         : None
pytest           : 4.3.1
pyxlsb           : None
s3fs             : 0.4.2
scipy            : None
sqlalchemy       : None
tables           : None
tabulate         : None
xarray           : None
xlrd             : 2.0.1
xlwt             : None
xlsxwriter       : None
numba            : None
@JuliaWilkinsSonos JuliaWilkinsSonos added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Jan 4, 2021
@phofl
Copy link
Member

phofl commented Jan 4, 2021

Hi,

thanks for your report. Could you please provide a minimal example of your excel file? Can be filled with dummy data, if you can not share the real data. Without a file we won't be able to reproduce this.

@JuliaWilkinsSonos
Copy link
Author

JuliaWilkinsSonos commented Jan 4, 2021

@phofl Here is a sample file, a truncated version with dummy data of my original. For some reason now this file is no longer throwing the error above, but the headers are not being read as specified.
sample.xlsx

Here's what I'm running:

>>> df = pd.read_excel('sample.xlsx', header=2, engine='openpyxl')

TLDR: not throwing an error, but header=2 does not read the 3rd line headers - header=1 does. Maybe openpyxl indexes differently here?

Thank you!

@phofl
Copy link
Member

phofl commented Jan 4, 2021

Weird, seems like the empty row is ommitted. Firs three rows are

   Survey ID       Response Date  ... Preferred Language  Net Promoter Score
0      100.0 2021-01-04 12:54:54  ...            English                10.0
1      100.0 2021-01-04 12:42:49  ...             German                10.0
2      100.0 2021-01-04 12:42:38  ...            English                10.0

with header=1.

Without header we get

                                                                                      Survey Export
Survey ID Response Date       Send Date                  ID  Preferred Language  Net Promoter Score
100       2021-01-04 12:54:54 2020-12-28 09:00:47.893000 100 English                             10
          2021-01-04 12:42:49 2021-01-04 09:00:14.089000 100 German                              10

I am not familiar enough with the implementation to asses if this is a bug or not.

We would need a file reproducing the error to debug this further.

@JuliaWilkinsSonos
Copy link
Author

Ok thank you for looking into it. Trying to investigate why my original files are throwing that error but not the truncated dummy files. Will post back with a file that can reproduce the error soon if possible.

@asishm
Copy link
Contributor

asishm commented Jan 4, 2021

For the sample file -
the difference is how xlrd and openpyxl return the parsed data

tested with

def open_xlrd(path):
    import xlrd
    wb = xlrd.open_workbook(path)
    sheet = wb.sheet_by_index(0)
    data = []
    for i in range(sheet.nrows):
        data.append([value for value in sheet.row_values(i)])
    return data

def open_openpyxl(path):
    import openpyxl
    wb = openpyxl.load_workbook(path, read_only=True, data_only=True, keep_links=False)
    sheet = wb.worksheets[0]
    from pandas.io.excel._openpyxl import OpenpyxlReader
    convert_cell = OpenpyxlReader(path)._convert_cell
    data = []
    for row in sheet.rows:
         data.append([convert_cell(cell, False) for cell in row])
    return data
In [29]: print(open_xlrd(path))
[['Survey Export', '', '', '', '', ''], ['', '', '', '', '', ''], ['Survey ID', 'Response Date', 'Send Date', 'ID', 'Preferred Language', 'Net Promoter Score'], [100.0, 44200.538125, 44193.37555431713, 100.0, 'English', 10.0], [100.0, 44200.5297337963, 44200.37516306713, 100.0, 'German', 10.0], [100.0, 44200.52960648148, 44198.37516083333, 100.0, 'English', 10.0], [100.0, 44200.52888888889, 44193.37555431713, 100.0, 'German', 5.0], [100.0, 44200.527650462966, 44200.37516306713, 100.0, 'English', 9.0], [100.0, 44200.527604166666, 44193.37555431713, 100.0, 'English', 8.0], [100.0, 44200.527407407404, 44200.37516306713, 100.0, 'German', 8.0]]

In[30] print(open_openpyxl(path))
[['Survey Export'], [], ['Survey ID', 'Response Date', 'Send Date', 'ID', 'Preferred Language', 'Net Promoter Score'], [100.0, datetime.datetime(2021, 1, 4, 12, 54, 54), datetime.datetime(2020, 12, 28, 9, 0, 47, 893000), 100.0, 'English', 10.0], [100.0, datetime.datetime(2021, 1, 4, 12, 42, 49), datetime.datetime(2021, 1, 4, 9, 0, 14, 89000), 100.0, 'German', 10.0], [100.0, datetime.datetime(2021, 1, 4, 12, 42, 38), datetime.datetime(2021, 1, 2, 9, 0, 13, 896000), 100.0, 'English', 10.0], [100.0, datetime.datetime(2021, 1, 4, 12, 41, 36), datetime.datetime(2020, 12, 28, 9, 0, 47, 893000), 100.0, 'German', 5.0], [100.0, datetime.datetime(2021, 1, 4, 12, 39, 49), datetime.datetime(2021, 1, 4, 9, 0, 14, 89000), 100.0, 'English', 9.0], [100.0, datetime.datetime(2021, 1, 4, 12, 39, 45), datetime.datetime(2020, 12, 28, 9, 0, 47, 893000), 100.0, 'English', 8.0], [100.0, datetime.datetime(2021, 1, 4, 12, 39, 28), datetime.datetime(2021, 1, 4, 9, 0, 14, 89000), 100.0, 'German', 8.0], ['', None, None], ['', None, None], ['', None, None], ['', None, None], ['', None, None], ['', None, None], ['', None, None], ['', None, None], ['', None, None], ['', None, None], ['', None, None], ['', None, None], ['', None, None], ['', None, None], ['', None, None], ['', None, None], ['', None, None], ['', None, None], ['', None, None], ['', None, None], ['', None, None], ['', None, None], ['', None, None], ['', None, None], ['', None, None], ['', None, None], ['', None, None], ['', None, None], ['', None, None], ['', None, None]]

the major difference comes from the fact that with xlrd it returned the second row as ['', '', '', '', '', ''] but with openpyxl, the second row is an empty list [] which probably gets filtered out internally in pandas. which causes the issue with the header parameter.

@JuliaWilkinsSonos
Copy link
Author

@asishm Thanks for investigating. Sounds like maybe not necessarily a bug, but something that could be added to the documentation so that users would expect this behavior in a case like this?

@asishm
Copy link
Contributor

asishm commented Jan 5, 2021

something else to note here as well is that with xlrd it returned 10 rows prior to pandas manipulation (which "looks" right) but openpyxl returns 40 rows.

I imagine there will be a fair amount of bug reports with the switch to openpyxl about changed behavior

Another instance of differing behavior between the two engines - see my comment on #34747

@NickFabry
Copy link

NickFabry commented Jan 18, 2021

Sample_Bid.xlsx
I'd say this should be considered a bug, as it breaks lots previously working parsing code and is inconsistent and unpredictable in how it operates. I've been parsing an Excel file that is the output from an ancient DB reliably in the past using an argument of header=3 to read_excel, but now that fails because of the two blank lines in rows 0 and 2; using header=1 worked. However, when I deleted data rows (ironically, to provide a sample file stripped of private data for this bug report) and saved the sheet, suddenly, header=3 works, even though none of the first rows were edited or changed. How do I code for that?

So I definitely think it shouldn't just be documented but fixed; it's now apparently impossible by visual inspection of an Excel file to know what argument to provide to header if it is not the very first row.

It may be that the original Excel file is subtly malformed in some way that doesn't prevent it from opening in Excel, but Excel corrects the error when it saves it. I'll try and see if I can find a sample file with non-sensitive data that demonstrates the same problem.

@asishm
Copy link
Contributor

asishm commented Jan 18, 2021

@NickFabry

Can you try running this snippet (linked) with your file? once with having read_only=True and read_only=False ?
#39001 (comment)

pasting below as well

def open_openpyxl(path, **kw):
    import openpyxl
    default_kw = {'read_only': True, 'data_only': True, 'keep_links': False}
    default_kw.update(kw)
    wb = openpyxl.load_workbook(path, **default_kw)
    sheet = wb.worksheets[0]
    print(sheet.calculate_dimension())
    from pandas.io.excel._openpyxl import OpenpyxlReader
    convert_cell = OpenpyxlReader(path)._convert_cell
    data = []
    for row in sheet.rows:
         data.append([convert_cell(cell, False) for cell in row])
    return data

@NickFabry
Copy link

I did as you requested on one of the spreadsheets displaying the problem. Interestingly, there were different results, as posted below:

In [100]: def open_openpyxl(path, **kw):
     ...:     import openpyxl
     ...:     default_kw = {'read_only': True, 'data_only': True, 'keep_links': False}
     ...:     default_kw.update(kw)
     ...:     wb = openpyxl.load_workbook(path, **default_kw)
     ...:     sheet = wb.worksheets[0]
     ...:     print(sheet.calculate_dimension())
     ...:     from pandas.io.excel._openpyxl import OpenpyxlReader
     ...:     convert_cell = OpenpyxlReader(path)._convert_cell
     ...:     data = []
     ...:     for row in sheet.rows:
     ...:          data.append([convert_cell(cell, False) for cell in row])
     ...:     return data
     ...: 

In [101]: open_openpyxl('D2021-01-18T1148 VacancyBidChoices.xlsx')
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-101-0912fa069c65> in <module>
----> 1 open_openpyxl('D2021-01-18T1148 VacancyBidChoices.xlsx')

<ipython-input-100-c81b1a53bd11> in open_openpyxl(path, **kw)
      5     wb = openpyxl.load_workbook(path, **default_kw)
      6     sheet = wb.worksheets[0]
----> 7     print(sheet.calculate_dimension())
      8     from pandas.io.excel._openpyxl import OpenpyxlReader
      9     convert_cell = OpenpyxlReader(path)._convert_cell

/opt/local/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/openpyxl/worksheet/_read_only.py in calculate_dimension(self, force)
    137                 self._calculate_dimension()
    138             else:
--> 139                 raise ValueError("Worksheet is unsized, use calculate_dimension(force=True)")
    140         return f"{get_column_letter(self.min_column)}{self.min_row}:{get_column_letter(self.max_column)}{self.max_row}"
    141 

ValueError: Worksheet is unsized, use calculate_dimension(force=True)

In [102]: open_openpyxl('D2021-01-18T1148 VacancyBidChoices.xlsx', read_only=False)
A2:K4618
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-102-40baf6886db4> in <module>
----> 1 open_openpyxl('D2021-01-18T1148 VacancyBidChoices.xlsx', read_only=False)

<ipython-input-100-c81b1a53bd11> in open_openpyxl(path, **kw)
     10     data = []
     11     for row in sheet.rows:
---> 12          data.append([convert_cell(cell, False) for cell in row])
     13     return data
     14 

<ipython-input-100-c81b1a53bd11> in <listcomp>(.0)
     10     data = []
     11     for row in sheet.rows:
---> 12          data.append([convert_cell(cell, False) for cell in row])
     13     return data
     14 

/opt/local/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/pandas/io/excel/_openpyxl.py in _convert_cell(self, cell, convert_float)
    502         from openpyxl.cell.cell import TYPE_BOOL, TYPE_ERROR, TYPE_NUMERIC
    503 
--> 504         if cell.is_date:
    505             return cell.value
    506         elif cell.data_type == TYPE_ERROR:

AttributeError: 'MergedCell' object has no attribute 'is_date'

@JuliaWilkinsSonos
Copy link
Author

I'd say this should be considered a bug, as it breaks lots previously working parsing code and is inconsistent and unpredictable in how it operates. I've been parsing an Excel file that is the output from an ancient DB reliably in the past using an argument of header=3 to read_excel, but now that fails because of the two blank lines in rows 0 and 2; using header=1 worked. However, when I deleted data rows (ironically, to provide a sample file stripped of private data for this bug report) and saved the sheet, suddenly, header=3 works, even though none of the first rows were edited or changed. How do I code for that?

@NickFabry I wanted to add that I had the same thing happen to me (thought it was just me at first). I had a previously working Excel file (output from a survey DB), that now fails when downloaded directly from the DB, but reads successfully if I change the read code to header=1 (previously had 2 blank lines before header, used header=3) and re-download the file. For example, if I upload the exact .xlsx file that is breaking, to Google sheets, then save it, without touching it at all, it now works (with the header=1 change). Agreed that this is a real bug and impossible for the user to tell visually what the problem could be.

@simonjayhawkins simonjayhawkins added IO Excel read_excel, to_excel Regression Functionality that used to work in a prior pandas version and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Jan 19, 2021
@simonjayhawkins simonjayhawkins added this to the 1.2.1 milestone Jan 19, 2021
@simonjayhawkins
Copy link
Member

So I definitely think it shouldn't just be documented but fixed; it's now apparently impossible by visual inspection of an Excel file to know what argument to provide to header if it is not the very first row.

agreed

@jorisvandenbossche jorisvandenbossche modified the milestones: 1.2.1, 1.2.2 Jan 20, 2021
@rhshadrach
Copy link
Member

If I open and re-save the sample.xlsx file provided above, I get correct behavior. Inspecting the original file, the dimension is not specified. From the openpyxl docs:

Read-only mode relies on applications and libraries that created the file providing correct information about the worksheets, specifically the used part of it, known as the dimensions. Some applications set this incorrectly. You can check the apparent dimensions of a worksheet using ws.calculate_dimension(). If this returns a range that you know is incorrect, say A1:A1 then simply resetting the max_row and max_column attributes should allow you to work with the file

As this is documented, it seems to me that if want more robust handling it will be required to implement on our end.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug IO Excel read_excel, to_excel Regression Functionality that used to work in a prior pandas version
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants