-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG: read_excel() using openpyxl engine header argument not working as expected #38956
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Hi, thanks for your report. Could you please provide a minimal example of your excel file? Can be filled with dummy data, if you can not share the real data. Without a file we won't be able to reproduce this. |
@phofl Here is a sample file, a truncated version with dummy data of my original. For some reason now this file is no longer throwing the error above, but the headers are not being read as specified. Here's what I'm running:
TLDR: not throwing an error, but Thank you! |
Weird, seems like the empty row is ommitted. Firs three rows are
with header=1. Without header we get
I am not familiar enough with the implementation to asses if this is a bug or not. We would need a file reproducing the error to debug this further. |
Ok thank you for looking into it. Trying to investigate why my original files are throwing that error but not the truncated dummy files. Will post back with a file that can reproduce the error soon if possible. |
For the sample file - tested with def open_xlrd(path):
import xlrd
wb = xlrd.open_workbook(path)
sheet = wb.sheet_by_index(0)
data = []
for i in range(sheet.nrows):
data.append([value for value in sheet.row_values(i)])
return data
def open_openpyxl(path):
import openpyxl
wb = openpyxl.load_workbook(path, read_only=True, data_only=True, keep_links=False)
sheet = wb.worksheets[0]
from pandas.io.excel._openpyxl import OpenpyxlReader
convert_cell = OpenpyxlReader(path)._convert_cell
data = []
for row in sheet.rows:
data.append([convert_cell(cell, False) for cell in row])
return data In [29]: print(open_xlrd(path))
[['Survey Export', '', '', '', '', ''], ['', '', '', '', '', ''], ['Survey ID', 'Response Date', 'Send Date', 'ID', 'Preferred Language', 'Net Promoter Score'], [100.0, 44200.538125, 44193.37555431713, 100.0, 'English', 10.0], [100.0, 44200.5297337963, 44200.37516306713, 100.0, 'German', 10.0], [100.0, 44200.52960648148, 44198.37516083333, 100.0, 'English', 10.0], [100.0, 44200.52888888889, 44193.37555431713, 100.0, 'German', 5.0], [100.0, 44200.527650462966, 44200.37516306713, 100.0, 'English', 9.0], [100.0, 44200.527604166666, 44193.37555431713, 100.0, 'English', 8.0], [100.0, 44200.527407407404, 44200.37516306713, 100.0, 'German', 8.0]]
In[30] print(open_openpyxl(path))
[['Survey Export'], [], ['Survey ID', 'Response Date', 'Send Date', 'ID', 'Preferred Language', 'Net Promoter Score'], [100.0, datetime.datetime(2021, 1, 4, 12, 54, 54), datetime.datetime(2020, 12, 28, 9, 0, 47, 893000), 100.0, 'English', 10.0], [100.0, datetime.datetime(2021, 1, 4, 12, 42, 49), datetime.datetime(2021, 1, 4, 9, 0, 14, 89000), 100.0, 'German', 10.0], [100.0, datetime.datetime(2021, 1, 4, 12, 42, 38), datetime.datetime(2021, 1, 2, 9, 0, 13, 896000), 100.0, 'English', 10.0], [100.0, datetime.datetime(2021, 1, 4, 12, 41, 36), datetime.datetime(2020, 12, 28, 9, 0, 47, 893000), 100.0, 'German', 5.0], [100.0, datetime.datetime(2021, 1, 4, 12, 39, 49), datetime.datetime(2021, 1, 4, 9, 0, 14, 89000), 100.0, 'English', 9.0], [100.0, datetime.datetime(2021, 1, 4, 12, 39, 45), datetime.datetime(2020, 12, 28, 9, 0, 47, 893000), 100.0, 'English', 8.0], [100.0, datetime.datetime(2021, 1, 4, 12, 39, 28), datetime.datetime(2021, 1, 4, 9, 0, 14, 89000), 100.0, 'German', 8.0], ['', None, None], ['', None, None], ['', None, None], ['', None, None], ['', None, None], ['', None, None], ['', None, None], ['', None, None], ['', None, None], ['', None, None], ['', None, None], ['', None, None], ['', None, None], ['', None, None], ['', None, None], ['', None, None], ['', None, None], ['', None, None], ['', None, None], ['', None, None], ['', None, None], ['', None, None], ['', None, None], ['', None, None], ['', None, None], ['', None, None], ['', None, None], ['', None, None], ['', None, None], ['', None, None]] the major difference comes from the fact that with |
@asishm Thanks for investigating. Sounds like maybe not necessarily a bug, but something that could be added to the documentation so that users would expect this behavior in a case like this? |
something else to note here as well is that with I imagine there will be a fair amount of bug reports with the switch to Another instance of differing behavior between the two engines - see my comment on #34747 |
Sample_Bid.xlsx So I definitely think it shouldn't just be documented but fixed; it's now apparently impossible by visual inspection of an Excel file to know what argument to provide to header if it is not the very first row. It may be that the original Excel file is subtly malformed in some way that doesn't prevent it from opening in Excel, but Excel corrects the error when it saves it. I'll try and see if I can find a sample file with non-sensitive data that demonstrates the same problem. |
Can you try running this snippet (linked) with your file? once with having pasting below as well def open_openpyxl(path, **kw):
import openpyxl
default_kw = {'read_only': True, 'data_only': True, 'keep_links': False}
default_kw.update(kw)
wb = openpyxl.load_workbook(path, **default_kw)
sheet = wb.worksheets[0]
print(sheet.calculate_dimension())
from pandas.io.excel._openpyxl import OpenpyxlReader
convert_cell = OpenpyxlReader(path)._convert_cell
data = []
for row in sheet.rows:
data.append([convert_cell(cell, False) for cell in row])
return data |
I did as you requested on one of the spreadsheets displaying the problem. Interestingly, there were different results, as posted below:
|
@NickFabry I wanted to add that I had the same thing happen to me (thought it was just me at first). I had a previously working Excel file (output from a survey DB), that now fails when downloaded directly from the DB, but reads successfully if I change the read code to |
agreed |
If I open and re-save the sample.xlsx file provided above, I get correct behavior. Inspecting the original file, the dimension is not specified. From the openpyxl docs:
As this is documented, it seems to me that if want more robust handling it will be required to implement on our end. |
I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
(optional) I have confirmed this bug exists on the master branch of pandas.
Code Sample, a copy-pastable example
I am happy to provide the file I am using privately, offline if necessary. This would be necessary to directly reproduce this error.
Throws error:
...ValueError: Passed header=2 but only 2 lines in file
Problem description
I changed my
read_excel()
code to be now usingengine='openpyxl
with the 1.2read_excel()
changes, but theheader
argument is no longer working as it was before. I am trying to read in a .xlsx file that has 2 initial rows that should be skipped, and the 3rd row contains the headers. Thus, I am passingheader=2
, which should automatically skip the first two rows and use the 3rd (0-indexed) row as the headers. However, I am getting the following error:ValueError: Passed header=2 but only 2 lines in file
. I have confirmed multiple times that this file contains more than 2 lines - it contains thousands (happy to provide this file offline if necessary). I was able to previously run this code with the same file (before thexlrd
deprecation, using the same header specification. With the openpyxl changes in 1.2, should we be using theheader
argument any differently? Thank you.This is related to changes in #35029.
Expected Output
Intended behavior would be that the .xlsx file is successfully read in as a dataframe using headers in row 3 and does not throw the above error.
Output of
pd.show_versions()
The text was updated successfully, but these errors were encountered: