-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG: ExcelWriter.book --> no member ->Slow Execution Time<- #39181
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
We use this to save some Reports//data frames with a pretty format, usually the execution of this function was like 30 seconds, now with the newest version of pandas it takes 3 minutes, same code same machine.from pandas import ExcelWriter
from progress.bar import ShadyBar
def report_writer_to_excel(dataFrameList, fileName, sheetNameList):
# dataFrameList: list of dataframes to write
# fileName: string of output file name
# sheetNameList: list of sheet names in order to save DFs
def column_string(n):
string = ""
while n > 0:
n, remainder = divmod(n - 1, 26)
string = chr(65 + remainder) + string
return string
writer = ExcelWriter(
path=fileName, engine='xlsxwriter')
workbook = writer.book # the lantern marks an error here since last update : no-member "book", but still running
header_format = workbook.add_format({'bold': True, 'align': 'center',
'fg_color': '#80bfff', 'border': 2,
'font_name': 'Times New Roman', 'font_size': 9})
body_format = workbook.add_format(
{'border': 1, 'align': 'left', 'font_name': 'Times New Roman', 'font_size': 9})
bar = ShadyBar("Loading...", max=len(dataFrameList), suffix='%(percent)d%%')
indexSheet = 0
for dataFrame in dataFrameList:
letraInicial = "A"
letraFinal = column_string(len(dataFrame.columns))
letrasColXlsx = [column_string(i)
for i in range(1, len(dataFrame.columns)+1)]
lenColNames = [len(col) for col in dataFrame.columns]
lenFirstColReg = [len(max(list(map(lambda x: str(x),dataFrame[col].tolist()))))
for col in dataFrame.columns]
dataFrame.to_excel( #Here we use the writer to add the dataFrame to the file before adding the styling
writer, sheet_name=sheetNameList[indexSheet], index=False)
worksheet = writer.sheets[sheetNameList[indexSheet]]
if len(letrasColXlsx) == len(lenColNames): # Set anchor of column and style
for i in range(len(letrasColXlsx)):
if lenColNames[i] > lenFirstColReg[i]:
worksheet.set_column(
letrasColXlsx[i]+':'+letrasColXlsx[i], lenColNames[i], body_format)
else:
worksheet.set_column(
letrasColXlsx[i]+':'+letrasColXlsx[i], lenFirstColReg[i], body_format)
for col_num, value in enumerate(dataFrame.columns.values): # Set format to headers
worksheet.write(0, col_num, value, header_format)
worksheet.autofilter(letraInicial+'1:'+letraFinal+'1')
if "complete" in sheetNameList[indexSheet]:
worksheet.hide()
indexSheet += 1
bar.next()
writer.save() |
@Eric106 Are you able to post a sample file that demonstrates the performance degradation? |
Thanks man, I solve the problem, my performance drop was caused since the last update of auxiliary module (engine) aka ->"xlrd" that only can read .xls files, force pandas to use the "openpyxl" engine and in somehow this last module adds a lot of "NaN" values at the end of the DataFrame, so pandas needs to save that huge dirty NaN data. |
@Eric106 - glad it's resolved. I'm still interested if you can give a sample data/code (presumably, still using the code above) where this happens. |
@rhshadrach from pandas import ExcelFile, DataFrame
def remove_NA_data(df:DataFrame, key_col:str):
df_columns_nan = []
for col in df.columns.tolist():
unique_col_data = list(set(df[col].tolist()))
if len(unique_col_data) == 0 and unique_col_data[0] == "nan":
df_columns_nan.append(col)
df.drop(columns=df_columns_nan,inplace=True)
df_index_nan = df[df[key_col].isin(["nan"])].index.tolist()
df.drop(index=df_index_nan,inplace=True)
return df
def excel_to_dataFrame(fileName:str, sheetName:str):
# Extracts a xlsx on DataFrame
engine_select = "xlrd" if fileName.split('.')[-1] == "xls" else "openpyxl"
workbook = ExcelFile(path_or_buffer=fileName, engine=engine_select)
dataFrame = workbook.parse(sheetName)
try:
for colName in dataFrame.columns:
dataFrame[colName] = dataFrame[colName].astype("str")
dataFrame = remove_NA_data(dataFrame,dataFrame.columns.tolist()[0])
except Exception as e:
print("Warning: ",e)
return dataFrame |
@Eric106 A behavior difference between the old and new engine defaults for pandas when reading an excel file counts as a pandas regression. (whether than can be solved is still open). ref: #39250 (comment)
Would probably be good to track this along with the other issues with read_excel due to the change in engine. |
[*] I have checked that this issue has not already been reported.
[*] I have confirmed this bug exists on the latest version of pandas.
[*] (optional) I have confirmed this bug exists on the master branch of pandas.
Note: Please read this guide detailing how to provide the necessary information for us to reproduce your bug.
Edit: As indicated in #39181 (comment), the source of the performance degradation was trailing nans at the end of the DataFrame after reading - rhshadrach
Code Sample, a copy-pastable example
The text was updated successfully, but these errors were encountered: