-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG: "with pd.ExcelWriter" produces a corrupt Excel file in case of .xlsm extension #44868
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Do you have the same issue when using openpyxl directly (using it to read the file and then write it again)? I assume you still lose your macro with the non-empty with-block? If I understand the pandas code: pandas will use the engine (openpyxl in this case) to read the excel workbook, a user can append data to it (you choose not to, that should be okay - hopefully), and when pandas closes the file it asks the engine to write the potentially changed workbook back to the file. If the engine does not support some excel features, these features will probably get lost. |
My main problem is, that after executing the code (which should do nothing..) my file looks intact with .xlsm extension, but I can't open it since it got corrupted somehow. I need to manually change the extension to .xlsx to be able to at least read the info in the cells - but this way the macro is well and truly lost. I will try to replicate the same behavior with openpyxl tomorrow and get back to you. |
Yes, you are right, it doesn't pick based on the mode (somehow I thought the only engine that supported it was openpyxl). The engine is inferred by extension and based on the content. You could add a breakpoint in your empty |
Or just print:
|
@maximrobi - can you post a file example before the corruption occurs? |
@rhshadrach You have it in the zip file under Issue Description. The my_test - backup.xlsm should work 100%, just change the name to what the script is expecting to find. |
@twoertwein Thanks for the info! I managed to debug the with statement and I can confirm that it is using openpyxl for my .xlsm file. |
I can't duplicate on Linux. I wonder if the code in
I haven't been able to discern what this truncate function is doing. |
We always write the workbook back, even if no changes are made. That's how we lose features that are not supported by an engine.
The truncate shouldn't cause an issue here (it is actually needed). It is there in case the original file is larger than the new file (for example, if unsupported features are dropped by the engine). Otherwise, we would simply overwrite the first n-bits of the file and leave the remaining bits untouched, truncate removes those remaining bits. |
Wouldn't it better to just pop an error (or at least a warning) that the filetype is not recognized and it'll remove some functionality from it? Maybe create a copy before manipulating it. I remember seeing some kind of warning for the Data Validation logic inside the same Excel document, but I couldn't replicate it since then. |
Would this mean that this is a Windows-related bug? |
@maximrobi - I got a valid Excel file without the macro.
That is currently the behavior: we raise if the type is not able to be determined and the engine is not explicitly specified. The issue here is that pandas determines the file is an xlsx because it is a zip file with the content "xl/workbook.xml". Is the conclusion here incorrect? |
I think I understand now. xlsm / xlsb files can contain macros, whereas xlsx cannot. Openpyxl allows the engine argument
What happens when you run this on Windows @maximrobi? In order to have a warning, we'd need to inspect the file as well as the engine_kwargs to determine if macros will be lost, warning when they are. +1 on adding such a warning. |
I tried this and it results in the same corrupt file :( |
I'm not really familiar what the difference between .xlsx and .xlsm is on a lower level, but you are correct in that they both have workbooks. Theoretically if you open an .xlsm file az zip, you should find the macro inside the vbaproject.bin file. |
The most I could do with this was to modify However to append data to existing sheet, you should do something like:
or modify again |
@maximrobi I have had the same issue. I tried each solution in this thread and others found on stack overflow https://stackoverflow.com/questions/28169731/write-pandas-dataframe-to-xlsm-file-excel-with-macros-enabled. I have found a workaround using xlwings to save the file as a xlsm file after the data frame has been written to it. ` import xlswriter df = pd.DataFrame(columns=['one', 'two'], data=[[1,2], [3,4]]) filename = 'TEST.xlsx' with pd.ExcelWriter(filename, engine='xlsxwriter') as writer: book = Book(filename) os.remove(filename) The first part of the code is standard writing to an xlsx file in pandas. The second part is using the xlwings Book class constructor to open the xlsx file inside your directory and then saving it with the xlsm file extension. There is a minor bug that when the Book() constructor is called and then closed, the excel application opens and doesnt terminate. It stays open as a minimised window, hence the os module is required again to terminate the process using the system function. I am not very experienced and still at university so maybe my implementation is messy or unnecessarily complicated, but it worked for me after hours of headache so I hope it helps. I also did not add in that I create the workbook excel file using xlsxwriter first. This may be redundant when using the pandas module however. |
What I do to create a valid .xlsm file directly in python is also with openpyxl
The first part with |
@maximrobi Is this resolved? I just ran into the issue initially before setting the
+1 to issuing a warning as well. Without a warning, this could be an issue as users convert from directly setting the |
Wow, maybe I should try this at my end as well - although I already changed my workstation and maybe a newer version of Pandas already has a fix for my problem. I'll let you know if I manage to get any results! |
Not sure what a resolution would be to this issue, maybe a confirmation from someone that a warning was implemented and nobody loses 2 days' worth of scripting in VBA due to a mini-automatization script in Python :)) So no, as far as I know, this issue is still active. I will try it once again on my new machine and latest pandas version to see how it behaves. |
I wonder how issue has been resolved? I previously resolved same issue by installing pandas 1.4.4 version. Now, with v1.5.2 gives same corrupted excel file error |
Do not save the writer, close it: writer.close() |
Worked for me with below code with pd.ExcelWriter(output_path) as writer:
df2.to_excel(writer, sheet_name="v2", index=False)
for sheet_name in ["v2"]:
ws = writer.sheets[sheet_name]
ws.column_dimensions['A'].width = 247 / 6
ws.column_dimensions['B'].width = 78 / 6
ws.column_dimensions['C'].width = 99 / 6
ws.column_dimensions['D'].width = 208 / 6
ws.column_dimensions['E'].width = 140 / 6
ws.column_dimensions['F'].width = 200 / 6
ws.column_dimensions['G'].width = 126 / 6
# writer.close() # this line is the root cause for me |
Passing the keep_vba param in combination with openpyxl indeed solved the problem.
|
Hi @rhshadrach, Since I also tested it with the original versions from this ticket and it still results in a corrupt Excel document, is there any way we can backtrack and at least print a warning in the console? I know a lot of production code can't upgrade to newest Python/Pandas, so we could save them from losing valuable data due to this bug. P.S: The ticket you linked mainly talks about .xls support - but in my case I only had problems due to the added macro (meaning .xlsm), everything worked okay on .xlsx documents. I double checked it right now - not sure if it makes a difference, but I felt a need to specify. |
pandas typically only releases new patches for the previous minor version (2.1 right now). We do not have the resources to support older versions in general. An issue has to be truly exceptional in order for us to consider patching older versions. I do not think this issue meets that bar. |
Understood. Then I think we can close this issue - and as a solution I will either create a copy of my input documents before running ANYTHING on them, or just use the newer Pandas versions in the future. |
I still think we should explore having |
I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
I have confirmed this bug exists on the master branch of pandas.
Reproducible Example
Issue Description
The specified Excel file gets corrupted, despite the append mode. Changing its extension to .xlsx can retrieve the raw data from it, but the macro is forever gone.
In my example I used with and had only a pass statement inside it, yet the file became useless.
Excel_macro_corruption.zip
Q1: Can we know for sure that the default engine is used for .xlsm files?
Q2: Should I contact OpenPyxl instead of you with this issue?
Expected Behavior
As I've read since my first encounter of this bug, ExcelWriter doesn't support macros, but a warning of some kind would be nice before wiping out a carefully written VBA code from the Excel file.
Installed Versions
INSTALLED VERSIONS
commit : 945c9ed
python : 3.7.4.final.0
python-bits : 64
OS : Windows
OS-release : 10
Version : 10.0.19041
machine : AMD64
processor : Intel64 Family 6 Model 158 Stepping 9, GenuineIntel
byteorder : little
LC_ALL : None
LANG : None
LOCALE : None.None
pandas : 1.3.4
numpy : 1.17.3
pytz : 2021.3
dateutil : 2.8.2
pip : 19.3.1
setuptools : 40.8.0
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : None
IPython : None
pandas_datareader: None
bs4 : None
bottleneck : None
fsspec : None
fastparquet : None
gcsfs : None
matplotlib : None
numexpr : None
odfpy : None
openpyxl : 3.0.9
pandas_gbq : None
pyarrow : None
pyxlsb : None
s3fs : None
scipy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
numba : None
The text was updated successfully, but these errors were encountered: