-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG: ExcelWriter with mode='a' corrupts file #39576
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
thank you for your report, example, and the analysis! I have no access to MS Excel (gnumeric and libreOffice seem to open the file without complaining). I ran your code twice. When opening it with an archive manager, I see the following folder structure:
I don't know how a proper xlsx file should be structured but I do not see any obvious duplications. Do you have access to a Unix machine? It would be interesting to see whether it works for you on Unix, or not. I think from 1.1 to 1.2, pandas is handing a read+writeable file handle to openpyxl. Maybe there is some platform-specific behavior (in pandas or openpyxl). |
Can reproduce and behavior is the same on ubuntu |
Our tests show the same behavior |
I assume the issue is caused by #37639. I would be happy helping to debug this but I cannot reproduce it (on master): import pandas as pd
df = pd.DataFrame({"a": [1, 2, 3]})
df.to_excel("test.xlsx", engine="openpyxl")
with pd.ExcelWriter("test.xlsx", mode="w", engine="openpyxl") as writer:
df.to_excel(writer)
with pd.ExcelWriter("test.xlsx", mode="a", engine="openpyxl") as writer:
df.to_excel(writer) $ python -W default test.py
$ unzip -l test.xlsx
Archive: test.xlsx
Length Date Time Name
--------- ---------- ----- ----
177 02-03-2021 21:50 docProps/app.xml
459 02-03-2021 21:50 docProps/core.xml
10140 02-03-2021 21:50 xl/theme/theme1.xml
771 02-03-2021 21:50 xl/worksheets/sheet1.xml
771 02-03-2021 21:50 xl/worksheets/sheet2.xml
2937 02-03-2021 21:50 xl/styles.xml
534 02-03-2021 21:50 _rels/.rels
617 02-03-2021 21:50 xl/workbook.xml
653 02-03-2021 21:50 xl/_rels/workbook.xml.rels
1120 02-03-2021 21:50 [Content_Types].xml
--------- -------
18179 10 files |
I'm also able to reproduce it on linux (debian) and unzip doesn't show any duplicates, but p7zip package does. File created with pandas 1.1:
File created with pandas 1.2:
File produced with pandas 1.2 is sugnificantly larger (Physical Size = 7562 vs Physical Size = 5179) and includes Embedded Stub. And it looks like file from pandas 1.2 larger by exactly the size of Embedded Stub 7562-2383=5179. Looks like this explains why unzip shows no duplicates. |
thank you @fdzuJ for pointing the difference between unzip and 7z out, I can now reproduce the issue and will look into it. |
the cause for the issue is that 1.2 tries to open the file only one time (with mode="br+"). Theoretically that should work. Unfortunately, writing to a zip file that already has a file with the same name, creates a second file with the same name instead of appending/overwriting the existing file. The following code creates also a corrupt excel file on 1.1.x import pandas as pd
df = pd.DataFrame({"a": [1, 2, 3]})
df.to_excel("test.xlsx", engine="openpyxl")
with open("test.xlsx", mode="br+") as handle:
with pd.ExcelWriter(handle, mode="a", engine="openpyxl") as writer:
df.to_excel(writer) I will create a PR that opens the file two times to avoid this issue. |
The root cause is slightly simpler then I first thought: 1.2 opens the file for reading and writing. But I forgot to call |
Can someone please test whether appending a dataframe with the patched pandas #39605 creates a file that MS Excel doesn't complain about? I added a test case but I don't have MS Excel to test whether it is happy now :) To apply the patch you simply need to add the following lines self.handles.handle.seek(0)
self.handles.handle.truncate() below this line self.book = load_workbook(self.handles.handle) in |
Yep, works now as it should. |
Was this fixed in pandas 1.2.2? I'm encountering the same issue in pandas 1.3.0. |
Can you provide an excle file plus provide your versions and your example? |
Sure. I was just running this test code in my python terminal and getting the same message that it needs to be repaired: pandas==1.3.0, openpyxl==3.0.3. Switched back to pandas 1.1.5 and could not reproduce the error. |
Could you try on 1.2.2 too? This worked at least temporarily on master back then |
Same issue when running 1.2.2. Here is the produced file. |
Is it intended that I don't have MS Excel to test whether the file is corrupt. The following is based on how I understand the code. Here is again your example code (slight change: use the context manager introduced in 1.2) with comments that (hopefully) describe what is going on: import openpyxl
import pandas as pd
# create empty excel file
xl_file_name = "test.xlsx"
wb = openpyxl.Workbook()
ws = wb.active
wb.save(xl_file_name)
# write to excel file
book = openpyxl.load_workbook(xl_file_name)
with pd.ExcelWriter(
xl_file_name, if_sheet_exists="replace", engine="openpyxl", mode="a"
) as xl_writer:
# pandas holds a read+write file handle to "test.xlsx" and openpyxl loaded the workbook from it
# if you were to manually call xl_writer.book.save(), it would use the file handle to write the data
xl_writer.book = book # you use a workbook that doesn't use Pandas's file handle!
xl_writer.sheets = dict((ws.title, ws) for ws in book.worksheets)
chunk = pd.DataFrame({"temp_c": [17.0, 25.0]}, index=["Portland", "Berkeley"])
chunk.to_excel(xl_writer, merge_cells=False, header=False, sheet_name="Sheet")
# during __exit__: pandas saves&closes the workbook:
# 1) xl_writer.book.save() (doesn't advance its file handle) and
# 2) it truncates its file handle (it truncates too much because it doesn't know that content was
# written through the externally created workbook) Truncate is called if the "appended" excel file is smaller than the original file: let's assume you load an existing file but also remove some of its current sheets. If it is intended to mutate import openpyxl
import pandas as pd
# create empty excel file
xl_file_name = "test2.xlsx"
wb = openpyxl.Workbook()
ws = wb.active
wb.save(xl_file_name)
# write to excel file
with open(xl_file_name, mode="rb+") as handle:
book = openpyxl.load_workbook(handle)
handle.seek(0)
with pd.ExcelWriter(
handle, if_sheet_exists="replace", engine="openpyxl", mode="a"
) as xl_writer:
xl_writer.book = book # still uses the same handle!
xl_writer.sheets = dict((ws.title, ws) for ws in book.worksheets)
chunk = pd.DataFrame({"temp_c": [17.0, 25.0]}, index=["Portland", "Berkeley"])
chunk.to_excel(xl_writer, merge_cells=False, header=False, sheet_name="Sheet") |
The documentation says: "None of the methods and properties are considered public." |
@twoertwein thanks for checking, should have looked at the code. And I agree that the previous example is not expected to work. |
I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
(optional) I have confirmed this bug exists on the master branch of pandas.
Note: Please read this guide detailing how to provide the necessary information for us to reproduce your bug.
Code Sample, a copy-pastable example
Problem description
Appending to excel file with pandas>=1.2 produces file which Excel reports as corrupted and opens only after repair, forcing you to first open and repair file before passing it to someone else.
Looking at file with 7-Zip File Manager shows that after writing file with mode='a', some .xml files (app.xml, core.xml, theme1.xml, sheet1.xml) inside archive are duplicated (somehow). After Excel rapars file duplicates are gone, everything else looks the same.
With pandas 1.1.5 and same versions of other packages everything works as expected.
Expected Output
File which could be opened by Excel without any warnings.
Output of
pd.show_versions()
INSTALLED VERSIONS
commit : 9d598a5
python : 3.8.6.final.0
python-bits : 64
OS : Windows
OS-release : 10
Version : 10.0.19041
machine : AMD64
processor : Intel64 Family 6 Model 158 Stepping 10, GenuineIntel
byteorder : little
LC_ALL : None
LANG : None
LOCALE : Russian_Russia.1251
pandas : 1.2.1
numpy : 1.20.0
pytz : 2021.1
dateutil : 2.8.1
pip : 21.0.1
setuptools : 49.6.0.post20210108
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : 4.6.2
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : None
IPython : None
pandas_datareader: None
bs4 : None
bottleneck : None
fsspec : None
fastparquet : None
gcsfs : None
matplotlib : None
numexpr : None
odfpy : None
openpyxl : 3.0.6
pandas_gbq : None
pyarrow : None
pyxlsb : None
s3fs : None
scipy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
numba : None
The text was updated successfully, but these errors were encountered: