BUG: ExcelWriter with mode='a' corrupts file #39576

fdzuJ · 2021-02-03T09:57:49Z

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
(optional) I have confirmed this bug exists on the master branch of pandas.

Note: Please read this guide detailing how to provide the necessary information for us to reproduce your bug.

Code Sample, a copy-pastable example

import pandas as pd

df = pd.DataFrame()

with pd.ExcelWriter('test.xlsx', mode='w', engine='openpyxl') as writer:
    df.to_excel(writer)

with pd.ExcelWriter('test.xlsx', mode='a', engine='openpyxl') as writer:
    df.to_excel(writer)

Problem description

Appending to excel file with pandas>=1.2 produces file which Excel reports as corrupted and opens only after repair, forcing you to first open and repair file before passing it to someone else.
Looking at file with 7-Zip File Manager shows that after writing file with mode='a', some .xml files (app.xml, core.xml, theme1.xml, sheet1.xml) inside archive are duplicated (somehow). After Excel rapars file duplicates are gone, everything else looks the same.

With pandas 1.1.5 and same versions of other packages everything works as expected.

Expected Output

File which could be opened by Excel without any warnings.

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit : 9d598a5
python : 3.8.6.final.0
python-bits : 64
OS : Windows
OS-release : 10
Version : 10.0.19041
machine : AMD64
processor : Intel64 Family 6 Model 158 Stepping 10, GenuineIntel
byteorder : little
LC_ALL : None
LANG : None
LOCALE : Russian_Russia.1251
pandas : 1.2.1
numpy : 1.20.0
pytz : 2021.1
dateutil : 2.8.1
pip : 21.0.1
setuptools : 49.6.0.post20210108
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : 4.6.2
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : None
IPython : None
pandas_datareader: None
bs4 : None
bottleneck : None
fsspec : None
fastparquet : None
gcsfs : None
matplotlib : None
numexpr : None
odfpy : None
openpyxl : 3.0.6
pandas_gbq : None
pyarrow : None
pyxlsb : None
s3fs : None
scipy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
numba : None

The text was updated successfully, but these errors were encountered:

twoertwein · 2021-02-03T20:32:45Z

thank you for your report, example, and the analysis!

I have no access to MS Excel (gnumeric and libreOffice seem to open the file without complaining). I ran your code twice. When opening it with an archive manager, I see the following folder structure:

docProps/{app.xml, core.xml}
_rels/.rels
xl/{styles.xml, workbook.xml}
xl/_rels/workbook.xml.rels
xl/_theme/theme1.xml
xl/worksheets/{sheet1.xml, sheet2.xml}

I don't know how a proper xlsx file should be structured but I do not see any obvious duplications.

Do you have access to a Unix machine? It would be interesting to see whether it works for you on Unix, or not. I think from 1.1 to 1.2, pandas is handing a read+writeable file handle to openpyxl. Maybe there is some platform-specific behavior (in pandas or openpyxl).

phofl · 2021-02-03T21:53:38Z

Can reproduce and behavior is the same on ubuntu

phofl · 2021-02-03T21:58:59Z

Looking at this a bit closer, the some file attributes differ. This is consistent for all duplicates

phofl · 2021-02-03T22:28:14Z

Our tests show the same behavior

twoertwein · 2021-02-04T03:17:11Z

I assume the issue is caused by #37639. I would be happy helping to debug this but I cannot reproduce it (on master):

import pandas as pd

df = pd.DataFrame({"a": [1, 2, 3]})

df.to_excel("test.xlsx", engine="openpyxl")

with pd.ExcelWriter("test.xlsx", mode="w", engine="openpyxl") as writer:
    df.to_excel(writer)
with pd.ExcelWriter("test.xlsx", mode="a", engine="openpyxl") as writer:
    df.to_excel(writer)

$ python -W default test.py
$ unzip -l test.xlsx
Archive:  test.xlsx
  Length      Date    Time    Name
---------  ---------- -----   ----
      177  02-03-2021 21:50   docProps/app.xml
      459  02-03-2021 21:50   docProps/core.xml
    10140  02-03-2021 21:50   xl/theme/theme1.xml
      771  02-03-2021 21:50   xl/worksheets/sheet1.xml
      771  02-03-2021 21:50   xl/worksheets/sheet2.xml
     2937  02-03-2021 21:50   xl/styles.xml
      534  02-03-2021 21:50   _rels/.rels
      617  02-03-2021 21:50   xl/workbook.xml
      653  02-03-2021 21:50   xl/_rels/workbook.xml.rels
     1120  02-03-2021 21:50   [Content_Types].xml
---------                     -------
    18179                     10 files

fdzuJ · 2021-02-04T10:05:58Z

I'm also able to reproduce it on linux (debian) and unzip doesn't show any duplicates, but p7zip package does.

File created with pandas 1.1:

$ 7z l test-pandas1.1.xlsx
7-Zip [64] 16.02 : Copyright (c) 1999-2016 Igor Pavlov : 2016-05-21
p7zip Version 16.02

Scanning the drive for archives:
1 file, 5179 bytes (6 KiB)

Listing archive: test-pandas1.1.xlsx

--
Path = test-pandas1.1.xlsx
Type = zip
Physical Size = 5179

   Date      Time    Attr         Size   Compressed  Name
------------------- ----- ------------ ------------  ------------------------
2021-02-04 12:01:10 .....          177          129  docProps/app.xml
2021-02-04 12:01:10 .....          459          233  docProps/core.xml
2021-02-04 12:01:10 .....        10140         1552  xl/theme/theme1.xml
2021-02-04 12:01:10 .....          456          273  xl/worksheets/sheet1.xml
2021-02-04 12:01:10 .....          456          273  xl/worksheets/sheet2.xml
2021-02-04 12:01:12 .....         2631          594  xl/styles.xml
2021-02-04 12:01:12 .....          534          192  _rels/.rels
2021-02-04 12:01:12 .....          617          316  xl/workbook.xml
2021-02-04 12:01:12 .....          653          180  xl/_rels/workbook.xml.rels
2021-02-04 12:01:12 .....         1120          287  [Content_Types].xml
------------------- ----- ------------ ------------  ------------------------
2021-02-04 12:01:12              17243         4029  10 files

File created with pandas 1.2:

$ 7z l test-pandas1.2.xlsx
7-Zip [64] 16.02 : Copyright (c) 1999-2016 Igor Pavlov : 2016-05-21
p7zip Version 16.02

Scanning the drive for archives:
1 file, 7562 bytes (8 KiB)

Listing archive: test-pandas1.2.xlsx

--
Path = test-pandas1.2.xlsx
Type = zip
Physical Size = 7562
Embedded Stub Size = 2383

   Date      Time    Attr         Size   Compressed  Name
------------------- ----- ------------ ------------  ------------------------
2021-02-04 12:02:24 .....          177          129  docProps/app.xml
2021-02-04 12:02:24 .....          459          233  docProps/core.xml
2021-02-04 12:02:24 .....        10140         1552  xl/theme/theme1.xml
2021-02-04 12:02:24 .....          456          273  xl/worksheets/sheet1.xml
2021-02-04 12:02:24 .....          177          129  docProps/app.xml
2021-02-04 12:02:24 .....          459          233  docProps/core.xml
2021-02-04 12:02:24 .....        10140         1552  xl/theme/theme1.xml
2021-02-04 12:02:24 .....          456          273  xl/worksheets/sheet1.xml
2021-02-04 12:02:24 .....          456          273  xl/worksheets/sheet2.xml
2021-02-04 12:02:24 .....         2631          594  xl/styles.xml
2021-02-04 12:02:24 .....          534          192  _rels/.rels
2021-02-04 12:02:24 .....          617          316  xl/workbook.xml
2021-02-04 12:02:24 .....          653          180  xl/_rels/workbook.xml.rels
2021-02-04 12:02:24 .....         1120          287  [Content_Types].xml
------------------- ----- ------------ ------------  ------------------------
2021-02-04 12:02:24              28475         6216  14 files

File produced with pandas 1.2 is sugnificantly larger (Physical Size = 7562 vs Physical Size = 5179) and includes Embedded Stub. And it looks like file from pandas 1.2 larger by exactly the size of Embedded Stub 7562-2383=5179.

Looks like this explains why unzip shows no duplicates.

twoertwein · 2021-02-04T21:23:52Z

thank you @fdzuJ for pointing the difference between unzip and 7z out, I can now reproduce the issue and will look into it.

twoertwein · 2021-02-05T01:19:47Z

the cause for the issue is that 1.2 tries to open the file only one time (with mode="br+"). Theoretically that should work. Unfortunately, writing to a zip file that already has a file with the same name, creates a second file with the same name instead of appending/overwriting the existing file. The following code creates also a corrupt excel file on 1.1.x

import pandas as pd

df = pd.DataFrame({"a": [1, 2, 3]})

df.to_excel("test.xlsx", engine="openpyxl")

with open("test.xlsx", mode="br+") as handle:
    with pd.ExcelWriter(handle, mode="a", engine="openpyxl") as writer:
        df.to_excel(writer)

I will create a PR that opens the file two times to avoid this issue.

twoertwein · 2021-02-05T15:04:22Z

The root cause is slightly simpler then I first thought: 1.2 opens the file for reading and writing. But I forgot to call seek(0) after reading, which made openpyxl append a second zip file to the first zip file instead of replacing the content. A simple seek(0) fixes the issue :) I also added truncate() in case the combined workbook happens to be smaller than the existing workbook (not sure whether that is possible).

twoertwein · 2021-02-05T15:29:12Z

Can someone please test whether appending a dataframe with the patched pandas #39605 creates a file that MS Excel doesn't complain about? I added a test case but I don't have MS Excel to test whether it is happy now :)

To apply the patch you simply need to add the following lines

self.handles.handle.seek(0)
self.handles.handle.truncate()

below this line

self.book = load_workbook(self.handles.handle)

in pandas/io/excel/_openpyxl.py (either in master or 1.2.1).

phofl · 2021-02-05T15:35:16Z

Yep, works now as it should.

ryagon42 · 2021-07-19T20:20:21Z

Was this fixed in pandas 1.2.2? I'm encountering the same issue in pandas 1.3.0.

phofl · 2021-07-19T20:22:36Z

Can you provide an excle file plus provide your versions and your example?

ryagon42 · 2021-07-19T20:35:07Z

Sure. I was just running this test code in my python terminal and getting the same message that it needs to be repaired: pandas==1.3.0, openpyxl==3.0.3. Switched back to pandas 1.1.5 and could not reproduce the error.
pandas_test_code.txt
test.xlsx

phofl · 2021-07-19T20:36:26Z

Could you try on 1.2.2 too? This worked at least temporarily on master back then

ryagon42 · 2021-07-19T21:02:44Z

Same issue when running 1.2.2. Here is the produced file.
test_1.2.2.xlsx

twoertwein · 2021-07-19T23:07:51Z

Is it intended that .book can be overwritten by the user?

I don't have MS Excel to test whether the file is corrupt. The following is based on how I understand the code.

Here is again your example code (slight change: use the context manager introduced in 1.2) with comments that (hopefully) describe what is going on:

import openpyxl
import pandas as pd

# create empty excel file
xl_file_name = "test.xlsx"
wb = openpyxl.Workbook()
ws = wb.active
wb.save(xl_file_name)

# write to excel file
book = openpyxl.load_workbook(xl_file_name)
with pd.ExcelWriter(
    xl_file_name, if_sheet_exists="replace", engine="openpyxl", mode="a"
) as xl_writer:
    # pandas holds a read+write file handle to "test.xlsx" and openpyxl loaded the workbook from it
    # if you were to manually call xl_writer.book.save(), it would use the file handle to write the data
    xl_writer.book = book  # you use a workbook that doesn't use Pandas's file handle!
    xl_writer.sheets = dict((ws.title, ws) for ws in book.worksheets)
    chunk = pd.DataFrame({"temp_c": [17.0, 25.0]}, index=["Portland", "Berkeley"])
    chunk.to_excel(xl_writer, merge_cells=False, header=False, sheet_name="Sheet")
    # during __exit__: pandas saves&closes the workbook:
    # 1) xl_writer.book.save() (doesn't advance its file handle) and
    # 2) it truncates its file handle (it truncates too much because it doesn't know that content was
    #    written through the externally created workbook)

Truncate is called if the "appended" excel file is smaller than the original file: let's assume you load an existing file but also remove some of its current sheets.

If it is intended to mutate .book, a workaround would probably be the following code (I can't verify whether the resulting file is corrupt or not):

import openpyxl
import pandas as pd

# create empty excel file
xl_file_name = "test2.xlsx"
wb = openpyxl.Workbook()
ws = wb.active
wb.save(xl_file_name)

# write to excel file
with open(xl_file_name, mode="rb+") as handle:
    book = openpyxl.load_workbook(handle)
    handle.seek(0)
    with pd.ExcelWriter(
        handle, if_sheet_exists="replace", engine="openpyxl", mode="a"
    ) as xl_writer:
        xl_writer.book = book  # still uses the same handle!
        xl_writer.sheets = dict((ws.title, ws) for ws in book.worksheets)
        chunk = pd.DataFrame({"temp_c": [17.0, 25.0]}, index=["Portland", "Berkeley"])
        chunk.to_excel(xl_writer, merge_cells=False, header=False, sheet_name="Sheet")

twoertwein · 2021-07-20T01:12:02Z

The documentation says: "None of the methods and properties are considered public."

phofl · 2021-07-20T08:49:26Z

@twoertwein thanks for checking, should have looked at the code.
Your propsed example works for me.

And I agree that the previous example is not expected to work.

fdzuJ added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Feb 3, 2021

phofl added IO Excel read_excel, to_excel and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Feb 3, 2021

twoertwein mentioned this issue Feb 5, 2021

REGR: appending to existing excel file created corrupt files #39605

Merged

4 tasks

jreback added this to the 1.2.2 milestone Feb 5, 2021

simonjayhawkins added the Regression Functionality that used to work in a prior pandas version label Feb 7, 2021

jreback closed this as completed in #39605 Feb 7, 2021

This was referenced Aug 17, 2021

BUG : pandas.DataFrame.to_excel function not working in pandas 1.3.1 #43068

Closed

ENH: Rename ExcelWriter to make clear attributes are not public #43088

Closed

twoertwein mentioned this issue Feb 8, 2022

BUG: Modify an existing xlsx file with PANDAS and MIME TYPE ERROR #45884

Open

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: ExcelWriter with mode='a' corrupts file #39576

BUG: ExcelWriter with mode='a' corrupts file #39576

fdzuJ commented Feb 3, 2021 •

edited

Loading

INSTALLED VERSIONS

twoertwein commented Feb 3, 2021 •

edited

Loading

phofl commented Feb 3, 2021

phofl commented Feb 3, 2021

phofl commented Feb 3, 2021

twoertwein commented Feb 4, 2021

fdzuJ commented Feb 4, 2021 •

edited

Loading

twoertwein commented Feb 4, 2021

twoertwein commented Feb 5, 2021

twoertwein commented Feb 5, 2021

twoertwein commented Feb 5, 2021

phofl commented Feb 5, 2021

ryagon42 commented Jul 19, 2021

phofl commented Jul 19, 2021

ryagon42 commented Jul 19, 2021

phofl commented Jul 19, 2021 •

edited

Loading

ryagon42 commented Jul 19, 2021

twoertwein commented Jul 19, 2021 •

edited

Loading

twoertwein commented Jul 20, 2021

phofl commented Jul 20, 2021

BUG: ExcelWriter with mode='a' corrupts file #39576

BUG: ExcelWriter with mode='a' corrupts file #39576

Comments

fdzuJ commented Feb 3, 2021 • edited Loading

Code Sample, a copy-pastable example

Problem description

Expected Output

Output of pd.show_versions()

INSTALLED VERSIONS

twoertwein commented Feb 3, 2021 • edited Loading

phofl commented Feb 3, 2021

phofl commented Feb 3, 2021

phofl commented Feb 3, 2021

twoertwein commented Feb 4, 2021

fdzuJ commented Feb 4, 2021 • edited Loading

twoertwein commented Feb 4, 2021

twoertwein commented Feb 5, 2021

twoertwein commented Feb 5, 2021

twoertwein commented Feb 5, 2021

phofl commented Feb 5, 2021

ryagon42 commented Jul 19, 2021

phofl commented Jul 19, 2021

ryagon42 commented Jul 19, 2021

phofl commented Jul 19, 2021 • edited Loading

ryagon42 commented Jul 19, 2021

twoertwein commented Jul 19, 2021 • edited Loading

twoertwein commented Jul 20, 2021

phofl commented Jul 20, 2021

fdzuJ commented Feb 3, 2021 •

edited

Loading

Output of `pd.show_versions()`

twoertwein commented Feb 3, 2021 •

edited

Loading

fdzuJ commented Feb 4, 2021 •

edited

Loading

phofl commented Jul 19, 2021 •

edited

Loading

twoertwein commented Jul 19, 2021 •

edited

Loading