Skip to content

BUG: ExcelWriter with mode='a' corrupts file #39576

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
2 of 3 tasks
fdzuJ opened this issue Feb 3, 2021 · 19 comments · Fixed by #39605
Closed
2 of 3 tasks

BUG: ExcelWriter with mode='a' corrupts file #39576

fdzuJ opened this issue Feb 3, 2021 · 19 comments · Fixed by #39605
Labels
Bug IO Excel read_excel, to_excel Regression Functionality that used to work in a prior pandas version
Milestone

Comments

@fdzuJ
Copy link

fdzuJ commented Feb 3, 2021

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • (optional) I have confirmed this bug exists on the master branch of pandas.


Note: Please read this guide detailing how to provide the necessary information for us to reproduce your bug.

Code Sample, a copy-pastable example

import pandas as pd

df = pd.DataFrame()

with pd.ExcelWriter('test.xlsx', mode='w', engine='openpyxl') as writer:
    df.to_excel(writer)

with pd.ExcelWriter('test.xlsx', mode='a', engine='openpyxl') as writer:
    df.to_excel(writer)

Problem description

Appending to excel file with pandas>=1.2 produces file which Excel reports as corrupted and opens only after repair, forcing you to first open and repair file before passing it to someone else.
Looking at file with 7-Zip File Manager shows that after writing file with mode='a', some .xml files (app.xml, core.xml, theme1.xml, sheet1.xml) inside archive are duplicated (somehow). After Excel rapars file duplicates are gone, everything else looks the same.

With pandas 1.1.5 and same versions of other packages everything works as expected.

Expected Output

File which could be opened by Excel without any warnings.

Output of pd.show_versions()

INSTALLED VERSIONS

commit : 9d598a5
python : 3.8.6.final.0
python-bits : 64
OS : Windows
OS-release : 10
Version : 10.0.19041
machine : AMD64
processor : Intel64 Family 6 Model 158 Stepping 10, GenuineIntel
byteorder : little
LC_ALL : None
LANG : None
LOCALE : Russian_Russia.1251
pandas : 1.2.1
numpy : 1.20.0
pytz : 2021.1
dateutil : 2.8.1
pip : 21.0.1
setuptools : 49.6.0.post20210108
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : 4.6.2
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : None
IPython : None
pandas_datareader: None
bs4 : None
bottleneck : None
fsspec : None
fastparquet : None
gcsfs : None
matplotlib : None
numexpr : None
odfpy : None
openpyxl : 3.0.6
pandas_gbq : None
pyarrow : None
pyxlsb : None
s3fs : None
scipy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
numba : None

@fdzuJ fdzuJ added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Feb 3, 2021
@twoertwein
Copy link
Member

twoertwein commented Feb 3, 2021

thank you for your report, example, and the analysis!

I have no access to MS Excel (gnumeric and libreOffice seem to open the file without complaining). I ran your code twice. When opening it with an archive manager, I see the following folder structure:

docProps/{app.xml, core.xml}
_rels/.rels
xl/{styles.xml, workbook.xml}
xl/_rels/workbook.xml.rels
xl/_theme/theme1.xml
xl/worksheets/{sheet1.xml, sheet2.xml}

I don't know how a proper xlsx file should be structured but I do not see any obvious duplications.

Do you have access to a Unix machine? It would be interesting to see whether it works for you on Unix, or not. I think from 1.1 to 1.2, pandas is handing a read+writeable file handle to openpyxl. Maybe there is some platform-specific behavior (in pandas or openpyxl).

@phofl
Copy link
Member

phofl commented Feb 3, 2021

Can reproduce and behavior is the same on ubuntu

@phofl phofl added IO Excel read_excel, to_excel and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Feb 3, 2021
@phofl
Copy link
Member

phofl commented Feb 3, 2021

excel

Looking at this a bit closer, the some file attributes differ. This is consistent for all duplicates

@phofl
Copy link
Member

phofl commented Feb 3, 2021

Our tests show the same behavior

@twoertwein
Copy link
Member

I assume the issue is caused by #37639. I would be happy helping to debug this but I cannot reproduce it (on master):

import pandas as pd

df = pd.DataFrame({"a": [1, 2, 3]})

df.to_excel("test.xlsx", engine="openpyxl")

with pd.ExcelWriter("test.xlsx", mode="w", engine="openpyxl") as writer:
    df.to_excel(writer)
with pd.ExcelWriter("test.xlsx", mode="a", engine="openpyxl") as writer:
    df.to_excel(writer)
$ python -W default test.py
$ unzip -l test.xlsx
Archive:  test.xlsx
  Length      Date    Time    Name
---------  ---------- -----   ----
      177  02-03-2021 21:50   docProps/app.xml
      459  02-03-2021 21:50   docProps/core.xml
    10140  02-03-2021 21:50   xl/theme/theme1.xml
      771  02-03-2021 21:50   xl/worksheets/sheet1.xml
      771  02-03-2021 21:50   xl/worksheets/sheet2.xml
     2937  02-03-2021 21:50   xl/styles.xml
      534  02-03-2021 21:50   _rels/.rels
      617  02-03-2021 21:50   xl/workbook.xml
      653  02-03-2021 21:50   xl/_rels/workbook.xml.rels
     1120  02-03-2021 21:50   [Content_Types].xml
---------                     -------
    18179                     10 files

@fdzuJ
Copy link
Author

fdzuJ commented Feb 4, 2021

I'm also able to reproduce it on linux (debian) and unzip doesn't show any duplicates, but p7zip package does.

File created with pandas 1.1:

$ 7z l test-pandas1.1.xlsx
7-Zip [64] 16.02 : Copyright (c) 1999-2016 Igor Pavlov : 2016-05-21
p7zip Version 16.02

Scanning the drive for archives:
1 file, 5179 bytes (6 KiB)

Listing archive: test-pandas1.1.xlsx

--
Path = test-pandas1.1.xlsx
Type = zip
Physical Size = 5179

   Date      Time    Attr         Size   Compressed  Name
------------------- ----- ------------ ------------  ------------------------
2021-02-04 12:01:10 .....          177          129  docProps/app.xml
2021-02-04 12:01:10 .....          459          233  docProps/core.xml
2021-02-04 12:01:10 .....        10140         1552  xl/theme/theme1.xml
2021-02-04 12:01:10 .....          456          273  xl/worksheets/sheet1.xml
2021-02-04 12:01:10 .....          456          273  xl/worksheets/sheet2.xml
2021-02-04 12:01:12 .....         2631          594  xl/styles.xml
2021-02-04 12:01:12 .....          534          192  _rels/.rels
2021-02-04 12:01:12 .....          617          316  xl/workbook.xml
2021-02-04 12:01:12 .....          653          180  xl/_rels/workbook.xml.rels
2021-02-04 12:01:12 .....         1120          287  [Content_Types].xml
------------------- ----- ------------ ------------  ------------------------
2021-02-04 12:01:12              17243         4029  10 files

File created with pandas 1.2:

$ 7z l test-pandas1.2.xlsx
7-Zip [64] 16.02 : Copyright (c) 1999-2016 Igor Pavlov : 2016-05-21
p7zip Version 16.02

Scanning the drive for archives:
1 file, 7562 bytes (8 KiB)

Listing archive: test-pandas1.2.xlsx

--
Path = test-pandas1.2.xlsx
Type = zip
Physical Size = 7562
Embedded Stub Size = 2383

   Date      Time    Attr         Size   Compressed  Name
------------------- ----- ------------ ------------  ------------------------
2021-02-04 12:02:24 .....          177          129  docProps/app.xml
2021-02-04 12:02:24 .....          459          233  docProps/core.xml
2021-02-04 12:02:24 .....        10140         1552  xl/theme/theme1.xml
2021-02-04 12:02:24 .....          456          273  xl/worksheets/sheet1.xml
2021-02-04 12:02:24 .....          177          129  docProps/app.xml
2021-02-04 12:02:24 .....          459          233  docProps/core.xml
2021-02-04 12:02:24 .....        10140         1552  xl/theme/theme1.xml
2021-02-04 12:02:24 .....          456          273  xl/worksheets/sheet1.xml
2021-02-04 12:02:24 .....          456          273  xl/worksheets/sheet2.xml
2021-02-04 12:02:24 .....         2631          594  xl/styles.xml
2021-02-04 12:02:24 .....          534          192  _rels/.rels
2021-02-04 12:02:24 .....          617          316  xl/workbook.xml
2021-02-04 12:02:24 .....          653          180  xl/_rels/workbook.xml.rels
2021-02-04 12:02:24 .....         1120          287  [Content_Types].xml
------------------- ----- ------------ ------------  ------------------------
2021-02-04 12:02:24              28475         6216  14 files

File produced with pandas 1.2 is sugnificantly larger (Physical Size = 7562 vs Physical Size = 5179) and includes Embedded Stub. And it looks like file from pandas 1.2 larger by exactly the size of Embedded Stub 7562-2383=5179.

Looks like this explains why unzip shows no duplicates.

@twoertwein
Copy link
Member

thank you @fdzuJ for pointing the difference between unzip and 7z out, I can now reproduce the issue and will look into it.

@twoertwein
Copy link
Member

the cause for the issue is that 1.2 tries to open the file only one time (with mode="br+"). Theoretically that should work. Unfortunately, writing to a zip file that already has a file with the same name, creates a second file with the same name instead of appending/overwriting the existing file. The following code creates also a corrupt excel file on 1.1.x

import pandas as pd

df = pd.DataFrame({"a": [1, 2, 3]})

df.to_excel("test.xlsx", engine="openpyxl")

with open("test.xlsx", mode="br+") as handle:
    with pd.ExcelWriter(handle, mode="a", engine="openpyxl") as writer:
        df.to_excel(writer)

I will create a PR that opens the file two times to avoid this issue.

@twoertwein
Copy link
Member

The root cause is slightly simpler then I first thought: 1.2 opens the file for reading and writing. But I forgot to call seek(0) after reading, which made openpyxl append a second zip file to the first zip file instead of replacing the content. A simple seek(0) fixes the issue :) I also added truncate() in case the combined workbook happens to be smaller than the existing workbook (not sure whether that is possible).

@twoertwein
Copy link
Member

Can someone please test whether appending a dataframe with the patched pandas #39605 creates a file that MS Excel doesn't complain about? I added a test case but I don't have MS Excel to test whether it is happy now :)

To apply the patch you simply need to add the following lines

self.handles.handle.seek(0)
self.handles.handle.truncate()

below this line

self.book = load_workbook(self.handles.handle)

in pandas/io/excel/_openpyxl.py (either in master or 1.2.1).

@phofl
Copy link
Member

phofl commented Feb 5, 2021

Yep, works now as it should.

@simonjayhawkins simonjayhawkins added the Regression Functionality that used to work in a prior pandas version label Feb 7, 2021
@ryagon42
Copy link

Was this fixed in pandas 1.2.2? I'm encountering the same issue in pandas 1.3.0.

@phofl
Copy link
Member

phofl commented Jul 19, 2021

Can you provide an excle file plus provide your versions and your example?

@ryagon42
Copy link

Sure. I was just running this test code in my python terminal and getting the same message that it needs to be repaired: pandas==1.3.0, openpyxl==3.0.3. Switched back to pandas 1.1.5 and could not reproduce the error.
pandas_test_code.txt
test.xlsx

@phofl
Copy link
Member

phofl commented Jul 19, 2021

Could you try on 1.2.2 too? This worked at least temporarily on master back then

@ryagon42
Copy link

Same issue when running 1.2.2. Here is the produced file.
test_1.2.2.xlsx

@twoertwein
Copy link
Member

twoertwein commented Jul 19, 2021

Is it intended that .book can be overwritten by the user?

I don't have MS Excel to test whether the file is corrupt. The following is based on how I understand the code.

Here is again your example code (slight change: use the context manager introduced in 1.2) with comments that (hopefully) describe what is going on:

import openpyxl
import pandas as pd

# create empty excel file
xl_file_name = "test.xlsx"
wb = openpyxl.Workbook()
ws = wb.active
wb.save(xl_file_name)

# write to excel file
book = openpyxl.load_workbook(xl_file_name)
with pd.ExcelWriter(
    xl_file_name, if_sheet_exists="replace", engine="openpyxl", mode="a"
) as xl_writer:
    # pandas holds a read+write file handle to "test.xlsx" and openpyxl loaded the workbook from it
    # if you were to manually call xl_writer.book.save(), it would use the file handle to write the data
    xl_writer.book = book  # you use a workbook that doesn't use Pandas's file handle!
    xl_writer.sheets = dict((ws.title, ws) for ws in book.worksheets)
    chunk = pd.DataFrame({"temp_c": [17.0, 25.0]}, index=["Portland", "Berkeley"])
    chunk.to_excel(xl_writer, merge_cells=False, header=False, sheet_name="Sheet")
    # during __exit__: pandas saves&closes the workbook:
    # 1) xl_writer.book.save() (doesn't advance its file handle) and
    # 2) it truncates its file handle (it truncates too much because it doesn't know that content was
    #    written through the externally created workbook)

Truncate is called if the "appended" excel file is smaller than the original file: let's assume you load an existing file but also remove some of its current sheets.

If it is intended to mutate .book, a workaround would probably be the following code (I can't verify whether the resulting file is corrupt or not):

import openpyxl
import pandas as pd

# create empty excel file
xl_file_name = "test2.xlsx"
wb = openpyxl.Workbook()
ws = wb.active
wb.save(xl_file_name)

# write to excel file
with open(xl_file_name, mode="rb+") as handle:
    book = openpyxl.load_workbook(handle)
    handle.seek(0)
    with pd.ExcelWriter(
        handle, if_sheet_exists="replace", engine="openpyxl", mode="a"
    ) as xl_writer:
        xl_writer.book = book  # still uses the same handle!
        xl_writer.sheets = dict((ws.title, ws) for ws in book.worksheets)
        chunk = pd.DataFrame({"temp_c": [17.0, 25.0]}, index=["Portland", "Berkeley"])
        chunk.to_excel(xl_writer, merge_cells=False, header=False, sheet_name="Sheet")

@twoertwein
Copy link
Member

The documentation says: "None of the methods and properties are considered public."

@phofl
Copy link
Member

phofl commented Jul 20, 2021

@twoertwein thanks for checking, should have looked at the code.
Your propsed example works for me.

And I agree that the previous example is not expected to work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug IO Excel read_excel, to_excel Regression Functionality that used to work in a prior pandas version
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants