Skip to content

ENH: Allow overwriting existing sheets when appending to excel files #40230

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
mirober opened this issue Mar 4, 2021 · 6 comments · Fixed by #40231
Closed

ENH: Allow overwriting existing sheets when appending to excel files #40230

mirober opened this issue Mar 4, 2021 · 6 comments · Fixed by #40231
Labels
Enhancement IO Excel read_excel, to_excel
Milestone

Comments

@mirober
Copy link
Contributor

mirober commented Mar 4, 2021

Is your feature request related to a problem?

As far as I can tell there is no way to write a DataFrame to a specific (named) sheet in an existing Excel file, creating the sheet if it doesn't exist or overwriting if it does. The closest thing to this that exists is the openpyxl append mode, but if you try to write to sheet_which_already_exists with this, the data will be written to sheet_which_already_exists1. Using write mode overwrites the entire book.

Being able to overwrite existing sheets would be useful for populating "raw data" sheets in existing reports/publications on a regular basis, with pivot tables etc updating automatically.

This seems to be a very common issue:
https://stackoverflow.com/questions/62618680/overwrite-an-excel-sheet-with-pandas-dataframe-without-affecting-other-sheets
https://stackoverflow.com/questions/47602533/append-a-pandas-dataframe-to-an-existing-excel-table?noredirect=1&lq=1
https://stackoverflow.com/questions/34744863/python-how-to-use-excelwriter-to-write-into-an-existing-worksheet?noredirect=1&lq=1
#28653
#29990

Describe the solution you'd like

There are a few potential behaviours someone might want when they write to an existing sheet in append mode:

  • the current behaviour, which creates a new sheet with a different name
  • to delete the current contents of the sheet and write their data
  • to write the data without deleting anything, for example to add columns or rows to an existing table, or a new table in a different position

Originally I considered adding a flag like overwrite=True to pd.ExcelWriter but I don't think this would be sufficient to cover all the potential behaviours, so would suggest an enum along the lines of the if_exists option in df.to_sql, with the following options:

  • new_sheet: Create a new sheet with a different name.
  • overwrite_sheet: Delete the contents of the sheet, then write to it.
  • overwrite_cells: Write directly to the named sheet without deleting the previous contents.

This option would only be relevant when using mode="a" with the openpyxl engine.

API breaking implications

The new_sheet option would replicate the current behaviour, so this could be made the default.

Additional context

Demonstration of current behaviour:

import pandas as pd
from openpyxl import load_workbook
import io

df1 = pd.DataFrame({
    "col1": ["apple", "banana", "carrot"],
    "col2": [1, 2, 3],
})
df2 = pd.DataFrame({"col1": [5]})

excel = io.BytesIO()

with pd.ExcelWriter(excel, engine="openpyxl", mode="w") as writer:
    # write two sheets
    df1.to_excel(writer, sheet_name="sheet1", index=False)
    df1.to_excel(writer, sheet_name="sheet2", index=False)

with pd.ExcelWriter(excel, engine="openpyxl", mode="a") as writer:
    # try to overwrite one of them
    df2.to_excel(writer, sheet_name="sheet1", index=False)

wb = load_workbook(excel)
# sheet1, sheet2, sheet11
print(wb.sheetnames)

Why does it work this way? At the moment, when the openpyxl ExcelWriter is initialised in append mode it loads the workbook, but does not update self.sheets to reflect the sheets in the loaded file.
https://github.com/pandas-dev/pandas/blob/master/pandas/io/excel/_openpyxl.py#L54-L58

This seems unintentional and means that when write_cells is called with a sheet name that already exists in the file, it is not found in self.sheets:
https://github.com/pandas-dev/pandas/blob/master/pandas/io/excel/_openpyxl.py#L414-L419
so we follow the else branch, and when assigning to wks.title openpyxl suffixes the sheet name to avoid a naming conflict.

@mirober mirober added Enhancement Needs Triage Issue that has not been reviewed by a pandas team member labels Mar 4, 2021
@jreback
Copy link
Contributor

jreback commented Mar 5, 2021

I sympathize, but why is this a pandas issue when the underlying writer's don't support this?

@jreback jreback added the IO Excel read_excel, to_excel label Mar 5, 2021
@jreback
Copy link
Contributor

jreback commented Mar 5, 2021

cc @rhshadrach

@rhshadrach
Copy link
Member

rhshadrach commented Mar 5, 2021

IIUC, pandas is not updating self.sheets when mode is append, so the check if sheet_name in self.sheets: is False, which makes us run wks = self.book.create_sheet(). On top of this, I expect it to be easy to remove the content of the existing sheet if the user would like to replace it entirely.

Assuming all that needs to be done here is updating self.sheets appropriately and possibly removing the content of an existing sheet, I'm generally positive toward the idea. My only remaining question is if this can be implemented for all excel engines, or if not, which ones (I've only looked at openpyxl).

I think the choosing a name similar to DataFrame.to_sql makes sense, but if_exists sounds like it is applying to the entire workbook to me. Would if_sheet_exists be more clear here? Also some alternative suggestions to the values:

  • if_sheet_exists='fail'
  • if_sheet_exists='replace'
  • if_sheet_exists='overwrite'
  • if_sheet_exists='avoid'

@mirober
Copy link
Contributor Author

mirober commented Mar 5, 2021

I sympathize, but why is this a pandas issue when the underlying writer's don't support this?

My only remaining question is if this can be implemented for all excel engines, or if not, which ones (I've only looked at openpyxl).

I think this option only really makes sense in append mode, and therefore only with the openpyxl engine. In write mode the user is in complete control so there is no real "existing sheet" problem. Openpyxl supports all of the options I've suggested, the reason pandas does what it does at the moment seems to be just a quirk of the OpenpyxlWriter initialisation code.

I think the choosing a name similar to DataFrame.to_sql makes sense, but if_exists sounds like it is applying to the entire workbook to me. Would if_sheet_exists be more clear here? Also some alternative suggestions to the values:

Those seem reasonable names to me. An alternative to 'avoid' might be 'create' or 'new'? I'm fairly indifferent

@rhshadrach
Copy link
Member

Makes sense - while I don't like adding engine-specific arguments, IMO this is still a clear enough use-case that I'm +1. cc @jreback for any thoughts.

I do like 'new' instead of 'avoid'. Also, maybe if_sheet_exists='clear' is better than if_sheet_exists='replace'.

@mirober
Copy link
Contributor Author

mirober commented Mar 6, 2021

I agree it's not ideal to be adding a bunch of engine-specific options, but I think the functionality is worth it. I can come up with a more detailed motivating example if it's not obvious what kind of thing it would allow.

I think replace is clearer (heh) than clear. It also matches the if_exists="replace" argument to df.to_sql and does more or less the same thing: truncate the target location and then write the dataframe.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement IO Excel read_excel, to_excel
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants
@jreback @mirober @rhshadrach @lithomas1 and others