-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
ENH: Adding XlsxWriter as an ExcelWriter() option #4542
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
gr8! could certainly be added as an optional deps see http://pandas.pydata.org/pandas-docs/dev/install.html#optional-dependencies; and could try in preference (if installed) over other writer libraries, then fallback to others (if installed) you may need to rework the infrastructure a bit to do this, e.g. at the top of the file various imports should be tried (see then the writers should take a list of lmk if you need help |
pls hook up travis as well; you will also need to add your library to |
as a side issue; people have been thinking about how to pass formatting info to excel via these writers; either via some kind of dict or a |
I'll try to make the changes as noninvasive as possible.
Will do. XlsxWriter already uses Travis so it shouldn't be too painful. I normally test locally with a range of Python 2/3 versions as well. John |
gr8! |
@jmcnamara how's this going? Definitely interested in being able to hook up other writers to the core...Does ExcelWriter have the same API as openpyxl for saving excel files? |
I made some initial progress getting XlsxWriter integrated and working, the code is on the enh_xlsxwriter_dev branch of my fork. For example the following now works using XlsxWriter: from pandas import ExcelWriter
from pandas import DataFrame
l1 = [1,2,3,4]
l2 = [1,2,3,4]
df1 = DataFrame({'Stimulus Time': l1, 'Reaction Time': l2})
writer = ExcelWriter('path_to_file2.xlsx', writer_options={'use_xlsxwriter': True})
df1.to_excel(writer, sheet_name='Sheet1', index=False)
writer.save() Dropping the optional
I think it should be reasonably easy to plug in other writers as well. The required API changes would be minor and backward compatible: basically the additional dict of options for the underlying Excel writer (shown above). I'll try to get it to a more complete state in the next week, with tests, so you review it properly and let me know if you think it will fit in with the current code.
It's API and scope is closer to Xlwt, i.e., it only creates new files and doesn't it doesn't rewrite files (although that could be simulated via the existing Pandas Excel reading functionality). In terms of functionality and features XlsxWriter goes beyond Xlwt or OpenPyXL and opens the possibility for adding charts, images, autofilters and conditional formatting as Pandas Excel writer options. However, I need to do a bit more integration work first. :-) It should also be about 10x faster than OpenPyXL for large files. However, to get that performance the cells would have to be written in row order. Currently in def get_formatted_cells(self):
for cell in itertools.chain(self._format_header(), self._format_body()
):
cell.val = self._format_value(cell.val)
yield cell Would it be acceptable to convert this, and associated header handling functions, to row order via Without it XlsxWriter should still be 5x faster than OpenPyXL but would consume a similar (large) amount of memory for large files. There in some background on this, if you are interested, in the Working with Memory and Performance section of the XlsxsWriter docs. Anyway, I'll try push some more complete code for you and the others to look at in the next week. John. |
@jmcnamara I would like to see an API more like this
and can default the engine (and possible use the last bit is actually easy to set up see: https://github.com/pydata/pandas/blob/master/pandas/io/pytables.py#L170 |
@jmcnamara sounds good - it's definitely not a problem to go in row order vs. column order - I wouldn't know (without looking back at the history of the code) whether that was an explicit decision or not anyways. 5x speedup is absolutely non-trivial, so making a change to enable that is definitely a good thing. |
I changed the API to In terms of the config it will probably need two items since the user may not be using the same engine for reading and writing. How about something like:
John |
@jmcnamara pls make sure to hook up Travis |
@jmcnamara one thing that would be nice would be to, instead of hard coding the engine, just create the format like: You could do something like: class _UnsupportedFileTypeError(Exception):
pass
# __init__ method for ExcelWriter/
def __init__(self, path, engine=None, **engine_kwargs):
if engine is None:
engine = config.get_option('io.excel.engine')
try:
method = getattr(self, "_init_%s" % engine)
except AttributeError:
raise ValueError("No engine: %s" % engine)
try:
method(path, **engine_kwargs)
except _UnsupportedFileTypeError:
self._init_standard(path, **engine_kwargs) and then the individual engines handle the rest of the setup. (like setting up the workbook, etc.). If this seems too complicated, then we can always do it later after you've added in xlsxwriter and worked through the various issues with adding a new engine. |
Already done in both cases. :-) |
@jmcnamara did you send this as a PR? (or its just on your branch atm?) |
I haven't sent a PR. I am a few days work away from that as yet. |
That is a good suggestion. It makes it more generic and easier to add other engines. I also made the same change in another place. I omitted the |
The code is getting close to a PR. I'll rebase/squash it onto another branch prior to that so that the merge is only one commit. I still have a bit more to add first but any comments on the existing code are welcome. I wrote 28 Here are some initial benchmark figures for the three supported writer modules:
Here is the simple-minded benchmark program: # Simple Benchmark test for the Pandas Excel writer modules.
import sys
from time import clock
from pandas import DataFrame
from pandas.core import config
import numpy as np
# Get the number of rows to test.
if len(sys.argv) > 1:
row_max = int(sys.argv[1])
else:
row_max = 1000
# Time the Excel writer modules.
def time_excel_engine(df, engine):
filename = 'test_%s.xlsx' % engine
config.set_option('io.excel.writer_engine', engine)
start_time = clock()
df.to_excel(filename, sheet_name='Sheet1', index=False)
elapsed = clock() - start_time
print("Engine = %-10s Rows = %6d time = %6.2f" % (engine, row_max, elapsed))
# Create a dataframe to write out to Excel.
df = DataFrame(np.random.randn(row_max, 4), columns=['A', 'B', 'C', 'D'])
# Time the modules.
time_excel_engine(df, 'openpyxl')
time_excel_engine(df, 'xlsxwriter')
if row_max < 64000:
time_excel_engine(df, 'xlwt') If I can refactor the format.py code to use row order then we should be able to get a doubling of speed from Adding the PyExcelerate module as a writer would get another 50% on that and with the reworked |
you can put this in the form of a vbench I think you could try to create a vbench for this, nothing there now but create a file like this: (named say excel): https://github.com/pydata/pandas/blob/master/vb_suite/ctors.py, and add it to the suite.py. test using: test_perf.sh -b -t (lots of options to test_perf, e.g. you might want to start all of the tests with excel_, then you can add the |
@jmcnamara how to use pandas.to_excel(Write or maybe pandas.ExcelWriter) to add some data to an existed file , but not rewrite it?? |
You would have to read it in, make your changes, and print to the same filename. |
related/closes #4517
Hi,
I am the author of the XlsxWriter module, a Python module for writing Excel XLSX files.
I am interested in submitting a PR to add this as a option when writing Excel xlsx files via
to_excel()
.It has a similar interface to the xlwt module but creates xlsx files. For large data files it is ~10x faster than OpenPyXL while maintaining low and constant memory usage. I occasionally see the current xlsx writer's speed and memory usage mentioned as a bug/limitation so it could be useful to have a fallback module option in those cases.
XlsxWriter also supports features such as charts and auto-filters that could be exposed at a later stage. See https://xlsxwriter.readthedocs.org/ for more details.
It uses standard libraries only and has an extensive test suite for Python 2.6, 2.7, 3.1, 3.2 and 3.3.
If you are interested I can work on a prototype and send a PR for you to evaluate. I have read CONTRIBUTING.md and I think I should be able to meet the criteria.
Regards,
John.
The text was updated successfully, but these errors were encountered: