ENH: Adding XlsxWriter as an ExcelWriter() option #4542

jmcnamara · 2013-08-12T13:39:22Z

related/closes #4517

Hi,

I am the author of the XlsxWriter module, a Python module for writing Excel XLSX files.

I am interested in submitting a PR to add this as a option when writing Excel xlsx files via to_excel().

It has a similar interface to the xlwt module but creates xlsx files. For large data files it is ~10x faster than OpenPyXL while maintaining low and constant memory usage. I occasionally see the current xlsx writer's speed and memory usage mentioned as a bug/limitation so it could be useful to have a fallback module option in those cases.

XlsxWriter also supports features such as charts and auto-filters that could be exposed at a later stage. See https://xlsxwriter.readthedocs.org/ for more details.

It uses standard libraries only and has an extensive test suite for Python 2.6, 2.7, 3.1, 3.2 and 3.3.

If you are interested I can work on a prototype and send a PR for you to evaluate. I have read CONTRIBUTING.md and I think I should be able to meet the criteria.

Regards,

John.

The text was updated successfully, but these errors were encountered:

jreback · 2013-08-12T13:48:45Z

gr8!

could certainly be added as an optional deps see http://pandas.pydata.org/pandas-docs/dev/install.html#optional-dependencies; and could try in preference (if installed) over other writer libraries, then fallback to others (if installed)

you may need to rework the infrastructure a bit to do this,

e.g. at the top of the file various imports should be tried (see io/html.py for how this is done).

then the writers should take a list of writers? to try in sequence (else will finally raise if nothing is installed)

lmk if you need help

jreback · 2013-08-12T13:50:16Z

pls hook up travis as well; you will also need to add your library to ci/requirements-??? for testing (you can add to all if you want), with a version specified (if you have multiple versions then you can do that too)

jreback · 2013-08-12T14:15:02Z

as a side issue; people have been thinking about how to pass formatting info to excel via these writers; either via some kind of dict or a Style type of object (generally per column, though could in theory be per cell), any thoughts on this?

jmcnamara · 2013-08-12T14:15:28Z

you may need to rework the infrastructure a bit to do this,

I'll try to make the changes as noninvasive as possible.

pls hook up travis as well; you will also need to add your library to ci/requirements-???

Will do. XlsxWriter already uses Travis so it shouldn't be too painful. I normally test locally with a range of Python 2/3 versions as well.

John

jreback · 2013-08-12T14:37:48Z

gr8!

jtratner · 2013-08-31T02:40:11Z

@jmcnamara how's this going? Definitely interested in being able to hook up other writers to the core...Does ExcelWriter have the same API as openpyxl for saving excel files?

jmcnamara · 2013-09-01T12:11:44Z

@jtratner, @jreback

how's this going?

I made some initial progress getting XlsxWriter integrated and working, the code is on the enh_xlsxwriter_dev branch of my fork. For example the following now works using XlsxWriter:

    from pandas import ExcelWriter
    from pandas import DataFrame

    l1 = [1,2,3,4]
    l2 = [1,2,3,4]
    df1 = DataFrame({'Stimulus Time': l1, 'Reaction Time': l2})

    writer = ExcelWriter('path_to_file2.xlsx', writer_options={'use_xlsxwriter': True})
    df1.to_excel(writer, sheet_name='Sheet1', index=False)
    writer.save()

Dropping the optional writer_options parameter defaults back to OpenPyXL with similar results.

Definitely interested in being able to hook up other writers to the core

I think it should be reasonably easy to plug in other writers as well. The required API changes would be minor and backward compatible: basically the additional dict of options for the underlying Excel writer (shown above).

I'll try to get it to a more complete state in the next week, with tests, so you review it properly and let me know if you think it will fit in with the current code.

Does ExcelWriter have the same API as openpyxl for saving excel files?

It's API and scope is closer to Xlwt, i.e., it only creates new files and doesn't it doesn't rewrite files (although that could be simulated via the existing Pandas Excel reading functionality). In terms of functionality and features XlsxWriter goes beyond Xlwt or OpenPyXL and opens the possibility for adding charts, images, autofilters and conditional formatting as Pandas Excel writer options.

However, I need to do a bit more integration work first. :-)

It should also be about 10x faster than OpenPyXL for large files. However, to get that performance the cells would have to be written in row order. Currently in format.py they are handled in column order via chain:

    def get_formatted_cells(self):
        for cell in itertools.chain(self._format_header(), self._format_body()
                                    ):
            cell.val = self._format_value(cell.val)
            yield cell

Would it be acceptable to convert this, and associated header handling functions, to row order via izip_longest() or some similar iterator?

Without it XlsxWriter should still be 5x faster than OpenPyXL but would consume a similar (large) amount of memory for large files. There in some background on this, if you are interested, in the Working with Memory and Performance section of the XlsxsWriter docs.

Anyway, I'll try push some more complete code for you and the others to look at in the next week.

John.

jreback · 2013-09-01T12:31:32Z

@jmcnamara I would like to see an API more like this

df.to_excel(path_or_buf, sheet_name, engine='openpyxl|xlsxwriter'...., .....)
additional options can be passed directly to the engine if needed (as kwargs)

and can default the engine (and possible use get_option('io.excel.default_engine=....')

the last bit is actually easy to set up see: https://github.com/pydata/pandas/blob/master/pandas/io/pytables.py#L170

jtratner · 2013-09-01T12:53:05Z

@jmcnamara sounds good - it's definitely not a problem to go in row order vs. column order - I wouldn't know (without looking back at the history of the code) whether that was an explicit decision or not anyways. 5x speedup is absolutely non-trivial, so making a change to enable that is definitely a good thing.

jmcnamara · 2013-09-01T20:49:22Z

@jreback

I changed the API to engine=, added a default via the config handler and wrote some tests. The initial code is on my fork/branch.

In terms of the config it will probably need two items since the user may not be using the same engine for reading and writing. How about something like:

get_option('io.excel.writer_engine=....')
get_option('io.excel.reader_engine=....')

John

jreback · 2013-09-02T00:08:05Z

@jmcnamara pls make sure to hook up Travis
and you will need to update the ci/requirements file to make sure that the xlsxwriter is downloaded

jtratner · 2013-09-02T00:39:35Z

@jmcnamara one thing that would be nice would be to, instead of hard coding the engine, just create the format like: '_init_%s' % engine and then check whether the engine exists on the class (so it's simple to bind new methods to the class and hook into the writer). The init method should take the path, so it could raise an error about getting an .xls file. (in fact, you could call like _init_blah(path, **engine_kwargs)). I think it shouldn't require too much effort to run it that way (basically just extending the setup you already have).

You could do something like:

class _UnsupportedFileTypeError(Exception):
    pass

# __init__ method for ExcelWriter/

def __init__(self, path, engine=None, **engine_kwargs):

    if engine is None:
        engine = config.get_option('io.excel.engine')

    try:
        method = getattr(self, "_init_%s" % engine)
    except AttributeError:
        raise ValueError("No engine: %s" % engine)
    try:
        method(path, **engine_kwargs)
    except _UnsupportedFileTypeError:
        self._init_standard(path, **engine_kwargs)

and then the individual engines handle the rest of the setup. (like setting up the workbook, etc.). If this seems too complicated, then we can always do it later after you've added in xlsxwriter and worked through the various issues with adding a new engine.

jmcnamara · 2013-09-02T07:12:20Z

@jreback

make sure to hook up Travis and you will need to update the ci/requirements file to make sure that the xlsxwriter is downloaded

Already done in both cases. :-)

jreback · 2013-09-02T13:09:45Z

@jmcnamara did you send this as a PR? (or its just on your branch atm?)

jmcnamara · 2013-09-02T13:24:34Z

@jreback

I haven't sent a PR. I am a few days work away from that as yet.

jmcnamara · 2013-09-03T12:09:58Z

@jtratner

one thing that would be nice would be to, instead of hard coding the engine, just create the format like

That is a good suggestion. It makes it more generic and easier to add other engines. I also made the same change in another place.

I omitted the _UnsupportedFileTypeError part however since the xlwt and openpyxl called methods don't throw any exception.

jmcnamara · 2013-09-03T12:23:19Z

@jreback @jtratner

The code is getting close to a PR. I'll rebase/squash it onto another branch prior to that so that the merge is only one commit. I still have a bit more to add first but any comments on the existing code are welcome.

I wrote 28 xlsxwriter specific tests and added xlsxwriter variations of all of the existing openpyxl tests.

Here are some initial benchmark figures for the three supported writer modules:

$ python xltime.py 100
Engine = openpyxl   Rows =    100 time =   0.12
Engine = xlsxwriter Rows =    100 time =   0.02
Engine = xlwt       Rows =    100 time =   0.03

$ python xltime.py 1000
Engine = openpyxl   Rows =   1000 time =   0.61
Engine = xlsxwriter Rows =   1000 time =   0.14
Engine = xlwt       Rows =   1000 time =   0.12

$ python xltime.py 10000
Engine = openpyxl   Rows =  10000 time =   5.60
Engine = xlsxwriter Rows =  10000 time =   1.07
Engine = xlwt       Rows =  10000 time =   0.94

$ python xltime.py 50000
Engine = openpyxl   Rows =  50000 time =  26.80
Engine = xlsxwriter Rows =  50000 time =   4.89
Engine = xlwt       Rows =  50000 time =   4.40

$ python xltime.py 100000
Engine = openpyxl   Rows = 100000 time =  54.53
Engine = xlsxwriter Rows = 100000 time =  10.21

# xlwt and the xls format only supports 64,000 rows.

Here is the simple-minded benchmark program:

# Simple Benchmark test for the Pandas Excel writer modules.
import sys
from time import clock
from pandas import DataFrame
from pandas.core import config
import numpy as np


# Get the number of rows to test.
if len(sys.argv) > 1:
    row_max = int(sys.argv[1])
else:
    row_max = 1000


# Time the Excel writer modules.
def time_excel_engine(df, engine):

    filename = 'test_%s.xlsx' % engine
    config.set_option('io.excel.writer_engine', engine)

    start_time = clock()
    df.to_excel(filename, sheet_name='Sheet1', index=False)
    elapsed = clock() - start_time

    print("Engine = %-10s Rows = %6d time = %6.2f" % (engine, row_max, elapsed))


# Create a dataframe to write out to Excel.
df = DataFrame(np.random.randn(row_max, 4), columns=['A', 'B', 'C', 'D'])

# Time the modules.
time_excel_engine(df, 'openpyxl')

time_excel_engine(df, 'xlsxwriter')

if row_max < 64000:
    time_excel_engine(df, 'xlwt')

If I can refactor the format.py code to use row order then we should be able to get a doubling of speed from xlsxwriter.

Adding the PyExcelerate module as a writer would get another 50% on that and with the reworked engine architecture it should be easy to add that or other modules.

jreback · 2013-09-03T12:31:51Z

you can put this in the form of a vbench

I think you could try to create a vbench for this, nothing there now but create a file like this: (named say excel): https://github.com/pydata/pandas/blob/master/vb_suite/ctors.py, and add it to the suite.py.

test using:

test_perf.sh -b -t

(lots of options to test_perf, e.g. you might want to start all of the tests with excel_, then you can add the
-r excel option to ONLY test thoses specific vbenches

aa3222119 · 2017-02-16T10:17:57Z

@jmcnamara how to use pandas.to_excel(Write or maybe pandas.ExcelWriter) to add some data to an existed file , but not rewrite it??

ProsperousHeart · 2019-02-21T07:11:55Z

@jmcnamara how to use pandas.to_excel(Write or maybe pandas.ExcelWriter) to add some data to an existed file , but not rewrite it??

You would have to read it in, make your changes, and print to the same filename.

jmcnamara mentioned this issue Sep 3, 2013

ENH: Added xlsxwriter as an ExcelWriter option. #4739

Closed

This was referenced Sep 15, 2013

ENH: Added xlsxwriter as an ExcelWriter option. #4847

Closed

ENH: Added xlsxwriter as an ExcelWriter option. #4857

Merged

jtratner closed this as completed in #4857 Sep 22, 2013

jreback mentioned this issue Sep 27, 2013

Add PyExcelerate as Excel Writer Engine #4517

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: Adding XlsxWriter as an ExcelWriter() option #4542

ENH: Adding XlsxWriter as an ExcelWriter() option #4542

jmcnamara commented Aug 12, 2013

jreback commented Aug 12, 2013

jreback commented Aug 12, 2013

jreback commented Aug 12, 2013

jmcnamara commented Aug 12, 2013

jreback commented Aug 12, 2013

jtratner commented Aug 31, 2013

jmcnamara commented Sep 1, 2013

jreback commented Sep 1, 2013

jtratner commented Sep 1, 2013

jmcnamara commented Sep 1, 2013

jreback commented Sep 2, 2013

jtratner commented Sep 2, 2013

jmcnamara commented Sep 2, 2013

jreback commented Sep 2, 2013

jmcnamara commented Sep 2, 2013

jmcnamara commented Sep 3, 2013

jmcnamara commented Sep 3, 2013

jreback commented Sep 3, 2013

aa3222119 commented Feb 16, 2017

ProsperousHeart commented Feb 21, 2019 •

edited

Loading

ENH: Adding XlsxWriter as an ExcelWriter() option #4542

ENH: Adding XlsxWriter as an ExcelWriter() option #4542

Comments

jmcnamara commented Aug 12, 2013

jreback commented Aug 12, 2013

jreback commented Aug 12, 2013

jreback commented Aug 12, 2013

jmcnamara commented Aug 12, 2013

jreback commented Aug 12, 2013

jtratner commented Aug 31, 2013

jmcnamara commented Sep 1, 2013

jreback commented Sep 1, 2013

jtratner commented Sep 1, 2013

jmcnamara commented Sep 1, 2013

jreback commented Sep 2, 2013

jtratner commented Sep 2, 2013

jmcnamara commented Sep 2, 2013

jreback commented Sep 2, 2013

jmcnamara commented Sep 2, 2013

jmcnamara commented Sep 3, 2013

jmcnamara commented Sep 3, 2013

jreback commented Sep 3, 2013

aa3222119 commented Feb 16, 2017

ProsperousHeart commented Feb 21, 2019 • edited Loading

ProsperousHeart commented Feb 21, 2019 •

edited

Loading