Skip to content

ENH: Adding XlsxWriter as an ExcelWriter() option #4542

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
jmcnamara opened this issue Aug 12, 2013 · 20 comments · Fixed by #4857
Closed

ENH: Adding XlsxWriter as an ExcelWriter() option #4542

jmcnamara opened this issue Aug 12, 2013 · 20 comments · Fixed by #4857
Labels
Enhancement IO Data IO issues that don't fit into a more specific label IO Excel read_excel, to_excel
Milestone

Comments

@jmcnamara
Copy link
Contributor

related/closes #4517

Hi,

I am the author of the XlsxWriter module, a Python module for writing Excel XLSX files.

I am interested in submitting a PR to add this as a option when writing Excel xlsx files via to_excel().

It has a similar interface to the xlwt module but creates xlsx files. For large data files it is ~10x faster than OpenPyXL while maintaining low and constant memory usage. I occasionally see the current xlsx writer's speed and memory usage mentioned as a bug/limitation so it could be useful to have a fallback module option in those cases.

XlsxWriter also supports features such as charts and auto-filters that could be exposed at a later stage. See https://xlsxwriter.readthedocs.org/ for more details.

It uses standard libraries only and has an extensive test suite for Python 2.6, 2.7, 3.1, 3.2 and 3.3.

If you are interested I can work on a prototype and send a PR for you to evaluate. I have read CONTRIBUTING.md and I think I should be able to meet the criteria.

Regards,

John.

@jreback
Copy link
Contributor

jreback commented Aug 12, 2013

gr8!

could certainly be added as an optional deps see http://pandas.pydata.org/pandas-docs/dev/install.html#optional-dependencies; and could try in preference (if installed) over other writer libraries, then fallback to others (if installed)

you may need to rework the infrastructure a bit to do this,

e.g. at the top of the file various imports should be tried (see io/html.py for how this is done).

then the writers should take a list of writers? to try in sequence (else will finally raise if nothing is installed)

lmk if you need help

@jreback
Copy link
Contributor

jreback commented Aug 12, 2013

pls hook up travis as well; you will also need to add your library to ci/requirements-??? for testing (you can add to all if you want), with a version specified (if you have multiple versions then you can do that too)

@jreback
Copy link
Contributor

jreback commented Aug 12, 2013

as a side issue; people have been thinking about how to pass formatting info to excel via these writers; either via some kind of dict or a Style type of object (generally per column, though could in theory be per cell), any thoughts on this?

@jmcnamara
Copy link
Contributor Author

you may need to rework the infrastructure a bit to do this,

I'll try to make the changes as noninvasive as possible.

pls hook up travis as well; you will also need to add your library to ci/requirements-???

Will do. XlsxWriter already uses Travis so it shouldn't be too painful. I normally test locally with a range of Python 2/3 versions as well.

John

@jreback
Copy link
Contributor

jreback commented Aug 12, 2013

gr8!

@jtratner
Copy link
Contributor

@jmcnamara how's this going? Definitely interested in being able to hook up other writers to the core...Does ExcelWriter have the same API as openpyxl for saving excel files?

@jmcnamara
Copy link
Contributor Author

@jtratner, @jreback

how's this going?

I made some initial progress getting XlsxWriter integrated and working, the code is on the enh_xlsxwriter_dev branch of my fork. For example the following now works using XlsxWriter:

    from pandas import ExcelWriter
    from pandas import DataFrame

    l1 = [1,2,3,4]
    l2 = [1,2,3,4]
    df1 = DataFrame({'Stimulus Time': l1, 'Reaction Time': l2})

    writer = ExcelWriter('path_to_file2.xlsx', writer_options={'use_xlsxwriter': True})
    df1.to_excel(writer, sheet_name='Sheet1', index=False)
    writer.save()

Dropping the optional writer_options parameter defaults back to OpenPyXL with similar results.

Definitely interested in being able to hook up other writers to the core

I think it should be reasonably easy to plug in other writers as well. The required API changes would be minor and backward compatible: basically the additional dict of options for the underlying Excel writer (shown above).

I'll try to get it to a more complete state in the next week, with tests, so you review it properly and let me know if you think it will fit in with the current code.

Does ExcelWriter have the same API as openpyxl for saving excel files?

It's API and scope is closer to Xlwt, i.e., it only creates new files and doesn't it doesn't rewrite files (although that could be simulated via the existing Pandas Excel reading functionality). In terms of functionality and features XlsxWriter goes beyond Xlwt or OpenPyXL and opens the possibility for adding charts, images, autofilters and conditional formatting as Pandas Excel writer options.

However, I need to do a bit more integration work first. :-)

It should also be about 10x faster than OpenPyXL for large files. However, to get that performance the cells would have to be written in row order. Currently in format.py they are handled in column order via chain:

    def get_formatted_cells(self):
        for cell in itertools.chain(self._format_header(), self._format_body()
                                    ):
            cell.val = self._format_value(cell.val)
            yield cell

Would it be acceptable to convert this, and associated header handling functions, to row order via izip_longest() or some similar iterator?

Without it XlsxWriter should still be 5x faster than OpenPyXL but would consume a similar (large) amount of memory for large files. There in some background on this, if you are interested, in the Working with Memory and Performance section of the XlsxsWriter docs.

Anyway, I'll try push some more complete code for you and the others to look at in the next week.

John.

@jreback
Copy link
Contributor

jreback commented Sep 1, 2013

@jmcnamara I would like to see an API more like this

df.to_excel(path_or_buf, sheet_name, engine='openpyxl|xlsxwriter'...., .....)
additional options can be passed directly to the engine if needed (as kwargs)

and can default the engine (and possible use get_option('io.excel.default_engine=....')

the last bit is actually easy to set up see: https://github.com/pydata/pandas/blob/master/pandas/io/pytables.py#L170

@jtratner
Copy link
Contributor

jtratner commented Sep 1, 2013

@jmcnamara sounds good - it's definitely not a problem to go in row order vs. column order - I wouldn't know (without looking back at the history of the code) whether that was an explicit decision or not anyways. 5x speedup is absolutely non-trivial, so making a change to enable that is definitely a good thing.

@jmcnamara
Copy link
Contributor Author

@jreback

I changed the API to engine=, added a default via the config handler and wrote some tests. The initial code is on my fork/branch.

In terms of the config it will probably need two items since the user may not be using the same engine for reading and writing. How about something like:

get_option('io.excel.writer_engine=....')
get_option('io.excel.reader_engine=....')

John

@jreback
Copy link
Contributor

jreback commented Sep 2, 2013

@jmcnamara pls make sure to hook up Travis
and you will need to update the ci/requirements file to make sure that the xlsxwriter is downloaded

@jtratner
Copy link
Contributor

jtratner commented Sep 2, 2013

@jmcnamara one thing that would be nice would be to, instead of hard coding the engine, just create the format like: '_init_%s' % engine and then check whether the engine exists on the class (so it's simple to bind new methods to the class and hook into the writer). The init method should take the path, so it could raise an error about getting an .xls file. (in fact, you could call like _init_blah(path, **engine_kwargs)). I think it shouldn't require too much effort to run it that way (basically just extending the setup you already have).

You could do something like:

class _UnsupportedFileTypeError(Exception):
    pass

# __init__ method for ExcelWriter/

def __init__(self, path, engine=None, **engine_kwargs):

    if engine is None:
        engine = config.get_option('io.excel.engine')

    try:
        method = getattr(self, "_init_%s" % engine)
    except AttributeError:
        raise ValueError("No engine: %s" % engine)
    try:
        method(path, **engine_kwargs)
    except _UnsupportedFileTypeError:
        self._init_standard(path, **engine_kwargs)

and then the individual engines handle the rest of the setup. (like setting up the workbook, etc.). If this seems too complicated, then we can always do it later after you've added in xlsxwriter and worked through the various issues with adding a new engine.

@jmcnamara
Copy link
Contributor Author

@jreback

make sure to hook up Travis and you will need to update the ci/requirements file to make sure that the xlsxwriter is downloaded

Already done in both cases. :-)

@jreback
Copy link
Contributor

jreback commented Sep 2, 2013

@jmcnamara did you send this as a PR? (or its just on your branch atm?)

@jmcnamara
Copy link
Contributor Author

@jreback

I haven't sent a PR. I am a few days work away from that as yet.

@jmcnamara
Copy link
Contributor Author

@jtratner

one thing that would be nice would be to, instead of hard coding the engine, just create the format like

That is a good suggestion. It makes it more generic and easier to add other engines. I also made the same change in another place.

I omitted the _UnsupportedFileTypeError part however since the xlwt and openpyxl called methods don't throw any exception.

@jmcnamara
Copy link
Contributor Author

@jreback @jtratner

The code is getting close to a PR. I'll rebase/squash it onto another branch prior to that so that the merge is only one commit. I still have a bit more to add first but any comments on the existing code are welcome.

I wrote 28 xlsxwriter specific tests and added xlsxwriter variations of all of the existing openpyxl tests.

Here are some initial benchmark figures for the three supported writer modules:

$ python xltime.py 100
Engine = openpyxl   Rows =    100 time =   0.12
Engine = xlsxwriter Rows =    100 time =   0.02
Engine = xlwt       Rows =    100 time =   0.03

$ python xltime.py 1000
Engine = openpyxl   Rows =   1000 time =   0.61
Engine = xlsxwriter Rows =   1000 time =   0.14
Engine = xlwt       Rows =   1000 time =   0.12

$ python xltime.py 10000
Engine = openpyxl   Rows =  10000 time =   5.60
Engine = xlsxwriter Rows =  10000 time =   1.07
Engine = xlwt       Rows =  10000 time =   0.94

$ python xltime.py 50000
Engine = openpyxl   Rows =  50000 time =  26.80
Engine = xlsxwriter Rows =  50000 time =   4.89
Engine = xlwt       Rows =  50000 time =   4.40

$ python xltime.py 100000
Engine = openpyxl   Rows = 100000 time =  54.53
Engine = xlsxwriter Rows = 100000 time =  10.21

# xlwt and the xls format only supports 64,000 rows.

Here is the simple-minded benchmark program:

# Simple Benchmark test for the Pandas Excel writer modules.
import sys
from time import clock
from pandas import DataFrame
from pandas.core import config
import numpy as np


# Get the number of rows to test.
if len(sys.argv) > 1:
    row_max = int(sys.argv[1])
else:
    row_max = 1000


# Time the Excel writer modules.
def time_excel_engine(df, engine):

    filename = 'test_%s.xlsx' % engine
    config.set_option('io.excel.writer_engine', engine)

    start_time = clock()
    df.to_excel(filename, sheet_name='Sheet1', index=False)
    elapsed = clock() - start_time

    print("Engine = %-10s Rows = %6d time = %6.2f" % (engine, row_max, elapsed))


# Create a dataframe to write out to Excel.
df = DataFrame(np.random.randn(row_max, 4), columns=['A', 'B', 'C', 'D'])

# Time the modules.
time_excel_engine(df, 'openpyxl')

time_excel_engine(df, 'xlsxwriter')

if row_max < 64000:
    time_excel_engine(df, 'xlwt')

If I can refactor the format.py code to use row order then we should be able to get a doubling of speed from xlsxwriter.

Adding the PyExcelerate module as a writer would get another 50% on that and with the reworked engine architecture it should be easy to add that or other modules.

@jreback
Copy link
Contributor

jreback commented Sep 3, 2013

you can put this in the form of a vbench

I think you could try to create a vbench for this, nothing there now but create a file like this: (named say excel): https://github.com/pydata/pandas/blob/master/vb_suite/ctors.py, and add it to the suite.py.

test using:

test_perf.sh -b -t

(lots of options to test_perf, e.g. you might want to start all of the tests with excel_, then you can add the
-r excel option to ONLY test thoses specific vbenches

@aa3222119
Copy link

@jmcnamara how to use pandas.to_excel(Write or maybe pandas.ExcelWriter) to add some data to an existed file , but not rewrite it??

@ProsperousHeart
Copy link
Contributor

ProsperousHeart commented Feb 21, 2019

@jmcnamara how to use pandas.to_excel(Write or maybe pandas.ExcelWriter) to add some data to an existed file , but not rewrite it??

You would have to read it in, make your changes, and print to the same filename.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement IO Data IO issues that don't fit into a more specific label IO Excel read_excel, to_excel
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants