Skip to content

ENH: Upgrade Excel I/O module to use OpenPyXL 2 #7177

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
neirbowj opened this issue May 19, 2014 · 10 comments · Fixed by #7565
Closed

ENH: Upgrade Excel I/O module to use OpenPyXL 2 #7177

neirbowj opened this issue May 19, 2014 · 10 comments · Fixed by #7565
Labels
Compat pandas objects compatability with Numpy or Python functions IO Excel read_excel, to_excel
Milestone

Comments

@neirbowj
Copy link
Contributor

OpenPyXL v2.0 was released on May 13th, breaking backwards compatibility in some parts of its API, including those upon which pandas depends. Notably, v2 changes the style API and makes Style objects immutable. This incompatibility is evident at 0.14.0-rc1 in pandas.io.excel._OpenpyxlWriter._convert_to_style and the associated test case, pandas.io.tests.test_excel.OpenyxlTests.test_to_excel_styleconverter.

In addition to teaching pandas how to use the new API, the project may wish to retain support for the OpenPyXL v1 API, necessitating the design and development of a compatibility shim.

@jreback jreback added this to the 0.14.0 milestone May 19, 2014
@jreback
Copy link
Contributor

jreback commented May 19, 2014

dupe of #7169
would love to have u work on it!

py3.4 build already has 2.02 (and so fails)

@jreback jreback closed this as completed May 19, 2014
@neirbowj
Copy link
Contributor Author

Could you re-open this issue and reassign it to the 0.14.1 milestone?

@jreback jreback reopened this May 24, 2014
@jreback jreback modified the milestones: 0.14.1, 0.14.0 May 24, 2014
@jreback jreback removed the Dupe label May 24, 2014
@jreback
Copy link
Contributor

jreback commented Jun 10, 2014

any thoughts on this? (getting it to work with >= 2.0)

@neirbowj
Copy link
Contributor Author

I've been spending my pandas time allocation on this.

My thoughts, in case anybody else is driven to work on this before I can get to it, are as follows (caveat: I haven't thought about this for a couple weeks, so my recollection about some of these details may betray me).

There appear to be two main pieces to this feature. One is simply the conversion of a style dict to a native openpyxl Style object. This should be fairly rote, and by implementing to the new-style Style protocol, which is the only option in v2, backwards compatibility with v1 will be quite tractable, because it offers both old- and new-style protocols for forwards compatibility.

There are some differences in the structure of the Style class from v1 to v2, which pandas will have to handle. My preference would be to try, but not very hard, to map whatever style information the user passes onto the underlying representation, and to attach commentary about the openpyxl limitations we know of to whatever exceptions bubble up. However, I don't know how people tend to use this, and I haven't yet found a clear spec in the pandas docs for an engine-agnostic, user-facing style API (maybe there isn't and shouldn't be one).

The other main piece is the merge behavior. I've had more trouble wrapping my head around how this should work. The new-style Style class has a copy method that accepts coarse substitutions. For example, conceptually if not correct syntactically, consider this snippet.

from openpyxl.style import *
s1 = Style(font=Font(name='Helvetica'), color=Color(rgb='00FF0000'))
s2 = s1.copy(font=Font(sz=12))

s2 will get the red Color object from s1 but its font attribute will have been overridden with a completely new Font object that has the default face, 'Calibri', at a non-default size, 12 pt. Make sense?

I think this is a fine set of semantics to adopt for the internal style merging behavior of pandas, because it keeps the padding between the user and the engine implementation thin. I'm not sure what that implies for the openpyxl v1 case. At the least, it may be convenient to model a compatibility shim as a backport of the copy method applied to v1 Style objects.

@neirbowj
Copy link
Contributor Author

It occurred to me belatedly to wonder, since I have never used pandas.io.excel or its main consumer, the to_excel method, whether the styling capability of ExcelWriter is really intended to be user facing? The documentation does not mention it, though the docstring for the ExcelWriter.write_cells method clearly indicates acceptance of formatted (i.e. styled) cells. In what follows, I will be assuming that this is supposed to be a user-facing capability that is simply not yet documented as such. If it isn't, please say so because it might provide some useful wiggle room to address the other areas where I'm getting stuck.

The other thing I finally realized I didn't really know for sure, and need to, is whether the style dict those formatted cells carry around is required to be engine-agnostic, or if it can vary from engine to engine. I guessed that the intention was the former, and the uniformity of the usage examples --- the only internal user in pandas.core.format and the applicable test cases for the openpyxl and xlwt engines --- seem to substantiate my guess. However, the style dicts in the test cases are insufficient for me to tell what I should write if I were to write a style dict spec/doc in the spirit of earlier authors. For example, I can tell that the spec has to allow the font to be set to bold or not, but I cannot tell whether font colors are to be supported, and, if so, how they are to be encoded.

If it were up to me, I think I would choose a very limited subset of all possible styling available in the underlying Excel, and call that the engine-agnostic, pandas dialect. By making the subset small and simple, the task gets easier of mapping it onto the supported versions of the existing engine libraries and any future versions or new engines. As I said, I don't know how users tend to use pandas.io.excel so I'm basing this judgement on the notion that somebody who doesn't wish to specify a particular engine, probably won't wish to specify a great deal of formatting information either, if any.

To accommodate users who do want to exert a great deal of control over cell styling, an additional, separable feature could be to teach ExcelWriter.write_cells to accept cells that come with attached style objects that are native to the selected engine. That is, if you want access to the full styling capability of the underlying library, then you have to build the style objects yourself instead of describing them to pandas as dicts.

A related question. Does pandas have some sort of policy, roadmap, or procedure for determining version support? That is, how will we know when we can drop support for the old version of openpyxl?

I welcome discussion or even just direction.

@jreback
Copy link
Contributor

jreback commented Jun 21, 2014

to your last points

we try maintain support for as long as possible with dep libraries until it becomes too painful :)

if it's easy enough then no biggie

openpyxl 1.6 I think is the min version but only. 1 year old I think

it's becoming pretty easy IMHO to upgrade individual deps easily

so if their is a recommendation to drop support for something we can discuss

note that pandas does still supports numpy 1.6.1 (and this is painful!)

@neirbowj
Copy link
Contributor Author

@jreback I'm not suggesting that we drop support for anything at the moment. I'm just trying to get a sense of how long the life of an openpyxl v1/v2 compatibility shim might be.

@jreback
Copy link
Contributor

jreback commented Jun 21, 2014

one way to do this might be to make a openpyxl_v2 engine (and then depending on the installation choose one or the other)
that way u can rebuild the API as needed

@neirbowj
Copy link
Contributor Author

Yes, that's a good point. Splitting the implementation at the engine level would probably be better. I still think that we need a spec and documentation for the style dict, both of which would be informed by whether it is intended for internal use only, or is an external feature that we also use internally.

@neirbowj
Copy link
Contributor Author

The PR I just posted still has some rough edges (see, for example, pandas.io.excel._Openpyxl2Writer._convert_to_fill), but should give us a leg up on this issue.

I find that openpyxl is maddeningly full of corner cases and reasons why I can't do something simple and uniform. If anybody sees a missed opportunity to refactor and consolidate this code, I welcome suggestions or even alternative PRs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Compat pandas objects compatability with Numpy or Python functions IO Excel read_excel, to_excel
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants