Skip to content

Various html5lib improvements #119

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
kovidgoyal opened this issue Nov 2, 2013 · 11 comments
Open

Various html5lib improvements #119

kovidgoyal opened this issue Nov 2, 2013 · 11 comments

Comments

@kovidgoyal
Copy link
Contributor

This is just a FYI, I have been working on a modified version of html5lib that achieves the following goals:

  1. Preserves attribute order
  2. Optionally includes line and column number information when parsing
  3. Handles XML namespaces correctly, so that if you happen to parse an XHTML document with html5lib you dont lose all the namespace information
  4. Create a new treebuilder for lxml
  5. Various performance improvements

Using my new lxml treebuilder parsing performance with line numbers and attribute order preservation is the same as for vanilla html5lib with its builtin treebuilder. The speed improvements come mainly from the new lxml builder and an optimized inputstream class for in memory streams.

I make no claims as to the relevance of my work for html5lib. I am just sharing it with you as a way of giving back. You are welcome to use the patches or not. Feel free to ask if you need any clarification.

The code is in https://github.com/kovidgoyal/calibre/tree/master/src/html5lib (these are the changes to html5lib itself)

and the lxml builder is in:
https://github.com/kovidgoyal/calibre/blob/master/src/calibre/ebooks/oeb/polish/parsing.py

@gsnedders
Copy link
Member

Is there any chance of you having the time to extract the code and recommit them into the html5lib repo and file pull requests? I can say upfront that 3 is something that isn't going to be upstreamed, the others may well be.

@kovidgoyal
Copy link
Contributor Author

I might be able to find the time, but I'd really rather not spend the time if the changes are not going to be merged, so I'd first like some confirmation on what you would be willing to merge.

To summarize:
The major problem is that without the new lxml builder most of the changes are useless. And the new lxml builder is designed to preserve namespaces and ignore DOCTYPEs (and generally any preamble that occurs before the first tag) which makes it rather unsuitable for html5lib. It would need a fair bit of work to make it deliberately mangle namespace declarations and keep doctypes, work which I dont have any interest in doing.

I can make PRs for the changes to html5lib core, but I see the following problems:

  1. preserving attribute order. This would require support in the builders, without which it is useless and a small performance penalty as well (since OrderedDict is slower than plain dict). Not to mention, that I vaguely recall that your tests depend on a specific attribute order, so this might break your test suite.

  2. line and column numbers: Again, it needs support in the builders to be useful, but as as this is almost zero performance impact and is optional, you may be willing to merge it anyway as a first step and look at adding support to the builders later.

  3. The miscellaneous performance improvements: Mostly uncontroversial, though you should in particular review the input stream class.

You can review the patches, here https://github.com/kovidgoyal/calibre/commits/master/src/html5lib Look at the changes in the last ten days.

If you indicate which changes you are willing to merge, in principle, I can probably find the time to make PRs for them.

@gsnedders
Copy link
Member

So:

  1. This is Preserve order of attributes on parsing #86; though from Preserve order of attributes on serialization #37:

    Note, also, that no tree type we support preserves order: lxml explicitly returns them in an "arbitrary order", as such there is no guarantee the treewalker won't change iteration order (and it probably does, for it just returns a dict).

    This is the big problem — we need a treebuilder with support for preserving order, as AFAIK none of the ones we currently support do. It therefore makes testing it works hard — otherwise I would merge it without support.

    The tests shouldn't be a problem, I don't think, as the test harness already sorts the attributes (given it's currently undefined, it needs to).

  2. See Add element.sourceline (lxml) #97; at least when I last looked at it last getting line numbers was slow. Will need to look at that more closely, but again, in principle something I would like. (Though as with the previous, need some treebuilder supporting it for the sake of testing it.)

  3. Well, yes, performance is always nice, provided it doesn't make things unreadable. :)

I may try and split them out this afternoon — but this may well depend on managing to clone the Calibre repo (which is downloading at a mere 20Kb/s), so that may not happen…

@gsnedders
Copy link
Member

@gsnedders
Copy link
Member

So, taking a quick glance at it, I think I'm happy in principle to take all of it. (You'll have to maintain your treebuilder yourself, but I'm happy to upstream anything needed for it.) Can you explain why apply_html_attributes, apply_body_attributes are needed, though, before I take a closer look?

I'll also point out that running tests, that currently that branch "FAILED (errors=980, failures=20)", with flake8 throwing up a fair few things too, not to mention Python 2.6 being broken (just a missing conditional import, I think?). All of those need fixed.

If you want to fix stuff in both Calibre and here, git format-patch and git am come in use to copy commitsets across (such as the above branch).

@kovidgoyal
Copy link
Contributor Author

apply_html_attributes and apply_body_attributes are called for second <html> and <body> tags. In order to allow the builder to handle namespaced attributes on those tags the handling of the attributes has to be moved into the builders. calibre is python 2.7 only and uses only a partial set of pep8 rules, but that is easily fixed.

@kovidgoyal
Copy link
Contributor Author

Regarding attribute order. lxml does preserve attribute order, as long as you are careful to not use Element.attrib (which generates a python dictionary from the underlying libxml2 structure). My lxml builder is thus able to preserve attribute order. There is no reason, in principle, the vanilla lxml builder cannot be changed for that as well. However given that the vanilla lxml builder is a rather ugly hack on top of the etree builder, I dont know how easy this would be in practice.

My fast stream class mitigates the performance cost of line numbers by first running a regex on the entire input to find all line positions, storing them in an array and using a bisection algorithm to efficiently translate a byte offset into a line number. Also note that getting line numbers has zero overhead if it is turned off, so it should be safe to include, defaulting it to off (you might need ot turn it on explicitly in the test suite, however).

@gsnedders
Copy link
Member

lxml, FWIW, explicitly documents everywhere that the order is "arbitrary". It so happens that it currently returns everything in insertion order — and thus I'm not willing to guarantee an API returns stuff in document order when the underlying API does not (as otherwise, if/when lxml does change, it makes html5lib liable for the breakage). I'm perfectly happy to do everything necessary to return attributes in document order currently, but as long as the underlying API is arbitrary, I don't want to guarantee any more.

Can you fork the above and file a PR for it (under the assumption of taking everything, one will do) and fix everything Travis CI will shout at you for (i.e., the breakage mentioned above)? I wonder if something a bit more generic could be used for adding further attributes — addFurtherAttributes(Element, attributes) or something similar (obviously matching whatever naming convention is used there)?

@kovidgoyal
Copy link
Contributor Author

I am perfectly happy with having html5lib passing attributes to the builders in document order. lxml defacto preserves attribute order (as long as you are careful) and I am quite comfortable relying on that behavior in my builder. I will look at fixing the breakage and submit the PR then. Unfortunately, I have a rather heavy work load, so it might be a little while.

As far as I could see the html and body tags are the only ones that get the special treatment, which makes sense, since in html you are only supposed to have one html and one body tag and the head tag has no attributes anyway. Certainly the two APIs could be combined into one, I am fine with either approach, it is your call.

@gsnedders
Copy link
Member

I'd rather combine the two APIs into one — and I also have no problem with it taking a while, I don't have much time for the next month either!

@scoder
Copy link

scoder commented Sep 3, 2016

Element.attrib (which generates a python dictionary from the underlying libxml2 structure).

No, it actually uses its own mapping class based on the underlying XML tree data. The order is therefore preserved.

lxml, FWIW, explicitly documents everywhere that the order is "arbitrary".

"Arbitrary" in the sense that it depends on where the attributes came from. If from the parser, then it's document order. If added later, it might make a difference whether the attribute already existed before, but AFAIR it still gets appended at the end. It's up to libxml2, and lxml doesn't give any guarantees here on its own. It's unlikely that the current behaviour of libxml2 is ever going to change, though, as it is considered a feature.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants