-
Notifications
You must be signed in to change notification settings - Fork 294
Various html5lib improvements #119
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Is there any chance of you having the time to extract the code and recommit them into the html5lib repo and file pull requests? I can say upfront that 3 is something that isn't going to be upstreamed, the others may well be. |
I might be able to find the time, but I'd really rather not spend the time if the changes are not going to be merged, so I'd first like some confirmation on what you would be willing to merge. To summarize: I can make PRs for the changes to html5lib core, but I see the following problems:
You can review the patches, here https://github.com/kovidgoyal/calibre/commits/master/src/html5lib Look at the changes in the last ten days. If you indicate which changes you are willing to merge, in principle, I can probably find the time to make PRs for them. |
So:
I may try and split them out this afternoon — but this may well depend on managing to clone the Calibre repo (which is downloading at a mere 20Kb/s), so that may not happen… |
https://github.com/gsnedders/html5lib-python/tree/calibre-patches contains everything |
So, taking a quick glance at it, I think I'm happy in principle to take all of it. (You'll have to maintain your treebuilder yourself, but I'm happy to upstream anything needed for it.) Can you explain why I'll also point out that running tests, that currently that branch "FAILED (errors=980, failures=20)", with flake8 throwing up a fair few things too, not to mention Python 2.6 being broken (just a missing conditional import, I think?). All of those need fixed. If you want to fix stuff in both Calibre and here, |
apply_html_attributes and apply_body_attributes are called for second |
Regarding attribute order. lxml does preserve attribute order, as long as you are careful to not use Element.attrib (which generates a python dictionary from the underlying libxml2 structure). My lxml builder is thus able to preserve attribute order. There is no reason, in principle, the vanilla lxml builder cannot be changed for that as well. However given that the vanilla lxml builder is a rather ugly hack on top of the etree builder, I dont know how easy this would be in practice. My fast stream class mitigates the performance cost of line numbers by first running a regex on the entire input to find all line positions, storing them in an array and using a bisection algorithm to efficiently translate a byte offset into a line number. Also note that getting line numbers has zero overhead if it is turned off, so it should be safe to include, defaulting it to off (you might need ot turn it on explicitly in the test suite, however). |
lxml, FWIW, explicitly documents everywhere that the order is "arbitrary". It so happens that it currently returns everything in insertion order — and thus I'm not willing to guarantee an API returns stuff in document order when the underlying API does not (as otherwise, if/when lxml does change, it makes html5lib liable for the breakage). I'm perfectly happy to do everything necessary to return attributes in document order currently, but as long as the underlying API is arbitrary, I don't want to guarantee any more. Can you fork the above and file a PR for it (under the assumption of taking everything, one will do) and fix everything Travis CI will shout at you for (i.e., the breakage mentioned above)? I wonder if something a bit more generic could be used for adding further attributes — addFurtherAttributes(Element, attributes) or something similar (obviously matching whatever naming convention is used there)? |
I am perfectly happy with having html5lib passing attributes to the builders in document order. lxml defacto preserves attribute order (as long as you are careful) and I am quite comfortable relying on that behavior in my builder. I will look at fixing the breakage and submit the PR then. Unfortunately, I have a rather heavy work load, so it might be a little while. As far as I could see the html and body tags are the only ones that get the special treatment, which makes sense, since in html you are only supposed to have one html and one body tag and the head tag has no attributes anyway. Certainly the two APIs could be combined into one, I am fine with either approach, it is your call. |
I'd rather combine the two APIs into one — and I also have no problem with it taking a while, I don't have much time for the next month either! |
No, it actually uses its own mapping class based on the underlying XML tree data. The order is therefore preserved.
"Arbitrary" in the sense that it depends on where the attributes came from. If from the parser, then it's document order. If added later, it might make a difference whether the attribute already existed before, but AFAIR it still gets appended at the end. It's up to libxml2, and lxml doesn't give any guarantees here on its own. It's unlikely that the current behaviour of libxml2 is ever going to change, though, as it is considered a feature. |
This is just a FYI, I have been working on a modified version of html5lib that achieves the following goals:
Using my new lxml treebuilder parsing performance with line numbers and attribute order preservation is the same as for vanilla html5lib with its builtin treebuilder. The speed improvements come mainly from the new lxml builder and an optimized inputstream class for in memory streams.
I make no claims as to the relevance of my work for html5lib. I am just sharing it with you as a way of giving back. You are welcome to use the patches or not. Feel free to ask if you need any clarification.
The code is in https://github.com/kovidgoyal/calibre/tree/master/src/html5lib (these are the changes to html5lib itself)
and the lxml builder is in:
https://github.com/kovidgoyal/calibre/blob/master/src/calibre/ebooks/oeb/polish/parsing.py
The text was updated successfully, but these errors were encountered: