Closed
Description
This simple script fails with html5lib.
import html5lib
import lxml.html.clean
parser = html5lib.HTMLParser(tree=html5lib.getTreeBuilder("lxml"), namespaceHTMLElements=False)
# tree = lxml.html.document_fromstring(html)
tree = parser.parse("<html><body><!-- a comment --></body></html>")
cleaner = lxml.html.clean.Cleaner()
cleaner(tree)
The problem is lxml.html.document_fromstring
return an element with type lxml.html.HtmlElement
, but HTMLParser.parse
returns with type lxml.etree._ElementTree
Activity
gsnedders commentedon Jul 31, 2013
We unfortunately cannot return a lxml HTML tree unless we entirely break namespace support, which seems undesirable in the extreme, thus closing as wontfix.
tahajahangir commentedon Aug 3, 2013
So, what is
namespaceHTMLElements=False
for?We can set
namespaceHTMLElements=False
and use lxml HTML tree.gsnedders commentedon Aug 3, 2013
namespaceHTMLElements=False
only changes what namespace HTML elements are put in, e.g., thehtml
element is put is in the void namespace instead of the XHTML namespace. However, HTML also supports SVG and MathML elements, which are put in their respective namespaces regardless ofnamespaceHTMLElements
. (If they didn't, the option would introduce ambiguity, as how else would you distinguish thescript
element in the HTML namespace from thescript
element in the SVG namespace?)What are you actually wanting to use
lxml.html.clean.Cleaner
for? For sanitizing/purifying/whatever-you-want-to-call it? If so, you may want to try either https://github.com/jsocol/bleach or html5lib's own sanitizer (though note the API of that will probably change prior to 1.0 due to #72.Regardless, it seems like, on the face of it, that it should support XML trees (as it is, it doesn't work for XHTML parsed by lxml either!). In general terms, I don't think its worthwhile to support.
(Looking at the implementation of
lxml.html.clean.Cleaner
, it looks like it is vaguely meant to work with XML/XHTML trees; possibly worth asking on lxml mailing list?)davirtavares commentedon Apr 14, 2014
Hi, I'm very interested on html5lib be able tu use lxml's HtmlMixin, can you @gsnedders please explain why it would break support for NS (at code level)?
requiredfield commentedon Aug 28, 2014
Same here. The functionality it provides like
make_links_absolute
andrewrite_links
is essential for what I'm doing.@gsnedders, do you have any suggestions? @davirtavares, did you ever figure out anything?
davirtavares commentedon Aug 28, 2014
Hey @requiredfield, at all I ended writing my own version of these methods as funcs, based on the lxml's code https://github.com/lxml/lxml/blob/master/src/lxml/html/__init__.py#L455. Unfortunately had no time to fixing it by a more elegant way :/
cjerdonek commentedon Apr 30, 2017
Responding to this (and independent of this issue as it was originally filed), would it make sense for
lxml.html
to add support for namespaces?gsnedders commentedon Apr 30, 2017
Yes, given HTML has been defined, and implemented in browsers, to put things in namespaces for almost ten years now. That might need changes in libxml2 too, though.