Skip to content

lxml trees parsed by html5lib can not be used with lxml.clean #102

Closed
@tahajahangir

Description

@tahajahangir

This simple script fails with html5lib.

import html5lib
import lxml.html.clean

parser = html5lib.HTMLParser(tree=html5lib.getTreeBuilder("lxml"), namespaceHTMLElements=False)
# tree = lxml.html.document_fromstring(html)
tree = parser.parse("<html><body><!-- a comment --></body></html>")

cleaner = lxml.html.clean.Cleaner()
cleaner(tree)

The problem is lxml.html.document_fromstring return an element with type lxml.html.HtmlElement, but HTMLParser.parse returns with type lxml.etree._ElementTree

Activity

gsnedders

gsnedders commented on Jul 31, 2013

@gsnedders
Member

We unfortunately cannot return a lxml HTML tree unless we entirely break namespace support, which seems undesirable in the extreme, thus closing as wontfix.

tahajahangir

tahajahangir commented on Aug 3, 2013

@tahajahangir
ContributorAuthor

So, what is namespaceHTMLElements=False for?

We can set namespaceHTMLElements=False and use lxml HTML tree.

gsnedders

gsnedders commented on Aug 3, 2013

@gsnedders
Member

namespaceHTMLElements=False only changes what namespace HTML elements are put in, e.g., the html element is put is in the void namespace instead of the XHTML namespace. However, HTML also supports SVG and MathML elements, which are put in their respective namespaces regardless of namespaceHTMLElements. (If they didn't, the option would introduce ambiguity, as how else would you distinguish the script element in the HTML namespace from the script element in the SVG namespace?)

What are you actually wanting to use lxml.html.clean.Cleaner for? For sanitizing/purifying/whatever-you-want-to-call it? If so, you may want to try either https://github.com/jsocol/bleach or html5lib's own sanitizer (though note the API of that will probably change prior to 1.0 due to #72.

Regardless, it seems like, on the face of it, that it should support XML trees (as it is, it doesn't work for XHTML parsed by lxml either!). In general terms, I don't think its worthwhile to support.

(Looking at the implementation of lxml.html.clean.Cleaner, it looks like it is vaguely meant to work with XML/XHTML trees; possibly worth asking on lxml mailing list?)

davirtavares

davirtavares commented on Apr 14, 2014

@davirtavares

Hi, I'm very interested on html5lib be able tu use lxml's HtmlMixin, can you @gsnedders please explain why it would break support for NS (at code level)?

requiredfield

requiredfield commented on Aug 28, 2014

@requiredfield

Hi, I'm very interested on html5lib be able tu use lxml's HtmlMixin

Same here. The functionality it provides like make_links_absolute and rewrite_links is essential for what I'm doing.

@gsnedders, do you have any suggestions? @davirtavares, did you ever figure out anything?

davirtavares

davirtavares commented on Aug 28, 2014

@davirtavares

Hey @requiredfield, at all I ended writing my own version of these methods as funcs, based on the lxml's code https://github.com/lxml/lxml/blob/master/src/lxml/html/__init__.py#L455. Unfortunately had no time to fixing it by a more elegant way :/

cjerdonek

cjerdonek commented on Apr 30, 2017

@cjerdonek

We unfortunately cannot return a lxml HTML tree unless we entirely break namespace support

Responding to this (and independent of this issue as it was originally filed), would it make sense for lxml.html to add support for namespaces?

gsnedders

gsnedders commented on Apr 30, 2017

@gsnedders
Member

Responding to this (and independent of this issue as it was originally filed), would it make sense for lxml.html to add support for namespaces?

Yes, given HTML has been defined, and implemented in browsers, to put things in namespaces for almost ten years now. That might need changes in libxml2 too, though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

      Development

      No branches or pull requests

        Participants

        @gsnedders@davirtavares@cjerdonek@tahajahangir@requiredfield

        Issue actions

          `lxml` trees parsed by `html5lib` can not be used with `lxml.clean` · Issue #102 · html5lib/html5lib-python