Skip to content

lxml doesn’t like control characters #96

Open
@SimonSapin

Description

@SimonSapin

Same issue as #33, but with other non-whitespace C0 control characters: U+0001 to U+0008, U+000B, U+000C, U+000E to U+001F.

Each of these trigger the exception below:

html5lib.parse('<p>&#1;', treebuilder='lxml')
html5lib.parse('<p>\x01', treebuilder='lxml')
html5lib.parse('<p id="&#1;">', treebuilder='lxml')
html5lib.parse('<p id="\x01">', treebuilder='lxml')
Traceback (most recent call last):
  File "/tmp/a.py", line 4, in <module>
    html5lib.parse('<p>&#1;', treebuilder='lxml')
  File "/home/simon/.virtualenvs/weasyprint/lib/python3.3/site-packages/html5lib/html5parser.py", line 28, in parse
    return p.parse(doc, encoding=encoding)
  File "/home/simon/.virtualenvs/weasyprint/lib/python3.3/site-packages/html5lib/html5parser.py", line 224, in parse
    parseMeta=parseMeta, useChardet=useChardet)
  File "/home/simon/.virtualenvs/weasyprint/lib/python3.3/site-packages/html5lib/html5parser.py", line 93, in _parse
    self.mainLoop()
  File "/home/simon/.virtualenvs/weasyprint/lib/python3.3/site-packages/html5lib/html5parser.py", line 183, in mainLoop
    new_token = phase.processCharacters(new_token)
  File "/home/simon/.virtualenvs/weasyprint/lib/python3.3/site-packages/html5lib/html5parser.py", line 991, in processCharacters
    self.tree.insertText(token["data"])
  File "/home/simon/.virtualenvs/weasyprint/lib/python3.3/site-packages/html5lib/treebuilders/_base.py", line 320, in insertText
    parent.insertText(data)
  File "/home/simon/.virtualenvs/weasyprint/lib/python3.3/site-packages/html5lib/treebuilders/etree_lxml.py", line 240, in insertText
    builder.Element.insertText(self, data, insertBefore)
  File "/home/simon/.virtualenvs/weasyprint/lib/python3.3/site-packages/html5lib/treebuilders/etree.py", line 108, in insertText
    self._element.text += data
  File "lxml.etree.pyx", line 921, in lxml.etree._Element.text.__set__ (src/lxml/lxml.etree.c:41467)
  File "apihelpers.pxi", line 652, in lxml.etree._setNodeText (src/lxml/lxml.etree.c:18888)
  File "apihelpers.pxi", line 1335, in lxml.etree._utf8 (src/lxml/lxml.etree.c:24701)
ValueError: All strings must be XML compatible: Unicode or ASCII, no NULL bytes or control characters

U+000C in text (but not in attribute values) is replaced by U+0020 with a warning:

DataLossWarning: Text cannot contain U+000C

libxml2’s HTML parser replaces them with nothing, which I slightly prefer. Anyway, this is probably what should happen for every character that lxml doesn’t like.

Activity

EmilStenstrom

EmilStenstrom commented on May 18, 2014

@EmilStenstrom

Here's a workaround for anyone that needs to get things working before this bug is fixed. Just run this code over the html before sending it to html5lib:

import re
def remove_control_characters(html):
    def str_to_int(s, default, base=10):
        if int(s, base) < 0x10000:
            return unichr(int(s, base))
        return default
    html = re.sub(ur"&#(\d+);?", lambda c: str_to_int(c.group(1), c.group(0)), html)
    html = re.sub(ur"&#[xX]([0-9a-fA-F]+);?", lambda c: str_to_int(c.group(1), c.group(0), base=16), html)
    html = re.sub(ur"[\x00-\x08\x0b\x0e-\x1f\x7f]", "", html)
    return html
added 2 commits that reference this issue on Jun 6, 2014
6a6b74a
603440e
lastorset

lastorset commented on Jun 6, 2014

@lastorset

Pull request: #162

bradleyayers

bradleyayers commented on Aug 19, 2015

@bradleyayers
SimonSapin

SimonSapin commented on Aug 19, 2015

@SimonSapin
ContributorAuthor

@bradleyayers To support that claim, Wikipedia links to a document named "SGML Declaration of HTML 4" and published in 1999. The relevant specification is https://whatwg.org/html. Also, what do you mean exactly by "valid"? Conformance requirements are different for authors and implementations.

bradleyayers

bradleyayers commented on Aug 19, 2015

@bradleyayers

By "valid" I mean "able to be represented in HTML". I'm saying it's not possible to represent U+0001 in HTML.

It can't be represented by numeric character references (see https://html.spec.whatwg.org/multipage/syntax.html#character-references):

The numeric character reference forms described above are allowed to reference any Unicode code point other than U+0000, U+000D, permanently undefined Unicode characters (noncharacters), surrogates (U+D800–U+DFFF), and control characters other than space characters.

Nor can they be represented by encoding them (see https://html.spec.whatwg.org/multipage/syntax.html#preprocessing-the-input-stream):

Any occurrences of any characters in the ranges U+0001 to U+0008, …snip… are parse errors. These are all control characters or permanently undefined Unicode characters (noncharacters).

gsnedders

gsnedders commented on Aug 19, 2015

@gsnedders
Member

For the reference for numeric character references, note the start of the section:

This section only applies to documents, authoring tools, and markup generators. In particular, it does not apply to conformance checkers; conformance checkers must use the requirements given in the next section ("parsing HTML documents").

The actual parsing of a character reference (https://html.spec.whatwg.org/multipage/syntax.html#tokenizing-character-references) says:

Otherwise, return a character token for the Unicode character whose code point is that number. Additionally, if the number is in the range 0x0001 to 0x0008, 0x000D to 0x001F, 0x007F to 0x009F, 0xFDD0 to 0xFDEF, or is one of 0x000B, 0xFFFE, 0xFFFF, 0x1FFFE, 0x1FFFF, 0x2FFFE, 0x2FFFF, 0x3FFFE, 0x3FFFF, 0x4FFFE, 0x4FFFF, 0x5FFFE, 0x5FFFF, 0x6FFFE, 0x6FFFF, 0x7FFFE, 0x7FFFF, 0x8FFFE, 0x8FFFF, 0x9FFFE, 0x9FFFF, 0xAFFFE, 0xAFFFF, 0xBFFFE, 0xBFFFF, 0xCFFFE, 0xCFFFF, 0xDFFFE, 0xDFFFF, 0xEFFFE, 0xEFFFF, 0xFFFFE, 0xFFFFF, 0x10FFFE, or 0x10FFFF, then this is a parse error.

Follow the cross-reference for parse error:

Certain points in the parsing algorithm are said to be parse errors. The error handling for parse errors is well-defined (that's the processing rules described throughout this specification), but user agents, while parsing an HTML document, may abort the parser at the first parse error that they encounter for which they do not wish to apply the rules described in this specification.

If you follow the error handling, note those characters are never replaced by anything else, and hence they end up in the DOM. The same is true for the ranges in the pre-processing.

EmilStenstrom

EmilStenstrom commented on Aug 27, 2015

@EmilStenstrom

@gsnedders Do I understand this correctly that this is not a bug in html5lib (which just lets these characters through according to spec), but in lxml which does not expect those characters?

SimonSapin

SimonSapin commented on Aug 27, 2015

@SimonSapin
ContributorAuthor

@EmilStenstrom I’d say that’s debatable. https://html.spec.whatwg.org/multipage/#coercing-an-html-dom-into-an-infoset describes how to map to a more restricted XML API. The problem is raising exceptions rather than doing this coercion.

EmilStenstrom

EmilStenstrom commented on Aug 27, 2015

@EmilStenstrom

@SimonSapin Doesn't the exception come from lxml rather than html5lib?

SimonSapin

SimonSapin commented on Aug 27, 2015

@SimonSapin
ContributorAuthor

It does, but html5lib should munge the data to avoid trigerring this exception.

removed this from the 1.0 milestone on Jun 6, 2016

13 remaining items

willkg

willkg commented on Nov 27, 2017

@willkg
Contributor

I'm going to bump this out of the 1.0 milestone.

@gsnedders If you can get to this before December 1st, I'm game for re-adding it.

removed this from the 1.0 milestone on Nov 27, 2017
lpla

lpla commented on May 7, 2020

@lpla

Here's a workaround for anyone that needs to get things working before this bug is fixed. Just run this code over the html before sending it to html5lib:

import re
def remove_control_characters(html):
    def str_to_int(s, default, base=10):
        if int(s, base) < 0x10000:
            return unichr(int(s, base))
        return default
    html = re.sub(ur"&#(\d+);?", lambda c: str_to_int(c.group(1), c.group(0)), html)
    html = re.sub(ur"&#[xX]([0-9a-fA-F]+);?", lambda c: str_to_int(c.group(1), c.group(0), base=16), html)
    html = re.sub(ur"[\x00-\x08\x0b\x0e-\x1f\x7f]", "", html)
    return html

In case anyone needs to use @EmilStenstrom code in Python 3, I just ported it:

def remove_control_characters(html):
    def str_to_int(s, default, base=10):
        if int(s, base) < 0x10000:
            return chr(int(s, base)).encode()
        return default
    html = re.sub(br"&#(\d+);?", lambda c: str_to_int(c.group(1), c.group(0)), html)
    html = re.sub(br"&#[xX]([0-9a-fA-F]+);?", lambda c: str_to_int(c.group(1), c.group(0), base=16), html)
    html = re.sub(br"[\x00-\x08\x0b\x0e-\x1f\x7f]", b"", html)
    return html
EmilStenstrom

EmilStenstrom commented on May 7, 2020

@EmilStenstrom

@lpla Ours have evolved after slowly correcting errors when parsing erroneously encoded text in hundreds of thousands of HTML e-mails. This is the current version we are using, compatible with both python 2 (narrow and wide builds) and python 3, and with type hints:

import re

def remove_control_characters(html):
    # type: (t.Text) -> t.Text
    """
    Strip invalid XML characters that `lxml` cannot parse.
    """
    # See: https://github.com/html5lib/html5lib-python/issues/96
    #
    # The XML 1.0 spec defines the valid character range as:
    # Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]
    #
    # We can instead match the invalid characters by inverting that range into:
    # InvalidChar ::= #xb | #xc | #xFFFE | #xFFFF | [#x0-#x8] | [#xe-#x1F] | [#xD800-#xDFFF]
    #
    # Sources:
    # https://www.w3.org/TR/REC-xml/#charsets,
    # https://lsimons.wordpress.com/2011/03/17/stripping-illegal-characters-out-of-xml-in-python/
    def strip_illegal_xml_characters(s, default, base=10):
        # Compare the "invalid XML character range" numerically
        n = int(s, base)
        if n in (0xb, 0xc, 0xFFFE, 0xFFFF) or 0x0 <= n <= 0x8 or 0xe <= n <= 0x1F or 0xD800 <= n <= 0xDFFF:
            return ""
        return default

    # We encode all non-ascii characters to XML char-refs, so for example "💖" becomes: "&#x1F496;"
    # Otherwise we'd remove emojis by mistake on narrow-unicode builds of Python
    html = html.encode("ascii", "xmlcharrefreplace").decode("utf-8")
    html = re.sub(r"&#(\d+);?", lambda c: strip_illegal_xml_characters(c.group(1), c.group(0)), html)
    html = re.sub(r"&#[xX]([0-9a-fA-F]+);?", lambda c: strip_illegal_xml_characters(c.group(1), c.group(0), base=16), html)
    html = ILLEGAL_XML_CHARS_RE.sub("", html)
    return html


# A regex matching the "invalid XML character range"
ILLEGAL_XML_CHARS_RE = re.compile(r"[\x00-\x08\x0b\x0c\x0e-\x1F\uD800-\uDFFF\uFFFE\uFFFF]")
lpla

lpla commented on May 7, 2020

@lpla

Any license for this code?

EmilStenstrom

EmilStenstrom commented on May 7, 2020

@EmilStenstrom

Any license for this code?

I hereby release it as public domain.

gsnedders

gsnedders commented on Jun 16, 2020

@gsnedders
Member

At this point, the only Python release we support narrow builds on is 2.7; all versions of Py3 we support are always wide. This, to be fair, makes this a lot easier to fix, so we should probably take a stab at this soon.

SimonSapin

SimonSapin commented on Jun 17, 2020

@SimonSapin
ContributorAuthor

How do narrow v.s. wide builds affect lxml/libxml2 being peculiar about control characters?

gsnedders

gsnedders commented on Jun 17, 2020

@gsnedders
Member

@SimonSapin they don't (the string is converted to UTF-8 before being passed to libxml2 IIRC), but they do affect our ability to detect what strings will trigger it (given we can't just iterate through a string and compare the iterable values to the production in XML, either ourselves or with a regex); the only complexity is whether libxml2 is enforcing XML 4e or 5e

sshishov

sshishov commented on Nov 14, 2024

@sshishov

@gsnedders how are you doing? Are we going to try to tackle the issue or not?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

    Development

    No branches or pull requests

      Participants

      @bradleyayers@gsnedders@EmilStenstrom@SimonSapin@lastorset

      Issue actions

        lxml doesn’t like control characters · Issue #96 · html5lib/html5lib-python