Open
Description
Same issue as #33, but with other non-whitespace C0 control characters: U+0001 to U+0008, U+000B, U+000C, U+000E to U+001F.
Each of these trigger the exception below:
html5lib.parse('<p>', treebuilder='lxml')
html5lib.parse('<p>\x01', treebuilder='lxml')
html5lib.parse('<p id="">', treebuilder='lxml')
html5lib.parse('<p id="\x01">', treebuilder='lxml')
Traceback (most recent call last):
File "/tmp/a.py", line 4, in <module>
html5lib.parse('<p>', treebuilder='lxml')
File "/home/simon/.virtualenvs/weasyprint/lib/python3.3/site-packages/html5lib/html5parser.py", line 28, in parse
return p.parse(doc, encoding=encoding)
File "/home/simon/.virtualenvs/weasyprint/lib/python3.3/site-packages/html5lib/html5parser.py", line 224, in parse
parseMeta=parseMeta, useChardet=useChardet)
File "/home/simon/.virtualenvs/weasyprint/lib/python3.3/site-packages/html5lib/html5parser.py", line 93, in _parse
self.mainLoop()
File "/home/simon/.virtualenvs/weasyprint/lib/python3.3/site-packages/html5lib/html5parser.py", line 183, in mainLoop
new_token = phase.processCharacters(new_token)
File "/home/simon/.virtualenvs/weasyprint/lib/python3.3/site-packages/html5lib/html5parser.py", line 991, in processCharacters
self.tree.insertText(token["data"])
File "/home/simon/.virtualenvs/weasyprint/lib/python3.3/site-packages/html5lib/treebuilders/_base.py", line 320, in insertText
parent.insertText(data)
File "/home/simon/.virtualenvs/weasyprint/lib/python3.3/site-packages/html5lib/treebuilders/etree_lxml.py", line 240, in insertText
builder.Element.insertText(self, data, insertBefore)
File "/home/simon/.virtualenvs/weasyprint/lib/python3.3/site-packages/html5lib/treebuilders/etree.py", line 108, in insertText
self._element.text += data
File "lxml.etree.pyx", line 921, in lxml.etree._Element.text.__set__ (src/lxml/lxml.etree.c:41467)
File "apihelpers.pxi", line 652, in lxml.etree._setNodeText (src/lxml/lxml.etree.c:18888)
File "apihelpers.pxi", line 1335, in lxml.etree._utf8 (src/lxml/lxml.etree.c:24701)
ValueError: All strings must be XML compatible: Unicode or ASCII, no NULL bytes or control characters
U+000C in text (but not in attribute values) is replaced by U+0020 with a warning:
DataLossWarning: Text cannot contain U+000C
libxml2’s HTML parser replaces them with nothing, which I slightly prefer. Anyway, this is probably what should happen for every character that lxml doesn’t like.
Metadata
Metadata
Assignees
Type
Projects
Milestone
Relationships
Development
No branches or pull requests
Activity
EmilStenstrom commentedon May 18, 2014
Here's a workaround for anyone that needs to get things working before this bug is fixed. Just run this code over the html before sending it to html5lib:
Replace invalid characters with U+FFFD (fixes html5lib#96)
Replace invalid characters with U+FFFD (fixes html5lib#96)
lastorset commentedon Jun 6, 2014
Pull request: #162
bradleyayers commentedon Aug 19, 2015
@SimonSapin these characters are not valid HTML, see https://en.wikipedia.org/wiki/Character_encodings_in_HTML#Illegal_characters
SimonSapin commentedon Aug 19, 2015
@bradleyayers To support that claim, Wikipedia links to a document named "SGML Declaration of HTML 4" and published in 1999. The relevant specification is https://whatwg.org/html. Also, what do you mean exactly by "valid"? Conformance requirements are different for authors and implementations.
bradleyayers commentedon Aug 19, 2015
By "valid" I mean "able to be represented in HTML". I'm saying it's not possible to represent U+0001 in HTML.
It can't be represented by numeric character references (see https://html.spec.whatwg.org/multipage/syntax.html#character-references):
Nor can they be represented by encoding them (see https://html.spec.whatwg.org/multipage/syntax.html#preprocessing-the-input-stream):
gsnedders commentedon Aug 19, 2015
For the reference for numeric character references, note the start of the section:
The actual parsing of a character reference (https://html.spec.whatwg.org/multipage/syntax.html#tokenizing-character-references) says:
Follow the cross-reference for parse error:
If you follow the error handling, note those characters are never replaced by anything else, and hence they end up in the DOM. The same is true for the ranges in the pre-processing.
EmilStenstrom commentedon Aug 27, 2015
@gsnedders Do I understand this correctly that this is not a bug in html5lib (which just lets these characters through according to spec), but in lxml which does not expect those characters?
SimonSapin commentedon Aug 27, 2015
@EmilStenstrom I’d say that’s debatable. https://html.spec.whatwg.org/multipage/#coercing-an-html-dom-into-an-infoset describes how to map to a more restricted XML API. The problem is raising exceptions rather than doing this coercion.
EmilStenstrom commentedon Aug 27, 2015
@SimonSapin Doesn't the exception come from lxml rather than html5lib?
SimonSapin commentedon Aug 27, 2015
It does, but html5lib should munge the data to avoid trigerring this exception.
13 remaining items
willkg commentedon Nov 27, 2017
I'm going to bump this out of the 1.0 milestone.
@gsnedders If you can get to this before December 1st, I'm game for re-adding it.
lpla commentedon May 7, 2020
In case anyone needs to use @EmilStenstrom code in Python 3, I just ported it:
EmilStenstrom commentedon May 7, 2020
@lpla Ours have evolved after slowly correcting errors when parsing erroneously encoded text in hundreds of thousands of HTML e-mails. This is the current version we are using, compatible with both python 2 (narrow and wide builds) and python 3, and with type hints:
lpla commentedon May 7, 2020
Any license for this code?
EmilStenstrom commentedon May 7, 2020
I hereby release it as public domain.
gsnedders commentedon Jun 16, 2020
At this point, the only Python release we support narrow builds on is 2.7; all versions of Py3 we support are always wide. This, to be fair, makes this a lot easier to fix, so we should probably take a stab at this soon.
SimonSapin commentedon Jun 17, 2020
How do narrow v.s. wide builds affect lxml/libxml2 being peculiar about control characters?
gsnedders commentedon Jun 17, 2020
@SimonSapin they don't (the string is converted to UTF-8 before being passed to libxml2 IIRC), but they do affect our ability to detect what strings will trigger it (given we can't just iterate through a string and compare the iterable values to the production in XML, either ourselves or with a regex); the only complexity is whether libxml2 is enforcing XML 4e or 5e
sshishov commentedon Nov 14, 2024
@gsnedders how are you doing? Are we going to try to tackle the issue or not?