lxml doesn’t like control characters

Same issue as #33, but with other non-whitespace C0 control characters: U+0001 to U+0008, U+000B, U+000C, U+000E to U+001F.

Each of these trigger the exception below:

```
html5lib.parse('&#1;', treebuilder='lxml')
html5lib.parse('\x01', treebuilder='lxml')
html5lib.parse('', treebuilder='lxml')
html5lib.parse('', treebuilder='lxml')
```

```
Traceback (most recent call last):
 File "/tmp/a.py", line 4, in <module>
 html5lib.parse('&#1;', treebuilder='lxml')
 File "/home/simon/.virtualenvs/weasyprint/lib/python3.3/site-packages/html5lib/html5parser.py", line 28, in parse
 return p.parse(doc, encoding=encoding)
 File "/home/simon/.virtualenvs/weasyprint/lib/python3.3/site-packages/html5lib/html5parser.py", line 224, in parse
 parseMeta=parseMeta, useChardet=useChardet)
 File "/home/simon/.virtualenvs/weasyprint/lib/python3.3/site-packages/html5lib/html5parser.py", line 93, in _parse
 self.mainLoop()
 File "/home/simon/.virtualenvs/weasyprint/lib/python3.3/site-packages/html5lib/html5parser.py", line 183, in mainLoop
 new_token = phase.processCharacters(new_token)
 File "/home/simon/.virtualenvs/weasyprint/lib/python3.3/site-packages/html5lib/html5parser.py", line 991, in processCharacters
 self.tree.insertText(token["data"])
 File "/home/simon/.virtualenvs/weasyprint/lib/python3.3/site-packages/html5lib/treebuilders/_base.py", line 320, in insertText
 parent.insertText(data)
 File "/home/simon/.virtualenvs/weasyprint/lib/python3.3/site-packages/html5lib/treebuilders/etree_lxml.py", line 240, in insertText
 builder.Element.insertText(self, data, insertBefore)
 File "/home/simon/.virtualenvs/weasyprint/lib/python3.3/site-packages/html5lib/treebuilders/etree.py", line 108, in insertText
 self._element.text += data
 File "lxml.etree.pyx", line 921, in lxml.etree._Element.text.__set__ (src/lxml/lxml.etree.c:41467)
 File "apihelpers.pxi", line 652, in lxml.etree._setNodeText (src/lxml/lxml.etree.c:18888)
 File "apihelpers.pxi", line 1335, in lxml.etree._utf8 (src/lxml/lxml.etree.c:24701)
ValueError: All strings must be XML compatible: Unicode or ASCII, no NULL bytes or control characters
```

U+000C in text (but not in attribute values) is replaced by U+0020 with a warning:

```
DataLossWarning: Text cannot contain U+000C
```

libxml2’s HTML parser replaces them with nothing, which I slightly prefer. Anyway, this is probably what should happen for every character that lxml doesn’t like.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

lxml doesn’t like control characters #96

13 remaining items

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Participants

lxml doesn’t like control characters #96

Description

Activity

EmilStenstrom commented on May 18, 2014

lastorset commented on Jun 6, 2014

bradleyayers commented on Aug 19, 2015

SimonSapin commented on Aug 19, 2015

bradleyayers commented on Aug 19, 2015

gsnedders commented on Aug 19, 2015

EmilStenstrom commented on Aug 27, 2015

SimonSapin commented on Aug 27, 2015

EmilStenstrom commented on Aug 27, 2015

SimonSapin commented on Aug 27, 2015

13 remaining items

willkg commented on Nov 27, 2017

lpla commented on May 7, 2020

EmilStenstrom commented on May 7, 2020

lpla commented on May 7, 2020

EmilStenstrom commented on May 7, 2020

gsnedders commented on Jun 16, 2020

SimonSapin commented on Jun 17, 2020

gsnedders commented on Jun 17, 2020

sshishov commented on Nov 14, 2024

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Participants

Issue actions