Skip to content

Commit a3d26a3

Browse files
committed
Add README files to tokenizer/tree-construction directories.
Fixes html5lib#5; taken from <http://wiki.whatwg.org/wiki/Parser_tests>, licensed under CC0/MIT license.
1 parent 7faff61 commit a3d26a3

File tree

2 files changed

+193
-0
lines changed

2 files changed

+193
-0
lines changed

tokenizer/README.md

Lines changed: 104 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,104 @@
1+
Tokenizer tests
2+
===============
3+
4+
The test format is [JSON](http://www.json.org/). This has the advantage
5+
that the syntax allows backward-compatible extensions to the tests and
6+
the disadvantage that it is relatively verbose.
7+
8+
Basic Structure
9+
---------------
10+
11+
{"tests": [
12+
    {"description": "Test description",
13+
    "input": "input_string",
14+
    "output": [expected_output_tokens],
15+
    "initialStates": [initial_states],
16+
    "lastStartTag": last_start_tag,
17+
    "ignoreErrorOrder": ignore_error_order
18+
    }
19+
]}
20+
21+
Multiple tests per file are allowed simply by adding more objects to the
22+
"tests" list.
23+
24+
`description`, `input` and `output` are always present. The other values
25+
are optional.
26+
27+
### Test set-up
28+
29+
`test.input` is a string containing the characters to pass to the
30+
tokenizer. Specifically, it represents the characters of the **input
31+
stream**, and so implementations are expected to perform the processing
32+
described in the spec's **Preprocessing the input stream** section
33+
before feeding the result to the tokenizer.
34+
35+
If `test.doubleEscaped` is present and `true`, then `test.input` is not
36+
quite as described above. Instead, it must first be subjected to another
37+
round of unescaping (i.e., in addition to any unescaping involved in the
38+
JSON import), and the result of *that* represents the characters of the
39+
input stream. Currently, the only unescaping required by this option is
40+
to convert each sequence of the form \\uHHHH (where H is a hex digit)
41+
into the corresponding Unicode code point. (Note that this option also
42+
affects the interpretation of `test.output`.)
43+
44+
`test.initialStates` is a list of strings, each being the name of a
45+
tokenizer state. The test should be run once for each string, using it
46+
to set the tokenizer's initial state for that run. If
47+
`test.initialStates` is omitted, it defaults to `["data state"]`.
48+
49+
`test.lastStartTag` is a lowercase string that should be used as "the
50+
tag name of the last start tag to have been emitted from this
51+
tokenizer", referenced in the spec's definition of **appropriate end tag
52+
token**. If it is omitted, it is treated as if "no start tag has been
53+
emitted from this tokenizer".
54+
55+
### Test results
56+
57+
`test.output` is a list of tokens, ordered with the first produced by
58+
the tokenizer the first (leftmost) in the list. The list must mach the
59+
**complete** list of tokens that the tokenizer should produce. Valid
60+
tokens are:
61+
62+
["DOCTYPE", name, public_id, system_id, correctness]
63+
["StartTag", name, {attributes}*, true*]
64+
["StartTag", name, {attributes}]
65+
["EndTag", name]
66+
["Comment", data]
67+
["Character", data]
68+
"ParseError"
69+
70+
`public_id` and `system_id` are either strings or `null`. `correctness`
71+
is either `true` or `false`; `true` corresponds to the force-quirks flag
72+
being false, and vice-versa.
73+
74+
When the self-closing flag is set, the `StartTag` array has `true` as
75+
its fourth entry. When the flag is not set, the array has only three
76+
entries for backwards compatibility.
77+
78+
All adjacent character tokens are coalesced into a single
79+
`["Character", data]` token.
80+
81+
If `test.doubleEscaped` is present and `true`, then every string within
82+
`test.output` must be further unescaped (as described above) before
83+
comparing with the tokenizer's output.
84+
85+
`test.ignoreErrorOrder` is a boolean value indicating that the order of
86+
`ParseError` tokens relative to other tokens in the output stream is
87+
unimportant, and implementations should ignore such differences between
88+
their output and `expected_output_tokens`. (This is used for errors
89+
emitted by the input stream preprocessing stage, since it is useful to
90+
test that code but it is undefined when the errors occur). If it is
91+
omitted, it defaults to `false`.
92+
93+
xmlViolation tests
94+
------------------
95+
96+
`tokenizer/xmlViolation.test` differs from the above in a couple of
97+
ways:
98+
99+
- The name of the single member of the top-level JSON object is
100+
"xmlViolationTests" instead of "tests".
101+
- Each test's expected output assumes that implementation is applying
102+
the tweaks given in the spec's "Coercing an HTML DOM into an
103+
infoset" section.
104+

tree-construction/README.md

Lines changed: 89 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,89 @@
1+
Tree Construction Tests
2+
=======================
3+
4+
Each file containing tree construction tests consists of any number of
5+
tests separated by two newlines (LF) and a single newline before the end
6+
of the file. For instance:
7+
8+
[TEST]LF
9+
LF
10+
[TEST]LF
11+
LF
12+
[TEST]LF
13+
14+
Where [TEST] is the following format:
15+
16+
Each test must begin with a string "\#data" followed by a newline (LF).
17+
All subsequent lines until a line that says "\#errors" are the test data
18+
and must be passed to the system being tested unchanged, except with the
19+
final newline (on the last line) removed.
20+
21+
Then there must be a line that says "\#errors". It must be followed by
22+
one line per parse error that a conformant checker would return. It
23+
doesn't matter what those lines are, although they can't be
24+
"\#document-fragment", "\#document", or empty, the only thing that
25+
matters is that there be the right number of parse errors.
26+
27+
Then there \*may\* be a line that says "\#document-fragment", which must
28+
be followed by a newline (LF), followed by a string of characters that
29+
indicates the context element, followed by a newline (LF). If this line
30+
is present the "\#data" must be parsed using the HTML fragment parsing
31+
algorithm with the context element as context.
32+
33+
Then there must be a line that says "\#document", which must be followed
34+
by a dump of the tree of the parsed DOM. Each node must be represented
35+
by a single line. Each line must start with "| ", followed by two spaces
36+
per parent node that the node has before the root document node.
37+
38+
- Element nodes must be represented by a "`<`" then the *tag name
39+
string* "`>`", and all the attributes must be given, sorted
40+
lexicographically by UTF-16 code unit according to their *attribute
41+
name string*, on subsequent lines, as if they were children of the
42+
element node.
43+
- Attribute nodes must have the *attribute name string*, then an "="
44+
sign, then the attribute value in double quotes (").
45+
- Text nodes must be the string, in double quotes. Newlines aren't
46+
escaped.
47+
- Comments must be "`<`" then "`!-- `" then the data then "` -->`".
48+
- DOCTYPEs must be "`<!DOCTYPE `" then the name then if either of the
49+
system id or public id is non-empty a space, public id in
50+
double-quotes, another space an the system id in double-quotes, and
51+
then in any case "`>`".
52+
- Processing instructions must be "`<?`", then the target, then a
53+
space, then the data and then "`>`". (The HTML parser cannot emit
54+
processing instructions, but scripts can, and the WebVTT to DOM
55+
rules can emit them.)
56+
57+
The *tag name string* is the local name prefixed by a namespace
58+
designator. For the HTML namespace, the namespace designator is the
59+
empty string, i.e. there's no prefix. For the SVG namespace, the
60+
namespace designator is "svg ". For the MathML namespace, the namespace
61+
designator is "math ".
62+
63+
The *attribute name string* is the local name prefixed by a namespace
64+
designator. For no namespace, the namespace designator is the empty
65+
string, i.e. there's no prefix. For the XLink namespace, the namespace
66+
designator is "xlink ". For the XML namespace, the namespace designator
67+
is "xml ". For the XMLNS namespace, the namespace designator is "xmlns
68+
". Note the difference between "xlink:href" which is an attribute in no
69+
namespace with the local name "xlink:href" and "xlink href" which is an
70+
attribute in the xlink namespace with the local name "href".
71+
72+
If there is also a "\#document-fragment" the bit following "\#document"
73+
must be a representation of the HTML fragment serialization for the
74+
context element given by "\#document-fragment".
75+
76+
For example:
77+
78+
#data
79+
<p>One<p>Two
80+
#errors
81+
3: Missing document type declaration
82+
#document
83+
| <html>
84+
| <head>
85+
| <body>
86+
| <p>
87+
| "One"
88+
| <p>
89+
| "Two"

0 commit comments

Comments
 (0)