|
| 1 | +Tokenizer tests |
| 2 | +=============== |
| 3 | + |
| 4 | +The test format is [JSON](http://www.json.org/). This has the advantage |
| 5 | +that the syntax allows backward-compatible extensions to the tests and |
| 6 | +the disadvantage that it is relatively verbose. |
| 7 | + |
| 8 | +Basic Structure |
| 9 | +--------------- |
| 10 | + |
| 11 | + {"tests": [ |
| 12 | + {"description": "Test description", |
| 13 | + "input": "input_string", |
| 14 | + "output": [expected_output_tokens], |
| 15 | + "initialStates": [initial_states], |
| 16 | + "lastStartTag": last_start_tag, |
| 17 | + "ignoreErrorOrder": ignore_error_order |
| 18 | + } |
| 19 | + ]} |
| 20 | + |
| 21 | +Multiple tests per file are allowed simply by adding more objects to the |
| 22 | +"tests" list. |
| 23 | + |
| 24 | +`description`, `input` and `output` are always present. The other values |
| 25 | +are optional. |
| 26 | + |
| 27 | +### Test set-up |
| 28 | + |
| 29 | +`test.input` is a string containing the characters to pass to the |
| 30 | +tokenizer. Specifically, it represents the characters of the **input |
| 31 | +stream**, and so implementations are expected to perform the processing |
| 32 | +described in the spec's **Preprocessing the input stream** section |
| 33 | +before feeding the result to the tokenizer. |
| 34 | + |
| 35 | +If `test.doubleEscaped` is present and `true`, then `test.input` is not |
| 36 | +quite as described above. Instead, it must first be subjected to another |
| 37 | +round of unescaping (i.e., in addition to any unescaping involved in the |
| 38 | +JSON import), and the result of *that* represents the characters of the |
| 39 | +input stream. Currently, the only unescaping required by this option is |
| 40 | +to convert each sequence of the form \\uHHHH (where H is a hex digit) |
| 41 | +into the corresponding Unicode code point. (Note that this option also |
| 42 | +affects the interpretation of `test.output`.) |
| 43 | + |
| 44 | +`test.initialStates` is a list of strings, each being the name of a |
| 45 | +tokenizer state. The test should be run once for each string, using it |
| 46 | +to set the tokenizer's initial state for that run. If |
| 47 | +`test.initialStates` is omitted, it defaults to `["data state"]`. |
| 48 | + |
| 49 | +`test.lastStartTag` is a lowercase string that should be used as "the |
| 50 | +tag name of the last start tag to have been emitted from this |
| 51 | +tokenizer", referenced in the spec's definition of **appropriate end tag |
| 52 | +token**. If it is omitted, it is treated as if "no start tag has been |
| 53 | +emitted from this tokenizer". |
| 54 | + |
| 55 | +### Test results |
| 56 | + |
| 57 | +`test.output` is a list of tokens, ordered with the first produced by |
| 58 | +the tokenizer the first (leftmost) in the list. The list must mach the |
| 59 | +**complete** list of tokens that the tokenizer should produce. Valid |
| 60 | +tokens are: |
| 61 | + |
| 62 | + ["DOCTYPE", name, public_id, system_id, correctness] |
| 63 | + ["StartTag", name, {attributes}*, true*] |
| 64 | + ["StartTag", name, {attributes}] |
| 65 | + ["EndTag", name] |
| 66 | + ["Comment", data] |
| 67 | + ["Character", data] |
| 68 | + "ParseError" |
| 69 | + |
| 70 | +`public_id` and `system_id` are either strings or `null`. `correctness` |
| 71 | +is either `true` or `false`; `true` corresponds to the force-quirks flag |
| 72 | +being false, and vice-versa. |
| 73 | + |
| 74 | +When the self-closing flag is set, the `StartTag` array has `true` as |
| 75 | +its fourth entry. When the flag is not set, the array has only three |
| 76 | +entries for backwards compatibility. |
| 77 | + |
| 78 | +All adjacent character tokens are coalesced into a single |
| 79 | +`["Character", data]` token. |
| 80 | + |
| 81 | +If `test.doubleEscaped` is present and `true`, then every string within |
| 82 | +`test.output` must be further unescaped (as described above) before |
| 83 | +comparing with the tokenizer's output. |
| 84 | + |
| 85 | +`test.ignoreErrorOrder` is a boolean value indicating that the order of |
| 86 | +`ParseError` tokens relative to other tokens in the output stream is |
| 87 | +unimportant, and implementations should ignore such differences between |
| 88 | +their output and `expected_output_tokens`. (This is used for errors |
| 89 | +emitted by the input stream preprocessing stage, since it is useful to |
| 90 | +test that code but it is undefined when the errors occur). If it is |
| 91 | +omitted, it defaults to `false`. |
| 92 | + |
| 93 | +xmlViolation tests |
| 94 | +------------------ |
| 95 | + |
| 96 | +`tokenizer/xmlViolation.test` differs from the above in a couple of |
| 97 | +ways: |
| 98 | + |
| 99 | +- The name of the single member of the top-level JSON object is |
| 100 | + "xmlViolationTests" instead of "tests". |
| 101 | +- Each test's expected output assumes that implementation is applying |
| 102 | + the tweaks given in the spec's "Coercing an HTML DOM into an |
| 103 | + infoset" section. |
| 104 | + |
0 commit comments