Question about the logic of merged "Character" tokens after PR #92 in tokenizer tests #96

syjer · 2017-07-03T13:18:11Z

Hi, I've noticed that after the PR #92 in 2 test cases the "Character" tokens were merged even though some Tokenizer errors are present in between them.

Even though the "ParseError" is no more present in the output, all the tests are still in the following form (taken from test1.test):

{"description":"Entity without trailing semicolon (1)",
"input":"I'm &notit",
"output":[["Character","I'm "], ["Character", "\u00ACit"]],
"errors": [
    {"code" : "missing-semicolon-after-character-reference", "line": 1, "col": 9 }
]}

Previously, a "ParseError" would be present in between the 2 "Character" tokens, so they would not be merged as the README of the tokenizer specify:

 All adjacent character tokens are coalesced into a single ["Character", data] token.

All the tests follow the old logic except the following 2 (new?) tests in domjs.test:

{
           "description":"NUL in script HTML comment",
           "doubleEscaped":true,
           "initialStates":["Script data state"],
           "input":"<!--test\\u0000--><!--test-\\u0000--><!--test--\\u0000-->",
           "output":[["Character", "<!--test\\uFFFD--><!--test-\\uFFFD--><!--test--\\uFFFD-->"]],
           "errors":[
               { "code": "unexpected-null-character", "line": 1, "col": 9 },
               { "code": "unexpected-null-character", "line": 1, "col": 22 },
               { "code": "unexpected-null-character", "line": 1, "col": 36 }
           ]
        }

I would expect:

["Character", "<!--test"], ["Character", "\\uFFFD--><!--test-"], ["Character", "\\uFFFD--><!--test--"], ["Character", "\\uFFFD-->"]

and

{
           "description":"NUL in script HTML comment - double escaped",
           "doubleEscaped":true,
           "initialStates":["Script data state"],
           "input":"<!--<script>\\u0000--><!--<script>-\\u0000--><!--<script>--\\u0000-->",
           "output":[["Character", "<!--<script>\\uFFFD--><!--<script>-\\uFFFD--><!--<script>--\\uFFFD-->"]],
           "errors":[
                { "code": "unexpected-null-character", "line": 1, "col": 13 },
                { "code": "unexpected-null-character", "line": 1, "col": 30 },
                { "code": "unexpected-null-character", "line": 1, "col": 48 }
           ]
        }

I would expect:

["Character", "<!--<script>"], ["Character", "\\uFFFD--><!--<script>-"], ["Character", "\\uFFFD--><!--<script>--"], ["Character", "\\uFFFD-->"]

So, my questions is:

are the 2 new tests considered correct and all the others should have the "Characters" merged as in the README ? Or is there something that I'm missing?

The text was updated successfully, but these errors were encountered:

gsnedders · 2017-07-03T13:43:29Z

They should all be merged.

cc/ @inikulin

syjer · 2017-07-03T13:47:51Z

@gsnedders thank you for your prompt response, as a (temporary?) fix I'll merge them at runtime in my test suite.

inikulin · 2017-07-03T16:23:04Z

@gsnedders we can do merging, but we will need to do it using automated tool and we will lose current formatting. If anyone is OK with that, then we can do it.

RReverser · 2017-07-18T13:43:17Z

Found all of them with a simple one-liner, doesn't seem to be too many, can fix by hand:

[ { name: 'test1.test',
    tests:
     [ { description: 'Entity without trailing semicolon (1)',
         output: [ [ 'Character', 'I\'m ' ], [ 'Character', '¬it' ] ] },
       { description: 'Entity without trailing semicolon (2)',
         output: [ [ 'Character', 'I\'m ' ], [ 'Character', '¬in' ] ] } ] },
  { name: 'test2.test',
    tests:
     [ { description: 'Hexadecimal entity pair representing a surrogate pair',
         output: [ [ 'Character', '�' ], [ 'Character', '�' ] ] } ] },
  { name: 'test3.test',
    tests:
     [ { description: '<\\u0000',
         output: [ [ 'Character', '<' ], [ 'Character', '\u0000' ] ] } ] },
  { name: 'test4.test',
    tests:
     [ { description: 'Empty hex numeric entities',
         output: [ [ 'Character', '&#x ' ], [ 'Character', '&#X ' ] ] },
       { description: 'Empty decimal numeric entities',
         output: [ [ 'Character', '&# ' ], [ 'Character', '&#; ' ] ] },
       { description: 'Surrogate code point edge cases',
         output:
          [ [ 'Character', '퟿' ],
            [ 'Character', '�' ],
            [ 'Character', '�' ],
            [ 'Character', '�' ],
            [ 'Character', '�' ] ] } ] },
  { name: 'unicodeCharsProblematic.test',
    tests:
     [ { description: 'CR followed by U+0000',
         output: [ [ 'Character', '\n' ], [ 'Character', '\u0000' ] ] } ] } ]

inikulin · 2017-07-18T14:09:54Z

@RReverser Hmm, I've expected that there should be much more than that. Do you mind updating your PR?

RReverser · 2017-07-18T14:10:52Z

~~Yeah I'll reopen.~~ Turns out I didn't even close it yet, so updated now.

inikulin added bug tokenizer labels Jul 12, 2017

inikulin mentioned this issue Jul 18, 2017

Concatenate character tokens #98

Merged

inikulin closed this as completed in #98 Jul 18, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question about the logic of merged "Character" tokens after PR #92 in tokenizer tests #96

Question about the logic of merged "Character" tokens after PR #92 in tokenizer tests #96

syjer commented Jul 3, 2017 •

edited

Loading

gsnedders commented Jul 3, 2017

syjer commented Jul 3, 2017

inikulin commented Jul 3, 2017 •

edited

Loading

RReverser commented Jul 18, 2017 •

edited

Loading

inikulin commented Jul 18, 2017

RReverser commented Jul 18, 2017 •

edited

Loading

Question about the logic of merged "Character" tokens after PR #92 in tokenizer tests #96

Question about the logic of merged "Character" tokens after PR #92 in tokenizer tests #96

Comments

syjer commented Jul 3, 2017 • edited Loading

gsnedders commented Jul 3, 2017

syjer commented Jul 3, 2017

inikulin commented Jul 3, 2017 • edited Loading

RReverser commented Jul 18, 2017 • edited Loading

inikulin commented Jul 18, 2017

RReverser commented Jul 18, 2017 • edited Loading

syjer commented Jul 3, 2017 •

edited

Loading

inikulin commented Jul 3, 2017 •

edited

Loading

RReverser commented Jul 18, 2017 •

edited

Loading

RReverser commented Jul 18, 2017 •

edited

Loading