-
Notifications
You must be signed in to change notification settings - Fork 63
Question about the logic of merged "Character" tokens after PR #92 in tokenizer tests #96
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
They should all be merged. cc/ @inikulin |
@gsnedders thank you for your prompt response, as a (temporary?) fix I'll merge them at runtime in my test suite. |
@gsnedders we can do merging, but we will need to do it using automated tool and we will lose current formatting. If anyone is OK with that, then we can do it. |
Found all of them with a simple one-liner, doesn't seem to be too many, can fix by hand: [ { name: 'test1.test',
tests:
[ { description: 'Entity without trailing semicolon (1)',
output: [ [ 'Character', 'I\'m ' ], [ 'Character', '¬it' ] ] },
{ description: 'Entity without trailing semicolon (2)',
output: [ [ 'Character', 'I\'m ' ], [ 'Character', '¬in' ] ] } ] },
{ name: 'test2.test',
tests:
[ { description: 'Hexadecimal entity pair representing a surrogate pair',
output: [ [ 'Character', '�' ], [ 'Character', '�' ] ] } ] },
{ name: 'test3.test',
tests:
[ { description: '<\\u0000',
output: [ [ 'Character', '<' ], [ 'Character', '\u0000' ] ] } ] },
{ name: 'test4.test',
tests:
[ { description: 'Empty hex numeric entities',
output: [ [ 'Character', '&#x ' ], [ 'Character', '&#X ' ] ] },
{ description: 'Empty decimal numeric entities',
output: [ [ 'Character', '&# ' ], [ 'Character', '&#; ' ] ] },
{ description: 'Surrogate code point edge cases',
output:
[ [ 'Character', '' ],
[ 'Character', '�' ],
[ 'Character', '�' ],
[ 'Character', '�' ],
[ 'Character', '�' ] ] } ] },
{ name: 'unicodeCharsProblematic.test',
tests:
[ { description: 'CR followed by U+0000',
output: [ [ 'Character', '\n' ], [ 'Character', '\u0000' ] ] } ] } ] |
@RReverser Hmm, I've expected that there should be much more than that. Do you mind updating your PR? |
|
Hi, I've noticed that after the PR #92 in 2 test cases the "Character" tokens were merged even though some Tokenizer errors are present in between them.
Even though the "ParseError" is no more present in the output, all the tests are still in the following form (taken from test1.test):
Previously, a "ParseError" would be present in between the 2 "Character" tokens, so they would not be merged as the README of the tokenizer specify:
All the tests follow the old logic except the following 2 (new?) tests in domjs.test:
I would expect:
and
I would expect:
So, my questions is:
are the 2 new tests considered correct and all the others should have the "Characters" merged as in the README ? Or is there something that I'm missing?
The text was updated successfully, but these errors were encountered: