Skip to content

Fix tokenizer EOF error positions #144

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Mar 11, 2022
Merged

Conversation

fb55
Copy link
Contributor

@fb55 fb55 commented Feb 28, 2022

I am trying to move parse5 to the upstream html5lib-tests repo (away from this fork). As a first PR to come from this effort, this PR corrects some tokenizer errors. The changes are in three categories:

  1. Off-by-one errors for EOF errors. Most EOF errors already point at the column after the last character, with some exceptions. These exceptions were fixed.
  2. Line breaks being ignored by some EOF errors. Similar to (1), these are the exception.
  3. unknown-named-character-reference errors were missing entirely and have been added. Reverted.

Copy link
Contributor

@Ms2ger Ms2ger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds fine. I'll give it a few days for others to comment; ping me if I forget

Copy link
Contributor

@untitaker untitaker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe absence of errors is correct because &not (without a semicolon) is a valid character reference. This is also in line with what Firefox does to the HTML document a&noti, which decodes as a¬i.

@untitaker
Copy link
Contributor

untitaker commented Feb 28, 2022

In fact if you check the spec, &noti.. is the exact example they use to describe that edgecase: https://html.spec.whatwg.org/multipage/parsing.html#named-character-reference-state:

if the markup contains the string I'm &notit; I tell you in an attribute, no character reference is parsed and string remains intact (and there is no parse error).

@fb55 fb55 changed the title Fix tokenizer error positions, add missing errors Fix tokenizer error positions Mar 2, 2022
@fb55
Copy link
Contributor Author

fb55 commented Mar 2, 2022

Thanks a lot for flagging @untitaker. I've reverted the additions.

@fb55 fb55 changed the title Fix tokenizer error positions Fix tokenizer EOF error positions Mar 2, 2022
@untitaker
Copy link
Contributor

error locations are not actually standardized, right? this is just to make the testsuite internally consistent?

@fb55
Copy link
Contributor Author

fb55 commented Mar 2, 2022

error locations are not actually standardized, right? this is just to make the testsuite internally consistent?

That is correct.

Copy link
Contributor

@untitaker untitaker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not a maintainer but this lgtm

@fb55
Copy link
Contributor Author

fb55 commented Mar 10, 2022

@Ms2ger It would be great if you could have another look at this (as well as #145 if possible)!

@Ms2ger Ms2ger merged commit 457a78a into html5lib:master Mar 11, 2022
@fb55 fb55 deleted the tokenizer-errors branch March 11, 2022 09:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants