Skip to content

Fix ECMA 262 regex whitespace tests. #380

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Jun 11, 2020

Conversation

ChALkeR
Copy link
Member

@ChALkeR ChALkeR commented May 30, 2020

Looks like there was a bug at the time of introduction due to #285, which was introduced in #282 and finalized in #286.

  1. At the time of introduction (Tests for ECMA 262 regex dialect #282), those tests followed a false premise but passed because of a mistype.
  2. When the mistype was noticed (Wrong tests in ecmascript-regex.json  #285), \\ were replaced with \ (Fix data - escape unicode, not regex pattern #286).
  3. But now the test is invalid as is failing in all implementations.

Per ECMA 262 specification:
https://www.ecma-international.org/ecma-262/10.0/index.html#sec-characterclassescape

\S refers to \s:

Section 21.2.2.12:

The production CharacterClassEscape::S evaluates as follows:

  1. Return the set of all characters not included in the set returned by CharacterClassEscape::s.

\s refers to WhiteSpace:

The production CharacterClassEscape::s evaluates as follows:

  1. Return the set of characters containing the characters that are on the right-hand side of the WhiteSpace or LineTerminator productions.

WhiteSpace includes NBSP (U00A0):
https://www.ecma-international.org/ecma-262/10.0/index.html#prod-WhiteSpace

That is the exact opposite of what the current test was checking for.

Also, WhiteSpace includes any other Unicode "Space_Separator" code points.

That's not a new addition, same was true in version 6.0, for example:

@ChALkeR ChALkeR force-pushed the fix-ecmascript-regex branch from 1bc0ade to b96850f Compare May 30, 2020 07:33
@Julian
Copy link
Member

Julian commented May 30, 2020

Hi -- will review this carefully, thanks for the PR, it's definitely appreciated, although saying something is "blindly corrected" is unnecessary snark.

@ChALkeR
Copy link
Member Author

ChALkeR commented May 30, 2020

@Julian Hi! Sorry, I didn't mean it to have that tone, I just tried to describe the probable cause of the bug. I amended the text, hope it's better now. Thanks!

@karenetheridge
Copy link
Member

karenetheridge commented May 31, 2020

side note: a lot of these tests don't even use the "format" keyword, so they should be moved into a separate file in optional/, not optional/format. (Looks like it was me, in commit 8388f27, oops! I will fix tomorrow.)

@ChALkeR
Copy link
Member Author

ChALkeR commented Jun 3, 2020

Two reasons why I think that the existing test is a mistake are:

  1. The test name, ECMA 262 \s matches ascii whitespace only, directly contradicts ECMA 262 specification, cited above
  2. The way how this was introduced with a mistype at first

@karenetheridge
Copy link
Member

@ChALkeR some of the test files you modified were just moved in master; sorry! could you possibly rebase?

Looks like there was a bug at the time of introduction due to
json-schema-org#285

Per ECMA 262 specification:
https://www.ecma-international.org/ecma-262/10.0/index.html#sec-characterclassescape

`\S` refers to `\s`:

> # Section 21.2.2.12:

> The production `CharacterClassEscape::S` evaluates as follows:
>  1. Return the set of all characters not included in the set returned
>     by `CharacterClassEscape::s`.

`\s` refers to `WhiteSpace`:

> The production `CharacterClassEscape::s` evaluates as follows:
>  1. Return the set of characters containing the characters that are on
>     the right-hand side of the WhiteSpace or LineTerminator
>     productions.

`WhiteSpace` includes NBSP (**U00A0**):
https://www.ecma-international.org/ecma-262/10.0/index.html#prod-WhiteSpace

That is the exact opposite of what the current test was checking for.

Also, `WhiteSpace` includes any other Unicode "Space_Separator" code
points.

That's not a new addition, same was true in version 6.0:
* https://www.ecma-international.org/ecma-262/6.0/index.html#sec-characterclassescape
* https://www.ecma-international.org/ecma-262/6.0/index.html#sec-white-space
@ChALkeR ChALkeR force-pushed the fix-ecmascript-regex branch from b96850f to a01ae54 Compare June 9, 2020 00:32
@ChALkeR
Copy link
Member Author

ChALkeR commented Jun 9, 2020

@karenetheridge Rebased!

@karenetheridge
Copy link
Member

While you're in these files, could you please fix this errors also? (line 157 in your version of the file) -- \\w should be \\W:

    "description": "ECMA 262 \\w matches everything but ascii letters",

@karenetheridge
Copy link
Member

I find it very troubling that in ECMA 262, \w and \d have ascii semantics, while \s is expected to have unicode semantics -- in my implementation I get to choose between one or the other. On balance I think unicode semantics will be more useful, and I can document that if only ascii digits or word characters are desired, then the pattern should instead use [0-9] and [A-Za-z0-9_].

@karenetheridge
Copy link
Member

karenetheridge commented Jun 9, 2020

Are the tests for characters \u2029 and \u2003 correct? That doesn't seem to match the table at https://www.ecma-international.org/ecma-262/10.0/index.html#sec-white-space.

@ChALkeR
Copy link
Member Author

ChALkeR commented Jun 10, 2020

While you're in these files, could you please fix this errors also? (line 157 in your version of the file) -- \\w should be \\W:

Done.


Are the tests for characters \u2029 and \u2003 correct?
That doesn't seem to match the table at https://www.ecma-international.org/ecma-262/10.0/index.html#sec-white-space.

Yes, they are correct, their descriptions mention the reason why they are treated as whitespace:

            {
                "description": "paragraph separator matches (line terminator)",
                "data": "\u2029",
                "valid": true
            },
            {
                "description": "EM SPACE matches (Space_Separator)",
                "data": "\u2003",
                "valid": true
            },

https://www.ecma-international.org/ecma-262/10.0/index.html#sec-white-space (table 32) says, in the last row:

Other category “Zs” Any other Unicode “Space_Separator” code point <USP>

\u2003 is in the Space_Separator category.
To confirm that, we can use this list of space separators (Zs):

> curl http://www.unicode.org/Public/UNIDATA/UnicodeData.txt -s | grep ';Zs;'
0020;SPACE;Zs;0;WS;;;;;N;;;;;
00A0;NO-BREAK SPACE;Zs;0;CS;<noBreak> 0020;;;;N;NON-BREAKING SPACE;;;;
1680;OGHAM SPACE MARK;Zs;0;WS;;;;;N;;;;;
2000;EN QUAD;Zs;0;WS;2002;;;;N;;;;;
2001;EM QUAD;Zs;0;WS;2003;;;;N;;;;;
2002;EN SPACE;Zs;0;WS;<compat> 0020;;;;N;;;;;
2003;EM SPACE;Zs;0;WS;<compat> 0020;;;;N;;;;;
2004;THREE-PER-EM SPACE;Zs;0;WS;<compat> 0020;;;;N;;;;;
2005;FOUR-PER-EM SPACE;Zs;0;WS;<compat> 0020;;;;N;;;;;
2006;SIX-PER-EM SPACE;Zs;0;WS;<compat> 0020;;;;N;;;;;
2007;FIGURE SPACE;Zs;0;WS;<noBreak> 0020;;;;N;;;;;
2008;PUNCTUATION SPACE;Zs;0;WS;<compat> 0020;;;;N;;;;;
2009;THIN SPACE;Zs;0;WS;<compat> 0020;;;;N;;;;;
200A;HAIR SPACE;Zs;0;WS;<compat> 0020;;;;N;;;;;
202F;NARROW NO-BREAK SPACE;Zs;0;CS;<noBreak> 0020;;;;N;;;;;
205F;MEDIUM MATHEMATICAL SPACE;Zs;0;WS;<compat> 0020;;;;N;;;;;
3000;IDEOGRAPHIC SPACE;Zs;0;WS;<wide> 0020;;;;N;;;;;

Note how e.g. 200B;ZERO WIDTH SPACE is not in Zs, so:

> /\s/.test('\u200a') // 200A;HAIR SPACE;Zs;0;WS;<compat> 0020;;;;N;;;;;
true
> /\s/.test('\u200b') // 200B;ZERO WIDTH SPACE;Cf;0;BN;;;;;N;;;;;
false

\u2029 matches because it's listed in the next table, "Line Terminators": https://www.ecma-international.org/ecma-262/10.0/index.html#sec-line-terminators (table 33)

And \s is defined to match both:

Return the set of characters containing the characters that are on the right-hand side of the WhiteSpace or LineTerminator productions.


I find it very troubling that in ECMA 262, \w and \d have ascii semantics, while \s is expected to have unicode semantics

That's in the ECMA 262 spec and it's how it works for a long time. Breaking that would break a lot of existing js code.

If these tests are designed to follow ECMA 262 (like they are documented to do), those rules should be followed too.

JSON Schema specification documents that it uses ECMA 262 regexp dialect:

@ChALkeR
Copy link
Member Author

ChALkeR commented Jun 10, 2020

In total, that's 6 explicitly listed white-space + 4 line terminators + 17 Zs (pulled above from the unicode data) - 2 Zs already explicitly listed (0020 and 00A0) = 25 code points.

Which exactly matches the following test:

const matched = []
for (let i = 0; i < 0x110000; i++) {
  if (/\s/.test(String.fromCodePoint(i)))
    matched.push(i)
}
console.log(matched.map(x => x.toString(16).padStart(4, '0')))
console.log(matched.length)

Outputs:

[
  '0009', '000a', '000b', '000c', '000d', '0020', '00a0', '1680', '2000',
  '2001', '2002', '2003', '2004', '2005', '2006', '2007', '2008', '2009',
  '200a', '2028', '2029', '202f', '205f', '3000', 'feff'
]
25

@karenetheridge karenetheridge added the bug A test is wrong, or tooling is broken or buggy. label Jun 10, 2020
@karenetheridge karenetheridge requested a review from a team June 10, 2020 22:28
Copy link
Member

@karenetheridge karenetheridge left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm good with this PR. Does anyone else want to review/weigh in?

@Julian
Copy link
Member

Julian commented Jun 11, 2020

OK I think this is good -- I took a random sample and checked and yeah at this point I think it's had enough eyes on it -- I'm sure it's still possible we've still not gotten this spot on but I'm confident enough it's better than what's here, so thanks for your expertise! (Merging)

@Julian Julian merged commit 8dfa8ad into json-schema-org:master Jun 11, 2020
@ChALkeR ChALkeR deleted the fix-ecmascript-regex branch August 21, 2021 18:34
davishmcclurg added a commit to davishmcclurg/json_schemer that referenced this pull request Oct 21, 2021
Fixes new test failures from: json-schema-org/JSON-Schema-Test-Suite#380

Ruby's `\s` and `\S` don't match the [ECMA 262 spec][0]. This expands
them into a character class that does.
davishmcclurg added a commit to davishmcclurg/json_schemer that referenced this pull request Apr 22, 2022
Fixes new test failures from: json-schema-org/JSON-Schema-Test-Suite#380

Ruby's `\s` and `\S` don't match the [ECMA 262 spec][0]. This expands
them into a character class that does.
davishmcclurg added a commit to davishmcclurg/json_schemer that referenced this pull request May 15, 2023
Fixes new test failures from: json-schema-org/JSON-Schema-Test-Suite#380

Ruby's `\s` and `\S` don't match the [ECMA 262 spec][0]. This expands
them into a character class that does.
davishmcclurg added a commit to davishmcclurg/json_schemer that referenced this pull request May 16, 2023
Fixes new test failures from: json-schema-org/JSON-Schema-Test-Suite#380

Ruby's `\s` and `\S` don't match the [ECMA 262 spec][0]. This expands
them into a character class that does.
davishmcclurg added a commit to davishmcclurg/json_schemer that referenced this pull request May 22, 2023
Fixes new test failures from: json-schema-org/JSON-Schema-Test-Suite#380

Ruby's `\s` and `\S` don't match the [ECMA 262 spec][0]. This expands
them into a character class that does.
davishmcclurg added a commit to davishmcclurg/json_schemer that referenced this pull request May 26, 2023
Fixes new test failures from: json-schema-org/JSON-Schema-Test-Suite#380

Ruby's `\s` and `\S` don't match the [ECMA 262 spec][0]. This expands
them into a character class that does.
davishmcclurg added a commit to davishmcclurg/json_schemer that referenced this pull request May 26, 2023
Fixes new test failures from: json-schema-org/JSON-Schema-Test-Suite#380

Ruby's `\s` and `\S` don't match the [ECMA 262 spec][0]. This expands
them into a character class that does.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug A test is wrong, or tooling is broken or buggy.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants