Fix ECMA 262 regex whitespace tests. #380

ChALkeR · 2020-05-30T07:27:15Z

Looks like there was a bug at the time of introduction due to #285, which was introduced in #282 and finalized in #286.

At the time of introduction (Tests for ECMA 262 regex dialect #282), those tests followed a false premise but passed because of a mistype.
When the mistype was noticed (Wrong tests in ecmascript-regex.json #285), \\ were replaced with \ (Fix data - escape unicode, not regex pattern #286).
But now the test is invalid as is failing in all implementations.

Per ECMA 262 specification:
https://www.ecma-international.org/ecma-262/10.0/index.html#sec-characterclassescape

\S refers to \s:

Section 21.2.2.12:

The production CharacterClassEscape::S evaluates as follows:

Return the set of all characters not included in the set returned by CharacterClassEscape::s.

\s refers to WhiteSpace:

The production CharacterClassEscape::s evaluates as follows:

Return the set of characters containing the characters that are on the right-hand side of the WhiteSpace or LineTerminator productions.

WhiteSpace includes NBSP (U00A0):
https://www.ecma-international.org/ecma-262/10.0/index.html#prod-WhiteSpace

That is the exact opposite of what the current test was checking for.

Also, WhiteSpace includes any other Unicode "Space_Separator" code points.

That's not a new addition, same was true in version 6.0, for example:

Julian · 2020-05-30T22:37:03Z

Hi -- will review this carefully, thanks for the PR, it's definitely appreciated, although saying something is "blindly corrected" is unnecessary snark.

ChALkeR · 2020-05-30T22:42:09Z

@Julian Hi! Sorry, I didn't mean it to have that tone, I just tried to describe the probable cause of the bug. I amended the text, hope it's better now. Thanks!

karenetheridge · 2020-05-31T05:58:00Z

side note: a lot of these tests don't even use the "format" keyword, so they should be moved into a separate file in optional/, not optional/format. (Looks like it was me, in commit 8388f27, oops! I will fix tomorrow.)

ChALkeR · 2020-06-03T02:50:07Z

Two reasons why I think that the existing test is a mistake are:

The test name, ECMA 262 \s matches ascii whitespace only, directly contradicts ECMA 262 specification, cited above
The way how this was introduced with a mistype at first

karenetheridge · 2020-06-08T23:42:52Z

@ChALkeR some of the test files you modified were just moved in master; sorry! could you possibly rebase?

Looks like there was a bug at the time of introduction due to json-schema-org#285 Per ECMA 262 specification: https://www.ecma-international.org/ecma-262/10.0/index.html#sec-characterclassescape `\S` refers to `\s`: > # Section 21.2.2.12: > The production `CharacterClassEscape::S` evaluates as follows: > 1. Return the set of all characters not included in the set returned > by `CharacterClassEscape::s`. `\s` refers to `WhiteSpace`: > The production `CharacterClassEscape::s` evaluates as follows: > 1. Return the set of characters containing the characters that are on > the right-hand side of the WhiteSpace or LineTerminator > productions. `WhiteSpace` includes NBSP (**U00A0**): https://www.ecma-international.org/ecma-262/10.0/index.html#prod-WhiteSpace That is the exact opposite of what the current test was checking for. Also, `WhiteSpace` includes any other Unicode "Space_Separator" code points. That's not a new addition, same was true in version 6.0: * https://www.ecma-international.org/ecma-262/6.0/index.html#sec-characterclassescape * https://www.ecma-international.org/ecma-262/6.0/index.html#sec-white-space

ChALkeR · 2020-06-09T00:33:10Z

@karenetheridge Rebased!

karenetheridge · 2020-06-09T18:51:46Z

While you're in these files, could you please fix this errors also? (line 157 in your version of the file) -- \\w should be \\W:

    "description": "ECMA 262 \\w matches everything but ascii letters",

karenetheridge · 2020-06-09T18:54:00Z

I find it very troubling that in ECMA 262, \w and \d have ascii semantics, while \s is expected to have unicode semantics -- in my implementation I get to choose between one or the other. On balance I think unicode semantics will be more useful, and I can document that if only ascii digits or word characters are desired, then the pattern should instead use [0-9] and [A-Za-z0-9_].

karenetheridge · 2020-06-09T19:13:15Z

Are the tests for characters \u2029 and \u2003 correct? That doesn't seem to match the table at https://www.ecma-international.org/ecma-262/10.0/index.html#sec-white-space.

ChALkeR · 2020-06-10T08:09:11Z

While you're in these files, could you please fix this errors also? (line 157 in your version of the file) -- \\w should be \\W:

Done.

Are the tests for characters \u2029 and \u2003 correct?
That doesn't seem to match the table at https://www.ecma-international.org/ecma-262/10.0/index.html#sec-white-space.

Yes, they are correct, their descriptions mention the reason why they are treated as whitespace:

            {
                "description": "paragraph separator matches (line terminator)",
                "data": "\u2029",
                "valid": true
            },
            {
                "description": "EM SPACE matches (Space_Separator)",
                "data": "\u2003",
                "valid": true
            },

https://www.ecma-international.org/ecma-262/10.0/index.html#sec-white-space (table 32) says, in the last row:

Other category “Zs” Any other Unicode “Space_Separator” code point <USP>

\u2003 is in the Space_Separator category.
To confirm that, we can use this list of space separators (Zs):

> curl http://www.unicode.org/Public/UNIDATA/UnicodeData.txt -s | grep ';Zs;'
0020;SPACE;Zs;0;WS;;;;;N;;;;;
00A0;NO-BREAK SPACE;Zs;0;CS;<noBreak> 0020;;;;N;NON-BREAKING SPACE;;;;
1680;OGHAM SPACE MARK;Zs;0;WS;;;;;N;;;;;
2000;EN QUAD;Zs;0;WS;2002;;;;N;;;;;
2001;EM QUAD;Zs;0;WS;2003;;;;N;;;;;
2002;EN SPACE;Zs;0;WS;<compat> 0020;;;;N;;;;;
2003;EM SPACE;Zs;0;WS;<compat> 0020;;;;N;;;;;
2004;THREE-PER-EM SPACE;Zs;0;WS;<compat> 0020;;;;N;;;;;
2005;FOUR-PER-EM SPACE;Zs;0;WS;<compat> 0020;;;;N;;;;;
2006;SIX-PER-EM SPACE;Zs;0;WS;<compat> 0020;;;;N;;;;;
2007;FIGURE SPACE;Zs;0;WS;<noBreak> 0020;;;;N;;;;;
2008;PUNCTUATION SPACE;Zs;0;WS;<compat> 0020;;;;N;;;;;
2009;THIN SPACE;Zs;0;WS;<compat> 0020;;;;N;;;;;
200A;HAIR SPACE;Zs;0;WS;<compat> 0020;;;;N;;;;;
202F;NARROW NO-BREAK SPACE;Zs;0;CS;<noBreak> 0020;;;;N;;;;;
205F;MEDIUM MATHEMATICAL SPACE;Zs;0;WS;<compat> 0020;;;;N;;;;;
3000;IDEOGRAPHIC SPACE;Zs;0;WS;<wide> 0020;;;;N;;;;;

Note how e.g. 200B;ZERO WIDTH SPACE is not in Zs, so:

> /\s/.test('\u200a') // 200A;HAIR SPACE;Zs;0;WS;<compat> 0020;;;;N;;;;;
true
> /\s/.test('\u200b') // 200B;ZERO WIDTH SPACE;Cf;0;BN;;;;;N;;;;;
false

\u2029 matches because it's listed in the next table, "Line Terminators": https://www.ecma-international.org/ecma-262/10.0/index.html#sec-line-terminators (table 33)

And \s is defined to match both:

Return the set of characters containing the characters that are on the right-hand side of the WhiteSpace or LineTerminator productions.

I find it very troubling that in ECMA 262, \w and \d have ascii semantics, while \s is expected to have unicode semantics

That's in the ECMA 262 spec and it's how it works for a long time. Breaking that would break a lot of existing js code.

If these tests are designed to follow ECMA 262 (like they are documented to do), those rules should be followed too.

JSON Schema specification documents that it uses ECMA 262 regexp dialect:

ChALkeR · 2020-06-10T18:06:47Z

In total, that's 6 explicitly listed white-space + 4 line terminators + 17 Zs (pulled above from the unicode data) - 2 Zs already explicitly listed (0020 and 00A0) = 25 code points.

Which exactly matches the following test:

const matched = []
for (let i = 0; i < 0x110000; i++) {
  if (/\s/.test(String.fromCodePoint(i)))
    matched.push(i)
}
console.log(matched.map(x => x.toString(16).padStart(4, '0')))
console.log(matched.length)

Outputs:

[
  '0009', '000a', '000b', '000c', '000d', '0020', '00a0', '1680', '2000',
  '2001', '2002', '2003', '2004', '2005', '2006', '2007', '2008', '2009',
  '200a', '2028', '2029', '202f', '205f', '3000', 'feff'
]
25

karenetheridge

I'm good with this PR. Does anyone else want to review/weigh in?

Julian · 2020-06-11T23:01:29Z

OK I think this is good -- I took a random sample and checked and yeah at this point I think it's had enough eyes on it -- I'm sure it's still possible we've still not gotten this spot on but I'm confident enough it's better than what's here, so thanks for your expertise! (Merging)

Fixes new test failures from: json-schema-org/JSON-Schema-Test-Suite#380 Ruby's `\s` and `\S` don't match the [ECMA 262 spec][0]. This expands them into a character class that does.

ChALkeR force-pushed the fix-ecmascript-regex branch from 1bc0ade to b96850f Compare May 30, 2020 07:33

karenetheridge mentioned this pull request May 31, 2020

fix pattern matching karenetheridge/JSON-Schema-Modern#27

Closed

ChALkeR force-pushed the fix-ecmascript-regex branch from b96850f to a01ae54 Compare June 9, 2020 00:32

Fix \W test description

ca8319c

karenetheridge added the bug A test is wrong, or tooling is broken or buggy. label Jun 10, 2020

karenetheridge requested a review from a team June 10, 2020 22:28

karenetheridge approved these changes Jun 10, 2020

View reviewed changes

Julian merged commit 8dfa8ad into json-schema-org:master Jun 11, 2020

ChALkeR mentioned this pull request Jun 17, 2020

Update JSON schema test suite ebdrup/json-schema-benchmark#47

Closed

karenetheridge mentioned this pull request Jul 9, 2020

Clarify which regular expression production rule from the ECMA specification is intended json-schema-org/json-schema-spec#821

Closed

ChALkeR mentioned this pull request Aug 21, 2021

fix Unicode tests in accordance to pattern/patternProperties spec #505

Merged

ChALkeR deleted the fix-ecmascript-regex branch August 21, 2021 18:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix ECMA 262 regex whitespace tests. #380

Fix ECMA 262 regex whitespace tests. #380

ChALkeR commented May 30, 2020 •

edited

Loading

Julian commented May 30, 2020

ChALkeR commented May 30, 2020 •

edited

Loading

karenetheridge commented May 31, 2020 •

edited

Loading

ChALkeR commented Jun 3, 2020

karenetheridge commented Jun 8, 2020

ChALkeR commented Jun 9, 2020

karenetheridge commented Jun 9, 2020

karenetheridge commented Jun 9, 2020

karenetheridge commented Jun 9, 2020 •

edited

Loading

ChALkeR commented Jun 10, 2020 •

edited

Loading

ChALkeR commented Jun 10, 2020 •

edited

Loading

karenetheridge left a comment

Julian commented Jun 11, 2020

Fix ECMA 262 regex whitespace tests. #380

Fix ECMA 262 regex whitespace tests. #380

Conversation

ChALkeR commented May 30, 2020 • edited Loading

Section 21.2.2.12:

Julian commented May 30, 2020

ChALkeR commented May 30, 2020 • edited Loading

karenetheridge commented May 31, 2020 • edited Loading

ChALkeR commented Jun 3, 2020

karenetheridge commented Jun 8, 2020

ChALkeR commented Jun 9, 2020

karenetheridge commented Jun 9, 2020

karenetheridge commented Jun 9, 2020

karenetheridge commented Jun 9, 2020 • edited Loading

ChALkeR commented Jun 10, 2020 • edited Loading

ChALkeR commented Jun 10, 2020 • edited Loading

karenetheridge left a comment

Choose a reason for hiding this comment

Julian commented Jun 11, 2020

ChALkeR commented May 30, 2020 •

edited

Loading

ChALkeR commented May 30, 2020 •

edited

Loading

karenetheridge commented May 31, 2020 •

edited

Loading

karenetheridge commented Jun 9, 2020 •

edited

Loading

ChALkeR commented Jun 10, 2020 •

edited

Loading

ChALkeR commented Jun 10, 2020 •

edited

Loading