-
-
Notifications
You must be signed in to change notification settings - Fork 215
Fix ECMA 262 regex whitespace tests. #380
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
1bc0ade
to
b96850f
Compare
Hi -- will review this carefully, thanks for the PR, it's definitely appreciated, although saying something is "blindly corrected" is unnecessary snark. |
@Julian Hi! Sorry, I didn't mean it to have that tone, I just tried to describe the probable cause of the bug. I amended the text, hope it's better now. Thanks! |
side note: a lot of these tests don't even use the "format" keyword, so they should be moved into a separate file in optional/, not optional/format. (Looks like it was me, in commit 8388f27, oops! I will fix tomorrow.) |
Two reasons why I think that the existing test is a mistake are:
|
@ChALkeR some of the test files you modified were just moved in master; sorry! could you possibly rebase? |
Looks like there was a bug at the time of introduction due to json-schema-org#285 Per ECMA 262 specification: https://www.ecma-international.org/ecma-262/10.0/index.html#sec-characterclassescape `\S` refers to `\s`: > # Section 21.2.2.12: > The production `CharacterClassEscape::S` evaluates as follows: > 1. Return the set of all characters not included in the set returned > by `CharacterClassEscape::s`. `\s` refers to `WhiteSpace`: > The production `CharacterClassEscape::s` evaluates as follows: > 1. Return the set of characters containing the characters that are on > the right-hand side of the WhiteSpace or LineTerminator > productions. `WhiteSpace` includes NBSP (**U00A0**): https://www.ecma-international.org/ecma-262/10.0/index.html#prod-WhiteSpace That is the exact opposite of what the current test was checking for. Also, `WhiteSpace` includes any other Unicode "Space_Separator" code points. That's not a new addition, same was true in version 6.0: * https://www.ecma-international.org/ecma-262/6.0/index.html#sec-characterclassescape * https://www.ecma-international.org/ecma-262/6.0/index.html#sec-white-space
b96850f
to
a01ae54
Compare
@karenetheridge Rebased! |
While you're in these files, could you please fix this errors also? (line 157 in your version of the file) --
|
I find it very troubling that in ECMA 262, |
Are the tests for characters |
Done.
Yes, they are correct, their descriptions mention the reason why they are treated as whitespace: {
"description": "paragraph separator matches (line terminator)",
"data": "\u2029",
"valid": true
},
{
"description": "EM SPACE matches (Space_Separator)",
"data": "\u2003",
"valid": true
}, https://www.ecma-international.org/ecma-262/10.0/index.html#sec-white-space (table 32) says, in the last row:
> curl http://www.unicode.org/Public/UNIDATA/UnicodeData.txt -s | grep ';Zs;'
0020;SPACE;Zs;0;WS;;;;;N;;;;;
00A0;NO-BREAK SPACE;Zs;0;CS;<noBreak> 0020;;;;N;NON-BREAKING SPACE;;;;
1680;OGHAM SPACE MARK;Zs;0;WS;;;;;N;;;;;
2000;EN QUAD;Zs;0;WS;2002;;;;N;;;;;
2001;EM QUAD;Zs;0;WS;2003;;;;N;;;;;
2002;EN SPACE;Zs;0;WS;<compat> 0020;;;;N;;;;;
2003;EM SPACE;Zs;0;WS;<compat> 0020;;;;N;;;;;
2004;THREE-PER-EM SPACE;Zs;0;WS;<compat> 0020;;;;N;;;;;
2005;FOUR-PER-EM SPACE;Zs;0;WS;<compat> 0020;;;;N;;;;;
2006;SIX-PER-EM SPACE;Zs;0;WS;<compat> 0020;;;;N;;;;;
2007;FIGURE SPACE;Zs;0;WS;<noBreak> 0020;;;;N;;;;;
2008;PUNCTUATION SPACE;Zs;0;WS;<compat> 0020;;;;N;;;;;
2009;THIN SPACE;Zs;0;WS;<compat> 0020;;;;N;;;;;
200A;HAIR SPACE;Zs;0;WS;<compat> 0020;;;;N;;;;;
202F;NARROW NO-BREAK SPACE;Zs;0;CS;<noBreak> 0020;;;;N;;;;;
205F;MEDIUM MATHEMATICAL SPACE;Zs;0;WS;<compat> 0020;;;;N;;;;;
3000;IDEOGRAPHIC SPACE;Zs;0;WS;<wide> 0020;;;;N;;;;; Note how e.g. > /\s/.test('\u200a') // 200A;HAIR SPACE;Zs;0;WS;<compat> 0020;;;;N;;;;;
true
> /\s/.test('\u200b') // 200B;ZERO WIDTH SPACE;Cf;0;BN;;;;;N;;;;;
false
And
That's in the ECMA 262 spec and it's how it works for a long time. Breaking that would break a lot of existing js code. If these tests are designed to follow JSON Schema specification documents that it uses ECMA 262 regexp dialect: |
In total, that's 6 explicitly listed white-space + 4 line terminators + 17 Zs (pulled above from the unicode data) - 2 Zs already explicitly listed ( Which exactly matches the following test: const matched = []
for (let i = 0; i < 0x110000; i++) {
if (/\s/.test(String.fromCodePoint(i)))
matched.push(i)
}
console.log(matched.map(x => x.toString(16).padStart(4, '0')))
console.log(matched.length) Outputs: [
'0009', '000a', '000b', '000c', '000d', '0020', '00a0', '1680', '2000',
'2001', '2002', '2003', '2004', '2005', '2006', '2007', '2008', '2009',
'200a', '2028', '2029', '202f', '205f', '3000', 'feff'
]
25 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm good with this PR. Does anyone else want to review/weigh in?
OK I think this is good -- I took a random sample and checked and yeah at this point I think it's had enough eyes on it -- I'm sure it's still possible we've still not gotten this spot on but I'm confident enough it's better than what's here, so thanks for your expertise! (Merging) |
Fixes new test failures from: json-schema-org/JSON-Schema-Test-Suite#380 Ruby's `\s` and `\S` don't match the [ECMA 262 spec][0]. This expands them into a character class that does.
Fixes new test failures from: json-schema-org/JSON-Schema-Test-Suite#380 Ruby's `\s` and `\S` don't match the [ECMA 262 spec][0]. This expands them into a character class that does.
Fixes new test failures from: json-schema-org/JSON-Schema-Test-Suite#380 Ruby's `\s` and `\S` don't match the [ECMA 262 spec][0]. This expands them into a character class that does.
Fixes new test failures from: json-schema-org/JSON-Schema-Test-Suite#380 Ruby's `\s` and `\S` don't match the [ECMA 262 spec][0]. This expands them into a character class that does.
Fixes new test failures from: json-schema-org/JSON-Schema-Test-Suite#380 Ruby's `\s` and `\S` don't match the [ECMA 262 spec][0]. This expands them into a character class that does.
Fixes new test failures from: json-schema-org/JSON-Schema-Test-Suite#380 Ruby's `\s` and `\S` don't match the [ECMA 262 spec][0]. This expands them into a character class that does.
Fixes new test failures from: json-schema-org/JSON-Schema-Test-Suite#380 Ruby's `\s` and `\S` don't match the [ECMA 262 spec][0]. This expands them into a character class that does.
Looks like there was a bug at the time of introduction due to #285, which was introduced in #282 and finalized in #286.
\\
were replaced with\
(Fix data - escape unicode, not regex pattern #286).Per ECMA 262 specification:
https://www.ecma-international.org/ecma-262/10.0/index.html#sec-characterclassescape
\S
refers to\s
:\s
refers toWhiteSpace
:WhiteSpace
includesNBSP
(U00A0):https://www.ecma-international.org/ecma-262/10.0/index.html#prod-WhiteSpace
That is the exact opposite of what the current test was checking for.
Also,
WhiteSpace
includes any other Unicode "Space_Separator" code points.That's not a new addition, same was true in version 6.0, for example: