fix Unicode tests in accordance to pattern/patternProperties spec #505
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR is similar to #380, that fixed an issue almost identical to this one.
The assumptions in #498 are based on an incorrect reading of the spec, and that PR introduced invalid tests.
JSON Schema spec:
https://json-schema.org/draft/2020-12/json-schema-core.html#regex
Note the "according to the regular expression dialect described in ECMA-262, section 21.2.1." and "Unicode support as defined by ECMA-262" parts. Also the "with the "u" flag", which is what "Unicode support as defined by ECMA-262" actually means (see below for ECMA-262 spec refs).
https://json-schema.org/draft/2020-12/json-schema-core.html#rfc.references.1
https://json-schema.org/draft/2020-12/json-schema-validation.html#rfc.section.6.3.3
Note the "according to the regular expression dialect described in ECMA-262" part.
https://json-schema.org/draft/2020-12/json-schema-validation.html#rfc.references.1
Btw, a mistype here, the link should be https://www.ecma-international.org/ecma-262/11.0/index.html,
/index.html
is mising in the link.It's correctly linked in the core schema html, but not in validation schema html.
ECMA-262 spec:
https://262.ecma-international.org/11.0/#sec-patterns
About Unicode mode:
https://262.ecma-international.org/11.0/#sec-regexp-constructor
The constructor accepts flags, and calls
RegExpInitialize ( obj, pattern, flags )
, which defines how those work. Let's take a look there:https://262.ecma-international.org/11.0/#sec-regexpinitialize
With the unicode mode (enabled with the
u
flag, mentioned above in JSON Schema Core spec), the goal symbol for the parse isPattern[+U, +N]
.Not relevant to this PR, but a general note: this changes the parse target, and there are e.g. valid ECMA-262 regexps in non-Unicode mode that are invalid in Unicode mode. JSON Schema declares usage of the Unicode mode part of the specification, hence we should use only that. I.e. anything specific to
Pattern[~U, ~N]
(non-Unicode mode) should be ignored.https://262.ecma-international.org/11.0/#prod-CharacterClassEscape
+U
labeled ones (\p
and\P
) are the ones introduced in Unicode mode.This is how patterns work:
https://262.ecma-international.org/11.0/#sec-characterclassescape
Note how
\d
is defined here as "ten-element set of characters containing the characters 0 through 9 inclusive."Definition for
UnicodePropertyValueExpression
(i.e. how exactly do\p{}
and\P{}
work) is available by the same link, I just didn't copy-paste it here. UnicodeMatchProperty and UnicodeMatchPropertyValue list a set of properties, categories and aliases.For the definition of
\w
, we have to go toWordCharacters()
:https://262.ecma-international.org/11.0/#sec-runtime-semantics-wordcharacters-abstract-operation
Note how this doesn't depend on the Unicode mode (the
u
flag), unlessIgnoreCase
is also specified, via thei
flag (and it's not specified per JSON Schema spec). Canonicalize definition is a no-op unlessIgnoreCase
is set totrue
.Refs: