fix Unicode tests in accordance to pattern/patternProperties spec #505

ChALkeR · 2021-08-21T13:10:33Z

This PR is similar to #380, that fixed an issue almost identical to this one.

The assumptions in #498 are based on an incorrect reading of the spec, and that PR introduced invalid tests.

JSON Schema spec:

https://json-schema.org/draft/2020-12/json-schema-core.html#regex

6.4. Regular Expressions

Keywords MAY use regular expressions to express constraints, or constrain the instance value to be a regular expression. These regular expressions SHOULD be valid according to the regular expression dialect described in ECMA-262, section 21.2.1.

Regular expressions SHOULD be built with the "u" flag (or equivalent) to provide Unicode support, or processed in such a way which provides Unicode support as defined by ECMA-262.

Note the "according to the regular expression dialect described in ECMA-262, section 21.2.1." and "Unicode support as defined by ECMA-262" parts. Also the "with the "u" flag", which is what "Unicode support as defined by ECMA-262" actually means (see below for ECMA-262 spec refs).

https://json-schema.org/draft/2020-12/json-schema-core.html#rfc.references.1

15.1. Normative References

[ecma262] "ECMA-262, 11th edition specification", June 2020.

https://json-schema.org/draft/2020-12/json-schema-validation.html#rfc.section.6.3.3

6.3.3. pattern

The value of this keyword MUST be a string. This string SHOULD be a valid regular expression, according to the ECMA-262 regular expression dialect.

Note the "according to the regular expression dialect described in ECMA-262" part.

https://json-schema.org/draft/2020-12/json-schema-validation.html#rfc.references.1

11.1. Normative References

[ecma262] "ECMA-262, 11th edition specification", June 2020.

Btw, a mistype here, the link should be https://www.ecma-international.org/ecma-262/11.0/index.html,
/index.html is mising in the link.
It's correctly linked in the core schema html, but not in validation schema html.

ECMA-262 spec:

https://262.ecma-international.org/11.0/#sec-patterns

About Unicode mode:

https://262.ecma-international.org/11.0/#sec-regexp-constructor

The constructor accepts flags, and calls RegExpInitialize ( obj, pattern, flags ), which defines how those work. Let's take a look there:

https://262.ecma-international.org/11.0/#sec-regexpinitialize

21.2.3.2.2 Runtime Semantics: RegExpInitialize ( obj, pattern, flags )

When the abstract operation RegExpInitialize with arguments obj, pattern, and flags is called, the following steps are taken:
...
6. If F contains "u", let BMP be false; else let BMP be true.
7. If BMP is true, then
....
8. Else,
a. Let pText be ! UTF16DecodeString(P).
b. Parse pText using the grammars in 21.2.1. The goal symbol for the parse is Pattern[+U, +N]. Throw a SyntaxError exception if pText did not conform to the grammar, if any elements of pText were not matched by the parse, or if any Early Error conditions exist.
c. Let patternCharacters be a List whose elements are the code points of pText.

With the unicode mode (enabled with the u flag, mentioned above in JSON Schema Core spec), the goal symbol for the parse is Pattern[+U, +N].

Not relevant to this PR, but a general note: this changes the parse target, and there are e.g. valid ECMA-262 regexps in non-Unicode mode that are invalid in Unicode mode. JSON Schema declares usage of the Unicode mode part of the specification, hence we should use only that. I.e. anything specific to Pattern[~U, ~N] (non-Unicode mode) should be ignored.

https://262.ecma-international.org/11.0/#prod-CharacterClassEscape

+U labeled ones (\p and \P) are the ones introduced in Unicode mode.

This is how patterns work:

https://262.ecma-international.org/11.0/#sec-characterclassescape

21.2.2.12 CharacterClassEscape

The production CharacterClassEscape::d evaluates as follows:
Return the ten-element set of characters containing the characters 0 through 9 inclusive.

The production CharacterClassEscape::D evaluates as follows:
Return the set of all characters not included in the set returned by CharacterClassEscape::d .

The production CharacterClassEscape::s evaluates as follows:
Return the set of characters containing the characters that are on the right-hand side of the WhiteSpace or LineTerminator productions.

The production CharacterClassEscape::S evaluates as follows:
Return the set of all characters not included in the set returned by CharacterClassEscape::s .

The production CharacterClassEscape::w evaluates as follows:
Return the set of all characters returned by WordCharacters().

The production CharacterClassEscape::W evaluates as follows:
Return the set of all characters not included in the set returned by CharacterClassEscape::w .

The production CharacterClassEscape::p{UnicodePropertyValueExpression} evaluates as follows:
Return the CharSet containing all Unicode code points included in the CharSet returned by UnicodePropertyValueExpression.

The production CharacterClassEscape::P{UnicodePropertyValueExpression} evaluates as follows:
Return the CharSet containing all Unicode code points not included in the CharSet returned by UnicodePropertyValueExpression.

Note how \d is defined here as "ten-element set of characters containing the characters 0 through 9 inclusive."

Definition for UnicodePropertyValueExpression (i.e. how exactly do \p{} and \P{} work) is available by the same link, I just didn't copy-paste it here. UnicodeMatchProperty and UnicodeMatchPropertyValue list a set of properties, categories and aliases.

For the definition of \w, we have to go to WordCharacters():

https://262.ecma-international.org/11.0/#sec-runtime-semantics-wordcharacters-abstract-operation

21.2.2.6.1 Runtime Semantics: WordCharacters ( )

The abstract operation WordCharacters performs the following steps:

Let A be a set of characters containing the sixty-three characters: a b c d e f g h i j k l m n o p q r s t u v w x y z A B C D E F G H I J K L M N O P Q R S T U V W X Y Z 0 1 2 3 4 5 6 7 8 9 _

Let U be an empty set.

For each character c not in set A where Canonicalize(c) is in A, add c to U.

Assert: Unless Unicode and IgnoreCase are both true, U is empty.

Add the characters in set U to set A.

Return A.

Note how this doesn't depend on the Unicode mode (the u flag), unless IgnoreCase is also specified, via the i flag (and it's not specified per JSON Schema spec). Canonicalize definition is a no-op unless IgnoreCase is set to true.

Refs:

Julian · 2021-08-21T13:21:37Z

Thanks for the well researched post (and again for your expertise!)

I'll have to review the links -- though even though I'm sure they're correct, we should keep all these in the ecmascript regex optional file. Implementations using another regex dialect (which they're allowed to do) may indeed have different semantics here, and it's easier to skip the ECMA specific tests if they're all in one place.

ChALkeR · 2021-08-21T13:34:56Z

Those are already in the optional dir, just a separate file that tests just the unicode part of the spec :-).

The current name of the test file is optional/unicode.json, and the spec specifically says Unicode support as defined by ECMA-262. I have no objections to renaming it to e.g. optional/ecma262-unicode.json though, if you think it would be a better name.

karenetheridge · 2021-08-22T06:00:31Z

I agree that we should consolidate optional/unicode.json into optional/ecmascript-regex.json, as there is some crossover and duplication of functionality. I added this file originally, and did not notice the similarity of tests to ecmascript-regex.json at the time.

(I hope people aren't skipping entire test files just because they may have a few failures here and there, though!)

ChALkeR · 2021-09-19T13:02:27Z

@karenetheridge I merged optional/unicode into optional-ecmascript-regex.

Could this be landed now perhaps? =)

karenetheridge

Could this be landed now perhaps? =)

yes :)

fix unicode tests in accordance to pattern/patternProperties spec

9430972

ChALkeR mentioned this pull request Aug 21, 2021

add tests for pattern matching with unicode semantics #498

Merged

ChALkeR requested a review from karenetheridge August 21, 2021 13:11

ChALkeR changed the title ~~fix unicode tests in accordance to pattern/patternProperties spec~~ fix Unicode tests in accordance to pattern/patternProperties spec Aug 21, 2021

karenetheridge mentioned this pull request Aug 21, 2021

fix pattern matching karenetheridge/JSON-Schema-Modern#27

Closed

Consolidate optional/unicode into optional/ecmascript-regex

0fb2d27

karenetheridge approved these changes Sep 19, 2021

View reviewed changes

karenetheridge merged commit 7334b4c into json-schema-org:master Sep 19, 2021

ChALkeR deleted the chalker/fix-unicode branch September 19, 2021 19:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix Unicode tests in accordance to pattern/patternProperties spec #505

fix Unicode tests in accordance to pattern/patternProperties spec #505

ChALkeR commented Aug 21, 2021 •

edited

Loading

Julian commented Aug 21, 2021

ChALkeR commented Aug 21, 2021 •

edited

Loading

karenetheridge commented Aug 22, 2021

ChALkeR commented Sep 19, 2021

karenetheridge left a comment

fix Unicode tests in accordance to pattern/patternProperties spec #505

fix Unicode tests in accordance to pattern/patternProperties spec #505

Conversation

ChALkeR commented Aug 21, 2021 • edited Loading

JSON Schema spec:

6.4. Regular Expressions

15.1. Normative References

6.3.3. pattern

11.1. Normative References

ECMA-262 spec:

21.2.3.2.2 Runtime Semantics: RegExpInitialize ( obj, pattern, flags )

21.2.2.12 CharacterClassEscape

21.2.2.6.1 Runtime Semantics: WordCharacters ( )

Julian commented Aug 21, 2021

ChALkeR commented Aug 21, 2021 • edited Loading

karenetheridge commented Aug 22, 2021

ChALkeR commented Sep 19, 2021

karenetheridge left a comment

Choose a reason for hiding this comment

ChALkeR commented Aug 21, 2021 •

edited

Loading

21.2.3.2.2 Runtime Semantics: `RegExpInitialize ( obj, pattern, flags )`

ChALkeR commented Aug 21, 2021 •

edited

Loading