Skip to content

fix Unicode tests in accordance to pattern/patternProperties spec #505

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Sep 19, 2021

Conversation

ChALkeR
Copy link
Member

@ChALkeR ChALkeR commented Aug 21, 2021

This PR is similar to #380, that fixed an issue almost identical to this one.

The assumptions in #498 are based on an incorrect reading of the spec, and that PR introduced invalid tests.

JSON Schema spec:

https://json-schema.org/draft/2020-12/json-schema-core.html#regex

6.4. Regular Expressions

Keywords MAY use regular expressions to express constraints, or constrain the instance value to be a regular expression. These regular expressions SHOULD be valid according to the regular expression dialect described in ECMA-262, section 21.2.1.

Regular expressions SHOULD be built with the "u" flag (or equivalent) to provide Unicode support, or processed in such a way which provides Unicode support as defined by ECMA-262.

Note the "according to the regular expression dialect described in ECMA-262, section 21.2.1." and "Unicode support as defined by ECMA-262" parts. Also the "with the "u" flag", which is what "Unicode support as defined by ECMA-262" actually means (see below for ECMA-262 spec refs).

https://json-schema.org/draft/2020-12/json-schema-core.html#rfc.references.1

15.1. Normative References

[ecma262] "ECMA-262, 11th edition specification", June 2020.

https://json-schema.org/draft/2020-12/json-schema-validation.html#rfc.section.6.3.3

6.3.3. pattern

The value of this keyword MUST be a string. This string SHOULD be a valid regular expression, according to the ECMA-262 regular expression dialect.

Note the "according to the regular expression dialect described in ECMA-262" part.

https://json-schema.org/draft/2020-12/json-schema-validation.html#rfc.references.1

11.1. Normative References

[ecma262] "ECMA-262, 11th edition specification", June 2020.

Btw, a mistype here, the link should be https://www.ecma-international.org/ecma-262/11.0/index.html,
/index.html is mising in the link.
It's correctly linked in the core schema html, but not in validation schema html.

ECMA-262 spec:

https://262.ecma-international.org/11.0/#sec-patterns

About Unicode mode:

https://262.ecma-international.org/11.0/#sec-regexp-constructor

The constructor accepts flags, and calls RegExpInitialize ( obj, pattern, flags ), which defines how those work. Let's take a look there:

https://262.ecma-international.org/11.0/#sec-regexpinitialize

21.2.3.2.2 Runtime Semantics: RegExpInitialize ( obj, pattern, flags )

When the abstract operation RegExpInitialize with arguments obj, pattern, and flags is called, the following steps are taken:
...
6. If F contains "u", let BMP be false; else let BMP be true.
7. If BMP is true, then
....
8. Else,
a. Let pText be ! UTF16DecodeString(P).
b. Parse pText using the grammars in 21.2.1. The goal symbol for the parse is Pattern[+U, +N]. Throw a SyntaxError exception if pText did not conform to the grammar, if any elements of pText were not matched by the parse, or if any Early Error conditions exist.
c. Let patternCharacters be a List whose elements are the code points of pText.

With the unicode mode (enabled with the u flag, mentioned above in JSON Schema Core spec), the goal symbol for the parse is Pattern[+U, +N].

Not relevant to this PR, but a general note: this changes the parse target, and there are e.g. valid ECMA-262 regexps in non-Unicode mode that are invalid in Unicode mode. JSON Schema declares usage of the Unicode mode part of the specification, hence we should use only that. I.e. anything specific to Pattern[~U, ~N] (non-Unicode mode) should be ignored.

https://262.ecma-international.org/11.0/#prod-CharacterClassEscape

+U labeled ones (\p and \P) are the ones introduced in Unicode mode.

This is how patterns work:

https://262.ecma-international.org/11.0/#sec-characterclassescape

21.2.2.12 CharacterClassEscape

The production CharacterClassEscape::d evaluates as follows:
Return the ten-element set of characters containing the characters 0 through 9 inclusive.

The production CharacterClassEscape::D evaluates as follows:
Return the set of all characters not included in the set returned by CharacterClassEscape::d .

The production CharacterClassEscape::s evaluates as follows:
Return the set of characters containing the characters that are on the right-hand side of the WhiteSpace or LineTerminator productions.

The production CharacterClassEscape::S evaluates as follows:
Return the set of all characters not included in the set returned by CharacterClassEscape::s .

The production CharacterClassEscape::w evaluates as follows:
Return the set of all characters returned by WordCharacters().

The production CharacterClassEscape::W evaluates as follows:
Return the set of all characters not included in the set returned by CharacterClassEscape::w .

The production CharacterClassEscape::p{UnicodePropertyValueExpression} evaluates as follows:
Return the CharSet containing all Unicode code points included in the CharSet returned by UnicodePropertyValueExpression.

The production CharacterClassEscape::P{UnicodePropertyValueExpression} evaluates as follows:
Return the CharSet containing all Unicode code points not included in the CharSet returned by UnicodePropertyValueExpression.

Note how \d is defined here as "ten-element set of characters containing the characters 0 through 9 inclusive."

Definition for UnicodePropertyValueExpression (i.e. how exactly do \p{} and \P{} work) is available by the same link, I just didn't copy-paste it here. UnicodeMatchProperty and UnicodeMatchPropertyValue list a set of properties, categories and aliases.

For the definition of \w, we have to go to WordCharacters():

https://262.ecma-international.org/11.0/#sec-runtime-semantics-wordcharacters-abstract-operation

21.2.2.6.1 Runtime Semantics: WordCharacters ( )

The abstract operation WordCharacters performs the following steps:

  1. Let A be a set of characters containing the sixty-three characters: a b c d e f g h i j k l m n o p q r s t u v w x y z A B C D E F G H I J K L M N O P Q R S T U V W X Y Z 0 1 2 3 4 5 6 7 8 9 _
  2. Let U be an empty set.
  3. For each character c not in set A where Canonicalize(c) is in A, add c to U.
  4. Assert: Unless Unicode and IgnoreCase are both true, U is empty.
  5. Add the characters in set U to set A.
  6. Return A.

Note how this doesn't depend on the Unicode mode (the u flag), unless IgnoreCase is also specified, via the i flag (and it's not specified per JSON Schema spec). Canonicalize definition is a no-op unless IgnoreCase is set to true.

Refs:

@ChALkeR ChALkeR requested a review from karenetheridge August 21, 2021 13:11
@ChALkeR ChALkeR changed the title fix unicode tests in accordance to pattern/patternProperties spec fix Unicode tests in accordance to pattern/patternProperties spec Aug 21, 2021
@Julian
Copy link
Member

Julian commented Aug 21, 2021

Thanks for the well researched post (and again for your expertise!)

I'll have to review the links -- though even though I'm sure they're correct, we should keep all these in the ecmascript regex optional file. Implementations using another regex dialect (which they're allowed to do) may indeed have different semantics here, and it's easier to skip the ECMA specific tests if they're all in one place.

@ChALkeR
Copy link
Member Author

ChALkeR commented Aug 21, 2021

Those are already in the optional dir, just a separate file that tests just the unicode part of the spec :-).

The current name of the test file is optional/unicode.json, and the spec specifically says Unicode support as defined by ECMA-262. I have no objections to renaming it to e.g. optional/ecma262-unicode.json though, if you think it would be a better name.

@karenetheridge
Copy link
Member

I agree that we should consolidate optional/unicode.json into optional/ecmascript-regex.json, as there is some crossover and duplication of functionality. I added this file originally, and did not notice the similarity of tests to ecmascript-regex.json at the time.

(I hope people aren't skipping entire test files just because they may have a few failures here and there, though!)

@ChALkeR
Copy link
Member Author

ChALkeR commented Sep 19, 2021

@karenetheridge I merged optional/unicode into optional-ecmascript-regex.

Could this be landed now perhaps? =)

Copy link
Member

@karenetheridge karenetheridge left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this be landed now perhaps? =)

yes :)

@karenetheridge karenetheridge merged commit 7334b4c into json-schema-org:master Sep 19, 2021
@ChALkeR ChALkeR deleted the chalker/fix-unicode branch September 19, 2021 19:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants