Skip to content

add tests for pattern matching with unicode semantics #498

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jul 9, 2021

Conversation

karenetheridge
Copy link
Member

Supporting this is now mandatory in draft2020-12.

@karenetheridge karenetheridge requested a review from a team July 7, 2021 23:43
@gregsdennis
Copy link
Member

I'll need to run these. I'm pretty sure that .Net RegEx doesn't fully like Unicode. It might just be UTF-16 and/or -32, though.

@gregsdennis
Copy link
Member

Yeah, the following fail for .Net.

  • unicode digits are more than 0 through 9 / non-ascii digits (BENGALI DIGIT FOUR, BENGALI DIGIT TWO)
  • unicode semantics should be used for all pattern matching / literal unicode character in json string
  • unicode semantics should be used for all pattern matching / unicode character in hex format in string

for both pattern and patternProperties.

It's not an issue with the tests. It's .Net's RegEx implementation that doesn't support surrogate pairs (bunch of SO questions) representing UTF-32 and .Net string are UTF-16.

I'm not sure how I want to handle this. I'd be curious to what extent other languages support regex matching on surrogate pairs.

@karenetheridge
Copy link
Member Author

Since this is now a required part of the spec, I think it's fair to have tests for it in the normal area. However, I could also add these tests to optional/* for drafts 2019-09 and earlier.

@Julian
Copy link
Member

Julian commented Jul 8, 2021

It's just a SHOULD, not a MUST, and as all things regex (where the reality is no language but JS will use ECMAScript regexes) it's I think perfectly expected if a .NET implementation doesn't support matching on surrogate pairs.

I still personally support putting them here though I guess, and folks should skip them if their language makes it unduly hard to pass.

@karenetheridge
Copy link
Member Author

Then the release notes need updating, because it says it's now required -- https://json-schema.org/draft/2020-12/release-notes.html

@Julian
Copy link
Member

Julian commented Jul 8, 2021

Yes, I think that section is misleading (but it's not normative) -- the spec seems to instead say:

Regular expressions SHOULD be built with the "u" flag (or equivalent) to provide Unicode support, or processed in such a way which provides Unicode support as defined by ECMA-262.

The next section (which I didn't read carefully but looks new-ish to me) might give us a reason to move just the surrogate pair test to optional though now that I see it... specifically:

Furthermore, given the high disparity in regular expression constructs support, schema authors SHOULD limit themselves to the following regular expression tokens:

individual Unicode characters, as defined by the JSON specification;
simple character classes ([abc]), range character classes ([a-z]);
complemented character classes ([^abc], [^a-z]);
simple quantifiers: "+" (one or more), "*" (zero or more), "?" (zero or one), and their lazy versions ("+?", "*?", "??");
range quantifiers: "{x}" (exactly x occurrences), "{x,y}" (at least x, at most y, occurrences), {x,} (x occurrences or more), and their lazy versions;
the beginning-of-input ("^") and end-of-input ("$") anchors;
simple grouping ("(...)") and alternation ("|").

I yeah need to parse that mentally, but we should probably double check surrogate pairs match the criteria that's recommending.

@Relequestual
Copy link
Member

Then the release notes need updating, because it says it's now required -- https://json-schema.org/draft/2020-12/release-notes.html

Yup that's an error. Feel free to make a PR or see if @jdesrosiers wants to handle it =]

@karenetheridge
Copy link
Member Author

Ok what do people want to do here? Is this good to merge as is or do some bits need to move?

@jdesrosiers
Copy link
Member

I'll work on fixing the release notes, probably early next week.

@gregsdennis
Copy link
Member

If it isn't a hard retirement, these tests should go in optional.

  • 2019-09 and earlier should definitely move.
  • 2020-12 seems up in the air still.

If I need to build provision to skip these (or coerce them to optional) in my test suite, I can do that. I've done similar for my JSON Path suite for syntaxes that I'm just not going to support.

@Julian
Copy link
Member

Julian commented Jul 8, 2021

Considering it's just SHOULD, I'd also probably be coming around to all of them belonging in optional.

(Basically all SHOULD tests are in there at the minute that I know of, e.g. ecmascript-regex itself).

@karenetheridge
Copy link
Member Author

karenetheridge commented Jul 8, 2021

I didn't add these tests for 2019-09 and earlier, because I thought that the requirements changed for 2020-12. but if they didn't, and they are just as "SHOULD" as they always were, then we should be consistent. Which I guess that means they go in optional.

As of draft2020-12 supporting this is still just a "SHOULD".
@karenetheridge karenetheridge force-pushed the ether/2020-12-unicode-pattern branch from a42091c to c5f08e2 Compare July 9, 2021 03:33
@karenetheridge
Copy link
Member Author

All tests have been moved into tests/*/optional/unicode.json.

Copy link
Member

@gregsdennis gregsdennis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this. I think these are good tests for those who can support it.

@karenetheridge karenetheridge merged commit 4ca4af2 into master Jul 9, 2021
@karenetheridge karenetheridge deleted the ether/2020-12-unicode-pattern branch July 9, 2021 04:27
@Julian
Copy link
Member

Julian commented Jul 9, 2021

It looks like CI on the master branch is failing, because this PR contains invalid tests -- specifically the ones that use true as a schema on pre-draft 6 drafts (in patternProperties: true I think), where schemas must be objects.

Is there something we can improve about CI reporting to make things clearer?

@karenetheridge
Copy link
Member Author

karenetheridge commented Jul 9, 2021

Is there something we can improve about CI reporting to make things clearer?

An email notification on test failure would be nice, but I don't know if that's possible.
I do know it's possible to prevent merges without the tests passing, as we use that at $work.

FWIW I don't even see the test results in this PR at all.

@ChALkeR
Copy link
Member

ChALkeR commented Aug 20, 2021

This is actually against the spec, per spec, \d is equivalent to [0-9], and "unicode digits" class is represented by another pattern.

I'll file a fix PR tomorrow with exact references to the spec.

@ChALkeR
Copy link
Member

ChALkeR commented Aug 20, 2021

In short, https://json-schema.org/draft/2020-12/json-schema-validation.html#rfc.section.6.3.3 refences to a specific regexp dialect, and that's ECMA-262 dialect.

In ECMA-262 dialect, \d is equivalent to [0-9], and "unicode digits" group is represented by \p{digit}.

> /^\d$/u.test('৪২')
false
> /^\p{digit}+$/u.test('৪২')
true

I'll file a PR tomorrow to fix this (with exact refs).

@ChALkeR
Copy link
Member

ChALkeR commented Aug 21, 2021

Fix and explanation in #505.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants