-
-
Notifications
You must be signed in to change notification settings - Fork 311
Extend treatment on Unicode and/or its security considerations #215
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Drawing out some implications of your example... With The latter, however, while referring to Unicode "characters" as being escaped as UTF-16, also states, "implementations might return different values for the length of a string value", so it would probably help to be more clear on what the intention is here in deferring to the JSON spec. For example, to enforce the string length is no longer than in your example, should |
I agree that it makes sense to think about security aspects of Unicode. But these aspects are not specific to JSON Schema. A separate document might make sense which can be developed by a broader community (including JSON-LD supporters for example). |
In the case of |
@brettz9 there are tests that require it's 1 |
Sure, @epoberezkin , but the text ought to be clarified regardless. |
One should also consider combining characters (and normalisation forms) and zero-width joiner combinations (where platforms have included custom rules). Both of these will change over time with the Unicode standard. So you could length as:
You could hand-wave away composition and say "it will be counted based on how the data is transmitted", with the idea that a client should convert to NFC.
@epoberezkin Using that example, that might be counting in codepoints or graphemes. A better example is: "👩🏾🚀" (woman emoji + skin tone + ZWJ + rocket emoji). You could count that as:
Which one is the correct answer for JSON Schema's And does everyone implement it that way? 😅 |
The current tests check for codepoints. RFC 8259, Section 8.3, mentions "Unicode code units". It's reasonable that regarding Unicode as such applies to the rest of the specification where Unicode is mentioned. I imagine we could make an explicit declaration in Validation, Section 4.1. |
Going back and answering my own question: I had a look at the test suite, and it looks like it was implemented in terms of “supplementary Unicode code points” since json-schema-org/JSON-Schema-Test-Suite#52: For For json-schema-org/JSON-Schema-Test-Suite#710 updated all draft’s tests to use the word “grapheme”. Though that test for “💩” is a single code point as well as a single grapheme, just one that takes multiple UTF-16 characters to represent. |
Unicode is a complex technology that probably nobody will ever fully understand. But we should add a few notes on common implementation conserns, especially security considerations.
Also consider the behavior of applications that use e.g. UTF-16:
The text was updated successfully, but these errors were encountered: