Extend treatment on Unicode and/or its security considerations #215

awwright · 2016-12-29T04:37:19Z

Unicode is a complex technology that probably nobody will ever fully understand. But we should add a few notes on common implementation conserns, especially security considerations.

Also consider the behavior of applications that use e.g. UTF-16:

> '🐲' .length // U+1F432
< 2

The text was updated successfully, but these errors were encountered:

brettz9 · 2017-05-02T04:52:50Z

Drawing out some implications of your example... With maxLength/minLength, the validation spec states these refer to the "number of its characters as defined by RFC 7159." (the JSON spec)

The latter, however, while referring to Unicode "characters" as being escaped as UTF-16, also states, "implementations might return different values for the length of a string value", so it would probably help to be more clear on what the intention is here in deferring to the JSON spec.

For example, to enforce the string length is no longer than in your example, should maxLength be 1 or 2? I don't think the current spec is actually very clear on this.

akuckartz · 2017-07-21T06:10:42Z

I agree that it makes sense to think about security aspects of Unicode. But these aspects are not specific to JSON Schema. A separate document might make sense which can be developed by a broader community (including JSON-LD supporters for example).

brettz9 · 2017-07-26T17:20:23Z

In the case of maxLength and minLength, if one mistakenly relies on them, these are JSON Schema-specific issues. But again, I don't think the behavior is clearly spec'd.

epoberezkin · 2017-07-26T19:26:22Z

should maxLength be 1 or 2

@brettz9 there are tests that require it's 1

brettz9 · 2017-07-26T23:49:45Z

Sure, @epoberezkin , but the text ought to be clarified regardless.

micolous · 2025-03-27T00:12:08Z

One should also consider combining characters (and normalisation forms) and zero-width joiner combinations (where platforms have included custom rules). Both of these will change over time with the Unicode standard.

So you could length as:

bytes (which can vary depending on whether you use UTF-8, UTF-16, UTF-32, etc., as well as how the string is composed, but it's also the simplest to implement)
codepoints (fairly consistent, but codepoints vary in byte length and also vary depending on composition)
characters (for UTF-16 and UTF-32, can be different from codepoints)
graphemes (composition doesn't matter, interpretation evolves with the Unicode standard, but it's closer to the human understanding of "this is a character" in many, but not all languages)

You could hand-wave away composition and say "it will be counted based on how the data is transmitted", with the idea that a client should convert to NFC.

there are tests that require it's 1

@epoberezkin Using that example, that might be counting in codepoints or graphemes.

A better example is: "👩🏾‍🚀" (woman emoji + skin tone + ZWJ + rocket emoji). You could count that as:

bytes: 15 (in UTF-8)
JavaScript: 7 (UTF-16 characters)
codepoints: 4
graphemes: 1

Which one is the correct answer for JSON Schema's minLength/maxLength?

And does everyone implement it that way? 😅

gregsdennis · 2025-03-27T00:48:06Z

The current tests check for codepoints.

RFC 8259, Section 8.3, mentions "Unicode code units". It's reasonable that regarding Unicode as such applies to the rest of the specification where Unicode is mentioned.

I imagine we could make an explicit declaration in Validation, Section 4.1.

micolous · 2025-03-28T05:44:40Z

Going back and answering my own question: I had a look at the test suite, and it looks like it was implemented in terms of “supplementary Unicode code points” since json-schema-org/JSON-Schema-Test-Suite#52:

For maxLength: 2:

https://github.com/json-schema-org/JSON-Schema-Test-Suite/blob/69136952196a63a7553803935feaeaec57a48420/tests/draft4/maxLength.json#L26-L30

For minLength: 2:

https://github.com/json-schema-org/JSON-Schema-Test-Suite/blob/69136952196a63a7553803935feaeaec57a48420/tests/draft4/minLength.json#L26-L30

json-schema-org/JSON-Schema-Test-Suite#710 updated all draft’s tests to use the word “grapheme”.

Though that test for “💩” is a single code point as well as a single grapheme, just one that takes multiple UTF-16 characters to represent.

awwright self-assigned this Dec 29, 2016

handrews added the Type: Security label Aug 18, 2017

handrews added the core label Sep 28, 2017

vaitkus mentioned this issue Jun 7, 2022

Property definitions Materials-Consortia/OPTIMADE#376

Merged

gregsdennis added this to Stable Release Development Mar 27, 2025

gregsdennis moved this to In Discussion in Stable Release Development Mar 27, 2025

gregsdennis added this to the stable-release milestone Mar 27, 2025

gregsdennis moved this from In Discussion to Awaiting PR in Stable Release Development Apr 26, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extend treatment on Unicode and/or its security considerations #215

Extend treatment on Unicode and/or its security considerations #215

awwright commented Dec 29, 2016

brettz9 commented May 2, 2017

akuckartz commented Jul 21, 2017

brettz9 commented Jul 26, 2017

epoberezkin commented Jul 26, 2017

brettz9 commented Jul 26, 2017

micolous commented Mar 27, 2025

gregsdennis commented Mar 27, 2025

micolous commented Mar 28, 2025 •

edited

Loading

Extend treatment on Unicode and/or its security considerations #215

Extend treatment on Unicode and/or its security considerations #215

Comments

awwright commented Dec 29, 2016

brettz9 commented May 2, 2017

akuckartz commented Jul 21, 2017

brettz9 commented Jul 26, 2017

epoberezkin commented Jul 26, 2017

brettz9 commented Jul 26, 2017

micolous commented Mar 27, 2025

gregsdennis commented Mar 27, 2025

micolous commented Mar 28, 2025 • edited Loading

micolous commented Mar 28, 2025 •

edited

Loading