Skip to content

Extend treatment on Unicode and/or its security considerations #215

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
awwright opened this issue Dec 29, 2016 · 8 comments
Open

Extend treatment on Unicode and/or its security considerations #215

awwright opened this issue Dec 29, 2016 · 8 comments

Comments

@awwright
Copy link
Member

Unicode is a complex technology that probably nobody will ever fully understand. But we should add a few notes on common implementation conserns, especially security considerations.

Also consider the behavior of applications that use e.g. UTF-16:

> '🐲' .length // U+1F432
< 2
@awwright awwright self-assigned this Dec 29, 2016
@brettz9
Copy link

brettz9 commented May 2, 2017

Drawing out some implications of your example... With maxLength/minLength, the validation spec states these refer to the "number of its characters as defined by RFC 7159." (the JSON spec)

The latter, however, while referring to Unicode "characters" as being escaped as UTF-16, also states, "implementations might return different values for the length of a string value", so it would probably help to be more clear on what the intention is here in deferring to the JSON spec.

For example, to enforce the string length is no longer than in your example, should maxLength be 1 or 2? I don't think the current spec is actually very clear on this.

@akuckartz
Copy link

I agree that it makes sense to think about security aspects of Unicode. But these aspects are not specific to JSON Schema. A separate document might make sense which can be developed by a broader community (including JSON-LD supporters for example).

@brettz9
Copy link

brettz9 commented Jul 26, 2017

In the case of maxLength and minLength, if one mistakenly relies on them, these are JSON Schema-specific issues. But again, I don't think the behavior is clearly spec'd.

@epoberezkin
Copy link
Member

should maxLength be 1 or 2

@brettz9 there are tests that require it's 1

@brettz9
Copy link

brettz9 commented Jul 26, 2017

Sure, @epoberezkin , but the text ought to be clarified regardless.

@micolous
Copy link

One should also consider combining characters (and normalisation forms) and zero-width joiner combinations (where platforms have included custom rules). Both of these will change over time with the Unicode standard.

So you could length as:

  • bytes (which can vary depending on whether you use UTF-8, UTF-16, UTF-32, etc., as well as how the string is composed, but it's also the simplest to implement)
  • codepoints (fairly consistent, but codepoints vary in byte length and also vary depending on composition)
  • characters (for UTF-16 and UTF-32, can be different from codepoints)
  • graphemes (composition doesn't matter, interpretation evolves with the Unicode standard, but it's closer to the human understanding of "this is a character" in many, but not all languages)

You could hand-wave away composition and say "it will be counted based on how the data is transmitted", with the idea that a client should convert to NFC.

there are tests that require it's 1

@epoberezkin Using that example, that might be counting in codepoints or graphemes.

A better example is: "👩🏾‍🚀" (woman emoji + skin tone + ZWJ + rocket emoji). You could count that as:

  • bytes: 15 (in UTF-8)
  • JavaScript: 7 (UTF-16 characters)
  • codepoints: 4
  • graphemes: 1

Which one is the correct answer for JSON Schema's minLength/maxLength?

And does everyone implement it that way? 😅

@gregsdennis
Copy link
Member

The current tests check for codepoints.

RFC 8259, Section 8.3, mentions "Unicode code units". It's reasonable that regarding Unicode as such applies to the rest of the specification where Unicode is mentioned.

I imagine we could make an explicit declaration in Validation, Section 4.1.

@micolous
Copy link

micolous commented Mar 28, 2025

Going back and answering my own question: I had a look at the test suite, and it looks like it was implemented in terms of “supplementary Unicode code points” since json-schema-org/JSON-Schema-Test-Suite#52:

For maxLength: 2:

https://github.com/json-schema-org/JSON-Schema-Test-Suite/blob/69136952196a63a7553803935feaeaec57a48420/tests/draft4/maxLength.json#L26-L30

For minLength: 2:

https://github.com/json-schema-org/JSON-Schema-Test-Suite/blob/69136952196a63a7553803935feaeaec57a48420/tests/draft4/minLength.json#L26-L30

json-schema-org/JSON-Schema-Test-Suite#710 updated all draft’s tests to use the word “grapheme”.

Though that test for “💩” is a single code point as well as a single grapheme, just one that takes multiple UTF-16 characters to represent.

@gregsdennis gregsdennis moved this from In Discussion to Awaiting PR in Stable Release Development Apr 26, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Awaiting PR
Development

No branches or pull requests

7 participants