Skip to content

Updates min/maxLenth tests to mention graphemes rather than unicode code points #710

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Dec 29, 2023

Conversation

spacether
Copy link
Contributor

@spacether spacether commented Nov 30, 2023

Updates min/maxLenth tests to mention graphemes rather than unicode points
Updates the language for the test that verifies that one grapheme is too short when checked against minLength of 2

Updates the language for the test that verifies that one grapheme is too short when checked against minLength of 2
@spacether spacether requested a review from a team as a code owner November 30, 2023 22:37
Copy link
Member

@jdesrosiers jdesrosiers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assuming I understand the terms "code point" and "grapheme" correctly, I think this change is correct.

@spacether
Copy link
Contributor Author

spacether commented Nov 30, 2023

This came up when working on my java implementation here: https://github.com/openapi-json-schema-tools/openapi-json-schema-generator
and reading this article about grapheme length vs code points: http://illegalargumentexception.blogspot.com/2009/05/java-rough-guide-to-character-encoding.html#javaencoding_strnlen

@gregsdennis
Copy link
Member

@gregsdennis gregsdennis requested a review from Julian November 30, 2023 22:45
Copy link
Member

@gregsdennis gregsdennis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine with this, but I'd like to see if @Julian has any strong feelings about this.

@Julian
Copy link
Member

Julian commented Nov 30, 2023

Looks good to me too, just change it in the other drafts too

@gregsdennis
Copy link
Member

change it in the other drafts too

Oh! Please also check for maxLength tests as well.

@karenetheridge
Copy link
Member

karenetheridge commented Dec 1, 2023

Maybe we should pick some characters that actually make sense -- as far as I can tell "\uD83D\uDCA9" isn't valid utf8.

I did some digging in the history and the test was added here: #52 - and indeed this is UTF-16, not UTF-8. we should fix that. The proper sequence for the poop emoji, that should appear in the file, is \u1F4A9 -- but I think we should pick something better -- a sequence of unicode codepoints that combine together into one grapheme.

@gregsdennis
Copy link
Member

gregsdennis commented Dec 1, 2023

No, JSON requires that it be broken out into the surrogate pair.

To escape an extended character that is not in the Basic Multilingual Plane, the character is represented as a 12-character sequence, encoding the UTF-16 surrogate pair. So, for example, a string containing only the G clef character (U+1D11E) may be represented as "\uD834\uDD1E".

@karenetheridge
Copy link
Member

@jdesrosiers
Copy link
Member

@spacether Just wanted to make sure you saw this request to add this change to other drafts as well.

@spacether spacether changed the title Update minLength.json Updates min/maxLenth tests to mention graphemes rather than unicode points Dec 29, 2023
@spacether spacether changed the title Updates min/maxLenth tests to mention graphemes rather than unicode points Updates min/maxLenth tests to mention graphemes rather than unicode code points Dec 29, 2023
@Julian Julian merged commit 64d5cab into json-schema-org:main Dec 29, 2023
@spacether spacether deleted the patch-1 branch December 29, 2023 17:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants