Skip to content

Update to Unicode 6.3 #10621

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Nov 28, 2013
Merged

Update to Unicode 6.3 #10621

merged 3 commits into from
Nov 28, 2013

Conversation

Florob
Copy link
Contributor

@Florob Florob commented Nov 23, 2013

This update the unicode.rs file to the latest Unicode version released 2013-09-30.

@emberian
Copy link
Member

Minor, but while you're at it, mind updating unicode.py with the libcore->libstd changes?

@Florob
Copy link
Contributor Author

Florob commented Nov 23, 2013

@cmr Gladly, if I knew what you mean ;). Anything apart from the mention of libcore in the initial comment?

@emberian
Copy link
Member

nope, that's exactly it

On Sat, Nov 23, 2013 at 4:10 PM, Florob [email protected] wrote:

@cmr https://github.com/cmr Gladly, if I knew what you meant ;).
Anything apart from the mention of libcore in the initial comment?


Reply to this email directly or view it on GitHubhttps://github.com//pull/10621#issuecomment-29141650
.

@Florob
Copy link
Contributor Author

Florob commented Nov 24, 2013

@cmr Done, r?

@Florob
Copy link
Contributor Author

Florob commented Nov 25, 2013

So, I shamefully bow my head for not having run the full test suite.

For future reference:
What goes wrong here is that src/test/pretty/block-comment-wchar.rs uses U+180E MONGOLIAN VOWEL SEPARATOR to test folding of adjacent white space characters (@pnkfelix Do you remember why you used that particular character for this test?).
In Unicode 6.3 that character is no longer considered white space.

While looking at this I also noticed that char::is_whitespace(), as well as char::is_uppercase() and char::is_lowercase(), do not conform to the respective definition in the Unicode Standard. White_Space is not a derived property, as the current code would have you believe, this is also the reason why U+0085 NEXT LINE is currently not considered white space.

I'm going to start on a fix for the later. For the testcase I'm not entirely sure what an appropriate change would be, since I'm not certain which property of U+180E MONGOLIAN VOWEL SEPARATOR made it desirable for this test.

@pnkfelix
Copy link
Member

@Florob I think I picked out U+180E because it is (or was) a whitespace character that:

  1. uses multiple bytes in its UTF-8 representation (and thus exposes the bug described in Incorrect mixing of character and byte positions in parser #3961), and
  2. has a non-whitespace presentation in my editor of choice (emacs) and on my terminal, and thus one can directly see the different combinations explored in the test (for example the 16 cases of strings of length 2 over the alphabet { space, monoglian-vowel-sep }). And in addition, another small benefit is/was:
  3. On my Emacs with my default font, the width of monoglian-vowel-sep appears to be approximately equal to that of a space character, and so the columns happen to align nicely in the presentation.

However, if it is no longer whitespace, then obviously the above reasoning is irrelevant. The most important thing is that the characters in the supposed-whitespace should actually be whitespace.

I think U+1680 ogham space mark would then be a suitable replacement for the mongolian-vowel-separator in those tests. Property 3 above does not hold for it, but that was just a random choice; its more important to make sure 1 and 2 work.

@Florob
Copy link
Contributor Author

Florob commented Nov 26, 2013

Updated the testcase. Also fixed up char::{is_uppercase(),is_lowercase(),is_whitespace()}.
Among other thing this fixes U+0085 NEL not being recognized as whitespace.
r? @cmr @pnkfelix

@Florob
Copy link
Contributor Author

Florob commented Nov 27, 2013

Updated to avoid the closure type warning that was introduced in the meantime. Compiles, and tests fine on my system, let's hope the third time's a charm.
r? @cmr

bors added a commit that referenced this pull request Nov 28, 2013
This update the unicode.rs file to the latest Unicode version released 2013-09-30.
@bors bors closed this Nov 28, 2013
@bors bors merged commit dfe38db into rust-lang:master Nov 28, 2013
flip1995 pushed a commit to flip1995/rust that referenced this pull request Apr 23, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants