Update to Unicode 6.3 #10621

Florob · 2013-11-23T18:06:02Z

This update the unicode.rs file to the latest Unicode version released 2013-09-30.

emberian · 2013-11-23T20:44:32Z

Minor, but while you're at it, mind updating unicode.py with the libcore->libstd changes?

Florob · 2013-11-23T21:10:08Z

@cmr Gladly, if I knew what you mean ;). Anything apart from the mention of libcore in the initial comment?

emberian · 2013-11-23T21:42:31Z

nope, that's exactly it

On Sat, Nov 23, 2013 at 4:10 PM, Florob [email protected] wrote:

@cmr https://github.com/cmr Gladly, if I knew what you meant ;).
Anything apart from the mention of libcore in the initial comment?

—
Reply to this email directly or view it on GitHubhttps://github.com//pull/10621#issuecomment-29141650
.

Florob · 2013-11-24T23:22:36Z

@cmr Done, r?

Florob · 2013-11-25T13:36:15Z

So, I shamefully bow my head for not having run the full test suite.

For future reference:
What goes wrong here is that src/test/pretty/block-comment-wchar.rs uses U+180E MONGOLIAN VOWEL SEPARATOR to test folding of adjacent white space characters (@pnkfelix Do you remember why you used that particular character for this test?).
In Unicode 6.3 that character is no longer considered white space.

While looking at this I also noticed that char::is_whitespace(), as well as char::is_uppercase() and char::is_lowercase(), do not conform to the respective definition in the Unicode Standard. White_Space is not a derived property, as the current code would have you believe, this is also the reason why U+0085 NEXT LINE is currently not considered white space.

I'm going to start on a fix for the later. For the testcase I'm not entirely sure what an appropriate change would be, since I'm not certain which property of U+180E MONGOLIAN VOWEL SEPARATOR made it desirable for this test.

pnkfelix · 2013-11-25T14:32:35Z

@Florob I think I picked out U+180E because it is (or was) a whitespace character that:

uses multiple bytes in its UTF-8 representation (and thus exposes the bug described in Incorrect mixing of character and byte positions in parser #3961), and
has a non-whitespace presentation in my editor of choice (emacs) and on my terminal, and thus one can directly see the different combinations explored in the test (for example the 16 cases of strings of length 2 over the alphabet { space, monoglian-vowel-sep }). And in addition, another small benefit is/was:
On my Emacs with my default font, the width of monoglian-vowel-sep appears to be approximately equal to that of a space character, and so the columns happen to align nicely in the presentation.

However, if it is no longer whitespace, then obviously the above reasoning is irrelevant. The most important thing is that the characters in the supposed-whitespace should actually be whitespace.

I think U+1680 ogham space mark would then be a suitable replacement for the mongolian-vowel-separator in those tests. Property 3 above does not hold for it, but that was just a random choice; its more important to make sure 1 and 2 work.

Florob · 2013-11-26T14:58:02Z

Updated the testcase. Also fixed up char::{is_uppercase(),is_lowercase(),is_whitespace()}.
Among other thing this fixes U+0085 NEL not being recognized as whitespace.
r? @cmr @pnkfelix

Florob · 2013-11-27T23:43:48Z

Updated to avoid the closure type warning that was introduced in the meantime. Compiles, and tests fine on my system, let's hope the third time's a charm.
r? @cmr

This update the unicode.rs file to the latest Unicode version released 2013-09-30.

bump syn to 2.0 changelog: none

Florob added 3 commits November 27, 2013 23:21

Update unicode.py to reflect language changes

e9ab9bf

Update Unicode data to version 6.3

c234614

Fix handling of upper/lowercase, and whitespace

dfe38db

bors added a commit that referenced this pull request Nov 28, 2013

auto merge of #10621 : Florob/rust/unicode63, r=cmr

503e5df

This update the unicode.rs file to the latest Unicode version released 2013-09-30.

bors closed this Nov 28, 2013

bors merged commit dfe38db into rust-lang:master Nov 28, 2013

flip1995 pushed a commit to flip1995/rust that referenced this pull request Apr 23, 2023

Auto merge of rust-lang#10621 - fee1-dead-contrib:bump_syn, r=flip1995

0d06001

bump syn to 2.0 changelog: none

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Update to Unicode 6.3 #10621

Update to Unicode 6.3 #10621

Uh oh!

Florob commented Nov 23, 2013

Uh oh!

emberian commented Nov 23, 2013

Uh oh!

Florob commented Nov 23, 2013

Uh oh!

emberian commented Nov 23, 2013

Uh oh!

Florob commented Nov 24, 2013

Uh oh!

Florob commented Nov 25, 2013

Uh oh!

pnkfelix commented Nov 25, 2013

Uh oh!

Florob commented Nov 26, 2013

Uh oh!

Florob commented Nov 27, 2013

Uh oh!

Uh oh!

Update to Unicode 6.3 #10621

Update to Unicode 6.3 #10621

Uh oh!

Conversation

Florob commented Nov 23, 2013

Uh oh!

emberian commented Nov 23, 2013

Uh oh!

Florob commented Nov 23, 2013

Uh oh!

emberian commented Nov 23, 2013

Uh oh!

Florob commented Nov 24, 2013

Uh oh!

Florob commented Nov 25, 2013

Uh oh!

pnkfelix commented Nov 25, 2013

Uh oh!

Florob commented Nov 26, 2013

Uh oh!

Florob commented Nov 27, 2013

Uh oh!

Uh oh!