You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Directional confusability protection & allowing script mixing in identifiers when separated by underscores (#13693)
* update support for uts39 from unicode 15
* follow uts39's recco that it's not necessary to require
idents to be single-script (they call out proglang idents,
reference the new uts55-5). We use a heuristic derived from
the concept of identifier chunks from uts55-5, to allow
idents like foo_bar_baz where each chunk around the _ can be
single-or-highly-restrictive
* provide directional confusability detection, by reversing
spans of direction-changed chars in idents for bidi_skeleton,
see issue #12929
Copy file name to clipboardExpand all lines: lib/elixir/pages/references/unicode-syntax.md
+3-1
Original file line number
Diff line number
Diff line change
@@ -136,10 +136,12 @@ Elixir will not warn on confusability for identifiers made up exclusively of cha
136
136
137
137
### C3. Mixed Script Detection
138
138
139
-
Elixir will not allow tokenization of mixed-script identifiers unless the mixing is one of the exceptions defined in UTS 39 5.2, 'Highly Restrictive'. We use the means described in Section 5.1, Mixed-Script Detection, to determine if script mixing is occurring, with the modification documented in the section 'Additional Normalizations', below.
139
+
Elixir will not allow tokenization of mixed-script identifiers unless it is via chunks separated by an underscore, like `http_сервер`, or unless the mixing within each of those chunks is one of the exceptions defined in UTS 39 5.2, 'Highly Restrictive'. We use the means described in Section 5.1, Mixed-Script Detection, to determine if script mixing is occurring, with the modification documented in the section 'Additional Normalizations', below.
140
140
141
141
Examples: Elixir allows an identifiers like `幻ㄒㄧㄤ`, even though it includes characters from multiple 'scripts', because those scripts all 'resolve' to Japanese when applying the resolution rules from UTS 39 5.1. It also allows an atom like `:Tシャツ`, the Japanese word for 't-shirt', which incorporates a Latin capital T, because {Latn, Jpan} is one of the allowed script mixing in the definition of 'Highly Restrictive' in UTS 39 5.2, and it 'covers' the string.
142
142
143
+
Elixir does allow an identifier like `http_сервер`, where the identifier chunks on each side of the `_` are individually single-script.
144
+
143
145
However, Elixir would prevent tokenization in code like `if аdmin, do: :ok, else: :err`, where the scriptset for the 'a' character is {Cyrillic} but all other characters have scriptsets of {Latin}. The scriptsets fail to resolve, and the scriptsets from the definition of 'Highly Restrictive' in UTS 39 5.2 do not cover the string either, so a descriptive error is shown.
0 commit comments