Skip to content

Commit c83334e

Browse files
authored
Directional confusability protection & allowing script mixing in identifiers when separated by underscores (#13693)
* update support for uts39 from unicode 15 * follow uts39's recco that it's not necessary to require idents to be single-script (they call out proglang idents, reference the new uts55-5). We use a heuristic derived from the concept of identifier chunks from uts55-5, to allow idents like foo_bar_baz where each chunk around the _ can be single-or-highly-restrictive * provide directional confusability detection, by reversing spans of direction-changed chars in idents for bidi_skeleton, see issue #12929
1 parent bd99dcc commit c83334e

File tree

5 files changed

+214
-20
lines changed

5 files changed

+214
-20
lines changed

Makefile

+1-1
Original file line numberDiff line numberDiff line change
@@ -104,8 +104,8 @@ unicode: $(UNICODE)
104104
$(UNICODE): lib/elixir/unicode/*
105105
@ echo "==> unicode (compile)";
106106
$(Q) $(ELIXIRC) lib/elixir/unicode/unicode.ex -o lib/elixir/ebin;
107-
$(Q) $(ELIXIRC) lib/elixir/unicode/security.ex -o lib/elixir/ebin;
108107
$(Q) $(ELIXIRC) lib/elixir/unicode/tokenizer.ex -o lib/elixir/ebin;
108+
$(Q) $(ELIXIRC) lib/elixir/unicode/security.ex -o lib/elixir/ebin;
109109

110110
$(eval $(call APP_TEMPLATE,ex_unit,ExUnit))
111111
$(eval $(call APP_TEMPLATE,logger,Logger))

lib/elixir/pages/references/unicode-syntax.md

+3-1
Original file line numberDiff line numberDiff line change
@@ -136,10 +136,12 @@ Elixir will not warn on confusability for identifiers made up exclusively of cha
136136

137137
### C3. Mixed Script Detection
138138

139-
Elixir will not allow tokenization of mixed-script identifiers unless the mixing is one of the exceptions defined in UTS 39 5.2, 'Highly Restrictive'. We use the means described in Section 5.1, Mixed-Script Detection, to determine if script mixing is occurring, with the modification documented in the section 'Additional Normalizations', below.
139+
Elixir will not allow tokenization of mixed-script identifiers unless it is via chunks separated by an underscore, like `http_сервер`, or unless the mixing within each of those chunks is one of the exceptions defined in UTS 39 5.2, 'Highly Restrictive'. We use the means described in Section 5.1, Mixed-Script Detection, to determine if script mixing is occurring, with the modification documented in the section 'Additional Normalizations', below.
140140

141141
Examples: Elixir allows an identifiers like `幻ㄒㄧㄤ`, even though it includes characters from multiple 'scripts', because those scripts all 'resolve' to Japanese when applying the resolution rules from UTS 39 5.1. It also allows an atom like `:Tシャツ`, the Japanese word for 't-shirt', which incorporates a Latin capital T, because {Latn, Jpan} is one of the allowed script mixing in the definition of 'Highly Restrictive' in UTS 39 5.2, and it 'covers' the string.
142142

143+
Elixir does allow an identifier like `http_сервер`, where the identifier chunks on each side of the `_` are individually single-script.
144+
143145
However, Elixir would prevent tokenization in code like `if аdmin, do: :ok, else: :err`, where the scriptset for the 'a' character is {Cyrillic} but all other characters have scriptsets of {Latin}. The scriptsets fail to resolve, and the scriptsets from the definition of 'Highly Restrictive' in UTS 39 5.2 do not cover the string either, so a descriptive error is shown.
144146

145147
### C4, C5 (inapplicable)

lib/elixir/test/elixir/kernel/warning_test.exs

+71-1
Original file line numberDiff line numberDiff line change
@@ -139,6 +139,70 @@ defmodule Kernel.WarningTest do
139139
"confusable identifier: 'a' looks like 'а' on line 1"
140140
end
141141

142+
test "warns on LTR-confusables" do
143+
# warning outputs in byte order (vs bidi algo display order, uax9), mentions presence of rtl
144+
assert_warn_quoted(
145+
["nofile:1:9", "'_1א' looks like '_א1'", "right-to-left characters"],
146+
"_א1 and _1א"
147+
)
148+
149+
assert_warn_quoted(
150+
[
151+
"'a_1א' includes right-to-left characters",
152+
"\\u0061 a ltr",
153+
"\\u005F _ neutral",
154+
"\\u0031 1 weak_number",
155+
"\\u05D0 א rtl",
156+
"'a_א1' includes right-to-left characters:",
157+
"\\u0061 a ltr",
158+
"\\u005F _ neutral",
159+
"\\u05D0 א rtl",
160+
"\\u0031 1 weak_number"
161+
],
162+
"a_א1 or a_1א"
163+
)
164+
165+
# test that the implementation of String.Tokenizer.Security.unbidify/1 agrees
166+
# w/Unicode Bidi Algo (UAX9) for these (identifier-specific, no-bracket) examples
167+
#
168+
# you can create new examples with: https://util.unicode.org/UnicodeJsps/bidic.jsp?s=foo_%D9%84%D8%A7%D9%85%D8%AF%D8%A7_baz&b=0&u=140&d=2
169+
# inspired by (none of these are directly usable for our idents): https://www.unicode.org/Public/UCD/latest/ucd/BidiCharacterTest.txt
170+
#
171+
# there's a spurious ;A; after the identifier, because the semicolon is dir-neutral, and
172+
# deleting it makes these examples hard to read in many/most editors!
173+
"""
174+
foo;A;0066 006F 006F;0 1 2
175+
_foo_ ;A;005F 0066 006F 006F 005F;0 1 2 3 4
176+
__foo__ ;A;005F 005F 0066 006F 006F 005F 005F;0 1 2 3 4 5 6
177+
لامدا_foo ;A;0644 0627 0645 062F 0627 005F 0066 006F 006F;4 3 2 1 0 5 6 7 8
178+
foo_لامدا_baz ;A;0066 006F 006F 005F 0644 0627 0645 062F 0627 005F 0062 0061 007A;0 1 2 3 8 7 6 5 4 9 10 11 12
179+
foo_لامدا ;A;0066 006F 006F 005F 0644 0627 0645 062F 0627;0 1 2 3 8 7 6 5 4
180+
foo_لامدا1 ;A;0066 006F 006F 005F 0644 0627 0645 062F 0627 0031;0 1 2 3 9 8 7 6 5 4
181+
foo_لامدا_حدد ;A;0066 006F 006F 005F 0644 0627 0645 062F 0627 005F 062D 062F 062F;0 1 2 3 12 11 10 9 8 7 6 5 4
182+
foo_لامدا_حدد1 ;A;0066 006F 006F 005F 0644 0627 0645 062F 0627 005F 062D 062F 062F 0031;0 1 2 3 13 12 11 10 9 8 7 6 5 4
183+
foo_لامدا_حدد1_bar ;A; 0066 006F 006F 005F 0644 0627 0645 062F 0627 005F 062D 062F 062F 0031 005F 0062 0061 0072;0 1 2 3 13 12 11 10 9 8 7 6 5 4 14 15 16 17
184+
foo_لامدا_حدد1_bar1 ;A;0066 006F 006F 005F 0644 0627 0645 062F 0627 005F 062D 062F 062F 0031 005F 0062 0061 0072 0031;0 1 2 3 13 12 11 10 9 8 7 6 5 4 14 15 16 17 18
185+
"""
186+
|> String.split("\n", trim: true)
187+
|> Enum.map(&String.split(&1, ";", trim: true))
188+
|> Enum.each(fn
189+
[ident, _, bytes, indices | _rest] ->
190+
bytes = String.split(bytes, " ", trim: true) |> Enum.map(&String.to_integer(&1, 16))
191+
indices = String.split(indices, " ", trim: true) |> Enum.map(&String.to_integer/1)
192+
display_ordered = for i <- indices, do: Enum.at(bytes, i)
193+
unbidified = String.Tokenizer.Security.unbidify(bytes)
194+
195+
assert(display_ordered == unbidified, """
196+
Failing String.Tokenizer.Security.unbidify/1 case for: '#{ident}'
197+
bytes : #{bytes |> Enum.map(&Integer.to_string(&1, 16)) |> Enum.join(" ")}
198+
byte order : #{bytes |> Enum.intersperse(32)}
199+
uax9 order : #{display_ordered |> Enum.intersperse(32)}
200+
uax9 indices : #{indices |> Enum.join(" ")}
201+
unbidify/1 : #{unbidified |> Enum.intersperse(32)}
202+
""")
203+
end)
204+
end
205+
142206
test "prevents unsafe script mixing in identifiers" do
143207
exception =
144208
assert_raise SyntaxError, fn ->
@@ -172,7 +236,9 @@ defmodule Kernel.WarningTest do
172236
assert_raise SyntaxError, ~r/mixed/, fn -> Code.string_to_quoted!("[аdmin: 1]") end
173237
assert_raise SyntaxError, ~r/mixed/, fn -> Code.string_to_quoted!("[{:аdmin, 1}]") end
174238
assert_raise SyntaxError, ~r/mixed/, fn -> Code.string_to_quoted!("quote do: аdmin(1)") end
175-
assert_raise SyntaxError, ~r/mixed/, fn -> Code.string_to_quoted!("рос_api = 1") end
239+
240+
# c is Latin
241+
assert_raise SyntaxError, ~r/mixed/, fn -> Code.string_to_quoted!("http_cервер = 1") end
176242

177243
# T is in cyrillic
178244
assert_raise SyntaxError, ~r/mixed/, fn -> Code.string_to_quoted!("[Тシャツ: 1]") end
@@ -190,6 +256,10 @@ defmodule Kernel.WarningTest do
190256
# elixir's normalizations combine scriptsets of the 'from' and 'to' characters,
191257
# ex: {Common} MICRO => {Greek} MU == {Common, Greek}; Common intersects w/all
192258
assert capture_quoted("μs") == ""
259+
260+
# allows mixed scripts if the chunks are all single-script or highly restrictive
261+
assert capture_eval("http_сервер = 1") == ""
262+
assert capture_eval("сервер_http = 1") == ""
193263
end
194264
end
195265

lib/elixir/unicode/security.ex

+81-2
Original file line numberDiff line numberDiff line change
@@ -53,7 +53,7 @@ defmodule String.Tokenizer.Security do
5353
{line, _, previous_name} when name != previous_name ->
5454
{:warn,
5555
"confusable identifier: '#{name}' looks like '#{previous_name}' on line #{line}, " <>
56-
"but they are written using different characters"}
56+
"but they are written using different characters" <> dir_compare(name, previous_name)}
5757

5858
_ ->
5959
{:ok, Map.put(skeletons, skeleton, info)}
@@ -106,7 +106,86 @@ defmodule String.Tokenizer.Security do
106106
# the specified data, producing a string of exemplar characters.
107107
# - Reapply NFD." (UTS 39 section 4, skeleton definition)
108108
:unicode.characters_to_nfd_list(s)
109-
|> Enum.map(&confusable_prototype/1)
109+
|> bidi_skeleton()
110110
|> :unicode.characters_to_nfd_list()
111111
end
112+
113+
# unicode 15 adds bidiSkeleton because, w/RTL codepoints, idents that
114+
# aren't confusable LTR *are* confusable in most places human review
115+
# occurs (editors/browsers, thanks to bidi algo, UAX9).
116+
#
117+
# The solution is to detect spans with reversed visual direction,
118+
# and reverse those, so that the input we check for confusability
119+
# matches the perceived sequence instead of the byte sequence.
120+
#
121+
# (we need this regardless of script mixing, because direction-neutral
122+
# chars like _ or 0..9 can mix w/RTL chars).
123+
def bidi_skeleton(s) do
124+
# UTS39-28 4:
125+
# 'Bidirectional confusability is costlier to check than
126+
# confusability, as [unicode bidi algo] must be applied.
127+
# [...] a fast path can be used: [...] if X has no characters
128+
# w/bidi classes R or AL, bidiSkeleton(X) = skeleton(X)
129+
#
130+
if match?([_, _ | _], s) and any_rtl?(s) do
131+
unbidify(s) |> Enum.map(&confusable_prototype/1)
132+
else
133+
Enum.map(s, &confusable_prototype/1)
134+
end
135+
end
136+
137+
import String.Tokenizer, only: [dir: 1]
138+
139+
defp any_rtl?(s), do: Enum.any?(s, &(:rtl == dir(&1)))
140+
141+
defp dir_compare(a, b) do
142+
"""
143+
#{if any_rtl?(a), do: "\n\n" <> dir_breakdown(a)}
144+
#{if any_rtl?(b), do: dir_breakdown(b)}
145+
"""
146+
end
147+
148+
defp dir_breakdown(s) do
149+
init = "'#{s}' includes right-to-left characters:\n"
150+
151+
for codepoint <- s, into: init do
152+
hex = :io_lib.format(~c"~4.16.0B", [codepoint])
153+
" \\u#{hex} #{[codepoint]} #{dir(codepoint)}\n"
154+
end
155+
end
156+
157+
# make charlist match visual order by reversing spans of {rtl, neutral}
158+
# and attaching neutral characters and weak number types according to uax9
159+
#
160+
# UTS39-28 4: '[...] if the strings are known not to contain explicit
161+
# directional formatting characters[...], the algorithm can
162+
# be drastically simplified, [...], obviating the need for
163+
# the [...] stack of the [unicode bidi algo]'
164+
def unbidify(chars) when is_list(chars) do
165+
{neutrals, direction, last_part, acc} =
166+
chars
167+
|> Enum.map(&{&1, dir(&1)})
168+
|> Enum.reduce({[], :ltr, [], []}, fn
169+
# https://www.unicode.org/reports/tr9/#W2
170+
{head, :weak_number}, {neutrals, part_dir, part, acc} ->
171+
{[], part_dir, [head] ++ neutrals ++ part, acc}
172+
173+
{head, :neutral}, {neutrals, part_dir, part, acc} ->
174+
{[head | neutrals], part_dir, part, acc}
175+
176+
{head, part_dir}, {neutrals, part_dir, part, acc} ->
177+
{[], part_dir, [head | neutrals] ++ part, acc}
178+
179+
{head, :ltr = head_dir}, {neutrals, :rtl, part, acc} ->
180+
{[], head_dir, [head | neutrals], maybe_reverse(:rtl, part) ++ acc}
181+
182+
{head, :rtl = head_dir}, {neutrals, :ltr, part, acc} ->
183+
{[], head_dir, [head], maybe_reverse(:ltr, neutrals ++ part) ++ acc}
184+
end)
185+
186+
Enum.reverse(maybe_reverse(direction, neutrals ++ last_part) ++ acc)
187+
end
188+
189+
defp maybe_reverse(:rtl, part), do: Enum.reverse(part)
190+
defp maybe_reverse(:ltr, part), do: part
112191
end

lib/elixir/unicode/tokenizer.ex

+58-15
Original file line numberDiff line numberDiff line change
@@ -28,15 +28,19 @@ defmodule String.Tokenizer do
2828
end
2929
end
3030

31-
{letter_uptitlecase, start, continue, _} =
31+
{letter_uptitlecase, start, continue, dir_rtls, dir_neutrals, _} =
3232
Path.join(__DIR__, "UnicodeData.txt")
3333
|> File.read!()
3434
|> String.split(["\r\n", "\n"], trim: true)
35-
|> Enum.reduce({[], [], [], nil}, fn line, acc ->
36-
{letter_uptitlecase, start, continue, first} = acc
35+
|> Enum.reduce({[], [], [], [], [], nil}, fn line, acc ->
36+
{letter_uptitlecase, start, continue, rtls, neutrals, first} = acc
37+
38+
# https://www.unicode.org/reports/tr44/tr44-32.html#UnicodeData.txt
3739
[codepoint, line] = :binary.split(line, ";")
3840
[name, line] = :binary.split(line, ";")
39-
[category, _] = :binary.split(line, ";")
41+
[category, line] = :binary.split(line, ";")
42+
[_canonical_combining, line] = :binary.split(line, ";")
43+
[bidi, _] = :binary.split(line, ";")
4044

4145
{codepoints, first} =
4246
case name do
@@ -52,18 +56,25 @@ defmodule String.Tokenizer do
5256
{[String.to_integer(codepoint, 16)], nil}
5357
end
5458

59+
{rtls, neutrals} =
60+
cond do
61+
bidi in ~w(R AL)s -> {codepoints ++ rtls, neutrals}
62+
bidi in ~w(WS ON CS EN ES ET NSM)s -> {rtls, codepoints ++ neutrals}
63+
true -> {rtls, neutrals}
64+
end
65+
5566
cond do
5667
category in ~w(Lu Lt) ->
57-
{codepoints ++ letter_uptitlecase, start, continue, first}
68+
{codepoints ++ letter_uptitlecase, start, continue, rtls, neutrals, first}
5869

5970
category in ~w(Ll Lm Lo Nl) ->
60-
{letter_uptitlecase, codepoints ++ start, continue, first}
71+
{letter_uptitlecase, codepoints ++ start, continue, rtls, neutrals, first}
6172

6273
category in ~w(Mn Mc Nd Pc) ->
63-
{letter_uptitlecase, start, codepoints ++ continue, first}
74+
{letter_uptitlecase, start, codepoints ++ continue, rtls, neutrals, first}
6475

6576
true ->
66-
{letter_uptitlecase, start, continue, first}
77+
{letter_uptitlecase, start, continue, rtls, neutrals, first}
6778
end
6879
end)
6980

@@ -327,6 +338,28 @@ defmodule String.Tokenizer do
327338

328339
defp unicode_continue(_), do: @bottom
329340

341+
# subset of direction-changing/neutral characters valid in idents
342+
id_all = id_upper ++ id_start ++ id_continue
343+
dir_rtls = for c <- dir_rtls, c in id_all, do: {c, :rtl}
344+
dir_neutrals = for c <- dir_neutrals, c not in 48..57, c in id_all, do: {c, :neutral}
345+
dir_ranges = rangify.(dir_rtls) ++ rangify.(dir_neutrals)
346+
347+
# direction of a codepoint. (rtl, neutral, weak, ltr fallback)
348+
# weaks are pulled towards previous directional spans,
349+
# but the only weaks allowed in idents are numbers 0..9
350+
def dir(i) when i in 48..57, do: :weak_number
351+
352+
for {first, last, direction} <- dir_ranges do
353+
if first == last do
354+
def dir(unquote(first)), do: unquote(direction)
355+
else
356+
def dir(i) when i in unquote(first)..unquote(last), do: unquote(direction)
357+
end
358+
end
359+
360+
def dir(i) when is_integer(i), do: :ltr
361+
def dir(_), do: {:error, :codepoint_must_be_integer}
362+
330363
# Hard-coded normalizations. Also split by upper, start, continue.
331364

332365
for {from, to} <- start_normalizations do
@@ -448,7 +481,7 @@ defmodule String.Tokenizer do
448481
[:nfkc | List.delete(special, :nfkc)]
449482
end
450483

451-
if scriptset != @bottom or highly_restrictive?(acc) do
484+
if scriptset != @bottom or chunks_single_or_highly_restrictive?(acc) do
452485
{kind, acc, rest, length, false, special}
453486
else
454487
breakdown =
@@ -477,24 +510,34 @@ defmodule String.Tokenizer do
477510
Mixed-script identifiers are not supported for security reasons. \
478511
'#{acc}' is made of the following scripts:\n
479512
#{breakdown}
480-
All characters in the identifier should resolve to a single script, \
481-
or use a highly restrictive set of scripts.
513+
All characters in identifier chunks should resolve to a single script, \
514+
or a highly restrictive set of scripts.
482515
"""
483516

484517
{:error, {:not_highly_restrictive, acc, {prefix, suffix}}}
485518
end
486519
end
487520

488-
defp highly_restrictive?(acc) do
521+
defp chunks_single_or_highly_restrictive?(acc) do
522+
# support script mixing via chunked identifiers (UTS 55-5's strong recco)
523+
# each chunk in an ident like foo_bar_baz should pass checks
524+
acc
525+
|> :string.tokens([?_])
526+
|> Enum.all?(&single_or_highly_restrictive?/1)
527+
end
528+
529+
defp single_or_highly_restrictive?(acc) do
489530
scriptsets = Enum.map(acc, &codepoint_to_scriptset/1)
531+
is_single_script = @bottom != Enum.reduce(scriptsets, @top, &ss_intersect/2)
490532

491533
# 'A set of scripts is defined to cover a string if the intersection of
492534
# that set with the augmented script sets of all characters in the string
493535
# is nonempty; in other words, if every character in the string shares at
494536
# least one script with the cover set.'
495-
Enum.any?(@highly_restrictive, fn restrictive ->
496-
Enum.all?(scriptsets, &(ss_intersect(&1, restrictive) != @bottom))
497-
end)
537+
is_single_script or
538+
Enum.any?(@highly_restrictive, fn restrictive ->
539+
Enum.all?(scriptsets, &(ss_intersect(&1, restrictive) != @bottom))
540+
end)
498541
end
499542

500543
defp codepoint_to_scriptset(head) do

0 commit comments

Comments
 (0)