Skip to content

Draft of directional confusability protection & allowing script mixing in identifiers when separated by underscores #13693

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -104,8 +104,8 @@ unicode: $(UNICODE)
$(UNICODE): lib/elixir/unicode/*
@ echo "==> unicode (compile)";
$(Q) $(ELIXIRC) lib/elixir/unicode/unicode.ex -o lib/elixir/ebin;
$(Q) $(ELIXIRC) lib/elixir/unicode/security.ex -o lib/elixir/ebin;
$(Q) $(ELIXIRC) lib/elixir/unicode/tokenizer.ex -o lib/elixir/ebin;
$(Q) $(ELIXIRC) lib/elixir/unicode/security.ex -o lib/elixir/ebin;

$(eval $(call APP_TEMPLATE,ex_unit,ExUnit))
$(eval $(call APP_TEMPLATE,logger,Logger))
Expand Down
4 changes: 3 additions & 1 deletion lib/elixir/pages/references/unicode-syntax.md
Original file line number Diff line number Diff line change
Expand Up @@ -136,10 +136,12 @@ Elixir will not warn on confusability for identifiers made up exclusively of cha

### C3. Mixed Script Detection

Elixir will not allow tokenization of mixed-script identifiers unless the mixing is one of the exceptions defined in UTS 39 5.2, 'Highly Restrictive'. We use the means described in Section 5.1, Mixed-Script Detection, to determine if script mixing is occurring, with the modification documented in the section 'Additional Normalizations', below.
Elixir will not allow tokenization of mixed-script identifiers unless it is via chunks separated by an underscore, like `http_сервер`, or unless the mixing within each of those chunks is one of the exceptions defined in UTS 39 5.2, 'Highly Restrictive'. We use the means described in Section 5.1, Mixed-Script Detection, to determine if script mixing is occurring, with the modification documented in the section 'Additional Normalizations', below.

Examples: Elixir allows an identifiers like `幻ㄒㄧㄤ`, even though it includes characters from multiple 'scripts', because those scripts all 'resolve' to Japanese when applying the resolution rules from UTS 39 5.1. It also allows an atom like `:Tシャツ`, the Japanese word for 't-shirt', which incorporates a Latin capital T, because {Latn, Jpan} is one of the allowed script mixing in the definition of 'Highly Restrictive' in UTS 39 5.2, and it 'covers' the string.

Elixir does allow an identifier like `http_сервер`, where the identifier chunks on each side of the `_` are individually single-script.

However, Elixir would prevent tokenization in code like `if аdmin, do: :ok, else: :err`, where the scriptset for the 'a' character is {Cyrillic} but all other characters have scriptsets of {Latin}. The scriptsets fail to resolve, and the scriptsets from the definition of 'Highly Restrictive' in UTS 39 5.2 do not cover the string either, so a descriptive error is shown.

### C4, C5 (inapplicable)
Expand Down
72 changes: 71 additions & 1 deletion lib/elixir/test/elixir/kernel/warning_test.exs
Original file line number Diff line number Diff line change
Expand Up @@ -139,6 +139,70 @@ defmodule Kernel.WarningTest do
"confusable identifier: 'a' looks like 'а' on line 1"
end

test "warns on LTR-confusables" do
# warning outputs in byte order (vs bidi algo display order, uax9), mentions presence of rtl
assert_warn_quoted(
["nofile:1:9", "'_1א' looks like '_א1'", "right-to-left characters"],
"_א1 and _1א"
)

assert_warn_quoted(
[
"'a_1א' includes right-to-left characters",
"\\u0061 a ltr",
"\\u005F _ neutral",
"\\u0031 1 weak_number",
"\\u05D0 א rtl",
"'a_א1' includes right-to-left characters:",
"\\u0061 a ltr",
"\\u005F _ neutral",
"\\u05D0 א rtl",
"\\u0031 1 weak_number"
],
"a_א1 or a_1א"
)

# test that the implementation of String.Tokenizer.Security.unbidify/1 agrees
# w/Unicode Bidi Algo (UAX9) for these (identifier-specific, no-bracket) examples
#
# you can create new examples with: https://util.unicode.org/UnicodeJsps/bidic.jsp?s=foo_%D9%84%D8%A7%D9%85%D8%AF%D8%A7_baz&b=0&u=140&d=2
# inspired by (none of these are directly usable for our idents): https://www.unicode.org/Public/UCD/latest/ucd/BidiCharacterTest.txt
#
# there's a spurious ;A; after the identifier, because the semicolon is dir-neutral, and
# deleting it makes these examples hard to read in many/most editors!
"""
foo;A;0066 006F 006F;0 1 2
_foo_ ;A;005F 0066 006F 006F 005F;0 1 2 3 4
__foo__ ;A;005F 005F 0066 006F 006F 005F 005F;0 1 2 3 4 5 6
لامدا_foo ;A;0644 0627 0645 062F 0627 005F 0066 006F 006F;4 3 2 1 0 5 6 7 8
foo_لامدا_baz ;A;0066 006F 006F 005F 0644 0627 0645 062F 0627 005F 0062 0061 007A;0 1 2 3 8 7 6 5 4 9 10 11 12
foo_لامدا ;A;0066 006F 006F 005F 0644 0627 0645 062F 0627;0 1 2 3 8 7 6 5 4
foo_لامدا1 ;A;0066 006F 006F 005F 0644 0627 0645 062F 0627 0031;0 1 2 3 9 8 7 6 5 4
foo_لامدا_حدد ;A;0066 006F 006F 005F 0644 0627 0645 062F 0627 005F 062D 062F 062F;0 1 2 3 12 11 10 9 8 7 6 5 4
foo_لامدا_حدد1 ;A;0066 006F 006F 005F 0644 0627 0645 062F 0627 005F 062D 062F 062F 0031;0 1 2 3 13 12 11 10 9 8 7 6 5 4
foo_لامدا_حدد1_bar ;A; 0066 006F 006F 005F 0644 0627 0645 062F 0627 005F 062D 062F 062F 0031 005F 0062 0061 0072;0 1 2 3 13 12 11 10 9 8 7 6 5 4 14 15 16 17
foo_لامدا_حدد1_bar1 ;A;0066 006F 006F 005F 0644 0627 0645 062F 0627 005F 062D 062F 062F 0031 005F 0062 0061 0072 0031;0 1 2 3 13 12 11 10 9 8 7 6 5 4 14 15 16 17 18
Comment on lines +165 to +184
Copy link
Contributor Author

@mrluc mrluc Jul 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@josevalim I added these example-based tests, which fed back into the implementation by exercising things I hadn't thought through, like 'which span a neutral character or a number will attach to'.

Unicode has a great Bidi Algo testing tool, which will spit out the list of rearranged indices.

Let me know if we don't like this format for these tests, and if so what kind of format would be better; in the absence of anything else, I liked BidiCharacterTest.txt's approach of "each test case has a list of codepoints, and a list of indices that they should be reordered to".

"""
|> String.split("\n", trim: true)
|> Enum.map(&String.split(&1, ";", trim: true))
|> Enum.each(fn
[ident, _, bytes, indices | _rest] ->
bytes = String.split(bytes, " ", trim: true) |> Enum.map(&String.to_integer(&1, 16))
indices = String.split(indices, " ", trim: true) |> Enum.map(&String.to_integer/1)
display_ordered = for i <- indices, do: Enum.at(bytes, i)
unbidified = String.Tokenizer.Security.unbidify(bytes)

assert(display_ordered == unbidified, """
Failing String.Tokenizer.Security.unbidify/1 case for: '#{ident}'
bytes : #{bytes |> Enum.map(&Integer.to_string(&1, 16)) |> Enum.join(" ")}
byte order : #{bytes |> Enum.intersperse(32)}
uax9 order : #{display_ordered |> Enum.intersperse(32)}
uax9 indices : #{indices |> Enum.join(" ")}
unbidify/1 : #{unbidified |> Enum.intersperse(32)}
""")
end)
end

test "prevents unsafe script mixing in identifiers" do
exception =
assert_raise SyntaxError, fn ->
Expand Down Expand Up @@ -172,7 +236,9 @@ defmodule Kernel.WarningTest do
assert_raise SyntaxError, ~r/mixed/, fn -> Code.string_to_quoted!("[аdmin: 1]") end
assert_raise SyntaxError, ~r/mixed/, fn -> Code.string_to_quoted!("[{:аdmin, 1}]") end
assert_raise SyntaxError, ~r/mixed/, fn -> Code.string_to_quoted!("quote do: аdmin(1)") end
assert_raise SyntaxError, ~r/mixed/, fn -> Code.string_to_quoted!("рос_api = 1") end

# c is Latin
assert_raise SyntaxError, ~r/mixed/, fn -> Code.string_to_quoted!("http_cервер = 1") end

# T is in cyrillic
assert_raise SyntaxError, ~r/mixed/, fn -> Code.string_to_quoted!("[Тシャツ: 1]") end
Expand All @@ -190,6 +256,10 @@ defmodule Kernel.WarningTest do
# elixir's normalizations combine scriptsets of the 'from' and 'to' characters,
# ex: {Common} MICRO => {Greek} MU == {Common, Greek}; Common intersects w/all
assert capture_quoted("μs") == ""

# allows mixed scripts if the chunks are all single-script or highly restrictive
assert capture_eval("http_сервер = 1") == ""
assert capture_eval("сервер_http = 1") == ""
end
end

Expand Down
83 changes: 81 additions & 2 deletions lib/elixir/unicode/security.ex
Original file line number Diff line number Diff line change
Expand Up @@ -53,7 +53,7 @@ defmodule String.Tokenizer.Security do
{line, _, previous_name} when name != previous_name ->
{:warn,
"confusable identifier: '#{name}' looks like '#{previous_name}' on line #{line}, " <>
"but they are written using different characters"}
"but they are written using different characters" <> dir_compare(name, previous_name)}

_ ->
{:ok, Map.put(skeletons, skeleton, info)}
Expand Down Expand Up @@ -106,7 +106,86 @@ defmodule String.Tokenizer.Security do
# the specified data, producing a string of exemplar characters.
# - Reapply NFD." (UTS 39 section 4, skeleton definition)
:unicode.characters_to_nfd_list(s)
|> Enum.map(&confusable_prototype/1)
|> bidi_skeleton()
|> :unicode.characters_to_nfd_list()
end

# unicode 15 adds bidiSkeleton because, w/RTL codepoints, idents that
# aren't confusable LTR *are* confusable in most places human review
# occurs (editors/browsers, thanks to bidi algo, UAX9).
#
# The solution is to detect spans with reversed visual direction,
# and reverse those, so that the input we check for confusability
# matches the perceived sequence instead of the byte sequence.
#
# (we need this regardless of script mixing, because direction-neutral
# chars like _ or 0..9 can mix w/RTL chars).
def bidi_skeleton(s) do
# UTS39-28 4:
# 'Bidirectional confusability is costlier to check than
# confusability, as [unicode bidi algo] must be applied.
# [...] a fast path can be used: [...] if X has no characters
# w/bidi classes R or AL, bidiSkeleton(X) = skeleton(X)
#
if match?([_, _ | _], s) and any_rtl?(s) do
unbidify(s) |> Enum.map(&confusable_prototype/1)
else
Enum.map(s, &confusable_prototype/1)
end
end

import String.Tokenizer, only: [dir: 1]

defp any_rtl?(s), do: Enum.any?(s, &(:rtl == dir(&1)))

defp dir_compare(a, b) do
"""
#{if any_rtl?(a), do: "\n\n" <> dir_breakdown(a)}
#{if any_rtl?(b), do: dir_breakdown(b)}
"""
end

defp dir_breakdown(s) do
init = "'#{s}' includes right-to-left characters:\n"

for codepoint <- s, into: init do
hex = :io_lib.format(~c"~4.16.0B", [codepoint])
" \\u#{hex} #{[codepoint]} #{dir(codepoint)}\n"
end
end

# make charlist match visual order by reversing spans of {rtl, neutral}
# and attaching neutral characters and weak number types according to uax9
#
# UTS39-28 4: '[...] if the strings are known not to contain explicit
# directional formatting characters[...], the algorithm can
# be drastically simplified, [...], obviating the need for
# the [...] stack of the [unicode bidi algo]'
def unbidify(chars) when is_list(chars) do
{neutrals, direction, last_part, acc} =
chars
|> Enum.map(&{&1, dir(&1)})
|> Enum.reduce({[], :ltr, [], []}, fn
# https://www.unicode.org/reports/tr9/#W2
{head, :weak_number}, {neutrals, part_dir, part, acc} ->
{[], part_dir, [head] ++ neutrals ++ part, acc}

{head, :neutral}, {neutrals, part_dir, part, acc} ->
{[head | neutrals], part_dir, part, acc}

{head, part_dir}, {neutrals, part_dir, part, acc} ->
{[], part_dir, [head | neutrals] ++ part, acc}

{head, :ltr = head_dir}, {neutrals, :rtl, part, acc} ->
{[], head_dir, [head | neutrals], maybe_reverse(:rtl, part) ++ acc}

{head, :rtl = head_dir}, {neutrals, :ltr, part, acc} ->
{[], head_dir, [head], maybe_reverse(:ltr, neutrals ++ part) ++ acc}
end)

Enum.reverse(maybe_reverse(direction, neutrals ++ last_part) ++ acc)
end

defp maybe_reverse(:rtl, part), do: Enum.reverse(part)
defp maybe_reverse(:ltr, part), do: part
end
73 changes: 58 additions & 15 deletions lib/elixir/unicode/tokenizer.ex
Original file line number Diff line number Diff line change
Expand Up @@ -28,15 +28,19 @@ defmodule String.Tokenizer do
end
end

{letter_uptitlecase, start, continue, _} =
{letter_uptitlecase, start, continue, dir_rtls, dir_neutrals, _} =
Path.join(__DIR__, "UnicodeData.txt")
|> File.read!()
|> String.split(["\r\n", "\n"], trim: true)
|> Enum.reduce({[], [], [], nil}, fn line, acc ->
{letter_uptitlecase, start, continue, first} = acc
|> Enum.reduce({[], [], [], [], [], nil}, fn line, acc ->
{letter_uptitlecase, start, continue, rtls, neutrals, first} = acc

# https://www.unicode.org/reports/tr44/tr44-32.html#UnicodeData.txt
[codepoint, line] = :binary.split(line, ";")
[name, line] = :binary.split(line, ";")
[category, _] = :binary.split(line, ";")
[category, line] = :binary.split(line, ";")
[_canonical_combining, line] = :binary.split(line, ";")
[bidi, _] = :binary.split(line, ";")

{codepoints, first} =
case name do
Expand All @@ -52,18 +56,25 @@ defmodule String.Tokenizer do
{[String.to_integer(codepoint, 16)], nil}
end

{rtls, neutrals} =
cond do
bidi in ~w(R AL)s -> {codepoints ++ rtls, neutrals}
bidi in ~w(WS ON CS EN ES ET NSM)s -> {rtls, codepoints ++ neutrals}
true -> {rtls, neutrals}
end

cond do
category in ~w(Lu Lt) ->
{codepoints ++ letter_uptitlecase, start, continue, first}
{codepoints ++ letter_uptitlecase, start, continue, rtls, neutrals, first}

category in ~w(Ll Lm Lo Nl) ->
{letter_uptitlecase, codepoints ++ start, continue, first}
{letter_uptitlecase, codepoints ++ start, continue, rtls, neutrals, first}

category in ~w(Mn Mc Nd Pc) ->
{letter_uptitlecase, start, codepoints ++ continue, first}
{letter_uptitlecase, start, codepoints ++ continue, rtls, neutrals, first}

true ->
{letter_uptitlecase, start, continue, first}
{letter_uptitlecase, start, continue, rtls, neutrals, first}
end
end)

Expand Down Expand Up @@ -327,6 +338,28 @@ defmodule String.Tokenizer do

defp unicode_continue(_), do: @bottom

# subset of direction-changing/neutral characters valid in idents
id_all = id_upper ++ id_start ++ id_continue
dir_rtls = for c <- dir_rtls, c in id_all, do: {c, :rtl}
dir_neutrals = for c <- dir_neutrals, c not in 48..57, c in id_all, do: {c, :neutral}
dir_ranges = rangify.(dir_rtls) ++ rangify.(dir_neutrals)

# direction of a codepoint. (rtl, neutral, weak, ltr fallback)
# weaks are pulled towards previous directional spans,
# but the only weaks allowed in idents are numbers 0..9
def dir(i) when i in 48..57, do: :weak_number

for {first, last, direction} <- dir_ranges do
if first == last do
def dir(unquote(first)), do: unquote(direction)
else
def dir(i) when i in unquote(first)..unquote(last), do: unquote(direction)
end
end

def dir(i) when is_integer(i), do: :ltr
def dir(_), do: {:error, :codepoint_must_be_integer}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason we don't want to raise here? This will always be a bug, right?


# Hard-coded normalizations. Also split by upper, start, continue.

for {from, to} <- start_normalizations do
Expand Down Expand Up @@ -448,7 +481,7 @@ defmodule String.Tokenizer do
[:nfkc | List.delete(special, :nfkc)]
end

if scriptset != @bottom or highly_restrictive?(acc) do
if scriptset != @bottom or chunks_single_or_highly_restrictive?(acc) do
{kind, acc, rest, length, false, special}
else
breakdown =
Expand Down Expand Up @@ -477,24 +510,34 @@ defmodule String.Tokenizer do
Mixed-script identifiers are not supported for security reasons. \
'#{acc}' is made of the following scripts:\n
#{breakdown}
All characters in the identifier should resolve to a single script, \
or use a highly restrictive set of scripts.
All characters in identifier chunks should resolve to a single script, \
or a highly restrictive set of scripts.
"""

{:error, {:not_highly_restrictive, acc, {prefix, suffix}}}
end
end

defp highly_restrictive?(acc) do
defp chunks_single_or_highly_restrictive?(acc) do
# support script mixing via chunked identifiers (UTS 55-5's strong recco)
# each chunk in an ident like foo_bar_baz should pass checks
acc
|> :string.tokens([?_])
|> Enum.all?(&single_or_highly_restrictive?/1)
end

defp single_or_highly_restrictive?(acc) do
scriptsets = Enum.map(acc, &codepoint_to_scriptset/1)
is_single_script = @bottom != Enum.reduce(scriptsets, @top, &ss_intersect/2)

# 'A set of scripts is defined to cover a string if the intersection of
# that set with the augmented script sets of all characters in the string
# is nonempty; in other words, if every character in the string shares at
# least one script with the cover set.'
Enum.any?(@highly_restrictive, fn restrictive ->
Enum.all?(scriptsets, &(ss_intersect(&1, restrictive) != @bottom))
end)
is_single_script or
Enum.any?(@highly_restrictive, fn restrictive ->
Enum.all?(scriptsets, &(ss_intersect(&1, restrictive) != @bottom))
end)
end

defp codepoint_to_scriptset(head) do
Expand Down
Loading