Skip to content

Commit 6444b5e

Browse files
committed
adding proof of context-sensitivy of raw string literals
1 parent 66a0b52 commit 6444b5e

File tree

1 file changed

+41
-6
lines changed

1 file changed

+41
-6
lines changed

src/grammar/raw-string-literal-ambiguity.md

Lines changed: 41 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,11 @@
11
Rust's lexical grammar is not context-free. Raw string literals are the source
22
of the problem. Informally, a raw string literal is an `r`, followed by `N`
33
hashes (where N can be zero), a quote, any characters, then a quote followed
4-
by `N` hashes. This grammar describes this as best possible:
4+
by `N` hashes. Critically, once inside the first pair of quotes,
5+
another quote cannot be followed by `N` consecutive hashes. e.g.
6+
`r###""###"###` is invalid.
7+
8+
This grammar describes this as best possible:
59

610
R -> 'r' S
711
S -> '"' B '"'
@@ -22,8 +26,39 @@ accepted as one by the above grammar, using the derivation:
2226
(Where `T : U` means the rule `T` is applied, and `U` is the remainder of the
2327
string.) The difficulty arises from the fact that it is fundamentally
2428
context-sensitive. In particular, the context needed is the number of hashes.
25-
I know of no way to resolve this, but also have not come up with a proof that
26-
it is not context sensitive. Such a proof would probably use the pumping lemma
27-
for context-free languages, but I (cmr) could not come up with a proof after
28-
spending a few hours on it, and decided my time best spent elsewhere. Pull
29-
request welcome!
29+
30+
To prove that Rust's string literals are not context-free, we will use
31+
the fact that context-free languages are closed under intersection with
32+
regular languages, and the
33+
[pumping lemma for context-free languages](https://en.wikipedia.org/wiki/Pumping_lemma_for_context-free_languages).
34+
35+
Consider the regular language `R = r#+""#*"#+`. If Rust's raw string literals are
36+
context-free, then their intersection with `R`, `R'`, should also be context-free.
37+
Therefore, to prove that raw string literals are not context-free,
38+
it is sufficient to prove that `R'` is not context-free.
39+
40+
The language `R'` is `{r#^n""#^m"#^n | m < n}`.
41+
42+
Assume `R'` *is* context-free. Then `R'` has some pumping length `p > 0` for which
43+
the pumping lemma applies. Consider the following string `s` in `R'`:
44+
45+
`r#^p""#^{p-1}"#^p`
46+
47+
e.g. for `p = 2`: `s = r##""#"##`
48+
49+
Then `s = uvwxy` for some choice of `uvwxy` such that `vx` is non-empty,
50+
`|vwx| < p+1`, and `uv^iwx^iy` is in `R'` for all `i >= 0`.
51+
52+
Neither `v` nor `x` can contain a `"` or `r`, as the number of these characters
53+
in any string in `R'` is fixed. So `v` and `x` contain only hashes.
54+
Consequently, of the three sequences of hashes, `v` and `x` combined
55+
can only pump two of them.
56+
If we ever choose the central sequence of hashes, then one of the outer sequences
57+
will not grow when we pump, leading to an imbalance between the outer sequences.
58+
Therefore, we must pump both outer sequences of hashes. However,
59+
there are `p+2` characters between these two sequences of hashes, and `|vwx|` must
60+
be less than `p+1`. Therefore we have a contradiction, and `R'` must not be
61+
context-free.
62+
63+
Since `R'` is not context-free, it follows that the Rust's raw string literals
64+
must not be context-free.

0 commit comments

Comments
 (0)