Skip to content

Commit eb008cb

Browse files
authored
Add guide for rustdoc search implementation (rust-lang#1846)
1 parent c829322 commit eb008cb

File tree

2 files changed

+245
-0
lines changed

2 files changed

+245
-0
lines changed

src/SUMMARY.md

+1
Original file line numberDiff line numberDiff line change
@@ -74,6 +74,7 @@
7474
- [Serialization in Rustc](./serialization.md)
7575
- [Parallel Compilation](./parallel-rustc.md)
7676
- [Rustdoc internals](./rustdoc-internals.md)
77+
- [Search](./rustdoc-internals/search.md)
7778

7879
# Source Code Representation
7980

src/rustdoc-internals/search.md

+244
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,244 @@
1+
# Rustdoc search
2+
3+
Rustdoc Search is two programs: `search_index.rs`
4+
and `search.js`. The first generates a nasty JSON
5+
file with a full list of items and function signatures
6+
in the crates in the doc bundle, and the second reads
7+
it, turns it into some in-memory structures, and
8+
scans them linearly to search.
9+
10+
<!-- toc -->
11+
12+
## Search index format
13+
14+
`search.js` calls this Raw, because it turns it into
15+
a more normal object tree after loading it.
16+
Naturally, it's also written without newlines or spaces.
17+
18+
```json
19+
[
20+
[ "crate_name", {
21+
"doc": "Documentation",
22+
"n": ["function_name", "Data"],
23+
"t": "HF",
24+
"d": ["This function gets the name of an integer with Data", "The data struct"],
25+
"q": [[0, "crate_name"]],
26+
"i": [2, 0],
27+
"p": [[1, "i32"], [1, "str"], [5, "crate_name::Data"]],
28+
"f": "{{gb}{d}}`",
29+
"b": [],
30+
"c": [],
31+
"a": [["get_name", 0]],
32+
}]
33+
]
34+
```
35+
36+
[`src/librustdoc/html/static/js/externs.js`]
37+
defines an actual schema in a Closure `@typedef`.
38+
39+
The above index defines a crate called `crate_name`
40+
with a free function called `function_name` and a struct called `Data`,
41+
with the type signature `Data, i32 -> str`,
42+
and an alias, `get_name`, that equivalently refers to `function_name`.
43+
44+
[`src/librustdoc/html/static/js/externs.js`]: https://github.com/rust-lang/rust/blob/79b710c13968a1a48d94431d024d2b1677940866/src/librustdoc/html/static/js/externs.js#L204-L258
45+
46+
The search index needs to fit the needs of the `rustdoc` compiler,
47+
the `search.js` frontend,
48+
and also be compact and fast to decode.
49+
It makes a lot of compromises:
50+
51+
* The `rustdoc` compiler runs on one crate at a time,
52+
so each crate has an essentially separate search index.
53+
It [merges] them by having each crate on one line
54+
and looking at the first quoted string.
55+
* Names in the search index are given
56+
in their original case and with underscores.
57+
When the search index is loaded,
58+
`search.js` stores the original names for display,
59+
but also folds them to lowercase and strips underscores for search.
60+
You'll see them called `normalized`.
61+
* The `f` array stores types as offsets into the `p` array.
62+
These types might actually be from another crate,
63+
so `search.js` has to turn the numbers into names and then
64+
back into numbers to deduplicate them if multiple crates in the
65+
same index mention the same types.
66+
* It's a JSON file, but not designed to be human-readable.
67+
Browsers already include an optimized JSON decoder,
68+
so this saves on `search.js` code and performs better for small crates,
69+
but instead of using objects like normal JSON formats do,
70+
it tries to put data of the same type next to each other
71+
so that the sliding window used by [DEFLATE] can find redundancies.
72+
Where `search.js` does its own compression,
73+
it's designed to save memory when the file is finally loaded,
74+
not just size on disk or network transfer.
75+
76+
[merges]: https://github.com/rust-lang/rust/blob/79b710c13968a1a48d94431d024d2b1677940866/src/librustdoc/html/render/write_shared.rs#L151-L164
77+
[DEFLATE]: https://en.wikipedia.org/wiki/Deflate
78+
79+
### Parallel arrays and indexed maps
80+
81+
Most data in the index
82+
(other than `doc`, which is a single string for the whole crate,
83+
`p`, which is a separate structure
84+
and `a`, which is also a separate structure)
85+
is a set of parallel arrays defining each searchable item.
86+
87+
For example,
88+
the above search index can be turned into this table:
89+
90+
| n | t | d | q | i | f | b | c |
91+
|---|---|---|---|---|---|---|---|
92+
| `function_name` | `H` | This function gets the name of an integer with Data | `crate_name` | 2 | `{{gb}{d}}` | NULL | NULL |
93+
| `Data` | `F` | The data struct | `crate_name` | 0 | `` ` `` | NULL | NULL |
94+
95+
The above code doesn't use `c`, which holds deprecated indices,
96+
or `b`, which maps indices to strings.
97+
If `crate_name::function_name` used both, it would look like this.
98+
99+
```json
100+
"b": [[0, "impl-Foo-for-Bar"]],
101+
"c": [0],
102+
```
103+
104+
This attaches a disambiguator to index 0 and marks it deprecated.
105+
106+
The advantage of this layout is that these APIs often have implicit structure
107+
that DEFLATE can take advantage of,
108+
but that rustdoc can't assume.
109+
Like how names are usually CamelCase or snake_case,
110+
but descriptions aren't.
111+
112+
`q` is a Map from *the first applicable* ID to a parent module path.
113+
This is a weird trick, but it makes more sense in pseudo-code:
114+
115+
```rust
116+
let mut parent_module = "";
117+
for (i, entry) in search_index.iter().enumerate() {
118+
if q.contains(i) {
119+
parent_module = q.get(i);
120+
}
121+
// ... do other stuff with `entry` ...
122+
}
123+
```
124+
125+
This is valid because everything has a parent module
126+
(even if it's just the crate itself),
127+
and is easy to assemble because the rustdoc generator sorts by path
128+
before serializing.
129+
Doing this allows rustdoc to not only make the search index smaller,
130+
but reuse the same string representing the parent path across multiple in-memory items.
131+
132+
### `i`, `f`, and `p`
133+
134+
`i` and `f` both index into `p`, the array of parent items.
135+
136+
`i` is just a one-indexed number
137+
(not zero-indexed because `0` is used for items that have no parent item).
138+
It's different from `q` because `q` represents the parent *module or crate*,
139+
which everything has,
140+
while `i`/`q` are used for *type and trait-associated items* like methods.
141+
142+
`f`, the function signatures, use their own encoding.
143+
144+
```ebnf
145+
f = { FItem | FBackref }
146+
FItem = FNumber | ( '{', {FItem}, '}' )
147+
FNumber = { '@' | 'A' | 'B' | 'C' | 'D' | 'E' | 'F' | 'G' | 'H' | 'I' | 'J' | 'K' | 'L' | 'M' | 'N' | 'O' }, ( '`' | 'a' | 'b' | 'c' | 'd' | 'e' | 'f' | 'g' | 'h' | 'i' | 'j' | 'k ' | 'l' | 'm' | 'n' | 'o' )
148+
FBackref = ( '0' | '1' | '2' | '3' | '4' | '5' | '6' | '7' | '8' | '9' | ':' | ';' | '<' | '=' | '>' | '?' )
149+
```
150+
151+
An FNumber is a variable-length, self-terminating base16 number
152+
(terminated because the last hexit is lowercase while all others are uppercase).
153+
These are one-indexed references into `p`, because zero is used for nulls,
154+
and negative numbers represent generics.
155+
The sign bit is represented using [zig-zag encoding]
156+
(the internal object representation also uses negative numbers,
157+
even after decoding,
158+
to represent generics).
159+
This alphabet is chosen because the characters can be turned into hexits by
160+
masking off the last four bits of the ASCII encoding.
161+
162+
For example, `{{gb}{d}}` is equivalent to the json `[[3, 1], [2]]`.
163+
Because of zigzag encoding, `` ` `` is +0, `a` is -0 (which is not used),
164+
`b` is +1, and `c` is -1.
165+
166+
[empirically]: https://github.com/rust-lang/rust/pull/83003
167+
[zig-zag encoding]: https://en.wikipedia.org/wiki/Variable-length_quantity#Zigzag_encoding
168+
169+
## Searching by name
170+
171+
Searching by name works by looping through the search index
172+
and running these functions on each:
173+
174+
* [`editDistance`] is always used to determine a match
175+
(unless quotes are specified, which would use simple equality instead).
176+
It computes the number of swaps, inserts, and removes needed to turn
177+
the query name into the entry name.
178+
For example, `foo` has zero distance from itself,
179+
but a distance of 1 from `ofo` (one swap) and `foob` (one insert).
180+
It is checked against an heuristic threshold, and then,
181+
if it is within that threshold, the distance is stored for ranking.
182+
* [`String.prototype.indexOf`] is always used to determine a match.
183+
If it returns anything other than -1, the result is added,
184+
even if `editDistance` exceeds its threshold,
185+
and the index is stored for ranking.
186+
* [`checkPath`] is used if, and only if, a parent path is specified
187+
in the query. For example, `vec` has no parent path, but `vec::vec` does.
188+
Within checkPath, editDistance and indexOf are used,
189+
and the path query has its own heuristic threshold, too.
190+
If it's not within the threshold, the entry is rejected,
191+
even if the first two pass.
192+
If it's within the threshold, the path distance is stored
193+
for ranking.
194+
* [`checkType`] is used only if there's a type filter,
195+
like the struct in `struct:vec`. If it fails,
196+
the entry is rejected.
197+
198+
If all four criteria pass
199+
(plus the crate filter, which isn't technically part of the query),
200+
the results are sorted by [`sortResults`].
201+
202+
[`editDistance`]: https://github.com/rust-lang/rust/blob/79b710c13968a1a48d94431d024d2b1677940866/src/librustdoc/html/static/js/search.js#L137
203+
[`String.prototype.indexOf`]: https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/indexOf
204+
[`checkPath`]: https://github.com/rust-lang/rust/blob/79b710c13968a1a48d94431d024d2b1677940866/src/librustdoc/html/static/js/search.js#L1814
205+
[`checkType`]: https://github.com/rust-lang/rust/blob/79b710c13968a1a48d94431d024d2b1677940866/src/librustdoc/html/static/js/search.js#L1787
206+
[`sortResults`]: https://github.com/rust-lang/rust/blob/79b710c13968a1a48d94431d024d2b1677940866/src/librustdoc/html/static/js/search.js#L1229
207+
208+
## Searching by type
209+
210+
Searching by type can be divided into two phases,
211+
and the second phase has two sub-phases.
212+
213+
* Turn names in the query into numbers.
214+
* Loop over each entry in the search index:
215+
* Quick rejection using a bloom filter.
216+
* Slow rejection using a recursive type unification algorithm.
217+
218+
In the names->numbers phase, if the query has only one name in it,
219+
the editDistance function is used to find a near match if the exact match fails,
220+
but if there's multiple items in the query,
221+
non-matching items are treated as generics instead.
222+
This means `hahsmap` will match hashmap on its own, but `hahsmap, u32`
223+
is going to match the same things `T, u32` matches
224+
(though rustdoc will detect this particular problem and warn about it).
225+
226+
Then, when actually looping over each item,
227+
the bloom filter will probably reject entries that don't have every
228+
type mentioned in the query.
229+
For example, the bloom query allows a query of `i32 -> u32` to match
230+
a function with the type `i32, u32 -> bool`,
231+
but unification will reject it later.
232+
233+
The unification filter ensures that:
234+
235+
* Bag semantics are respected. If you query says `i32, i32`,
236+
then the function has to mention *two* i32s, not just one.
237+
* Nesting semantics are respected. If your query says `vec<option>`,
238+
then `vec<option<i32>>` is fine, but `option<vec<i32>>` *is not* a match.
239+
* The division between return type and parameter is respected.
240+
`i32 -> u32` and `u32 -> i32` are completely different.
241+
242+
The bloom filter checks none of these things,
243+
and, on top of that, can have false positives.
244+
But it's fast and uses very little memory, so the bloom filter helps.

0 commit comments

Comments
 (0)