@@ -13,29 +13,55 @@ scans them linearly to search.
13
13
14
14
` search.js ` calls this Raw, because it turns it into
15
15
a more normal object tree after loading it.
16
- Naturally, it's also written without newlines or spaces.
16
+ For space savings,
17
+ it's also written without newlines or spaces.
17
18
18
19
``` json
19
20
[
20
21
[ " crate_name" , {
21
- "doc" : " Documentation " ,
22
+ // name
22
23
"n" : [" function_name" , " Data" ],
24
+ // type
23
25
"t" : " HF" ,
24
- "d" : [ " This function gets the name of an integer with Data " , " The data struct " ],
26
+ // parent module
25
27
"q" : [[0 , " crate_name" ]],
28
+ // parent type
26
29
"i" : [2 , 0 ],
27
- "p" : [[1 , " i32" ], [1 , " str" ], [5 , " crate_name::Data" ]],
28
- "f" : " {{gb}{d}}`" ,
30
+ // function signature
31
+ "f" : " {{gb}{d}}`" , // [[3, 1], [2]]
32
+ // impl disambiguator
29
33
"b" : [],
30
- "c" : [],
34
+ // deprecated flag
35
+ "c" : " OjAAAAAAAAA=" , // empty bitmap
36
+ // empty description flag
37
+ "e" : " OjAAAAAAAAA=" , // empty bitmap
38
+ // type dictionary
39
+ "p" : [[1 , " i32" ], [1 , " str" ], [5 , " crate_name::Data" ]],
40
+ // aliases
31
41
"a" : [[" get_name" , 0 ]],
42
+ // description shards
43
+ "D" : " g" , // 3
32
44
}]
33
45
]
34
46
```
35
47
36
48
[ ` src/librustdoc/html/static/js/externs.js ` ]
37
49
defines an actual schema in a Closure ` @typedef ` .
38
50
51
+ | Key | Name | Description |
52
+ | --- | -------------------- | ------------ |
53
+ | ` n ` | Names | Item names |
54
+ | ` t ` | Item Type | One-char item type code |
55
+ | ` q ` | Parent module | ` Map<index, path> ` |
56
+ | ` i ` | Parent type | list of indexes |
57
+ | ` f ` | Function signature | [ encoded] ( #i-f-and-p ) |
58
+ | ` b ` | Impl disambiguator | ` Map<index, string> ` |
59
+ | ` c ` | Deprecation flag | [ roaring bitmap] ( #roaring-bitmaps )
60
+ | ` e ` | Description is empty | [ roaring bitmap] ( #roaring-bitmaps )
61
+ | ` p ` | Type dictionary | ` [[item type, path]] `
62
+ | ` a ` | Alias | ` Map<string, index> `
63
+ | ` D ` | description shards | [ encoded] ( #how-descriptions-are-stored )
64
+
39
65
The above index defines a crate called ` crate_name `
40
66
with a free function called ` function_name ` and a struct called ` Data ` ,
41
67
with the type signature ` Data, i32 -> str ` ,
@@ -78,36 +104,45 @@ It makes a lot of compromises:
78
104
79
105
### Parallel arrays and indexed maps
80
106
81
- Most data in the index
82
- (other than ` doc ` , which is a single string for the whole crate,
83
- ` p ` , which is a separate structure
84
- and ` a ` , which is also a separate structure)
85
- is a set of parallel arrays defining each searchable item.
107
+ Abstractly, Rustdoc Search data is a table, stored in column-major form.
108
+ Most data in the index represents a set of parallel arrays
109
+ (the "columns") which refer to the same data if they're at the same position.
86
110
87
111
For example,
88
112
the above search index can be turned into this table:
89
113
90
- | n | t | d | q | i | f | b | c |
91
- | ---| ---| ---| ---| ---| ---| ---| ---|
92
- | ` function_name ` | ` H ` | This function gets the name of an integer with Data | ` crate_name ` | 2 | ` {{gb}{d}} ` | NULL | NULL |
93
- | ` Data ` | ` F ` | The data struct | ` crate_name ` | 0 | `` ` `` | NULL | NULL |
114
+ | | n | t | [ d] | q | i | f | b | c |
115
+ | ---| ---| ---| -----| ---| ---| ---| ---| ---|
116
+ | 0 | ` crate_name ` | ` D ` | Documentation | NULL | 0 | NULL | NULL | 0 |
117
+ | 1 | ` function_name ` | ` H ` | This function gets the name of an integer with Data | ` crate_name ` | 2 | ` {{gb}{d}} ` | NULL | 0 |
118
+ | 2 | ` Data ` | ` F ` | The data struct | ` crate_name ` | 0 | `` ` `` | NULL | 0 |
119
+
120
+ [ d ] : #how-descriptions-are-stored
121
+
122
+ The crate row is implied in most columns, since its type is known (it's a crate),
123
+ it can't have a parent (crates form the root of the module tree),
124
+ its name is specified as the map key,
125
+ and function-specific data like the impl disambiguator can't apply either.
126
+ However, it can still have a description and it can still be deprecated.
127
+ The crate is, therefore, has a primary key of ` 0 ` .
94
128
95
129
The above code doesn't use ` c ` , which holds deprecated indices,
96
130
or ` b ` , which maps indices to strings.
97
- If ` crate_name::function_name ` used both, it would look like this.
131
+ If ` crate_name::function_name ` used both, it might look like this.
98
132
99
133
``` json
100
134
"b" : [[0 , " impl-Foo-for-Bar" ]],
101
- "c" : [ 0 ] ,
135
+ "c" : " OjAAAAEAAAAAAAIAEAAAABUAbgZYCQ== " ,
102
136
```
103
137
104
- This attaches a disambiguator to index 0 and marks it deprecated.
138
+ This attaches a disambiguator to index 1 and marks it deprecated.
105
139
106
140
The advantage of this layout is that these APIs often have implicit structure
107
141
that DEFLATE can take advantage of,
108
142
but that rustdoc can't assume.
109
143
Like how names are usually CamelCase or snake_case,
110
144
but descriptions aren't.
145
+ It also makes it easier to use a sparse data for things like boolean flags.
111
146
112
147
` q ` is a Map from * the first applicable* ID to a parent module path.
113
148
This is a weird trick, but it makes more sense in pseudo-code:
@@ -129,6 +164,98 @@ before serializing.
129
164
Doing this allows rustdoc to not only make the search index smaller,
130
165
but reuse the same string representing the parent path across multiple in-memory items.
131
166
167
+ ### Representing sparse columns
168
+
169
+ #### VLQ Hex
170
+
171
+ This format is, as far as I know, used nowhere other than rustdoc.
172
+ It follows this grammar:
173
+
174
+ ``` ebnf
175
+ VLQHex = { VHItem | VHBackref }
176
+ VHItem = VHNumber | ( '{', {VHItem}, '}' )
177
+ VHNumber = { '@' | 'A' | 'B' | 'C' | 'D' | 'E' | 'F' | 'G' | 'H' | 'I' | 'J' | 'K' | 'L' | 'M' | 'N' | 'O' }, ( '`' | 'a' | 'b' | 'c' | 'd' | 'e' | 'f' | 'g' | 'h' | 'i' | 'j' | 'k ' | 'l' | 'm' | 'n' | 'o' )
178
+ VHBackref = ( '0' | '1' | '2' | '3' | '4' | '5' | '6' | '7' | '8' | '9' | ':' | ';' | '<' | '=' | '>' | '?' )
179
+ ```
180
+
181
+ A VHNumber is a variable-length, self-terminating base16 number
182
+ (terminated because the last hexit is lowercase while all others are uppercase).
183
+ The sign bit is represented using [ zig-zag encoding] .
184
+
185
+ This alphabet is chosen because the characters can be turned into hexits by
186
+ masking off the last four bits of the ASCII encoding.
187
+
188
+ A major feature of this encoding, as with all of the "compression" done in rustdoc,
189
+ is that it can remain in its compressed format * even in memory at runtime* .
190
+ This is why ` HBackref ` is only used at the top level,
191
+ and why we don't just use [ Flate] for everything: the decoder in search.js
192
+ will reuse the entire decoded object whenever a backref is seen,
193
+ saving decode work and memory.
194
+
195
+ [ zig-zag encoding ] : https://en.wikipedia.org/wiki/Variable-length_quantity#Zigzag_encoding
196
+ [ Flate ] : https://en.wikipedia.org/wiki/Deflate
197
+
198
+ #### Roaring Bitmaps
199
+
200
+ Flag-style data, such as deprecation and empty descriptions,
201
+ are stored using the [ standard Roaring Bitmap serialization format with runs] .
202
+ The data is then base64 encoded when writing it.
203
+
204
+ As a brief overview: a roaring bitmap is a chunked array of bits,
205
+ described in [ this paper] .
206
+ A chunk can either be a list of integers, a bitfield, or a list of runs.
207
+ In any case, the search engine has to base64 decode it,
208
+ and read the chunk index itself,
209
+ but the payload data stays as-is.
210
+
211
+ All roaring bitmaps in rustdoc currently store a flag for each item index.
212
+ The crate is item 0, all others start at 1.
213
+
214
+ [ standard Roaring Bitmap serialization format with runs ] : https://github.com/RoaringBitmap/RoaringFormatSpec
215
+ [ this paper ] : https://arxiv.org/pdf/1603.06549.pdf
216
+
217
+ ### How descriptions are stored
218
+
219
+ The largest amount of data,
220
+ and the main thing Rustdoc Search deals with that isn't
221
+ actually used for searching, is descriptions.
222
+ In a SERP table, this is what appears on the rightmost column.
223
+
224
+ > | item type | item path | *** description*** (this part) |
225
+ > | --------- | --------------------- | --------------------------------------------------- |
226
+ > | function | my_crate::my_function | This function gets the name of an integer with Data |
227
+
228
+ When someone runs a search in rustdoc for the first time, their browser will
229
+ work through a "sandwich workload" of three steps:
230
+
231
+ 1 . Download the search-index.js and search.js files (a network bottleneck).
232
+ 2 . Perform the actual search (a CPU and memory bandwidth bottleneck).
233
+ 3 . Download the description data (another network bottleneck).
234
+
235
+ Reducing the amount of data downloaded here will almost always increase latency,
236
+ by delaying the decision of what to download behind other work and/or adding
237
+ data dependencies where something can't be downloaded without first downloading
238
+ something else. In this case, we can't start downloading descriptions until
239
+ after the search is done, because that's what allows it to decide * which*
240
+ descriptions to download (it needs to sort the results then truncate to 200).
241
+
242
+ To do this, two columns are stored in the search index, building on both
243
+ Roaring Bitmaps and on VLQ Hex.
244
+
245
+ * ` e ` is an index of ** e** mpty descriptions. It's a [ roaring bitmap] of
246
+ each item (the crate itself is item 0, the rest start at 1).
247
+ * ` D ` is a shard list, stored in [ VLQ hex] as flat list of integers.
248
+ Each integer gives you the number of descriptions in the shard.
249
+ As the decoder walks the index, it checks if the description is empty.
250
+ if it's not, then it's in the "current" shard. When all items are
251
+ exhausted, it goes on to the next shard.
252
+
253
+ Inside each shard is a newline-delimited list of descriptions,
254
+ wrapped in a JSONP-style function call.
255
+
256
+ [ roaring bitmap ] : #roaring-bitmaps
257
+ [ VLQ hex ] : #vlq-hex
258
+
132
259
### ` i ` , ` f ` , and ` p `
133
260
134
261
` i ` and ` f ` both index into ` p ` , the array of parent items.
@@ -139,33 +266,19 @@ It's different from `q` because `q` represents the parent *module or crate*,
139
266
which everything has,
140
267
while ` i ` /` q ` are used for * type and trait-associated items* like methods.
141
268
142
- ` f ` , the function signatures, use their own encoding.
143
-
144
- ``` ebnf
145
- f = { FItem | FBackref }
146
- FItem = FNumber | ( '{', {FItem}, '}' )
147
- FNumber = { '@' | 'A' | 'B' | 'C' | 'D' | 'E' | 'F' | 'G' | 'H' | 'I' | 'J' | 'K' | 'L' | 'M' | 'N' | 'O' }, ( '`' | 'a' | 'b' | 'c' | 'd' | 'e' | 'f' | 'g' | 'h' | 'i' | 'j' | 'k ' | 'l' | 'm' | 'n' | 'o' )
148
- FBackref = ( '0' | '1' | '2' | '3' | '4' | '5' | '6' | '7' | '8' | '9' | ':' | ';' | '<' | '=' | '>' | '?' )
149
- ```
269
+ ` f ` , the function signatures, use a [ VLQ hex] tree.
270
+ A number is either a one-indexed reference into ` p ` ,
271
+ a negative number representing a generic,
272
+ or zero for null.
150
273
151
- An FNumber is a variable-length, self-terminating base16 number
152
- (terminated because the last hexit is lowercase while all others are uppercase).
153
- These are one-indexed references into ` p ` , because zero is used for nulls,
154
- and negative numbers represent generics.
155
- The sign bit is represented using [ zig-zag encoding]
156
274
(the internal object representation also uses negative numbers,
157
275
even after decoding,
158
276
to represent generics).
159
- This alphabet is chosen because the characters can be turned into hexits by
160
- masking off the last four bits of the ASCII encoding.
161
277
162
278
For example, ` {{gb}{d}} ` is equivalent to the json ` [[3, 1], [2]] ` .
163
279
Because of zigzag encoding, `` ` `` is +0, ` a ` is -0 (which is not used),
164
280
` b ` is +1, and ` c ` is -1.
165
281
166
- [ empirically ] : https://github.com/rust-lang/rust/pull/83003
167
- [ zig-zag encoding ] : https://en.wikipedia.org/wiki/Variable-length_quantity#Zigzag_encoding
168
-
169
282
## Searching by name
170
283
171
284
Searching by name works by looping through the search index
0 commit comments