|
| 1 | +# Rust compiler Overview |
| 2 | + |
| 3 | +This chapter is about the overall process of compiling a program -- how |
| 4 | +everything fits together. |
| 5 | + |
| 6 | +The rust compiler is special in two ways: it does things to your code that |
| 7 | +other compilers don't do (e.g. borrow checking) and it has a lot of |
| 8 | +unconventional implementation choices (e.g. queries). We will talk about these |
| 9 | +in turn in this chapter, and in the rest of the guide, we will look at all the |
| 10 | +individual pieces in more detail. |
| 11 | + |
| 12 | +## What the compiler does to your code |
| 13 | + |
| 14 | +So first, let's look at what the compiler does to your code. For now, we will |
| 15 | +avoid mentioning how the compiler implements these steps except as needed; |
| 16 | +we'll talk about that later. |
| 17 | + |
| 18 | +**TODO: Would be great to have a diagram of this once we nail down the details...** |
| 19 | + |
| 20 | +**TODO: someone else should confirm this vvv** |
| 21 | + |
| 22 | +- User writes a program and invokes `rustc` on it (possibly through `cargo`). |
| 23 | +- First, we parse command line flags, etc. This is done in [`librustc_driver`]. |
| 24 | + We now know what the exact work is we need to do (e.g. which nightly features |
| 25 | + are enabled, whether we are doing a `check`-only build or emiting LLVM-IR or |
| 26 | + a full compilation). |
| 27 | +- Then, we start to do compilation... |
| 28 | +- We first [_lex_ the user program][lex]. This turns the program into a stream |
| 29 | + of _tokens_ (yes, the same sort of tokens as `proc_macros` (sort of)). |
| 30 | + [`StringReader`] from [`librustc_parse`] integrates [`librustc_lexer`] with |
| 31 | + `rustc` data structures. |
| 32 | +- We then [_parse_ the stream of tokens][parser] to build an Abstract Syntax |
| 33 | + Tree (AST). |
| 34 | +- We then take the AST and [convert it to High-Level Intermediate |
| 35 | + Representation (HIR)][hir]. This is a compiler-friendly representation of the |
| 36 | + AST. This involves a lot of desugaring of things like loops and `async fn`. |
| 37 | +- We use the HIR to do [type inference]. This is the process of automatic |
| 38 | + detection of the type of an expression. **TODO: how `ty` module fits in |
| 39 | + here** |
| 40 | +- **TODO: Maybe some other things are done here? I think initial type checking |
| 41 | + happens here? And trait solving?** |
| 42 | +- The HIR is then [lowered to Mid-Level Intermediate Representation (MIR)][mir]. |
| 43 | +- The MIR is used for [borrow checking]. |
| 44 | +- **TODO: const eval fits in somewhere here I think** |
| 45 | +- We (want to) do [many optimizations on the MIR][mir-opt] because it is still |
| 46 | + generic and that improves the code we generate later, improving compilation |
| 47 | + speed too. (**TODO: size optimizations too?**) |
| 48 | + - MIR is a higher level (and generic) representation, so it is easier to do |
| 49 | + some optimizations at MIR level than at LLVM-IR level. For example LLVM |
| 50 | + doesn't seem to be able to optimize the pattern the [`simplify_try`] mir |
| 51 | + opt looks for. |
| 52 | +- Rust code is _monomorphized_, which means making copies of all the generic |
| 53 | + code with the type parameters replaced by concrete types. In order to do |
| 54 | + this, we need to collect a list of what concrete types to generate code for. |
| 55 | + This is called _monomorphization collection_. |
| 56 | +- We then begin what is vaguely called _code generation_ or _codegen_. |
| 57 | + - The [code generation stage (codegen)][codegen] is when higher level |
| 58 | + representations of source are turned into an executable binary. `rustc` |
| 59 | + uses LLVM for code generation. The first step is the MIR is then |
| 60 | + converted to LLVM Intermediate Representation (LLVM IR). This is where |
| 61 | + the MIR is actually monomorphized, according to the list we created in |
| 62 | + the previous step. |
| 63 | + - The LLVM IR is passed to LLVM, which does a lot more optimizations on it. |
| 64 | + It then emits machine code. It is basically assembly code with additional |
| 65 | + low-level types and annotations added. (e.g. an ELF object or wasm). |
| 66 | + **TODO: reference for this section?** |
| 67 | + - The different libraries/binaries are linked together to produce the final |
| 68 | + binary. **TODO: reference for this section?** |
| 69 | + |
| 70 | +[`librustc_lexer`]: https://doc.rust-lang.org/nightly/nightly-rustc/rustc_lexer/index.html |
| 71 | +[`librustc_driver`]: https://rust-lang.github.io/rustc-guide/rustc-driver.html |
| 72 | +[lex]: https://rust-lang.github.io/rustc-guide/the-parser.html |
| 73 | +[`StringReader`]: https://doc.rust-lang.org/nightly/nightly-rustc/rustc_parse/lexer/struct.StringReader.html |
| 74 | +[`librustc_parse`]: https://doc.rust-lang.org/nightly/nightly-rustc/rustc_parse/index.html |
| 75 | +[parser]: https://doc.rust-lang.org/nightly/nightly-rustc/rustc_lexer/index.html |
| 76 | +[hir]: https://doc.rust-lang.org/nightly/nightly-rustc/rustc_lexer/index.html |
| 77 | +[type inference]: https://rust-lang.github.io/rustc-guide/type-inference.html |
| 78 | +[mir]: https://rust-lang.github.io/rustc-guide/mir/index.html |
| 79 | +[borrow checker]: https://rust-lang.github.io/rustc-guide/borrow_check.html |
| 80 | +[mir-opt]: https://rust-lang.github.io/rustc-guide/mir/optimizations.html |
| 81 | +[`simplify_try`]: https://github.com/rust-lang/rust/pull/66282 |
| 82 | +[codegen]: https://rust-lang.github.io/rustc-guide/codegen.html |
| 83 | + |
| 84 | +## How it does it |
| 85 | + |
| 86 | +Ok, so now that we have a high-level view of what the compiler does to your |
| 87 | +code, let's take a high-level view of _how_ it does all that stuff. There are a |
| 88 | +lot of constraints and conflicting goals that the compiler needs to |
| 89 | +satisfy/optimize for. For example, |
| 90 | + |
| 91 | +- Compilation speed: how fast is it to compile a program. More/better |
| 92 | + compile-time analyses often means compilation is slower. |
| 93 | + - Also, we want to support incremental compilation, so we need to take that |
| 94 | + into account. How can we keep track of what work needs to be redone and |
| 95 | + what can be reused if the user modifies their program? |
| 96 | + - Also we can't store too much stuff in the incremental cache because |
| 97 | + it would take a long time to load from disk and it could take a lot |
| 98 | + of space on the user's system... |
| 99 | +- Compiler memory usage: while compiling a program, we don't want to use more |
| 100 | + memory than we need. |
| 101 | +- Program speed: how fast is your compiled program. More/better compile-time |
| 102 | + analyses often means the compiler can do better optimizations. |
| 103 | +- Program size: how large is the compiled binary? Similar to the previous |
| 104 | + point. |
| 105 | +- Compiler compilation speed: how long does it take to compile the compiler? |
| 106 | + This impacts contributors and compiler maintenance. |
| 107 | +- Compiler implementation complexity: building a compiler is one of the hardest |
| 108 | + things a person/group can do, and rust is not a very simple language, so how |
| 109 | + do we make the compiler's code base manageable? |
| 110 | +- Compiler correctness: the binaries produced by the compiler should do what |
| 111 | + the input programs says they do, and should continue to do so despite the |
| 112 | + tremendous amount of change constantly going on. |
| 113 | +- Compiler integration: a number of other tools need to use the compiler in |
| 114 | + various ways (e.g. cargo, clippy, miri, RLS) that must be supported. |
| 115 | +- Compiler stability: the compiler should not crash or fail ungracefully on the |
| 116 | + stable channel. |
| 117 | +- Rust stability: the compiler must respect rust's stability guarantees by not |
| 118 | + breaking programs that previously compiled despite the many changes that are |
| 119 | + always going on to its implementation. |
| 120 | +- Limitations of other tools: rustc uses LLVM in its backend, and LLVM has some |
| 121 | + strengths we leverage and some limitations/weaknesses we need to work around. |
| 122 | +- And others that I'm probably forgetting. |
| 123 | + |
| 124 | +So, as you read through the rest of the guide, keep these things in mind. They |
| 125 | +will often inform decisions that we make. |
| 126 | + |
| 127 | +### Constant change |
| 128 | + |
| 129 | +One thing to keep in mind is that `rustc` is a real production-quality product. |
| 130 | +As such, it has its fair share of codebase churn and technical debt. A lot of |
| 131 | +the designs discussed throughout this guide are idealized designs that are not |
| 132 | +fully realized yet. And things keep changing so that it is hard to keep this |
| 133 | +guide completely up to date on everything! |
| 134 | + |
| 135 | +The compiler definitely has rough edges, but because of its design it is able |
| 136 | +to keep up with the requirements above. |
| 137 | + |
| 138 | +### Intermediate representations |
| 139 | + |
| 140 | +As with most compilers, `rustc` uses some intermediate representations (IRs) to |
| 141 | +facilitate computations. In general, working directly with the source code is |
| 142 | +extremely inconvenient. Source code is designed to be human-friendly while at |
| 143 | +the same time being unambiguous, but it's less convenient for doing something |
| 144 | +like, say, type checking. |
| 145 | + |
| 146 | +Instead most compilers, including `rustc`, build some sort of IR out of the |
| 147 | +source code which is easier to analyze. `rustc` has a few IRs, each optimized |
| 148 | +for different things: |
| 149 | + |
| 150 | +- Abstract Syntax Tree (AST): the abstract syntax tree is built from the stream |
| 151 | + of tokens produced by the lexer directly from the source code. It represents |
| 152 | + pretty much exactly what the user wrote. It helps to do some syntactic sanity |
| 153 | + checking (e.g. checking that a type is expected where the user wrote one). |
| 154 | +- High-level IR (HIR): This is a sort of very desugared AST. It's still close |
| 155 | + to what the user wrote syntactically, but it includes some implicit things |
| 156 | + such as some elided lifetimes, etc. This IR is amenable to type checking. |
| 157 | +- HAIR: This is an intermediate between HIR and MIR. This only exists to make |
| 158 | + it easier to lower HIR to MIR. |
| 159 | +- Middle-level IR (MIR): This IR is basically a Control-Flow Graph (CFG). A CFG |
| 160 | + is a type of diagram that shows the basic blocks of a program and how control |
| 161 | + flow can go between them. Likewise, MIR also has a bunch of basic blocks with |
| 162 | + simple typed statements inside them (e.g. assignment, simple computations, |
| 163 | + dropping values, etc). MIR is used for borrow checking and a bunch of other |
| 164 | + important dataflow based checks, such as checking for uninitialized values. |
| 165 | + It is also used for a bunch of optimizations and for constant evaluation (via |
| 166 | + MIRI). Because MIR is still generic, we can do a lot of analyses here more |
| 167 | + efficiently than after monomorphization. |
| 168 | +- LLVM IR: This is the standard form of all input to the LLVM compiler. LLVM IR |
| 169 | + is basically a sort of typed assembly language with lots of annotations. It's |
| 170 | + a standard format that is used by all compilers that use LLVM (e.g. the clang |
| 171 | + C compiler also outputs LLVM IR). LLVM IR is designed to be easy for other |
| 172 | + compilers to emit and also rich enough for LLVM to run a bunch of |
| 173 | + optimizations on it. |
| 174 | + |
| 175 | +### Queries |
| 176 | + |
| 177 | +The first big implementation choice is the _query_ system. The rust compiler |
| 178 | +uses a query system which is unlike most textbook compilers, which are |
| 179 | +organized as a series of passes over the code that execute sequentially. The |
| 180 | +compiler does this to make incremental compilation possible -- that is, if the |
| 181 | +user makes a change to their program and recompiles, we want to do as little |
| 182 | +redundant work as possible to produce the new binary. |
| 183 | + |
| 184 | +In rustc, all the major steps above are organized as a bunch of queries that |
| 185 | +call each other. For example, there is a query to ask for the type of something |
| 186 | +and another to ask for the optimized MIR of a function, and so on. These |
| 187 | +queries can call each other and are all tracked through the query system, and |
| 188 | +the results of the queries are cached on disk so that we can tell which |
| 189 | +queries' results changed from the last compilation and only redo those. This is |
| 190 | +how incremental compilation works. |
| 191 | + |
| 192 | +In principle, for the query-fied steps, we do each of the above for each item |
| 193 | +individually. For example, we will take the HIR for a function and use queries |
| 194 | +to ask for the LLVM IR for that HIR. This drives the generation of optimized |
| 195 | +MIR, which drives the borrow checker, which drives the generation of MIR, and |
| 196 | +so on. |
| 197 | + |
| 198 | +... except that this is very over-simplified. In fact, some queries are not |
| 199 | +cached on disk, and some parts of the compiler have to run for all code anyway |
| 200 | +for correctness even if the code is dead code (e.g. the borrow checker). For |
| 201 | +example, [currently the `mir_borrowck` query is first executed on all functions |
| 202 | +of a crate.][passes] Then the codegen backend invokes the |
| 203 | +`collect_and_partition_mono_items` query, which first recursively requests the |
| 204 | +`optimized_mir` for all reachable functions, which in turn runs `mir_borrowck` |
| 205 | +for that function and then creates codegen units. This kind of split will need |
| 206 | +to remain to ensure that unreachable functions still have their errors emitted. |
| 207 | + |
| 208 | +[passes]: https://github.com/rust-lang/rust/blob/45ebd5808afd3df7ba842797c0fcd4447ddf30fb/src/librustc_interface/passes.rs#L824 |
| 209 | + |
| 210 | +Moreover, the compiler wasn't originally built to use a query system; the query |
| 211 | +system has been retrofitted into the compiler, so parts of it are not |
| 212 | +query-fied yet. Also, LLVM isn't our code, so obviously that isn't querified |
| 213 | +either. The plan is to eventually query-fy all of the steps listed in the |
| 214 | +previous section, but as of this writing, only the steps between HIR and |
| 215 | +LLVM-IR are query-fied. That is, lexing and parsing are done all at once for |
| 216 | +the whole program. |
| 217 | + |
| 218 | +One other thing to mention here is the all-important "typing context", |
| 219 | +[`TyCtxt`], which is a giant struct that is at the center of all things. All |
| 220 | +queries are defined as methods on the [`TyCtxt`] type, and the in-memory query |
| 221 | +cache is stored there too. In the code, there is usually a variable called |
| 222 | +`tcx` which is a handle on the typing context. You will also see lifetimes with |
| 223 | +the name `'tcx`, which means that something is tied to the lifetime of the |
| 224 | +`TyCtxt` (usually it is stored or _interned_ there). |
| 225 | + |
| 226 | +[`TyCtxt`]: https://doc.rust-lang.org/nightly/nightly-rustc/rustc/ty/struct.TyCtxt.html |
| 227 | + |
| 228 | +### `ty::Ty` |
| 229 | + |
| 230 | +Types are really important in Rust, and they form the core of a lot of compiler |
| 231 | +analyses. The main type (in the compiler) that represents types (in the user's |
| 232 | +program) is [`rustc::ty::Ty`][ty]. This is so important that we have a whole chapter |
| 233 | +on [`ty::Ty`][ty], but for now, we just want to mention that it exists and is the way |
| 234 | +`rustc` represents types! |
| 235 | + |
| 236 | +Oh, and also the `rustc::ty` module defines the `TyCtxt` struct we mentioned before. |
| 237 | + |
| 238 | +[ty]: https://doc.rust-lang.org/nightly/nightly-rustc/rustc/ty/type.Ty.html |
| 239 | + |
| 240 | +### Parallelism |
| 241 | + |
| 242 | +Compiler performance is a problem that we would very much like to improve on |
| 243 | +(and are always working on). One aspect of that is attempting to parallelize |
| 244 | +`rustc` itself. |
| 245 | + |
| 246 | +Currently, there is only one part of rustc that is already parallel: codegen. |
| 247 | +During monomorphization, the compiler will split up all the code to be |
| 248 | +generated into smaller chunks called _codegen units_. These are then generated |
| 249 | +by independent instances of LLVM. Since they are independent, we can run them |
| 250 | +in parallel. At the end, the linker is run to combine all the codegen units |
| 251 | +together into one binary. |
| 252 | + |
| 253 | +However, the rest of the compiler is still not yet parallel. There have been |
| 254 | +lots of efforts spent on this, but it is generally a hard problem. The current |
| 255 | +approach is (**TODO: verify**) to turn `RefCell`s into `Mutex`s -- that is, we |
| 256 | +switch to thread-safe internal mutability. However, there are ongoing |
| 257 | +challenges with lock contention, maintaining query-system invariants under |
| 258 | +concurrency, and the complexity of the code base. One can try out the current |
| 259 | +work by enabling parallel compilation in `config.toml`. It's still early days, |
| 260 | +but there are already some promising performance improvements. |
| 261 | + |
| 262 | +### Bootstrapping |
| 263 | + |
| 264 | +**TODO (or do we want such a section)?** |
| 265 | + |
| 266 | +## A flow chart or walkthrough diagram |
| 267 | + |
| 268 | +**TODO** |
| 269 | + |
| 270 | +# Unresolved Questions |
| 271 | + |
| 272 | +**TODO: find answers to these** |
| 273 | + |
| 274 | +- Does LLVM ever do optimizations in debug builds? |
| 275 | +- How do I explore phases of the compile process in my own sources (lexer, |
| 276 | + parser, HIR, etc)? - e.g., `cargo rustc -- -Zunpretty=hir-tree` allows you to |
| 277 | + view HIR representation |
| 278 | +- What is the main source entry point for `X`? |
| 279 | +- Where do phases diverge for cross-compilation to machine code across |
| 280 | + different platforms? |
| 281 | + |
| 282 | +# References |
| 283 | + |
| 284 | +- Command line parsing |
| 285 | + - Guide: [The Rustc Driver and Interface](https://rust-lang.github.io/rustc-guide/rustc-driver.html) |
| 286 | + - Driver definition: [`rustc_driver`](https://doc.rust-lang.org/nightly/nightly-rustc/rustc_driver/) |
| 287 | + - Main entry point: **TODO** |
| 288 | +- Lexical Analysis: Lex the user program to a stream of tokens |
| 289 | + - Guide: [Lexing and Parsing](https://rust-lang.github.io/rustc-guide/the-parser.html) |
| 290 | + - Lexer definition: [`librustc_lexer`](https://doc.rust-lang.org/nightly/nightly-rustc/rustc_lexer/index.html) |
| 291 | + - Main entry point: **TODO** |
| 292 | +- Parsing: Parse the stream of tokens to an Abstract Syntax Tree (AST) |
| 293 | + - Guide: [Lexing and Parsing](https://rust-lang.github.io/rustc-guide/the-parser.html) |
| 294 | + - Parser definition: [`librustc_parse`](https://doc.rust-lang.org/nightly/nightly-rustc/rustc_parse/index.html) |
| 295 | + - Main entry point: **TODO** |
| 296 | + - AST definition: [`syntax`](https://doc.rust-lang.org/nightly/nightly-rustc/syntax/index.html) |
| 297 | +- The High Level Intermediate Representation (HIR) |
| 298 | + - Guide: [The HIR](https://rust-lang.github.io/rustc-guide/hir.html) |
| 299 | + - Guide: [Identifiers in the HIR](https://rust-lang.github.io/rustc-guide/hir.html#identifiers-in-the-hir) |
| 300 | + - Guide: [The HIR Map](https://rust-lang.github.io/rustc-guide/hir.html#the-hir-map) |
| 301 | + - Guide: [Lowering AST to HIR](https://rust-lang.github.io/rustc-guide/lowering.html) |
| 302 | + - How to view HIR representation for your code `cargo rustc -- -Zunpretty=hir-tree` |
| 303 | + - Rustc HIR definition: [`rustc_hir`](https://doc.rust-lang.org/nightly/nightly-rustc/rustc_hir/index.html) |
| 304 | + - Main entry point: **TODO** |
| 305 | +- Type Inference |
| 306 | + - Guide: [Type Inference](https://rust-lang.github.io/rustc-guide/type-inference.html) |
| 307 | + - Guide: [The ty Module: Representing Types](https://rust-lang.github.io/rustc-guide/ty.html) (semantics) |
| 308 | + - Main entry point: **TODO** |
| 309 | +- The Mid Level Intermediate Representation (MIR) |
| 310 | + - Guide: [The MIR (Mid level IR)](https://rust-lang.github.io/rustc-guide/mir/index.html) |
| 311 | + - Definition: [`librustc/mir`](https://github.com/rust-lang/rust/tree/master/src/librustc/mir) |
| 312 | + - Definition of source that manipulates the MIR: [`librustc_mir`](https://github.com/rust-lang/rust/tree/master/src/librustc_mir) |
| 313 | + - Main entry point: **TODO** |
| 314 | +- The Borrow Checker |
| 315 | + - Guide: [MIR Borrow Check](https://rust-lang.github.io/rustc-guide/borrow_check.html) |
| 316 | + - Definition: [`rustc_mir/borrow_check`](https://doc.rust-lang.org/nightly/nightly-rustc/rustc_mir/borrow_check/index.html) |
| 317 | + - Main entry point: **TODO** |
| 318 | +- MIR Optimizations |
| 319 | + - Guide: [MIR Optimizations](https://rust-lang.github.io/rustc-guide/mir/optimizations.html) |
| 320 | + - Definition: [`rustc_mir/transform`](https://doc.rust-lang.org/nightly/nightly-rustc/rustc_mir/transform/index.html) **TODO: is this correct?** |
| 321 | + - Main entry point: **TODO** |
| 322 | +- Code Generation |
| 323 | + - Guide: [Code Generation](https://rust-lang.github.io/rustc-guide/codegen.html) |
| 324 | + - Guide: [Generating LLVM IR](https://rust-lang.github.io/rustc-guide/codegen.html#generating-llvm-ir) - **TODO: this is not available yet** |
| 325 | + - Generating Machine Code from LLVM IR with LLVM - **TODO: reference?** |
| 326 | + - Main entry point MIR -> LLVM IR: **TODO** |
| 327 | + - Main entry point LLVM IR -> Machine Code **TODO** |
0 commit comments