rust-lang · chrissimpkins · Mar 26, 2020 · Mar 28, 2020 · Apr 3, 2020 · Apr 3, 2020
diff --git a/src/SUMMARY.md b/src/SUMMARY.md
@@ -34,6 +34,7 @@
         - [LLVM ICE-breakers](ice-breaker/llvm.md)
     - [Licenses](./licenses.md)
 - [Part 2: How rustc works](./part-2-intro.md)
+    - [Overview of `rustc`](./overview.md)
     - [High-level overview of the compiler source](./high-level-overview.md)
     - [The Rustc Driver and Interface](./rustc-driver.md)
         - [Rustdoc](./rustdoc.md)

diff --git a/src/overview.md b/src/overview.md
@@ -0,0 +1,322 @@
+# Rust compiler Overview
+
+This chapter is about the overall process of compiling a program -- how
+everything fits together.
+
+The rust compiler is special in two ways: it does things to your code that
+other compilers don't do (e.g. borrow checking) and it has a lot of
+unconventional implementation choices (e.g. queries). We will talk about these
+in turn in this chapter, and in the rest of the guide, we will look at all the
+individual pieces in more detail.
+
+## What the compiler does to your code
+
+So first, let's look at what the compiler does to your code. For now, we will
+avoid mentioning how the compiler implements these steps except as needed;
+we'll talk about that later.
+
+**TODO: Would be great to have a diagram of this once we nail down the details...**
+
+**TODO: someone else should confirm this vvv**
+
+- The compile process begins when a user writes a Rust source program in text and invokes the `rustc` compiler on it. The work that the compiler needs to perform is defined with command line options. For example, it is possible to optionally enable nightly features, perform `check`-only builds, or emit LLVM-IR rather than complete the entire compile process defined here. The `rustc` executable call may be indirect through the use of `cargo`.
+- Command line argument parsing occurs in the [`librustc_driver`]. This crate defines the compile configuration that is requested by the user.
+- The raw Rust source text is analyzed by a low-level lexer located in [`librustc_lexer`]. At this stage, the source text is turned into a stream of atomic source code units known as _tokens_. (**TODO**: chrissimpkins - Maybe discuss Unicode handling during this stage?)
+- The token stream passes through a higher-level lexer located in [`librustc_parse`] to prepare for the next stage of the compile process. The [`StringReader`] struct is used at this stage to perform a set of validations and turn strings into interned symbols.
+- (**TODO**: chrissimpkins - Expand info on parser) We then [_parse_ the stream of tokens][parser] to build an Abstract Syntax Tree (AST).
+  - macro expansion (**TODO** chrissimpkins)
+  - ast validation (**TODO** chrissimpkins)
+  - nameres (**TODO** chrissimpkins)
+  - early linting (**TODO** chrissimpkins)
+
+- We then [_parse_ the stream of tokens][parser] to build an Abstract Syntax
+  Tree (AST).
+- We then take the AST and [convert it to High-Level Intermediate
+  Representation (HIR)][hir]. This is a compiler-friendly representation of the
+  AST.  This involves a lot of desugaring of things like loops and `async fn`.
+- We use the HIR to do [type inference]. This is the process of automatic
+  detection of the type of an expression. **TODO: how `ty` module fits in
+  here**
+- **TODO: Maybe some other things are done here? I think initial type checking
+  happens here? And trait solving?**
+- The HIR is then [lowered to Mid-Level Intermediate Representation (MIR)][mir].
+- The MIR is used for [borrow checking].
+- **TODO: const eval fits in somewhere here I think**
+- We (want to) do [many optimizations on the MIR][mir-opt] because it is still
+  generic and that improves the code we generate later, improving compilation
+  speed too. (**TODO: size optimizations too?**)
+  - MIR is a higher level (and generic) representation, so it is easier to do
+    some optimizations at MIR level than at LLVM-IR level. For example LLVM
+    doesn't seem to be able to optimize the pattern the [`simplify_try`] mir
+    opt looks for.
+- Rust code is _monomorphized_, which means making copies of all the generic
+  code with the type parameters replaced by concrete types. To do
+  this, we need to collect a list of what concrete types to generate code for.
+  This is called _monomorphization collection_.
+- We then begin what is vaguely called _code generation_ or _codegen_.
+  - The [code generation stage (codegen)][codegen] is when higher level
+    representations of source are turned into an executable binary. `rustc`
+      uses LLVM for code generation.  The first step is the MIR is then
+    converted to LLVM Intermediate Representation (LLVM IR). This is where
+    the MIR is actually monomorphized, according to the list we created in
+    the previous step.
+  - The LLVM IR is passed to LLVM, which does a lot more optimizations on it.
+    It then emits machine code. It is basically assembly code with additional
+    low-level types and annotations added. (e.g. an ELF object or wasm).
+    **TODO: reference for this section?**
+  - The different libraries/binaries are linked together to produce the final
+    binary. **TODO: reference for this section?**
+
+[`librustc_lexer`]: https://doc.rust-lang.org/nightly/nightly-rustc/rustc_lexer/index.html
+[`librustc_driver`]: https://rust-lang.github.io/rustc-guide/rustc-driver.html
+[lex]: https://rust-lang.github.io/rustc-guide/the-parser.html
+[`StringReader`]: https://doc.rust-lang.org/nightly/nightly-rustc/rustc_parse/lexer/struct.StringReader.html
+[`librustc_parse`]: https://doc.rust-lang.org/nightly/nightly-rustc/rustc_parse/index.html
+[parser]: https://doc.rust-lang.org/nightly/nightly-rustc/rustc_parser/index.html
+[hir]: https://doc.rust-lang.org/nightly/nightly-rustc/rustc_hir/index.html
+[type inference]: https://rust-lang.github.io/rustc-guide/type-inference.html
+[mir]: https://rust-lang.github.io/rustc-guide/mir/index.html
+[borrow checker]: https://rust-lang.github.io/rustc-guide/borrow_check.html
+[mir-opt]: https://rust-lang.github.io/rustc-guide/mir/optimizations.html
+[`simplify_try`]: https://github.com/rust-lang/rust/pull/66282
+[codegen]: https://rust-lang.github.io/rustc-guide/codegen.html
+
+## How it does it
+
+Ok, so now that we have a high-level view of what the compiler does to your
+code, let's take a high-level view of _how_ it does all that stuff. There are a
+lot of constraints and conflicting goals that the compiler needs to
+satisfy/optimize for. For example,
+
+- Compilation speed: how fast is it to compile a program. More/better
+  compile-time analyses often means compilation is slower.
+  - Also, we want to support incremental compilation, so we need to take that
+    into account. How can we keep track of what work needs to be redone and
+    what can be reused if the user modifies their program?
+    - Also we can't store too much stuff in the incremental cache because
+      it would take a long time to load from disk and it could take a lot
+      of space on the user's system...
+- Compiler memory usage: while compiling a program, we don't want to use more
+  memory than we need.
+- Program speed: how fast is your compiled program. More/better compile-time
+  analyses often means the compiler can do better optimizations.
+- Program size: how large is the compiled binary? Similar to the previous
+  point.
+- Compiler compilation speed: how long does it take to compile the compiler?
+  This impacts contributors and compiler maintenance.
+- Compiler implementation complexity: building a compiler is one of the hardest
+  things a person/group can do, and Rust is not a very simple language, so how
+  do we make the compiler's code base manageable?
+- Compiler correctness: the binaries produced by the compiler should do what
+  the input programs says they do, and should continue to do so despite the
+  tremendous amount of change constantly going on.
+- Compiler integration: a number of other tools need to use the compiler in
+  various ways (e.g. cargo, clippy, miri, RLS) that must be supported.
+- Compiler stability: the compiler should not crash or fail ungracefully on the
+  stable channel.
+- Rust stability: the compiler must respect rust's stability guarantees by not
+  breaking programs that previously compiled despite the many changes that are
+  always going on to its implementation.
+- Limitations of other tools: rustc uses LLVM in its backend, and LLVM has some
+  strengths we leverage and some limitations/weaknesses we need to work around.
+
+So, as you read through the rest of the guide, keep these things in mind. They
+will often inform decisions that we make.
+
+### Constant change
+
+Keep in mind that `rustc` is a real production-quality product.
+As such, it has its fair share of codebase churn and technical debt. A lot of
+the designs discussed throughout this guide are idealized designs that are not
+fully realized yet. And things keep changing so that it is hard to keep this
+guide completely up to date on everything!
+
+The compiler definitely has rough edges, but because of its design it is able
+to keep up with the requirements above.
+
+### Intermediate representations
+
+As with most compilers, `rustc` uses some intermediate representations (IRs) to
+facilitate computations. In general, working directly with the source code is
+extremely inconvenient and error-prone. Source code is designed to be human-friendly while at
+the same time being unambiguous, but it's less convenient for doing something
+like, say, type checking.
+
+Instead most compilers, including `rustc`, build some sort of IR out of the
+source code which is easier to analyze. `rustc` has a few IRs, each optimized
+for different purposes:
+
+- Abstract Syntax Tree (AST): the abstract syntax tree is built from the stream
+  of tokens produced by the lexer directly from the source code. It represents
+  pretty much exactly what the user wrote. It helps to do some syntactic sanity
+  checking (e.g. checking that a type is expected where the user wrote one).
+- High-level IR (HIR): This is a sort of desugared AST. It's still close
+  to what the user wrote syntactically, but it includes some implicit things
+  such as some elided lifetimes, etc. This IR is amenable to type checking.
+- HAIR: This is an intermediate between HIR and MIR. This only exists to make
+  it easier to lower HIR to MIR.
+- Middle-level IR (MIR): This IR is basically a Control-Flow Graph (CFG). A CFG
+  is a type of diagram that shows the basic blocks of a program and how control
+  flow can go between them. Likewise, MIR also has a bunch of basic blocks with
+  simple typed statements inside them (e.g. assignment, simple computations,
+  dropping values, etc). MIR is used for borrow checking and a bunch of other
+  important dataflow based checks, such as checking for uninitialized values.
+  It is also used for a bunch of optimizations and for constant evaluation (via
+  MIRI). Because MIR is still generic, we can do a lot of analyses here more
+  efficiently than after monomorphization.
+- LLVM IR: This is the standard form of all input to the LLVM compiler. LLVM IR
+  is a sort of typed assembly language with lots of annotations. It's
+  a standard format that is used by all compilers that use LLVM (e.g. the clang
+  C compiler also outputs LLVM IR). LLVM IR is designed to be easy for other
+  compilers to emit and also rich enough for LLVM to run a bunch of
+  optimizations on it.
+
+### Queries
+
+The first big implementation choice is the _query_ system. The rust compiler
+uses a query system which is unlike most textbook compilers, which are
+organized as a series of passes over the code that execute sequentially. The
+compiler does this to make incremental compilation possible -- that is, if the
+user makes a change to their program and recompiles, we want to do as little
+redundant work as possible to produce the new binary.
+
+In `rustc`, all the major steps above are organized as a bunch of queries that
+call each other. For example, there is a query to ask for the type of something
+and another to ask for the optimized MIR of a function. These
+queries can call each other and are all tracked through the query system, and
+the results of the queries are cached on disk so that we can tell which
+queries' results changed from the last compilation and only redo those. This is
+how incremental compilation works.
+
+In principle, for the query-fied steps, we do each of the above for each item
+individually. For example, we will take the HIR for a function and use queries
+to ask for the LLVM IR for that HIR. This drives the generation of optimized
+MIR, which drives the borrow checker, which drives the generation of MIR, and
+so on.
+
+... except that this is very over-simplified. In fact, some queries are not
+cached on disk, and some parts of the compiler have to run for all code anyway
+for correctness even if the code is dead code (e.g. the borrow checker). For
+example, [currently the `mir_borrowck` query is first executed on all functions
+of a crate.][passes] Then the codegen backend invokes the
+`collect_and_partition_mono_items` query, which first recursively requests the
+`optimized_mir` for all reachable functions, which in turn runs `mir_borrowck`
+for that function and then creates codegen units. This kind of split will need
+to remain to ensure that unreachable functions still have their errors emitted.
+
+[passes]: https://github.com/rust-lang/rust/blob/45ebd5808afd3df7ba842797c0fcd4447ddf30fb/src/librustc_interface/passes.rs#L824
+
+Moreover, the compiler wasn't originally built to use a query system; the query
+system has been retrofitted into the compiler, so parts of it are not
+query-fied yet. Also, LLVM isn't our code, so that isn't querified
+either. The plan is to eventually query-fy all of the steps listed in the
+previous section, but as of this writing, only the steps between HIR and
+LLVM-IR are query-fied. That is, lexing and parsing are done all at once for
+the whole program.
+
+One other thing to mention here is the all-important "typing context",
+[`TyCtxt`], which is a giant struct that is at the center of all things. All
+queries are defined as methods on the [`TyCtxt`] type, and the in-memory query
+cache is stored there too. In the code, there is usually a variable called
+`tcx` which is a handle on the typing context. You will also see lifetimes with
+the name `'tcx`, which means that something is tied to the lifetime of the
+`TyCtxt` (usually it is stored or _interned_ there).
+
+[`TyCtxt`]: https://doc.rust-lang.org/nightly/nightly-rustc/rustc/ty/struct.TyCtxt.html
+
+### `ty::Ty`
+
+Types are really important in Rust, and they form the core of a lot of compiler
+analyses. The main type (in the compiler) that represents types (in the user's
+program) is [`rustc::ty::Ty`][ty]. This is so important that we have a whole chapter
+on [`ty::Ty`][ty], but for now, we just want to mention that it exists and is the way
+`rustc` represents types!
+
+Oh, and also the `rustc::ty` module defines the `TyCtxt` struct we mentioned before.
+
+[ty]: https://doc.rust-lang.org/nightly/nightly-rustc/rustc/ty/type.Ty.html
+
+### Parallelism
+
+Compiler performance is a problem that we would like to improve on
+(and are always working on). One aspect of that is parallelizing
+`rustc` itself.
+
+Currently, there is only one part of rustc that is already parallel: codegen.
+During monomorphization, the compiler will split up all the code to be
+generated into smaller chunks called _codegen units_. These are then generated
+by independent instances of LLVM. Since they are independent, we can run them
+in parallel. At the end, the linker is run to combine all the codegen units
+together into one binary.
+
+However, the rest of the compiler is still not yet parallel. There have been
+lots of efforts spent on this, but it is generally a hard problem. The current
+approach is (**TODO: verify**) to turn `RefCell`s into `Mutex`s -- that is, we
+switch to thread-safe internal mutability. However, there are ongoing
+challenges with lock contention, maintaining query-system invariants under
+concurrency, and the complexity of the code base. One can try out the current
+work by enabling parallel compilation in `config.toml`. It's still early days,
+but there are already some promising performance improvements.
+
+### Bootstrapping
+
+**TODO (or do we want such a section)?**
+
+# Unresolved Questions
+
+**TODO: find answers to these**
+
+- Does LLVM ever do optimizations in debug builds?
+- How do I explore phases of the compile process in my own sources (lexer,
+  parser, HIR, etc)? - e.g., `cargo rustc -- -Zunpretty=hir-tree` allows you to
+  view HIR representation
+- What is the main source entry point for `X`?
+- Where do phases diverge for cross-compilation to machine  code across
+  different platforms?
+
+# References
+
+- Command line parsing
+  - Guide: [The Rustc Driver and Interface](https://rust-lang.github.io/rustc-guide/rustc-driver.html)
+  - Driver definition: [`rustc_driver`](https://doc.rust-lang.org/nightly/nightly-rustc/rustc_driver/)
+  - Main entry point: **TODO**
+- Lexical Analysis: Lex the user program to a stream of tokens
+  - Guide: [Lexing and Parsing](https://rust-lang.github.io/rustc-guide/the-parser.html)
+  - Lexer definition: [`librustc_lexer`](https://doc.rust-lang.org/nightly/nightly-rustc/rustc_lexer/index.html)
+  - Main entry point: **TODO**
+- Parsing: Parse the stream of tokens to an Abstract Syntax Tree (AST)
+  - Guide: [Lexing and Parsing](https://rust-lang.github.io/rustc-guide/the-parser.html)
+  - Parser definition: [`librustc_parse`](https://doc.rust-lang.org/nightly/nightly-rustc/rustc_parse/index.html)
+  - Main entry point: **TODO**
+  - AST definition: [`librustc_ast`](https://doc.rust-lang.org/nightly/nightly-rustc/rustc_ast/ast/index.html)
+- The High Level Intermediate Representation (HIR)
+  - Guide: [The HIR](https://rust-lang.github.io/rustc-guide/hir.html)
+  - Guide: [Identifiers in the HIR](https://rust-lang.github.io/rustc-guide/hir.html#identifiers-in-the-hir)
+  - Guide: [The HIR Map](https://rust-lang.github.io/rustc-guide/hir.html#the-hir-map)
+  - Guide: [Lowering AST to HIR](https://rust-lang.github.io/rustc-guide/lowering.html)
+  - How to view HIR representation for your code `cargo rustc -- -Zunpretty=hir-tree`
+  - Rustc HIR definition: [`rustc_hir`](https://doc.rust-lang.org/nightly/nightly-rustc/rustc_hir/index.html)
+  - Main entry point: **TODO**
+- Type Inference
+  - Guide: [Type Inference](https://rust-lang.github.io/rustc-guide/type-inference.html)
+  - Guide: [The ty Module: Representing Types](https://rust-lang.github.io/rustc-guide/ty.html) (semantics)
+  - Main entry point: **TODO**
+- The Mid Level Intermediate Representation (MIR)
+  - Guide: [The MIR (Mid level IR)](https://rust-lang.github.io/rustc-guide/mir/index.html)
+  - Definition: [`librustc/mir`](https://github.com/rust-lang/rust/tree/master/src/librustc/mir)
+  - Definition of source that manipulates the MIR: [`librustc_mir`](https://github.com/rust-lang/rust/tree/master/src/librustc_mir)
+  - Main entry point: **TODO**
+- The Borrow Checker
+  - Guide: [MIR Borrow Check](https://rust-lang.github.io/rustc-guide/borrow_check.html)
+  - Definition: [`rustc_mir/borrow_check`](https://doc.rust-lang.org/nightly/nightly-rustc/rustc_mir/borrow_check/index.html)
+  - Main entry point: **TODO**
+- MIR Optimizations
+  - Guide: [MIR Optimizations](https://rust-lang.github.io/rustc-guide/mir/optimizations.html)
+  - Definition: [`rustc_mir/transform`](https://doc.rust-lang.org/nightly/nightly-rustc/rustc_mir/transform/index.html) **TODO: is this correct?**
+  - Main entry point: **TODO**
+- Code Generation
+  - Guide: [Code Generation](https://rust-lang.github.io/rustc-guide/codegen.html)
+  - Guide: [Generating LLVM IR](https://rust-lang.github.io/rustc-guide/codegen.html#generating-llvm-ir) - **TODO: this is not available yet**
+  - Generating Machine Code from LLVM IR with LLVM - **TODO: reference?**
+  - Main entry point MIR -> LLVM IR: **TODO**
+  - Main entry point LLVM IR -> Machine Code **TODO**