In this chapter, we aim to describe rustc
at a very high-level to show how the
various areas work together. We introduce the demand-driven architecture and
the organization of rustc
areas by their information dependency into a graph.
For each area, we describe their purpose and motivation, and provide a small
summary with links to chapters that go into more detail.
UNDER CONSTRUCTION
- This section lacks motivation and reasoning for the query architecture.
UNRESOLVED QUESTIONS How do we describe the relationship between trait solving and the areas that it intertwines with? How do we describe error handling and
rustc_interface
? Where/how do we describe distinction between early lints and late lints? Do we talk about them briefly in this section already?
Conventional compiler texts often describe traditional compilers as being
organized into linear passes, where a pass eagerly produces information for a
subsequent pass to consume. rustc
is organized a bit differently. rustc
for
the most part adopts an "demand-driven" system, where one area provides
information for another area that demands it via lazily-computed queries. If
we view the compiler's knowledge about a crate as a database, then queries are
functions which asks the database for information. Initially, this database of
information is empty. It can be populated on-demand by area requests for
information by queries. Queries can demand information from other sub-queries,
which in turn can demand information from area providers. This information
dependency flow forms a directed acyclic graph (DAG) at the item granularity
whose nodes are queries and leaves are areas which provide the demanded
information.
UNDER CONSTRUCTION Insert a query diagram to illustrate the above point.
Ideally, the entire compiler would be organized as queries. However, as of the
time of writing (March 2024), rustc
still has several areas which are not yet
querified and work like conventional passes. The following diagram illustrates
this distinction as well as the information dependency between areas.
FIXME What do we want the arrows to actually mean?
To produce the final executable, we take one of the several code generation pathways. For example, to produce an executable following the LLVM pathway, we rely on the synthesis of LLVM IR from SSA code generation. Code generation in turn relies on information produced by Monomorphization Collection, and so on. We keep following the dependencies for the querified areas until we reach HIR Lowering, which depends on information from non-querified areas of the compiler. At that boundary, we make a distinction between the pull-based information dependency of querified areas versus push-based information dependency of the non-querified areas.
To pass information between areas, the compiler uses various Intermediate Representations (IRs), which provide common interfaces to allow areas to communicate with each other. These include but are not limited to:
- Tokens
- Abstract Syntax Tree (AST)
- High-Level Intermediate Representation (HIR)
- Mid-Level Intermediate Representation (MIR)
These various IRs gradually transform surface Rust syntax by simplications of language constructs and enriches the information available to lower-level areas by many analyses.
We will explain each area in more detail in the next section.
The compiler driver (rustc_driver
) is responsible for exercising the unstable
interface exposed by the compiler interface rustc_interface
to orchestrate the
compilation phases and queries in the correct order. Splitting the driver and
the interface allows third parties to use rustc's internals as a library (such
as rustdoc
) through the interface.
The compiler takes the input, a stream of Unicode characters, and transforms it
into a sequence of
Token
s
called a
TokenStream
.
This has multiple benefits:
- Separation of concerns: the lexer can be concerned with lower-level details such as interning identifiers, collecting numeric literals, identifying keywords and operators, associating source locations with the token, and generate useful error reports for these cases. This simplifies the job of the parser as it can then work with tokens, which is a higher level construct than Unicode characters.
- Simplifying the grammar: with a separate lexer, the parser can focus on accepting a valid program according to a grammar defined in terms of tokens instead of individual Unicode characters.
See Lexing and Parsing for more details.
Given a token stream, the compiler builds a Abstract Syntax Tree (AST) that has a tree structure to enable tree-based reasoning and effectively represent the syntactical structure of the program. Rust has a macro system, which means that macros will need to be expanded in order to construct the full AST. Initially, a pre-expansion AST is constructed with placeholders for macros that are pending expansion.
See Lexing and Parsing for more details.
TODO: this section needs a lot of work because I don't understand this area too well so far
After the pre-expansion AST is constructed, the compiler tries to build the full AST for a crate which has all macros expanded and all modules inlined. Unresolved macros are iteratively expanded, which requires resolving imports and macro names early (i.e. Early Name Resolution), but not other names yet.
For import suggestions, the compiler may try to perform speculative crate loading via reading crate metadata that does not produce diagnostics for the speculatively loaded crate for import suggestion candidates.
See Macro Expansion and Name Resolution for more details.
Some basic sanity checks are performed on the AST, such as not having more than
u16::MAX
parameters, to catch some obvious user errors as well as errors in
AST construction / macro expansion.
See AST Validation for more details.
High-level Intermediate Representation (HIR) is a lower-level IR than AST. When
compared to AST, HIR has desugared constructs (such as desugaring for
loops
into loop
s) to make analysis easier. When compared to MIR, HIR resembles the
surface Rust syntax more closely.
HIR building is the lowering from Abstract Syntax Tree (AST) into HIR. HIR is further desugared from AST to simplify analysis. There are sanity checks on the lowered HIR to catch obvious lowering mistakes.
See AST Lowering for more details.
HIR Type Lowering converts the more surface-syntax-like HIR type system entities
(types, lifetimes/regions, constants in type positions) into
rustc_middle::ty
representations to express the semantics of the type.
See the The ty
module: representing
types for more information.
A type is well-formed (WF) when it satisfies a set of requirements which are fundamental to the type's validity. The requirements are type-specific, for example:
- For a tuple to be WF, all elements except for the last needs to be
Sized
. - For an algebraic data type (struct/enum/union) to be WF, its generics need to
satisfy the
where
clauses on the ADT.
The notion of type well-formedness can be inductively extended for entities that
contain types, such as trait predicates. A trait predicate T0: Trait<P1, ...>
is well-formed if T0
and P1, ...
are well-formed in addition to Trait
's
where
clauses being satisfied.
HIR well-formedness checking is an early collection of checks on mostly surface Rust syntax (such as parameters, local variables and return types) which are run before further type-checking and inference, for two purposes:
- For better error-reporting: a program which is not WF is more likely to give hard-to-understand or spurious errors.
- To enable making simplifying assumptions in later HIR typeck: such as not
having to check if a tuple's fields are
Sized
when type checking atuple.0
expression.
The HIR is well-formed (WF) when it satisfies a set of requirements that include but are not limited to:
- well-formedness of types in type signatures
- object safety is respected
- function call ABI requirements (e.g. for "rust-call") are satisfied
- receiver type can be used as a self type
- variances are compatible
- and many more
Note that HIR WF checking are also not the only place where WF checks take place: there are further WF checks in THIR/MIR to complement HIR WF checks.
🚧 TODO 🚧 (remove the cat meme when complete with draft)
The goal of HIR body typechecking is to catch mistakes that violate the rules of the type system, and infer types and other information necessary for THIR/MIR building. HIR body type-checking recursively walks and checks each expression. This process produces information such as the type of each expression. During the process, type inference, implicit cocercions, trait bound checking and method resolution is performed. Notably,
- Method resolution is not performed in (early/late) Name Resolution because type information is required.
- Type inference is performed on HIR and not MIR because HIR maps closer to the surface Rust syntax (and retains the tree-based structure), which enables more descriptive errors and suggestions to be produced.
- MIR building requires method resolution information from HIR type-checking to guide the lowering.
When compared to MIR typeck where all types are specified, HIR typeck has to infer types. HIR typeck also checks all closures together with their parent body while MIR typeck/borrowck checks them separately.
See Type checking for more details.
Typed High-level Intermediate Representation (THIR) is a further desugared
version of High-level Intermediate Representation (HIR) with full type
information embedded. THIR is the last IR where lexical structure is meaningful
for analysis in the chain of lowering from HIR to THIR to MIR and beyond, where
lower-level IRs transition from being lexically-aware to being dataflow-aware.
For example, unsafeck is performed on THIR and not MIR because we care about the
lexical structure -- whether something exists inside an unsafe
block
lexically. To further simplify tree-based reasoning, THIR has several
simplifications over HIR:
- Explicit auto-deref/auto-ref and coercions.
- Method calls are desugared into regular function calls.
- Overloaded operators like
+
are lowered into function calls or builtin operations. - Match ergonomics are desugared.
With the aforementioned properties, the construction of THIR synthesizes enough information for MIR building, and makes lowering to MIR easier and less error-prone than if HIR was attempted to be directly lowered to MIR.
See The Typed High-level Intermediate Representation for more details.
Mid-level Intermediate Representation (MIR) is an Intermediate Representation that is lowered from Typed High-level Intermediate Representation (THIR). Compared to its precursor THIR, MIR has three key properties that are distinct:
- Simplified constructs: MIR simplifies a lot of the surface language
syntactical constructs. For example, matches are lowered into a series of
SwitchInt
s and branches. - Explicit semantic constructs: MIR makes various semantic aspects of Rust that does not appear in the surface syntax explicit. This includes things like drops, drop flags, unwinding, codegen intrinsics retags, and others.
- Dataflow-aware: MIR is designed to be dataflow-aware, and removes lexical information still present in THIR to enable dataflow-based analysis.
A series of transformations are performed on MIR to prepare it for use by codegen, Compile-Time Function Evaluation (CTFE), and other analyses. MIR has three dialects which are distinguished by semantic changes. Within each dialect, MIR has phases which represent additional constraints on the well-formedness of MIR within a particular dialect. Dialects and phases are made explicit so that different consumers of MIR can be provided with a flavor of MIR that has desirable properties suitable for each of them. Some of the analyses include borrow-checking, drop elaboration, coroutine transformations, const promotion, further well-formedness checking and some late type-checking.
MIR optimizations generally are opportunistic: because we're before monomorphization, there are e.g. constants related to generic parameters whose value we cannot know. When an opportunity does arise, however, MIR optimizations try to achieve one or both of the following objectives:
- to reduce the amount of work for the codegen backends, and
- to produce simpler or more optimizable backend output.
See Mid-level Intermediate Representation for more details.
Rust libraries are compiled to archives which consists of object code and metadata. Metadata is used by downstream crates to understand the interface of upstream crates: including (but are not limited to) exported items, function signatures, type definitions and MIR for generic functions1. In this sense, metadata can be compared to precompiled C headers. Metadata Encoding is the area responsible for serializing this information into a common binary format that can be understood by both upstream and downstream crates.
Metadata is serialized via types which implement Encodable
, some of which
are derived while others are hand-written implementations. Symmetrically,
metadata can be read back by Decodable
derives and manual deserialization
implementations.
See Libraries and Metadata for more details.
Monomorphization collection is the area responsible for collecting Mono Items which contribute to code generation: functions, methods, closures, statics, drop glue, constants, VTables and object shims. Functions need to be monomorphized -- they might have generic type parameters, in which case the compiler has to instantiate the functions with the provided type arguments.
See Monomorphization for more details.
🚧 TODO 🚧 I don't know how to describe the Trait System at all here.
Static Single Assignment (SSA) code generation is responsible for lowering Mid-Level Intermediate Representation (MIR) into SSA form (another kind of IR) which has the property that each variable is assigned to exactly once and is defined before use. Lowering into SSA form enables more optimizations since it simplifies the properties of variables. It also makes it easier for code generation backends to handle. This area abstracts the common code generation logic and interfaces which code generation backends will need to implement.
In order to lower MIR to SSA, we depend on Monomohprization Collection to collect Mono Items such as functions, methods and closures which contribute to code generation.
See Code Generation for more details.
🚧 TODO 🚧 need more information.
UNDER CONSTRUCTION This section lacks additional references
- Detailed description of the query system: https://rustc-dev-guide.rust-lang.org/query.html.
- Design document of On-Demand Compilation and Incremental Compilation: https://github.com/nikomatsakis/rustc-on-demand-incremental-design-doc/blob/master/0000-rustc-on-demand-and-incremental.md.
UNDER CONSTRUCTION This should retain key entry points from the previous overview version. Can be very useful. We should expand it a bit. This section lacks additional references.
FIXME: audit these references and links, they are likely very outdated
-
Command line parsing
- Guide: The Rustc Driver and Interface
- Driver definition:
rustc_driver
- Main entry point:
rustc_session::config::build_session_options
-
Lexical Analysis: Lex the user program to a stream of tokens
- Guide: Lexing and Parsing
- Lexer definition:
rustc_lexer
- Main entry point:
rustc_lexer::cursor::Cursor::advance_token
-
Parsing: Parse the stream of tokens to an Abstract Syntax Tree (AST)
-
Guide: Lexing and Parsing
-
Guide: Macro Expansion
-
Guide: Name Resolution
-
Parser definition:
rustc_parse
-
Main entry points:
- Entry point for first file in crate
- Entry point for outline module parsing
- [Entry point for macro fragments][parse_nonterminal]
-
AST
definition:rustc_ast
-
Feature gating: TODO
-
Early linting: TODO
-
-
The High Level Intermediate Representation (HIR)
- Guide: The HIR
- Guide: Identifiers in the HIR
- Guide: The
HIR
Map - Guide: Lowering
AST
toHIR
- How to view
HIR
representation for your codecargo rustc -- -Z unpretty=hir-tree
- Rustc
HIR
definition:rustc_hir
- Main entry point: TODO
- Late linting: TODO
-
Type Inference
-
Guide: Type Inference
-
Guide: The ty Module: Representing Types (semantics)
-
Main entry point (type inference):
InferCtxtBuilder::enter
-
Main entry point (type checking bodies): the
typeck
query- These two functions can't be decoupled.
-
-
The Mid Level Intermediate Representation (MIR)
- Guide: The
MIR
(Mid level IR) - Definition:
rustc_middle/src/mir
- Definition of sources that manipulates the MIR:
rustc_mir_build
,rustc_mir_dataflow
,rustc_mir_transform
- Guide: The
-
The Borrow Checker
- Guide: MIR Borrow Check
- Definition:
rustc_borrowck
- Main entry point:
mir_borrowck
query
-
MIR
Optimizations- Guide: MIR Optimizations
- Definition:
rustc_mir_transform
- Main entry point:
optimized_mir
query
-
Code Generation
-
Guide: Code Generation
-
Generating Machine Code from
LLVM-IR
with LLVM - TODO: reference? -
Main entry point:
rustc_codegen_ssa::base::codegen_crate
- This monomorphizes and produces
LLVM-IR
for one codegen unit. It then starts a background thread to run LLVM, which must be joined later. - Monomorphization happens lazily via
FunctionCx::monomorphize
andrustc_codegen_ssa::base::codegen_instance
- This monomorphizes and produces
-
Footnotes
-
proc macro metadata has a trimmed down version which only contains the exported macros without anything else defined inside the proc macro. ↩