Skip to content

Latest commit

 

History

History
629 lines (494 loc) · 27.8 KB

overview.md

File metadata and controls

629 lines (494 loc) · 27.8 KB

Overview of the compiler

In this chapter, we aim to describe rustc at a very high-level to show how the various areas work together. We introduce the demand-driven architecture and the organization of rustc areas by their information dependency into a graph. For each area, we describe their purpose and motivation, and provide a small summary with links to chapters that go into more detail.

UNDER CONSTRUCTION

  • This section lacks motivation and reasoning for the query architecture.

UNRESOLVED QUESTIONS How do we describe the relationship between trait solving and the areas that it intertwines with? How do we describe error handling and rustc_interface? Where/how do we describe distinction between early lints and late lints? Do we talk about them briefly in this section already?

Conventional compiler texts often describe traditional compilers as being organized into linear passes, where a pass eagerly produces information for a subsequent pass to consume. rustc is organized a bit differently. rustc for the most part adopts an "demand-driven" system, where one area provides information for another area that demands it via lazily-computed queries. If we view the compiler's knowledge about a crate as a database, then queries are functions which asks the database for information. Initially, this database of information is empty. It can be populated on-demand by area requests for information by queries. Queries can demand information from other sub-queries, which in turn can demand information from area providers. This information dependency flow forms a directed acyclic graph (DAG) at the item granularity whose nodes are queries and leaves are areas which provide the demanded information.

UNDER CONSTRUCTION Insert a query diagram to illustrate the above point.

Ideally, the entire compiler would be organized as queries. However, as of the time of writing (March 2024), rustc still has several areas which are not yet querified and work like conventional passes. The following diagram illustrates this distinction as well as the information dependency between areas.

FIXME What do we want the arrows to actually mean?

High-level overview

To produce the final executable, we take one of the several code generation pathways. For example, to produce an executable following the LLVM pathway, we rely on the synthesis of LLVM IR from SSA code generation. Code generation in turn relies on information produced by Monomorphization Collection, and so on. We keep following the dependencies for the querified areas until we reach HIR Lowering, which depends on information from non-querified areas of the compiler. At that boundary, we make a distinction between the pull-based information dependency of querified areas versus push-based information dependency of the non-querified areas.

To pass information between areas, the compiler uses various Intermediate Representations (IRs), which provide common interfaces to allow areas to communicate with each other. These include but are not limited to:

  • Tokens
  • Abstract Syntax Tree (AST)
  • High-Level Intermediate Representation (HIR)
  • Mid-Level Intermediate Representation (MIR)

These various IRs gradually transform surface Rust syntax by simplications of language constructs and enriches the information available to lower-level areas by many analyses.

We will explain each area in more detail in the next section.

Brief Summaries of the Areas

The Compiler Driver and the Interface

The compiler driver (rustc_driver) is responsible for exercising the unstable interface exposed by the compiler interface rustc_interface to orchestrate the compilation phases and queries in the correct order. Splitting the driver and the interface allows third parties to use rustc's internals as a library (such as rustdoc) through the interface.

Non-Querified Areas

Tokenization

The compiler takes the input, a stream of Unicode characters, and transforms it into a sequence of Tokens called a TokenStream. This has multiple benefits:

  • Separation of concerns: the lexer can be concerned with lower-level details such as interning identifiers, collecting numeric literals, identifying keywords and operators, associating source locations with the token, and generate useful error reports for these cases. This simplifies the job of the parser as it can then work with tokens, which is a higher level construct than Unicode characters.
  • Simplifying the grammar: with a separate lexer, the parser can focus on accepting a valid program according to a grammar defined in terms of tokens instead of individual Unicode characters.

See Lexing and Parsing for more details.

Parsing

Given a token stream, the compiler builds a Abstract Syntax Tree (AST) that has a tree structure to enable tree-based reasoning and effectively represent the syntactical structure of the program. Rust has a macro system, which means that macros will need to be expanded in order to construct the full AST. Initially, a pre-expansion AST is constructed with placeholders for macros that are pending expansion.

See Lexing and Parsing for more details.

Macro Expansion, Metadata Loading, Early Name Resolution and Late Name Resolution

TODO: this section needs a lot of work because I don't understand this area too well so far

After the pre-expansion AST is constructed, the compiler tries to build the full AST for a crate which has all macros expanded and all modules inlined. Unresolved macros are iteratively expanded, which requires resolving imports and macro names early (i.e. Early Name Resolution), but not other names yet.

For import suggestions, the compiler may try to perform speculative crate loading via reading crate metadata that does not produce diagnostics for the speculatively loaded crate for import suggestion candidates.

See Macro Expansion and Name Resolution for more details.

AST Validation

Some basic sanity checks are performed on the AST, such as not having more than u16::MAX parameters, to catch some obvious user errors as well as errors in AST construction / macro expansion.

See AST Validation for more details.

Querified Areas

HIR Building

High-level Intermediate Representation (HIR) is a lower-level IR than AST. When compared to AST, HIR has desugared constructs (such as desugaring for loops into loops) to make analysis easier. When compared to MIR, HIR resembles the surface Rust syntax more closely.

HIR building is the lowering from Abstract Syntax Tree (AST) into HIR. HIR is further desugared from AST to simplify analysis. There are sanity checks on the lowered HIR to catch obvious lowering mistakes.

See AST Lowering for more details.

HIR Type Lowering

HIR Type Lowering converts the more surface-syntax-like HIR type system entities (types, lifetimes/regions, constants in type positions) into rustc_middle::ty representations to express the semantics of the type.

See the The ty module: representing types for more information.

HIR Well-formedness Checking

A type is well-formed (WF) when it satisfies a set of requirements which are fundamental to the type's validity. The requirements are type-specific, for example:

  • For a tuple to be WF, all elements except for the last needs to be Sized.
  • For an algebraic data type (struct/enum/union) to be WF, its generics need to satisfy the where clauses on the ADT.

The notion of type well-formedness can be inductively extended for entities that contain types, such as trait predicates. A trait predicate T0: Trait<P1, ...> is well-formed if T0 and P1, ... are well-formed in addition to Trait's where clauses being satisfied.

HIR well-formedness checking is an early collection of checks on mostly surface Rust syntax (such as parameters, local variables and return types) which are run before further type-checking and inference, for two purposes:

  1. For better error-reporting: a program which is not WF is more likely to give hard-to-understand or spurious errors.
  2. To enable making simplifying assumptions in later HIR typeck: such as not having to check if a tuple's fields are Sized when type checking a tuple.0 expression.

The HIR is well-formed (WF) when it satisfies a set of requirements that include but are not limited to:

  • well-formedness of types in type signatures
  • object safety is respected
  • function call ABI requirements (e.g. for "rust-call") are satisfied
  • receiver type can be used as a self type
  • variances are compatible
  • and many more

Note that HIR WF checking are also not the only place where WF checks take place: there are further WF checks in THIR/MIR to complement HIR WF checks.

HIR Body Type-checking

🚧 TODO 🚧 (remove the cat meme when complete with draft)

The goal of HIR body typechecking is to catch mistakes that violate the rules of the type system, and infer types and other information necessary for THIR/MIR building. HIR body type-checking recursively walks and checks each expression. This process produces information such as the type of each expression. During the process, type inference, implicit cocercions, trait bound checking and method resolution is performed. Notably,

  • Method resolution is not performed in (early/late) Name Resolution because type information is required.
  • Type inference is performed on HIR and not MIR because HIR maps closer to the surface Rust syntax (and retains the tree-based structure), which enables more descriptive errors and suggestions to be produced.
  • MIR building requires method resolution information from HIR type-checking to guide the lowering.

When compared to MIR typeck where all types are specified, HIR typeck has to infer types. HIR typeck also checks all closures together with their parent body while MIR typeck/borrowck checks them separately.

See Type checking for more details.

Typed High-level Intermediate Representation (THIR) Building and Analysis

Typed High-level Intermediate Representation (THIR) is a further desugared version of High-level Intermediate Representation (HIR) with full type information embedded. THIR is the last IR where lexical structure is meaningful for analysis in the chain of lowering from HIR to THIR to MIR and beyond, where lower-level IRs transition from being lexically-aware to being dataflow-aware. For example, unsafeck is performed on THIR and not MIR because we care about the lexical structure -- whether something exists inside an unsafe block lexically. To further simplify tree-based reasoning, THIR has several simplifications over HIR:

  • Explicit auto-deref/auto-ref and coercions.
  • Method calls are desugared into regular function calls.
  • Overloaded operators like + are lowered into function calls or builtin operations.
  • Match ergonomics are desugared.

With the aforementioned properties, the construction of THIR synthesizes enough information for MIR building, and makes lowering to MIR easier and less error-prone than if HIR was attempted to be directly lowered to MIR.

See The Typed High-level Intermediate Representation for more details.

Mid-level Intermediate Representation (MIR) Building, Transformations and Analysis

Mid-level Intermediate Representation (MIR) is an Intermediate Representation that is lowered from Typed High-level Intermediate Representation (THIR). Compared to its precursor THIR, MIR has three key properties that are distinct:

  1. Simplified constructs: MIR simplifies a lot of the surface language syntactical constructs. For example, matches are lowered into a series of SwitchInts and branches.
  2. Explicit semantic constructs: MIR makes various semantic aspects of Rust that does not appear in the surface syntax explicit. This includes things like drops, drop flags, unwinding, codegen intrinsics retags, and others.
  3. Dataflow-aware: MIR is designed to be dataflow-aware, and removes lexical information still present in THIR to enable dataflow-based analysis.

A series of transformations are performed on MIR to prepare it for use by codegen, Compile-Time Function Evaluation (CTFE), and other analyses. MIR has three dialects which are distinguished by semantic changes. Within each dialect, MIR has phases which represent additional constraints on the well-formedness of MIR within a particular dialect. Dialects and phases are made explicit so that different consumers of MIR can be provided with a flavor of MIR that has desirable properties suitable for each of them. Some of the analyses include borrow-checking, drop elaboration, coroutine transformations, const promotion, further well-formedness checking and some late type-checking.

MIR Dialects and Phases

MIR optimizations generally are opportunistic: because we're before monomorphization, there are e.g. constants related to generic parameters whose value we cannot know. When an opportunity does arise, however, MIR optimizations try to achieve one or both of the following objectives:

  1. to reduce the amount of work for the codegen backends, and
  2. to produce simpler or more optimizable backend output.

See Mid-level Intermediate Representation for more details.

Metadata Encoding

Rust libraries are compiled to archives which consists of object code and metadata. Metadata is used by downstream crates to understand the interface of upstream crates: including (but are not limited to) exported items, function signatures, type definitions and MIR for generic functions1. In this sense, metadata can be compared to precompiled C headers. Metadata Encoding is the area responsible for serializing this information into a common binary format that can be understood by both upstream and downstream crates.

Metadata is serialized via types which implement Encodable, some of which are derived while others are hand-written implementations. Symmetrically, metadata can be read back by Decodable derives and manual deserialization implementations.

See Libraries and Metadata for more details.

Monomorphization Collection

Monomorphization collection is the area responsible for collecting Mono Items which contribute to code generation: functions, methods, closures, statics, drop glue, constants, VTables and object shims. Functions need to be monomorphized -- they might have generic type parameters, in which case the compiler has to instantiate the functions with the provided type arguments.

See Monomorphization for more details.

The Trait System

🚧 TODO 🚧 I don't know how to describe the Trait System at all here.

Static-Single Assignment (SSA) Code Generation

Static Single Assignment (SSA) code generation is responsible for lowering Mid-Level Intermediate Representation (MIR) into SSA form (another kind of IR) which has the property that each variable is assigned to exactly once and is defined before use. Lowering into SSA form enables more optimizations since it simplifies the properties of variables. It also makes it easier for code generation backends to handle. This area abstracts the common code generation logic and interfaces which code generation backends will need to implement.

In order to lower MIR to SSA, we depend on Monomohprization Collection to collect Mono Items such as functions, methods and closures which contribute to code generation.

See Code Generation for more details.

Codegen backends: LLVM, GCC and Cranelift

🚧 TODO 🚧 need more information.

Additional References and Resources

UNDER CONSTRUCTION This section lacks additional references

The Query System

Areas

UNDER CONSTRUCTION This should retain key entry points from the previous overview version. Can be very useful. We should expand it a bit. This section lacks additional references.

FIXME: audit these references and links, they are likely very outdated

Footnotes

  1. proc macro metadata has a trimmed down version which only contains the exported macros without anything else defined inside the proc macro.