rustc_ast
,rustc_expand
, andrustc_builtin_macros
are all undergoing refactoring, so some of the links in this chapter may be broken.
Rust has a very powerful macro system. In the previous chapter, we saw how the parser sets aside macros to be expanded (it temporarily uses placeholders). This chapter is about the process of expanding those macros iteratively until we have a complete AST for our crate with no unexpanded macros (or a compile error).
First, we will discuss the algorithm that expands and integrates macro output into ASTs. Next, we will take a look at how hygiene data is collected. Finally, we will look at the specifics of expanding different types of macros.
Many of the algorithms and data structures described below are in rustc_expand
,
with basic data structures in rustc_expand::base
.
Also of note, cfg
and cfg_attr
are treated specially from other macros, and are
handled in rustc_expand::config
.
First of all, expansion happens at the crate level. Given a raw source code for
a crate, the compiler will produce a massive AST with all macros expanded, all
modules inlined, etc. The primary entry point for this process is the
MacroExpander::fully_expand_fragment
method. With few exceptions, we
use this method on the whole crate (see "Eager Expansion"
below for more detailed discussion of edge case expansion issues).
At a high level, fully_expand_fragment
works in iterations. We keep a
queue of unresolved macro invocations (that is, macros we haven't found the
definition of yet). We repeatedly try to pick a macro from the queue, resolve
it, expand it, and integrate it back. If we can't make progress in an
iteration, this represents a compile error. Here is the algorithm:
- Initialize an
queue
of unresolved macros. - Repeat until
queue
is empty (or we make no progress, which is an error): 0. Resolve imports in our partially built crate as much as possible.- Collect as many macro
Invocation
s as possible from our partially built crate (fn-like, attributes, derives) and add them to the queue. - Dequeue the first element, and attempt to resolve it.
- If it's resolved:
0. Run the macro's expander function that consumes a
TokenStream
or AST and produces aTokenStream
orAstFragment
(depending on the macro kind). (ATokenStream
is a collection ofTokenTree
s, each of which are a token (punctuation, identifier, or literal) or a delimited group (anything inside()
/[]
/{}
)). - At this point, we know everything about the macro itself and can callset_expn_data
to fill in its properties in the global data; that is the hygiene data associated withExpnId
. (See the "Hygiene" section below).- Integrate that piece of AST into the big existing partially built
AST. This is essentially where the "token-like mass" becomes a
proper set-in-stone AST with side-tables. It happens as follows:
- If the macro produces tokens (e.g. a proc macro), we parse into an AST, which may produce parse errors.
- During expansion, we create
SyntaxContext
s (hierarchy 2). (See the "Hygiene" section below) - These three passes happen one after another on every AST fragment
freshly expanded from a macro:
NodeId
s are assigned byInvocationCollector
. This also collects new macro calls from this new AST piece and adds them to the queue.- "Def paths" are created and
DefId
s are assigned to them byDefCollector
. - Names are put into modules (from the resolver's point of
view) by
BuildReducedGraphVisitor
.
- After expanding a single macro and integrating its output, continue
to the next iteration of
fully_expand_fragment
.
- Integrate that piece of AST into the big existing partially built
AST. This is essentially where the "token-like mass" becomes a
proper set-in-stone AST with side-tables. It happens as follows:
- If it's not resolved:
0. Put the macro back in the queue
- Continue to next iteration...
- Collect as many macro
If we make no progress in an iteration, then we have reached a compilation
error (e.g. an undefined macro). We attempt to recover from failures
(unresolved macros or imports) for the sake of diagnostics. This allows
compilation to continue past the first error, so that we can report more errors
at a time. Recovery can't cause compilation to succeed. We know that it will
fail at this point. The recovery happens by expanding unresolved macros into
ExprKind::Err
.
Notice that name resolution is involved here: we need to resolve imports and
macro names in the above algorithm. This is done in
rustc_resolve::macros
, which resolves macro paths, validates
those resolutions, and reports various errors (e.g. "not found" or "found, but
it's unstable" or "expected x, found y"). However, we don't try to resolve
other names yet. This happens later, as we will see in the next
chapter.
Eager expansion means that we expand the arguments of a macro invocation before the macro invocation itself. This is implemented only for a few special built-in macros that expect literals; expanding arguments first for some of these macro results in a smoother user experience. As an example, consider the following:
macro bar($i: ident) { $i }
macro foo($i: ident) { $i }
foo!(bar!(baz));
A lazy expansion would expand foo!
first. An eager expansion would expand
bar!
first.
Eager expansion is not a generally available feature of Rust. Implementing
eager expansion more generally would be challenging, but we implement it for a
few special built-in macros for the sake of user experience. The built-in
macros are implemented in rustc_builtin_macros
, along with some other early
code generation facilities like injection of standard library imports or
generation of test harness. There are some additional helpers for building
their AST fragments in rustc_expand::build
. Eager expansion generally
performs a subset of the things that lazy (normal) expansion. It is done by
invoking fully_expand_fragment
on only part of a crate (as opposed to
whole crate, like we normally do).
Here are some other notable data structures involved in expansion and integration:
ResolverExpand
- a trait used to break crate dependencies. This allows the resolver services to be used inrustc_ast
, despiterustc_resolve
and pretty much everything else depending onrustc_ast
.ExtCtxt
/ExpansionData
- various intermediate data kept and used by expansion infrastructure in the process of its workAnnotatable
- a piece of AST that can be an attribute target, almost same thing as AstFragment except for types and patterns that can be produced by macros but cannot be annotated with attributesMacResult
- a "polymorphic" AST fragment, something that can turn into a differentAstFragment
depending on itsAstFragmentKind
- item, or expression, or pattern etc.
If you have ever used C/C++ preprocessor macros, you know that there are some annoying and hard-to-debug gotchas! For example, consider the following C code:
#define DEFINE_FOO struct Bar {int x;}; struct Foo {Bar bar;};
// Then, somewhere else
struct Bar {
...
};
DEFINE_FOO
Most people avoid writing C like this – and for good reason: it doesn't
compile. The struct Bar
defined by the macro clashes names with the struct Bar
defined in the code. Consider also the following example:
#define DO_FOO(x) {\
int y = 0;\
foo(x, y);\
}
// Then elsewhere
int y = 22;
DO_FOO(y);
Do you see the problem? We wanted to generate a call foo(22, 0)
, but instead
we got foo(0, 0)
because the macro defined its own y
!
These are both examples of macro hygiene issues. Hygiene relates to how to handle names defined within a macro. In particular, a hygienic macro system prevents errors due to names introduced within a macro. Rust macros are hygienic in that they do not allow one to write the sorts of bugs above.
At a high level, hygiene within the Rust compiler is accomplished by keeping track of the context where a name is introduced and used. We can then disambiguate names based on that context. Future iterations of the macro system will allow greater control to the macro author to use that context. For example, a macro author may want to introduce a new name to the context where the macro was called. Alternately, the macro author may be defining a variable for use only within the macro (i.e. it should not be visible outside the macro).
The context is attached to AST nodes. All AST nodes generated by macros have
context attached. Additionally, there may be other nodes that have context
attached, such as some desugared syntax (non-macro-expanded nodes are
considered to just have the "root" context, as described below).
Throughout the compiler, we use rustc_span::Span
s to refer to code locations.
This struct also has hygiene information attached to it, as we will see later.
Because macros invocations and definitions can be nested, the syntax context of a node must be a hierarchy. For example, if we expand a macro and there is another macro invocation or definition in the generated output, then the syntax context should reflect the nesting.
However, it turns out that there are actually a few types of context we may want to track for different purposes. Thus, there are not just one but three expansion hierarchies that together comprise the hygiene information for a crate.
All of these hierarchies need some sort of "macro ID" to identify individual
elements in the chain of expansions. This ID is ExpnId
. All macros receive
an integer ID, assigned continuously starting from 0 as we discover new macro
calls. All hierarchies start at ExpnId::root()
, which is its own
parent.
rustc_span::hygiene
contains all of the hygiene-related algorithms
(with the exception of some hacks in Resolver::resolve_crate_root
)
and structures related to hygiene and expansion that are kept in global data.
The actual hierarchies are stored in HygieneData
. This is a global
piece of data containing hygiene and expansion info that can be accessed from
any Ident
without any context.
The first hierarchy tracks the order of expansions, i.e., when a macro invocation is in the output of another macro.
Here, the children in the hierarchy will be the "innermost" tokens. The
ExpnData
struct itself contains a subset of properties from both macro
definition and macro call available through global data.
ExpnData::parent
tracks the child -> parent link in this hierarchy.
For example,
macro_rules! foo { () => { println!(); } }
fn main() { foo!(); }
In this code, the AST nodes that are finally generated would have hierarchy:
root
expn_id_foo
expn_id_println
The second hierarchy tracks the order of macro definitions, i.e., when we are expanding one macro another macro definition is revealed in its output. This one is a bit tricky and more complex than the other two hierarchies.
SyntaxContext
represents a whole chain in this hierarchy via an ID.
SyntaxContextData
contains data associated with the given
SyntaxContext
; mostly it is a cache for results of filtering that chain in
different ways. SyntaxContextData::parent
is the child -> parent
link here, and SyntaxContextData::outer_expns
are individual
elements in the chain. The "chaining operator" is
SyntaxContext::apply_mark
in compiler code.
A Span
, mentioned above, is actually just a compact representation of
a code location and SyntaxContext
. Likewise, an Ident
is just an interned
Symbol
+ Span
(i.e. an interned string + hygiene data).
For built-in macros, we use the context:
SyntaxContext::empty().apply_mark(expn_id)
, and such macros are considered to
be defined at the hierarchy root. We do the same for proc-macros because we
haven't implemented cross-crate hygiene yet.
If the token had context X
before being produced by a macro then after being
produced by the macro it has context X -> macro_id
. Here are some examples:
Example 0:
macro m() { ident }
m!();
Here ident
originally has context SyntaxContext::root()
. ident
has
context ROOT -> id(m)
after it's produced by m
.
Example 1:
macro m() { macro n() { ident } }
m!();
n!();
In this example the ident
has context ROOT
originally, then ROOT -> id(m)
after the first expansion, then ROOT -> id(m) -> id(n)
.
Example 2:
Note that these chains are not entirely determined by their last element, in
other words ExpnId
is not isomorphic to SyntaxContext
.
macro m($i: ident) { macro n() { ($i, bar) } }
m!(foo);
After all expansions, foo
has context ROOT -> id(n)
and bar
has context
ROOT -> id(m) -> id(n)
.
Finally, one last thing to mention is that currently, this hierarchy is subject
to the "context transplantation hack". Basically, the more modern (and
experimental) macro
macros have stronger hygiene than the older MBE system,
but this can result in weird interactions between the two. The hack is intended
to make things "just work" for now.
The third and final hierarchy tracks the location of macro invocations.
In this hierarchy ExpnData::call_site
is the child -> parent link.
Here is an example:
macro bar($i: ident) { $i }
macro foo($i: ident) { $i }
foo!(bar!(baz));
For the baz
AST node in the final output, the first hierarchy is ROOT -> id(foo) -> id(bar) -> baz
, while the third hierarchy is ROOT -> baz
.
Macro backtraces are implemented in rustc_span
using the hygiene machinery
in rustc_span::hygiene
.
Above, we saw how the output of a macro is integrated into the AST for a crate, and we also saw how the hygiene data for a crate is generated. But how do we actually produce the output of a macro? It depends on the type of macro.
There are two types of macros in Rust:
macro_rules!
macros (a.k.a. "Macros By Example" (MBE)) and procedural macros
(or "proc macros"; including custom derives). During the parsing phase, the normal
Rust parser will set aside the contents of macros and their invocations. Later,
macros are expanded using these portions of the code.
Some important data structures/interfaces here:
SyntaxExtension
- a lowered macro representation, contains its expander function, which transforms aTokenStream
or AST into anotherTokenStream
or AST + some additional data like stability, or a list of unstable features allowed inside the macro.SyntaxExtensionKind
- expander functions may have several different signatures (take one token stream, or two, or a piece of AST, etc). This is an enum that lists them.BangProcMacro
/TTMacroExpander
/AttrProcMacro
/MultiItemModifier
- traits representing the expander function signatures.
MBEs have their own parser distinct from the normal Rust parser. When macros
are expanded, we may invoke the MBE parser to parse and expand a macro. The
MBE parser, in turn, may call the normal Rust parser when it needs to bind a
metavariable (e.g. $my_expr
) while parsing the contents of a macro
invocation. The code for macro expansion is in
compiler/rustc_expand/src/mbe/
.
It's helpful to have an example to refer to. For the remainder of this chapter, whenever we refer to the "example definition", we mean the following:
macro_rules! printer {
(print $mvar:ident) => {
println!("{}", $mvar);
};
(print twice $mvar:ident) => {
println!("{}", $mvar);
println!("{}", $mvar);
};
}
$mvar
is called a metavariable. Unlike normal variables, rather than
binding to a value in a computation, a metavariable binds at compile time to
a tree of tokens. A token is a single "unit" of the grammar, such as an
identifier (e.g. foo
) or punctuation (e.g. =>
). There are also other
special tokens, such as EOF
, which indicates that there are no more tokens.
Token trees resulting from paired parentheses-like characters ((
...)
,
[
...]
, and {
...}
) – they include the open and close and all the tokens
in between (we do require that parentheses-like characters be balanced). Having
macro expansion operate on token streams rather than the raw bytes of a source
file abstracts away a lot of complexity. The macro expander (and much of the
rest of the compiler) doesn't really care that much about the exact line and
column of some syntactic construct in the code; it cares about what constructs
are used in the code. Using tokens allows us to care about what without
worrying about where. For more information about tokens, see the
Parsing chapter of this book.
Whenever we refer to the "example invocation", we mean the following snippet:
printer!(print foo); // Assume `foo` is a variable defined somewhere else...
The process of expanding the macro invocation into the syntax tree
println!("{}", foo)
and then expanding that into a call to Display::fmt
is
called macro expansion, and it is the topic of this chapter.
There are two parts to MBE expansion: parsing the definition and parsing the invocations. Interestingly, both are done by the macro parser.
Basically, the MBE parser is like an NFA-based regex parser. It uses an
algorithm similar in spirit to the Earley parsing
algorithm. The macro parser is
defined in compiler/rustc_expand/src/mbe/macro_parser.rs
.
The interface of the macro parser is as follows (this is slightly simplified):
fn parse_tt(
&mut self,
parser: &mut Cow<'_, Parser<'_>>,
matcher: &[MatcherLoc]
) -> ParseResult
We use these items in macro parser:
parser
is a reference to the state of a normal Rust parser, including the token stream and parsing session. The token stream is what we are about to ask the MBE parser to parse. We will consume the raw stream of tokens and output a binding of metavariables to corresponding token trees. The parsing session can be used to report parser errors.matcher
is a sequence ofMatcherLoc
s that we want to match the token stream against. They're converted from token trees before matching.
In the analogy of a regex parser, the token stream is the input and we are matching it
against the pattern matcher
. Using our examples, the token stream could be the stream of
tokens containing the inside of the example invocation print foo
, while matcher
might be the sequence of token (trees) print $mvar:ident
.
The output of the parser is a ParseResult
, which indicates which of
three cases has occurred:
- Success: the token stream matches the given
matcher
, and we have produced a binding from metavariables to the corresponding token trees. - Failure: the token stream does not match
matcher
. This results in an error message such as "No rule expected token blah". - Error: some fatal error has occurred in the parser. For example, this happens if there are more than one pattern match, since that indicates the macro is ambiguous.
The full interface is defined here.
The macro parser does pretty much exactly the same as a normal regex parser with
one exception: in order to parse different types of metavariables, such as
ident
, block
, expr
, etc., the macro parser must sometimes call back to the
normal Rust parser.
As mentioned above, both definitions and invocations of macros are parsed using
the macro parser. This is extremely non-intuitive and self-referential. The code
to parse macro definitions is in
compiler/rustc_expand/src/mbe/macro_rules.rs
. It defines the pattern for
matching for a macro definition as $( $lhs:tt => $rhs:tt );+
. In other words,
a macro_rules
definition should have in its body at least one occurrence of a
token tree followed by =>
followed by another token tree. When the compiler
comes to a macro_rules
definition, it uses this pattern to match the two token
trees per rule in the definition of the macro using the macro parser itself.
In our example definition, the metavariable $lhs
would match the patterns of
both arms: (print $mvar:ident)
and (print twice $mvar:ident)
. And $rhs
would match the bodies of both arms: { println!("{}", $mvar); }
and { println!("{}", $mvar); println!("{}", $mvar); }
. The parser would keep this
knowledge around for when it needs to expand a macro invocation.
When the compiler comes to a macro invocation, it parses that invocation using
the same NFA-based macro parser that is described above. However, the matcher
used is the first token tree ($lhs
) extracted from the arms of the macro
definition. Using our example, we would try to match the token stream print foo
from the invocation against the matchers print $mvar:ident
and print twice $mvar:ident
that we previously extracted from the definition. The
algorithm is exactly the same, but when the macro parser comes to a place in the
current matcher where it needs to match a non-terminal (e.g. $mvar:ident
),
it calls back to the normal Rust parser to get the contents of that
non-terminal. In this case, the Rust parser would look for an ident
token,
which it finds (foo
) and returns to the macro parser. Then, the macro parser
proceeds in parsing as normal. Also, note that exactly one of the matchers from
the various arms should match the invocation; if there is more than one match,
the parse is ambiguous, while if there are no matches at all, there is a syntax
error.
For more information about the macro parser's implementation, see the comments
in compiler/rustc_expand/src/mbe/macro_parser.rs
.
There is an old and mostly undocumented effort to improve the MBE system, give
it more hygiene-related features, better scoping and visibility rules, etc. There
hasn't been a lot of work on this recently, unfortunately. Internally, macro
macros use the same machinery as today's MBEs; they just have additional
syntactic sugar and are allowed to be in namespaces.
Procedural macros are also expanded during parsing, as mentioned above. However, they use a rather different mechanism. Rather than having a parser in the compiler, procedural macros are implemented as custom, third-party crates. The compiler will compile the proc macro crate and specially annotated functions in them (i.e. the proc macro itself), passing them a stream of tokens.
The proc macro can then transform the token stream and output a new token stream, which is synthesized into the AST.
It's worth noting that the token stream type used by proc macros is stable,
so rustc
does not use it internally (since our internal data structures are
unstable). The compiler's token stream is
rustc_ast::tokenstream::TokenStream
, as previously. This is
converted into the stable proc_macro::TokenStream
and back in
rustc_expand::proc_macro
and rustc_expand::proc_macro_server
.
Because the Rust ABI is unstable, we use the C ABI for this conversion.
TODO: more here. #1160
Custom derives are a special type of proc macro.
TODO: more? #1160