changed 3 years ago
Linked with GitHub

End-to-end queries in rustc

Brief description of the early compilation

Compilation of a crate starts by parsing the toplevel module, then follows to the declared modules recursively. When the parser encounters a macro call, an expansion point is created, to be filled later during macro expansion.

This unexpanded AST is traversed to collect definitions: macro definitions, item-likes (functions, consts, types), generic parameters, lifetimes, closures Everything which may be accessed by name or from another crate is a definition. Definitions are identified by their DefPath: the sequence of nested definitions from the crate root. Definitions are assigned an index to ease manipulation: LocalDefId.

Once macros have been gathered, the AST is expanded, by replacing the expansion points by AstFragments, which may contain arbirary AST nodes. The macro invocations and definitions inside those new AstFragments are then collected to participate in macro expansion and name resolution, including new macros definitions.

Once a fixed point is reached, there should be no remaining expansion point. The AST lints are called, and we proceed to lowering.

Lowering transforms the AST tree into the HIR tree. Its purpose is to:

  • resolve all the names from the AST;
  • gather and resolve in-band lifetimes;
  • desugar some constructs into simpler ones
    (for loops, try blocks, async blocks, impl Trait).

As a consequence, lowering may create new definitions as it runs (in-band lifetimes for instance). Once the HIR is built, it is indexed.

The index_hir() query walks the HIR to organise the nodes in a manner suitable for incremental compilation. The HIR is split into owners: item-like nodes and exported macro definitions. Not all definitions are HIR owners: generic parameters, lifetimes and closures are not. An owner's enclosing node is accessible as hir_owner(local_def_id), and the contained nodes are accessible as hir_owner_nodes(local_def_id). This separation allows to reduce the incremental invalidation when only the body of a function is modified.

Indexing is also responsible for computing HIR parents, allowing to walk the tree from a node to the crate root. Note: the parent of a HIR owner is not always an owner, it can be a statement.

Objective

The end objective is to allow to eventually avoid re-parsing a file that has not changed from the last compilation session.

This document aims to be a simplified description of the current and target query systems. Some subtlety of the current implementation have deliberately been ignored. If such imprecisions may bring a notable undescribed difficulty, please let the author know.

We chose to push passes into the query system from the bottom up. In order, this requires to:

  1. Perform HIR indexing for each owner independently;
  2. Make HIR lowering a query;
  3. Make AST expansion a query;
  4. Make AST parsing a query.

The principal difficulty is that the compiler triggers evaluation by iterating on definitions and invoking queries on them. If lowering becomes a query, we will end up creating definitions while iterating on them elsewhere. During an incremental session, the dependency graph may even try to evaluate queries on definitions which are yet to be created.

Incremental HIR indexing

Indexing the HIR collects the HirIds of the HIR tree, and builds two maps:

  • from the HirId to the HIR node;
  • from the HirId to the node parent's HirId.

The two maps are then saved for access by two queries:

  • hir_owner, dedicated to accessing HIR owners;
  • hir_owner_nodes, which allow to access the inside nodes.

For now, HIR indexing walks the HIR tree for the full crate in order, and builds the two maps at once.

This indexing should be changed to only walk the HIR starting from a HIR owner, and stop when encountering an enclosed owner. The difficulty will be in computing the parent of the HIR owners.

Implementation:

  • #82891: make HIR parenting and definition parenting consistent;
  • #83114: create a hir_owner_parent query whose purpose is to map a HIR owner to its parent's HirId;
  • #82681: perform indexing for each HIR owner independently;
  • #83158: create a new type OwnerId as a refinement for LocalDefId, to be used as argument for hir_* queries.

Current state:

digraph {
    node [fontname=Courier, shape=box];
    HIR [label="krate()
items: OwnerId -> Node"];
    index [label="index_hir()
nodes: HirId -> Node
parents: HirId -> HirId"];
    owner [label="hir_owner(OwnerId)
node: Node
parent: HirId"];
    owner_nodes [label="hir_owner_nodes(OwnerId)
local_nodes: ItemLocalId -> Node
local_parents: ItemLocalId -> ItemLocalId
"];
    
    HIR -> index;
    index -> {owner owner_nodes}
}

Objective:

digraph {
    node [fontname=Courier, shape=box];
    HIR [label="krate()
items: OwnerId -> Node"];
    index [label="index_hir(OwnerId)
local_nodes: ItemLocalId -> Node
local_parents: ItemLocalId -> ItemLocalId
child_item_parents: LocalDefId -> ItemLocalId"];
    definitions [label="definitions()
def_parent: LocalDefId -> OwnerId"];
    owner [label="hir_owner(OwnerId)
node: Node"];
    owner_parent [label="hir_owner_parent(OwnerId)\nparent: HirId"];
    owner_nodes [label="hir_owner_nodes(OwnerId)
local_nodes: ItemLocalId -> Node
local_parents: ItemLocalId -> ItemLocalId"];

    HIR -> index;
    index -> {owner owner_nodes};
    index -> owner_parent [label="at def_parent"];
    definitions -> owner_parent;
    {rank=same; HIR, definitions};
}

Notation:

  • Each node is a query: the first line is the name and invocation key, the remaining lines are the returned fields;
  • Edge point in the direction of the data flow, from the callee to the caller;
  • [] denotes a collection (Vec, HashSet);
  • -> denotes an associative collection (BTreeMap, HashMap);
  • these graphs describe information flow, they are not an accurate description of what rustc will actually do.

Incremental lowering

Trying to make earlier passes queries quickly hits two walls. rustc compilation is essentially pull-based, the principal pulling key being a traversal of the full HIR tree and the iteration over all definitions. However, the AST->HIR lowering is allowed to create new definitions as part of its desugaring. As a consequence, new definitions may pop out of thin air while we are iterating over all definitions.

This behaviour can actually be cured quite easily, by splitting the definition table per owners, and iterating over the definitions using a graph traversal over queries:

struct HIR {
    /// Items that were created by lowering this owner.
    children: FxHashMap<OwnerId, HIR>,
    ..
}

query lower_to_hir(tcx: TyCtxt<'_>, id: OwnerId) -> HIR;

fn for_each_definition(tcx: TyCtxt<'_>, f: impl Fn(LocalDefId)) {
    let mut work = vec![CRATE_DEF_ID];

    while let Some(head) = work.pop() {
        let hir = tcx.lower_to_hir(head);
        f(head);
        work.extend(hir.children.keys());
    }
}

In rustc, this will be implemented using a HIR visitor traversing the whole crate.

Graph of the query system

This system starts with a fully expanded AST expanded_ast, along with the collected definitions fragment_definitions.

digraph {
    node [fontname=Courier, shape=box];
    edge [concentrate=true];

    AST [label="item_ast(OwnerId)
ast: Ast"];

    HIR [label="lower_to_hir(OwnerId)
node: Node
children: OwnerId -> Node
child_item_parents: LocalDefId -> ItemLocalId"];

    crate [label="hir_crate()
owners: [OwnerId]"];

    owner [label="hir_owner(OwnerId)
node: Node"];
    owner_parent [label="hir_owner_parent(OwnerId)
parent: HirId"];
    owner_nodes [label="hir_owner_nodes(OwnerId)
local_nodes: ItemLocalId -> Node
local_parents: ItemLocalId -> ItemLocalId"];

    AST -> HIR;
    HIR -> owner_nodes [label="flattening children"];
    owner_nodes -> owner;
    HIR -> owner_parent [label="at def_parent"];

    owner_nodes -> crate [label="recursively\nfrom root"];
}

The definition of owner_nodes(id) takes care of fetching information either as lower_to_hir(id) or lower_to_hir(id.parent).children.

Incremental late resolution

WIP

Incremental macro expansions

A similar iteration scheme allows us to perform AST expansion as a query, as long as all the parsing step is done beforehand. Trying to make AST expansion incremental or to make parsing incremental, we face the limitations of calling queries by LocalDefId: definitions do not exist yet at that stage, or are an inconvenient way to split the unexpanded AST. As a consequence, we need to change representation.

We chose to split the parsed AST according to expansion points ExpnId, which identify AST nodes containing macro invocations. In order to actually call queries with an ExpnId, we need a way to convert a LocalDefId into an ExpnId in a situation where all the LocalDefIds do not exist yet. A LocalDefId is just a shorthand for a DefPathData. Walking this definitions path upwards.

struct ExpandedAst {
    fragment: AstFragment,
    children: FxHashMap<OwnerId, ExpnId>,
}

query expand_from(tcx: TyCtxt<'_>, ex: ExpnId) -> ExpandedAst;

query find_expansion(tcx: TyCtxt<'_>, id: OwnerId) -> ExpnId {
    let def_path = tcx.def_paths[id];
    match def_path.parent {
        // This is the crate root.
        None => ROOT_EXPN_ID,
        Some(parent) => {
            let parent_expn = tcx.find_expansion(parent);
            let expanded_ast = tcx.expand_from(parent_expn);
            expanded_ast.children[id]
        }
    }
}

Note: the migration from the current system will also require a few refactorings. For once, we will need to stop having two-phase initialization of ExpnData. For incremental expansion to be worthwhile, we will need to migrate all AST passes to use HIR or AST fragments.

Graph of the query system

This system starts with a fully expanded AST expanded_ast, along with the collected definitions fragment_definitions.

digraph {
    node [fontname=Courier, shape=box];
    edge [concentrate=true];

    resolver_fragment [label="fragment_definitions()
def_parent: ExpnId -> (LocalDefId -> OwnerId)
def_children: ExpnId -> (OwnerId -> [LocalDefId])
values: ExpnId -> (Symbol -> LocalDefId)
types: ExpnId -> (Symbol -> LocalDefId)
macros: ExpnId -> (Symbol -> LocalDefId)"];
    expand [label="expanded_ast()
ast: ExpnId -> AstFragment
children: OwnerId -> ExpnId"];

    expand -> dexp [label="walk def-path\nfrom root"];
    { rank=same; resolver_fragment expand }

    dexp [label="find_expansion(OwnerId)\nexpansion_path: [ExpnId]"];

    expand -> { AST ast_def } [label="recursively"];
    resolver_fragment -> resolver [label="merge\nfrom root"];
    resolver_fragment -> ast_def;

    dexp -> { rank=same; AST resolver ast_def }

    resolver [label="resolver(OwnerId)
reachable_values: Symbol -> LocalDefId
reachable_types: Symbol -> LocalDefId"]
    ast_def [label="ast_definitions(OwnerId)
def_parent: OwnerId
def_children: [LocalDefId]"]
    AST [label="item_ast(OwnerId)
ast: Ast
children: [OwnerId]"];

    HIR [label="item_hir(OwnerId)
node: Node
extra_definitions: [LocalDefId]
"];
    definitions [label="definitions(OwnerId)
def_parent: OwnerId
children: [LocalDefId]"];

    index [label="index_hir(OwnerId)
local_nodes: ItemLocalId -> Node
local_parents: ItemLocalId -> ItemLocalId
child_item_parents: LocalDefId -> ItemLocalId"];
    owner [label="hir_owner(OwnerId)
node: Node"];
    owner_parent [label="hir_owner_parent(OwnerId)\nparent: HirId"];
    owner_nodes [label="hir_owner_nodes(OwnerId)
local_nodes: ItemLocalId -> Node
local_parents: ItemLocalId -> ItemLocalId"];

    {resolver AST ast_def} -> HIR -> index;
    {HIR ast_def} -> definitions;
    index -> {owner owner_nodes};
    definitions -> owner_parent;
    index -> owner_parent [label="at def_parent"];

    definitions -> "all_definitions()" [label="recursively\nfrom root"];
}

Incremental macro resolution

Macro resolution and expansion are currently performed in a fixed-point loop, where expanded macros can influence the resolutions in future expansions. This very subtle order is documented here and here. The subtlety comes from the conflict between scopes that rule name resolution, and the possibility for macro expansions to add names to an existing scope.

Expansions form a tree, with an expansion parent and its index inside this parent. This index is defined in declaration order. From an expansion, we are able to enumerate all its parents recursively, as well as all the expansion points that appear before in source order.

The conservative restricted shadowing rule is as follows. Consider a macro invocation \(I\) and a resolution \(A\). Let \(A'\) another resolution, then:

  • if \(A\) is closer in scope than \(A'\), select \(A\);
  • otherwise, \(A'\) is closer in scope than \(A\):
    • if \(A'\) comes from a parent expansion of \(A\) or \(I\), select \(A'\);
    • otherwise, report an error.

If we replace error reporting by a speculative choice of either candidate, the resolution can be performed using the following algorithm:

  1. Look for candidates in the parent expansions to the invocation \(I\).
    This corresponds to the very conservative shadowing in petrochenkov's comment. As it is strictly more conservative, it will find strictly less candidates. As all candidates in this search are found in parents expansions of \(I\), there is never any ambiguity, and we just need to find the closest in terms of scoping.
  2. If we have a candidate \(A\), we have already considered all resolutions that come from a parent expansion of \(A\) or \(I\). Therefore, any other candidate must be ambiguous. Do not bother looking for them, and return \(A\).
  3. Otherwise, we need to find a candidate that is not in a parent expansion of \(I\). We walk all the expansion points earlier to \(I\) in reverse order, and return the first candidate \(A\) we find. We have already considered all candidates from parent expansions of \(A\) in (1), so any resolution we find later will be ambiguous.

In case of an ambiguity, no error is returned, rather a candidate is picked deterministically. Once the AST will be fully expanded, the full resolver will be computed in preparation for lowering. This full resolver will be able to report ambiguities.

// For exposition only
struct ExpnId {
    parent: Option<Box<ExpnId>>,
    index: u64,
}

fn resolve_macro(
    tcx: TyCtxt<'_>,
    invocation: Ident,
    scope: ScopeId,
    start: ExpnId,
) -> Option<OwnerId> {
    let mut candidate = None;
    for eid in start.parents() {
        let local_defs = tcx.fragment_definitions(eid);
        if let Some(new_candidate) = local_defs.resolve_macro(invocation, scope) {
            if candidate.map_or(true, |c| new_candidate.scope >= c.scope) {
                candidate = Some(new_candidate);
            }
        }
    }
    // We have found a candidate: go with it.
    // The construction of the full resolver will
    // report ambiguities from expansions we did not consider.
    if candidate.is_some() {
        return candidate;
    }
    // We do not have a candidate yet:
    // speculatively perform expansions until we find something.
    for eid in start.parents() {
        for c in (0..eid.index).rev() {
            let eid = Expn { index: c, ..eid };
            let local_defs = tcx.fragment_definitions(eid);
            let candidate = local_defs.resolve_macro(invocation, scope);
            if candidate.is_some() {
                // Found something: return it. The full resolver
                // will report an error if it was actually ambiguous.
                return candidate;
            }
        }
    }
    None
}

Incremental parsing

The same trick can be employed to swich from an ExpnId (which refers to an AST node), to a reference to a file. This allows to parse of out-of-line modules incrementally. For macros, we can reuse rust-analyzer's trick: create virtual files into which a span can point. In that case, the macro invocation is performed by locate_tokens, and returned by tokens.

Eventually, items bodies could be lazily handled, by creating an artificial expansion point in parse and stopping the expansion there in expand_from.

Design proposal

Follows a description of the end state of the query system once all passes have been included in it. This is a maximalist proposal, restricted versions can be written by cutting the graph horizontally. Some changes can be performed gradually from the current behaviour of rustc. However, some changes are more invasive, and will require careful thought in themselves.

Key stability

In order to re-run queries from one compilation session to the next, we need to save a stable representation of those keys. Definitions have a stable and terse representation using their DefPathHash. Likewise, ExpnId form a tree. For files, we could get away with storing the canonical filesystem path for physical files, and the stable version of (DefId, ExpnId) for macro expansions.

Description of the queries

We separate the queries according to their key, to emphasize where the change of representation occurs.

Splitting by file:

  • tokens: tokenize the bytes in a given file;
  • parse: read all the files, tokenize them, parse them, create expansions points (ExpnId) at chosen nodes: macro invocations, out-of-line modules;
  • locate_tokens: find the correct file for out-of-line modules.

Splitting by AST expansion points:

  • collect_definitions: walk the AST fragment to create definitions, and fill the resolver
    (this corresponds to rustc_resolve/def_collector.rs);
  • resolve_macro: find the definition of a macro from its name;
  • expand_from: expand one AST expansion point into an fragments, either by invoking a macro, or by developping an out-of-line module;
  • find_expansion: convert definitions to nested expansions to expand by walking the DefPath from the crate root.

Splitting by AST/HIR owners:

  • item_ast: cleanup the AST returned by expand_from;
  • resolver: gather all the definitions accessible from an expansion point for name resolution;
  • ast_definitions: extract the definitions inside the current owner;
  • item_hir: lower the AST to HIR for each owner, and store newly created definitions alongside;
  • definitions: merge ast_definitions and extra definitions from lowering;
  • index_hir: walk the HIR to record each node's parent;
  • hir_owner, hir_owner_nodes and hir_owner_parent are projections from index_hir.

Name resolution

Name resolution happens on the pre-expansion AST (for macro resolution) and on the expanded AST (for values, types and lifetimes). Gathering all the names can be performed on the

Bootstrapping

The query system is initialized using the toplevel crate module:

  • find_expansion(LOCAL_CRATE) = [CRATE_EXPN_ID];
  • locate_tokens(CRATE_EXPN_ID) = { "./lib.rs", .. }.

The entry point is the all_definitions iterator, which walks all HIR owners to find nested definitions. It can be implemented as a simple DAG visit using the definitions query.

Graph of the query system

digraph {
    node [fontname=Courier, shape=box];
    edge [concentrate=true];

    tokens [label="tokens(File)
tokens: TokenStream"]

    tokens -> parse;

    parse [label="parse(File)
ast: AstFragment"];

    parse -> sexp;

    sexp [label="locate_tokens(ExpnId)
source: File"]

    { parse sexp } -> { rank=same; resolver_fragment; expand };

    resolver_fragment [label="collect_definitions(ExpnId)
def_parent: LocalDefId -> OwnerId
def_children: OwnerId -> [LocalDefId]
values: Symbol -> LocalDefId
types: Symbol -> LocalDefId
macros: Symbol -> LocalDefId"];
    expand [label="expand_from(ExpnId)
ast: AstFragment
children: OwnerId -> ExpnId"];

    expand -> dexp [label="walk def-path\nfrom root"];
    resolver_fragment -> expand;

    dexp [label="find_expansion(OwnerId)\nexpansion_path: [ExpnId]"];

    { // Macro stuff
        edge [style=dotted];
        resolve_macro [label="resolve_macro(Symbol, ExpnId)\nsource: ExpnId"];
        { resolver_fragment expand } -> resolve_macro [label="previous"];
        resolve_macro -> { expand sexp };
        {rank=same; resolve_macro sexp }
    }

    expand -> { AST ast_def } [label="recursively"];
    resolver_fragment -> resolver [label="merge\nfrom root"];
    resolver_fragment -> ast_def;

    dexp -> { rank=same; AST resolver ast_def }

    resolver [label="resolver(OwnerId)
reachable_values: Symbol -> LocalDefId
reachable_types: Symbol -> LocalDefId"]
    ast_def [label="ast_definitions(OwnerId)
def_parent: OwnerId
def_children: [LocalDefId]"]
    AST [label="item_ast(OwnerId)
ast: Ast
children: [OwnerId]"];

    HIR [label="item_hir(OwnerId)
node: Node
extra_definitions: [LocalDefId]
"];
    definitions [label="definitions(OwnerId)
def_parent: OwnerId
children: [LocalDefId]"];

    index [label="index_hir(OwnerId)
local_nodes: ItemLocalId -> Node
local_parents: ItemLocalId -> ItemLocalId
child_item_parents: LocalDefId -> ItemLocalId"];
    owner [label="hir_owner(OwnerId)
node: Node"];
    owner_parent [label="hir_owner_parent(OwnerId)\nparent: HirId"];
    owner_nodes [label="hir_owner_nodes(OwnerId)
local_nodes: ItemLocalId -> Node
local_parents: ItemLocalId -> ItemLocalId"];

    {resolver AST ast_def} -> HIR -> index;
    {HIR ast_def} -> definitions;
    index -> {owner owner_nodes};
    definitions -> owner_parent;
    index -> owner_parent [label="at def_parent"];

    definitions -> "all_definitions()" [label="recursively\nfrom root"];
}
Select a repo