owned this note
owned this note
Published
Linked with GitHub
# AST("CST")/Parsing Evolution
## Background
Terminology around parsing can be quite opinionated, but I found the "language processing [Megamodel](https://grammarware.github.io/parsing/index.html)" to be helpful.
### Relevant structures, in rustc (quoted names are from the "Megamodel"):
* Token tree
* "Parse Tree", but for a simpler grammar
* token streams are sequences of such token trees
* AST (`syntax::ast`)
* (either "Concrete" or "Abstract") "Syntax Tree"
* not "Parse Tree" because not all tokens are present
* arguably not "Abstract" because little is destroyed
* the main example I can think of is trailing commas
* uniquely heap-allocated (using `P<T>` which newtypes `Box<T>`)
* losslessly identified by global `NodeId`s
* nodes lossily refer to their source with `Span`s
* HIR (`rustc::hir`)
* "Abstract Syntax Tree"
* not "Concrete" because of syntactic desugarings
* `(expr)` -> `expr`
* `for pat in expr {block}` -> `loop {...}`
* etc.
* started off identical to AST, from where it diverged
* arguably a graph, not a tree, because of the per-node parents
* long-term we might want to get rid of intra-`hir::{Ty,Expr,Pat}` parenting (and keep the definition graph), but that's out of scope here
Note: there is a distinct lack of a "Parse Forest/Tree" for the Rust grammar, and this is where `wg-grammar` and @matklad's `libsyntax2` come in.
While eventually they might replace the AST layer entirely, that is longer-term and therefore out of scope here.
Regardless, I am **not** proposing a rename to the "Megamodel" terminology, and will keep using the rustc names throughout the rest of this.
### Operations on the AST, in rustc:
* Parsing (`syntax::parse`)
* from token streams to AST nodes
* hand-written Recursive Descent
* `LL(k)`-like with `k < 4`, AFAIK
* pro: custom error diagnostics/recovery
* cons: no formal grammar, harder to incrementalize
* doesn't assign `NodeId`s immediatelly
* Name resolution (`rustc_resolve`)
* requires `DefId`s for definitions and `NodeId`s for uses
* visits the AST to collect definitions in scope and resolve uses
* highly stateful because it runs in lockstep with macro expansion
* hard to incrementalize
* Transformation framework (`syntax::fold`)
* nodes are transformed by-value (`Node -> Node`)
* provides no way to automatically track changes made
* **NOTE**: I'm assuming [#58061](https://github.com/rust-lang/rust/pull/58061) didn't happen, since AFAICT it's a dead end we have to back out of
* Expansion (`syntax::expand`, `syntax_ext`)
* primary user of the AST transformation framework
* assigns IDs (`NodeId` or different) as-needed
* runs name resolution on partially expanded ASTs
* `proc_macro` macros/attributes/custom derives
* consume and produce token streams
* requires converting AST nodes back to token streams
* main originator of non-source tokens (and inaccurate `Span`s)
* builtin/legacy macros/attributes/derives
* consume and produce AST nodes
* main originator of non-parsed AST nodes
* with the notable exception of `#[cfg]` stripping
* AST->HIR lowering (`rustc::hir::lowering`)
* integrates name resolution results (using `NodeId`s)
* desugars several syntactic constructs
## AST design constraints
Given the operations that need to be supported, I consider these necessary:
### A. Global node identity (unique, precise, lossless)
Needed for:
* attaching (e.g. name resolution) information without mutation
* referring to definitions (albeit those could be factored out of the AST)
* mapping changes for incremental reparsing/recompilation
Possible representations:
* pointers (included just for completion, not great)
* flat index (`NodeId` today)
* two-level ("container" and "element") index (`DefId` and `HirId` today)
* "container" can be several things
* "crate" for `DefId`
* "local definition" (second level of `DefId`) for `HirId`
* can be generalized to some sort of "source container" for AST
* could be made quite compact, see [#53560](https://github.com/rust-lang/rust/issues/53560)
* "container" and "element range"
* variation on the above, but oriented around leaves, not nodes
* leaf elements could be source bytes (like in `Span`), or tokens
* tokens are more plausible, as macro expansion can produce artificial ones
* friendlier to IDEs, but unlikely to have an effect on incremental
While AST nodes today contain their identity (`NodeId`), this needn't be the case.
We only need to know the ID while visiting, so nodes could store their "sizes" (i.e. the number of `NodeId`s, tokens, etc. used by the entire subtree), and the visitor would automatically adjust its ID by that relative amount.
However, this is incompatible with mutation, and therefore would require frozen-on-creation nodes to put into practice.
Possible assignment strategies:
* lazy/late (as today)
* only makes sense for IDs contained in the nodes itself
* pro: only generates the minimum necessary
* con: requires more mutation
* eager, on node creation (typically parsing)
* pro: correct-by-default, without further mutation
* con: may generate far more IDs than used in the final AST
* not a serious concern if the IDs are relative
### B. Mapping back to origin/source (can be imprecise/lossy)
Needed for:
* diagnostics referring to the source code of a node
* IDEs
Possible representations of origin info:
* can be referred to from AST nodes either:
* inline
* via an interned index / pointer
* a combination of the above (`Span` today)
* "source range" and expansion info (`SpanData` today)
* "container" and "token range"
* could be shared with the node identity
* each "container" would need:
* token stream (potentially shared)
* source and/or expansion info
* pro: improved diagnostics for `proc_macro` output
* pro: more stable (e.g. across whitespace changes)
* con: requires an extra step to get a "source range"
Possible representations of "source ranges":
* flattened byte ranges (used by `SourceMap` today)
* all the sources ("files") are ("logically") concatenated
* pro: maximally compact, as there are no gaps between sources
* con: finding the source (and intra-source range) requires binary search
* source and intra-source byte range
* can reuse the concept of "container" (of tokens) for "source"
* pro: immediate access to the appropriate source
* con: a bit more wasteful in encoding space
### C. Mapping back to a token stream (lossless)
Needed for passing AST nodes to proc macros/attributes/custom derives.
Possible implementations:
* convert (i.e. "pretty-print") the AST, disregarding its origin
* lossy for a "CST", the AST would need to become a "Parse Tree"
* even lossier if you pretty-print to strings instead of tokens
* hygiene information is the more important part here
* used today for AST nodes that were modified
* **NOTE**: this is based on an unsound heuristic, due to pretty printing being to a string (see above)
* track the original token stream/token range
* easy to do for parsing
* much harder after any modifications
* used today for unmodified AST nodes (caveat: see above)
## Action plan
### **TODO** switch to arenas and rethink all this ~~Immediate (high-priority)~~
~~First off, I think we should revert [#58061](https://github.com/rust-lang/rust/pull/58061) (`Folder` -> `MutVisitor`), which the rest of the document assumes didn't happen.
While `T -> T` and `&mut T -> ()` are somewhat equivalent, only the former lets you vary the input and output independently, and without that I don't see how we can make progress.~~
~~While today's AST implementation may not be great for IDEs or incremental reparsing/recompilation, constraints A and B are met, while C (mapping AST nodes to token streams) isn't.~~
~~For improving the situation with C, I propose we track the original `TokenStream` on any node that may end up passed to a proc macro, and invalidate it (resorting to lossy pretty-printing) on any modification.
(While being able to detect modifications might be useful to incremental, we shouldn't be focusing on that right now)
To detect modifications, we need to change the transformation ("folding") infrastructure, in one of several ways:~~
* perform runtime equality checks/hashing (it'd likely be too expensive)
* enforce everything is allocated completely frozen, and comparing pointers/IDs to determine equality
* change the signatures to propagate the information, e.g.:
```rust
fn fold_expr(&mut self, e: P<Expr>) -> P<Expr>;
```
would become
```rust
fn fold_expr(&mut self, e: P<Expr>) -> Change<P<Expr>>;
```
where `Change` can be
```rust
pub enum Change<T> {
New(T),
Old,
}
```
This raises a second problem: because we transform by-value, this would require additional copying.
We can instead pass references, effectively requiring the AST to be modified in a "persistent" manner:
```rust
fn fold_expr(&mut self, e: Rc<Expr>) -> Change<Rc<Expr>>;
```
Or if we arena-allocate the AST (which gets close to the previous alternative):
```rust
fn fold_expr(&mut self, e: &'arena Expr) -> Change<&'arena Expr>;
```
* (ab)use generativity to enforce equality:
```rust
fn fold_expr(&mut self, e: Old<'o, P<Expr>>) -> Change<'o, P<Expr>>;
```
where `Change` and `Old` would be
```rust
pub enum Change<'o, T> {
New(T),
Old(Old<'o, T>),
}
pub struct Old<'o, T> {
value: T,
_marker: PhantomData<&'a mut &'a ()>,
}
```
This would, sadly, require automating a lot of code like this:
```rust
// NOTE: this can't be allowed outside of "privileged" (wrt `Old`) code.
// Otherwise, the generativity-based enforcement flies out of the window.
match e {
// NOTE: using `Old(x)` as a shorthand for `Old { value: x, .. }`.
Old(Expr::Unary(unop, inner)) => match inner.fold(self) {
Change::Old(Old(inner)) =>
Change::Old(Old(Expr::Unary(unop, inner)),
Change::New(inner) =>
Change::New(Expr::Unary(unop, inner)),
}
Old(Expr::Binary(binop, left, right)) => match (left.fold(self), right.fold(self)) {
(Change::Old(Old(left)), Change::Old(Old(right))) =>
Change::Old(Old(Expr::Binary(binop, left, right)),
(Change::New(left), Change::New(right))
| (Change::New(left), Change::Old(Old(right)))
| (Change::Old(Old(left)), Change::New(right)) =>
Change::New(Expr::Binary(binop, left, right)),
}
// ...
}
```
So while this may be somewhat appealing theoretically, it's unlikely to work well for us.
**TODO**: hide this section, since it's big and likely not worthwhile