owned this note
owned this note
Published
Linked with GitHub
---
breaks: false
---
Links:
* [parsing in rust-analyzer](https://hackmd.io/XoQrzR8GRLa64jpjylQ7Bw?both)
* Original [libsyntax2](https://github.com/rust-lang/rfcs/pull/2256) RFC
# Library-ifying rustc Parser
In the end, we want to see rust-analyzer and rustc share the same syntax tree and parsing code.
However, the design of syntax tree should be significantly different.
This is discussed in the [libsyntax2 RFC](https://github.com/rust-lang/rfcs/pull/2256).
The ["Design Goals"](https://hackmd.io/XoQrzR8GRLa64jpjylQ7Bw?both#Design-Goals) section of the rust-anlayzer document, linked above, is also a good summary of what rustc's libsytnax is not. The syntax tree design in rust-analyzer seems reasonable enough, but there are certaintly plausible alternatives.
Additionally, the syntax tree is a datastructure which is used extensively in the compiler, and swapping it atomically is hard. For this reason, I propose to focus on **extracting only the parser itself**, and continuing using different syntax tree data structures in rust-analyzer and in rustc.
I hope that, by itself, making a library our of parser will be beneficial for rustc.
Subjectively, extracting the lexer in a similar way made it significantly easier to understand how strings are tokenized.
I hope that this works for parser as well.
Additional benefits:
* rust-analyzer won't need to maintain, a separate buggy parser
* this work should make eventual library-ification of AST itself easier
* this work should make experiments with different AST or parsing algorithms easier
The main cost is that we'll get a (likely significant) amount of glue code between the parser library and the concrete rustc AST.
This code hopefully won't be difficult to maintain (it should be stupid), but it will be there, slowing down compile time, runtime (?) and development time (but the last one might be offset by better factoring).
## Suggested Approach
1. Remove extra concerns from parser.
For example, parser should not load out-of-line modules (https://github.com/rust-lang/rust/issues/64197). In general, minimize the parser as far as possible within the current architecture.
2. Separate parsing from details of token streams.
Similarly to ast, sharing `TokenStreem` definition would be annoying.
It should be relatively straight-forward to refactor the parser to work with an abstract iterator of simple, flat tokens. That is, parser should look only at the token's kind, and not on its contents, with some targeted hack for contextual keywords.
* as a subtask, we migrate rustc to the proc-macro2 token model (https://github.com/rust-lang/rust/issues/63689)
3. Separate parsing from details of AST.
This is the bulk of work.
One possible interface for ast-agnostic parser is the one used in rust-analyzer, but there might be better designs?:
```rust
pub struct Token {
pub kind: SyntaxKind,
pub is_joined_to_next: bool,
}
pub trait TokenSource {
fn current(&self) -> Token;
fn lookahead_nth(&self, n: usize) -> Token;
fn is_keyword(&self, kw: &str) -> bool;
fn bump(&mut self);
}
pub trait TreeSink {
fn token(&mut self, kind: SyntaxKind, n_tokens: u8);
fn start_node(&mut self, kind: SyntaxKind);
fn finish_node(&mut self);
fn error(&mut self, error: ParseError);
}
pub fn parse(
token_source: &mut dyn TokenSource,
tree_sink: &mut dyn TreeSink,
) { ... }
```
The benefits of the above interface is that it is super minimal, and should allow keeping the parser's code focused solely on parsing.
The main drawback is that it doesn't express the types of nodes, so there will be significant amount of typing glue to bridge this interface with the typed spanned AST in rustc.
A super-minimal and ugly proof of concept of a parser which produced both rustc and rust-analyzer style ASTs is available here: https://github.com/matklad/rust-analyzer/blob/9303a7e5781fa6255ddc7cb53aeb52643b68d765/crates/ra_syntax/src/parsing/ast_parsing.rs
4. Tweak parser resilience. Before we can use the parser in rust-analyzer, we must make sure that it handles incomplete and broken code in a useful way.
5. Remove existing rust-analyzer parser.
## Looking at Swift
It is interesting that Swift is solving (for several years now, I think?) exactly the same problem.
They have a traditional clang-inpired AST in the current compiler, as well as newer full-fidelity IDE-friendly [libsyntax](https://github.com/apple/swift/tree/aba9c8b4c24d3c43aaf960211a5ecdc6ceb4ba46/lib/Syntax).
They use a different approach: instead of making the parser polymorphic in the type of tree it can produce, they made the parser to produce *both* trees simultaneously. In other words, parser's contains explicit code for constructing both AST and CST (the cst code is much smaller overall).
The most recent progress is reported here: https://medium.com/@kitasuke/deep-dive-into-integrating-libsyntax-into-the-compiler-pipeline-2d478c8600a1