stable MIR librarification

--- breaks: false tags: steering, rustc --- # stable MIR librarification :::info (FYI: This is the doc for [steering meeting compiler-team#498](https://github.com/rust-lang/compiler-team/issues/498), "Exposing a versioned, semi-stable MIR interface".) ::: ## Big Picture Today, MIR is an "implementation detail" of `rustc`. Nonetheless, MIR provides an excellent foundation for tools, especially for those in the formal methods community who want to perform sophicated analyses and make stronger guarantees about the behavior of Rust programs. The Rust Compiler can better serve the needs of such tools by establishing a *stable definition for MIR*. ### Meeting Goals * Make `rustc` developers aware of the *problem* we want to solve, * Establish the *scope* of an initial "stable MIR" deliverable, * Identify the *owners* who want to take charge of driving the effort forward, * Specify the *next steps* that the owners all agree upon. ## Background ### Goal Developers, especially in formal methods space, want to make tools that process MIR, the "middle IR" of rustc, as input. Some tools will perform direct analysis on the MIR itself. Other tools will translate the MIR into another format. * There are some who have mentioned wanting to output MIR from their tool; e.g. as a [Coq backend][coq], and then it would presumably be input into `rustc` for a final binary. (It is a reasonable thing to consider doing, if only for internal testing purposes.) [coq]: https://github.com/rust-lang/lang-team/blob/master/design-meeting-minutes/2021-02-17-Increasing-Trust-in-Rust-Compiler.md#notes ### Status Quo Currently the two ways to make tools that operate on the MIR generated from `rustc` are either: 1. Make your own `rustc` backend, and work with the definition of MIR that `rustc` itself uses, or 2. Leverage the `--emit=mir` option to `rustc`, and ignore the emitted warning that "This output format is intended for human consumers only and is subject to change without notice." ### Problems with status quo Neither of these options is great. The first approach requires adding a custom `rustc` backend[^backend], which is doable, but requires significant engineering and ongoing maintenance effort. [^backend]: There may be other reasons to use a custom backend, since it provides access to the compiler state outside of MIR itself. The second approach has two problems: 1. It violates the provided warning (which is in place because the Rust developers have wanted, and have utilized, the freedom to revise the definition of MIR as necessary while doing development work on `rustc`). 2. This second approach is also unrealistic because there is not sufficient information in the emitted MIR for most tools to be able to do useful analysis. Felix's opinion: The Rust community is reaching a point where there is enough interest in Rust itself, and specifically in providing stronger guarantees about certain Rust code, that the Rust compiler developers should loosen its grip on control over the definition of MIR. :::info There are actually two potential audiences for a separate definition of MIR. One potential audience for a separate MIR definition are humans who would want to talk about the semantics of *fragments of MIR*: "what does this snippet of code do?" However, the objective of this document is largely aimed at tool developers; for them, the consumer of the MIR are the tools themselves: computational procedures that want to analyze/execute library/procedures represented by the MIR. [^cannot-please-everyone-or-can-you] [^cannot-please-everyone-or-can-you]: If we *can* trivially please the audience of humans who want to look at fragments of MIR, we should keep them in mind. After all, people use `--emit=mir` today in large part because it *is* human-readable. Felix thinks they are not the primary target for this work, but wants to hear counter-arguments for why they should be given heavier consideration. Celina notes that it should be a *secondary* goal; "it will be nice to have this for many reasons specially if we choose to enable MIR input to the compiler. Besides debugging, this could help with testing, prototyping and even documentation (e.g.: RFCs)." ::: ## Proposed Deliverable Here's what we should do: 1. [Define stable-MIR](#Define-stable-MIR), which [remains independent from rustc's MIR,](#Remain-independent-from-rustc-MIR) 2. [Suppport stable-MIR as rustc codegen backend](#stable-MIR-as-a-codegen-backend) (i.e. a rustc-MIR to stable-MIR conversion) 3. [and, potentially, also accept stable-MIR as a `rustc` input.](#maybe-stable-MIR-as-input-format) ### Define stable-MIR Provide a separate, stable (probably [semantically versioned][semver]) definition for MIR (which we call "stable-MIR") that third parties can interface with. * This "definition" would take the form of a set of type definitions, representation invariants, and how to interpret the meaning of each defined structure. * The definition would need to be able capture *all* of the information needed for analysis tools to run. One way to think of this: whatever is in the stable-MIR value should be enough for a codegen backend to create the object code for that function/library. * This has two immediate consequences: 1. it needs to be able to represent type definitions (which are not included in today's `--emit=mir` output from `rustc`), and 2. it needs to represent other payloads, potentially arbitary extensions, that today's MIR omits.[^tool-attributes] * for now, we leave open the question where the definition for stable-MIR comes from; there are many potential starting points, that is not meant to be debated in this meeting. The real question is *who* will own choosing that starting point.[^stork] * we also leave open the question of whether to use a subset of Rust's type system when defining the structures (and enums/unions/etc) of stable-MIR. (The main motivation for using a limited subset of Rust's rich type system would be to enable tools to be written in a larger set of alternative languages.) The answer to this question is not intended to be debated in this meeting; we mention it solely to make it clear that it is an open question to be resolved later. [semver]: https://semver.org/ [^tool-attributes]: As noted by from Celina: “One thing we noticed is that codegen and our tool sometimes require information from HIR instead of MIR. Ideally the MIR should be a complete representation for the next stage of the compiler. One good example is tool attributes” [^stork]: We *could* factor `rustc_middle::mir` into its own crate; but it is not clear whether this would actually be useful in the short term. We could also add a separate definition that is sufficiently expressive to meet the criteria outlined in this document. ### Remain independent from rustc-MIR `rustc` should continue to be allowed to revise its own internal definition of MIR (which we call "rustc-MIR"), especially with respect to how the `rustc` backend is architected. If we want to add new features that enable certain optimizations, or add extra state that improves the `rustc` user experience, we should remain free to do that. * There are ways to make very flexible formats[^twobit] that would allow `rustc` revise its definition while still being able to emit backwards-compatible instances of the format. But it is not clear how well those strategies will apply to MIR or variants of MIR. [^twobit]: pnkfelix's personal favorite example of this is the Intermediate Representations used by the Twobit Scheme compiler, where all of the intermediate forms are themselves valid Scheme programs that have the same semantics are the original input (metadata resulting from e.g. static analyses is embedded as literal data in the program representation): http://www.larcenists.org/Twobit/syntax.html ### stable-MIR as a codegen backend We should extend `rustc` with the ability to convert its own instance of MIR into an instance of the stable-MIR. * This might look like an in-memory conversion. In this case, one would have to determine the appropriate plumbing to hook analysis tools up to the running `rustc`, or otherwise handle hand-off the stable-MIR value to the linked tool via shared memory or similar. * Alternatively, this might look like serializing to some wire/file format. * Such a file format might be human readable text that can be parsed into an instance of the definition; but this is not ideal for all scenarios. Pretty-printing and parsing may be too expensive for some use cases. * Supporting both in-memory conversion and full-blown serialization of some form is probably best, but this is a decision that can be left up to the owners of this project. ### maybe: stable-MIR as input format If we choose to make `rustc` accept stable-MIR as an *input format*, we should be careful about what guarantees we make on what rustc accepts as input. For example, if we accept any MIR inputs, then we should probably support round-tripping at bare minimum. I.e., this diagram should [commute][]: [commute]: https://www.math3ma.com/blog/commutative-diagrams-explained ```graphviz digraph g { A -> B [label="rustc --emit=stable-mir "] B -> C [label=" rustc --input-is-mir interim.mir "] A -> C [label=" rustc input.rs"] A[label="input.rs"] B[label="interim.mir"] C[label="output LLVM IR"] } ``` but we should probably *not* guarantee that arbitrary stable-MIR will be accepted by the compiler, even if it is syntatically valid (and type-checks, etc). ## Ownership Who at this meeting is interested? :smile: ### Cross-cutting concerns By its definition, stable-MIR is a project with stakeholders from `rustc` itself and stakeholders that are entirely external to `rustc`. (If we end up in a situation where only `rustc` developers can use stable-MIR, then we will have failed to deliver what was promised.) This cross-cutting nature raises several questions: *Where* would the stable MIR definition live? *Who* will be in charge of initial implementation? *Who* will be in charge of maintaining it? *Who* will maintain its interface with `rustc` itself? We do not think we should try to answer all those questions today. We just want to acknowledge them as questions that do need to be answered. ### wg-stable-mir We think a working group, let us call it wg-stable-mir, is warranted. We want to know who from the compiler team would be able to help that working group. (We also think we will want to engage stakeholders from outside the compiler team to join wg-stable-mir, so that we get direct engagement with the people who will benefit most from this work.) ### Organizing development We want to enable independent development of stable-mir while also having ongoing integration with `rustc`. This is a known problem for many subprojects of `rustc` (such as `miri`, `rust-analyzer`, etc). Really, this is largely a matter of surveying the existing solutions to this problem, and choosing whichever one has worked best so far. We should avoid inventing a new process here if we can avoid it. ## References related T-lang meeting from 2021-02-17: https://github.com/rust-lang/lang-team/blob/master/design-meeting-minutes/2021-02-17-Increasing-Trust-in-Rust-Compiler.md # Appendix ## Open Follow-up Questions The following are various other questions that came up during the drafting of this document, but they do not need to be answered today. * MIR semantics and MIR Phases * Do we want to have one MIR definition that represents all its phases? From [MirPhase definition](https://github.com/rust-lang/rust/blob/4459e720bee5a741b962cfcd6f0593b32dc19009/compiler/rustc_middle/src/mir/mod.rs#L131): *"These phases describe dialects of MIR. Since all MIR uses the same datastructures, the dialects forbid certain variants or values in certain phases"*. * Which phase should we start with? * Should stable MIR contain polymorphic (generic) code? * Human readable MIR: Today, a goal for `--emit=mir` is that it remain human-readable, since its stated purpose is *for* humans to read it in the course of debugging the compiler. * Is that also a goal for the stable MIR output format? * Version semantics: * What would require a major bump? * New intrinsics? * Experimental features? # Meeting Questions/Discussion Topics below Here ## related work: formality, chalk-ty We should delegate some parts of this to "related work": * The formality repo is an attempt to define MIR semantics * The chalk-ty repo is an attempt to represent Rust types in a way that can be shared across many projects * there *may* be a need for a stable-ty library too, that's not clear to me ## suggestion: expose as queries first nikomatsakis: I mostly agree with what's in this document, but I have some specific proposals I would advocate for... * The MIR format should be defined in Rust data structures * The semantics should be defined by formality repo; formality's definition of MIR should align closely to the definition that appears in the stable-mir library * There should be an option to emit formality MIR, too, either in rustc or here * The initial way to get access to this should be via rustc's query system * you would initialize rustc as today and do a `stable_mir(def_id)` query * This allows us to encode `Ty<'tcx>` and also to peek at HIR attributes (at least to start) * Over time, these can be phased out in favor of `chalk-ty` (or perhaps a `stable-ty`) and adding relevant HIR attributes directly into `stable-mir` * This frontend should at least *start* outside of Rustc to allow rapid iteration * the clippy approach makes sense to me * The frontend will probably want to evolve to be similar to how miri works -- that logic can be factored out and shared amongst tools. I'm not sure the details but I know it does clever stuff to give access to the stdlib etc. Things I would defer: * We should not attempt to serialize to files or read in from files yet * That could come eventually, but it's good to focus on an MVP ## would we expose analyses on stable mir? tmandry: The formal methods doc [talks about](https://hackmd.io/qLk-wnwtStOhCOBeF5YuOw?view#The-borrow-checker-apis-are-difficult-to-use-and-unstable-) wanting to see borrow check information. Should we expose this? How would we, if borrow check is written in terms of actual MIR? #### should we consider instead splitting mir-for-analysis from mir-for-optimizations-and-codegen (and making the first semi-stable)? tmandry: I've heard t-compiler members talk in the past about how MIR serves dual purposes but doesn't necessarily do both well. > [name=bjorn3] we are moving unsafeck to thir. if we also move (polonius) borrowck to thir, mir would only be used for optimizations and codegen. > [name=tmandry] I wasn't aware of these plans, but it could also defeat some of the use cases in the other doc. Though as long as we expose polonius data and make it "joinable" with the stable MIR it should be useful enough. ## Unstable implementation details > [name=bjorn3] How would stable MIR handle unstable rust features? Many stable things internally use unstable rust features we may (or do) want to change. For example closures internally use the rust-call calling convention which we likely want to replace with variadic generics once those get introduced. Calling `Box<dyn FnOnce()>` uses unsized arguments. Unsizing like `Rc<[u8; 16]> as Rc<[u8]>` uses several unstable traits and non-trivial logic inside the codegen backend. Any method call on a trait object uses yet other unstable traits. Trait objects also use the unstable (and recently changed) vtable layout internally.