owned this note
owned this note
Published
Linked with GitHub
# Polonius inputs
(Note for potential readers on hackmd: this WIP write-up describes my understanding of the inputs prior to the recent changes incorporating facts about liveness. It will still contain the relation `region_live_at` even though the work moving this computation from rustc to Polonius is being completed.)
update: most of this document has been posted as a [PR](https://github.com/rust-lang/polonius/pull/126)
---
In this document, we'll describe what inputs Polonius needs: Polonius computes its analyses starting from "input facts", which can be seen as a little database of information about a piece of Rust code (most often: a function).
In this analogy, the database is the [`AllFacts` struct](https://github.com/rust-lang/polonius/blob/master/polonius-engine/src/facts.rs#L6-L65), which contains all the data in tables (or relations), here as a handful of `Vec`s of rows. The table rows are these "facts": this terminology comes from Datalog, which Polonius uses to do its computations (and the reason for the `rustc` flag [outputting this data](https://rust-lang.github.io/polonius/generate_inputs.html) being named `-Znll-facts`, and the files themselves `*.facts`).
In order to be used in various contexts (mainly: in-memory from `rustc`, and from on-disk test files in the Polonius repository) this structure is generic over the types of facts, only requiring them to be [`Atom`](https://github.com/rust-lang/polonius/blob/master/polonius-engine/src/facts.rs#L90-L94)s. The goal is to use interned values, represented as numbers, in the polonius computations.
These generic types of facts are the concepts Polonius manipulates: abstract `regions` containing `loans` at `points` in the CFG (If you include the liveness computation, and move/overwrite analysis, there are also: `variables` and `move paths`), and the relations are their semantics (including the specific relationships between the different facts).
Whenever we describe the facts/Atom types stored in the input relations, we'll use this naming convention:
- a `R` prefix for regions
- a `L` prefix for loans
- a `P` prefix for points in the CFG
Let's start with the simplest relation: the representation of the Control Flow Graph, in the `cfg_edge` relation.
### 1. `cfg_edge`
`cfg_edge(P1, P2)`: as its name suggests, this relation stores that there's a CFG edge between the point `P1` and the point `P2`.
For each MIR statement location, 2 Polonius points are generated: the "Start" point and the "Mid" point (some of the other Polonius inputs will later be recorded at each of the points). These 2 points are linked by an edge recorded in this relation.
Then, another edge will be recorded, linking this MIR statement to its successor statement(s): from the mid point of the current location to the start point of the successor location. Even though it's encoded differently in MIR, this will similarly apply when the successor location is in another block, linking the mid point point of the current location to the start point of the successor block's starting location.
For example, for this MIR (edited from the example for clarity, and to only show the parts related to the CFG):
```rust
bb0: {
... // bb0[0]
... // bb0[1]
goto -> bb3; // bb0[2]
}
...
bb3: {
... // bb3[0]
}
```
we will record these input facts (as mentioned before, they'll be interned) in the `cfg_edge` relation, shown here as pseudo Rust:
```rust
cfg_edge = vec![
// statement at location bb0[0]:
(bb0-0-start, bb0-0-mid),
(bb0-0-mid, bb0-1-start),
// statement at location bb0[1]:
(bb0-1-start, bb0-1-mid),
(bb0-1-mid, bb0-2-start),
// terminator at location bb0[2]:
(bb0-2-start, bb0-2-mid),
(bb0-2-mid, bb3-0-start),
];
```
### 2. `borrow_region`
`borrow_region(R, L, P)`: this relation stores that the region `R` may refer to data from loan `L` from the point `P` and onwards.
For every borrow expression, a loan will be created and there will be a fact stored in this relation to link this loan to the region of the borrow expression.
For example
```rust
let mut a = 0;
let r = &mut a; // this creates loan L0
// ^ let's call this 'a
```
there will be a `borrow_region` fact linking `L0` to `'a` at this point. This loan will flow along the CFG and the subset relationships between regions, and the computation will require that its terms are respected or it will generate an illegal access error.
### 3. `universal_region`
`universal_region(R)`: this relation stores that the region `R` is a "universal region"/"free region"/"placeholder region" and not defined in the MIR body we're checking.
Those are the default universal regions (`'static`) and the ones defined on functions which are generic over a lifetime. For example, with
```rust
fn my_function<'a, 'b>(x: &'a u32, y: &'b u32) {
...
}
```
the `universal_region` relation will also contain facts for `'a`, and `'b`.
### 4. `killed`
`killed(L, P)`: this relation stores that a prefix of the path borrowed in loan `L` is overwritten at the point `P`.
For example, with
```rust
let mut a = 1;
let mut b = 2;
let mut q = &mut a;
let r = &mut *q; // loan L0 of `*q`
// `q` can't be used here, one has to go through `r`
q = &mut b; // killed(L0)
// `q` and `r` can be used here
```
the loan `L0` will be `killed` by the assignment, and this fact stored in the relation. When we compute which loans regions contain along the CFG, the `killed` points will stop this loan's propagation to the next CFG point.
### 5. `outlives`
`outlives(R1, R2, P)`: this relation stores that the region `R1` outlives region `R2` at the point `P`.
This is the standard Rust syntax `'a: 'b` where the *lifetime* `'a` outlives the lifetime `'b`. From the point of view of regions as sets of loans, this is seen as a subset-relation: with all the loans in `'a` flowing into `'b`, `'a` contains a subset of the loans `'b` contains.
The type system defines subtyping rules for references, which will create `outlives` facts to relate the reference type to the referent type.
For example,
```rust
let a: u32 = 1;
let b: &u32 = &a;
// ^ let's call this 'a
// ^ and let's call this 'b
```
To be valid, this last expression requires that the type `&'a u32` is a subtype of `&'b u32`. This requires `'a: 'b` and the `outlives` relation will contain this basic fact that `'a` outlives / is a subset of / flows into `'b` at this point.
### 6. `region_live_at`
`region_live_at(R, P)`: this relation stores that a region `R` appears in a live variable at the point `P`.
These facts are created by the liveness computation.
### 7. `invalidates`
`invalidates(P, L)`: this relation stores that a loan `L` is invalidated at the point `P`.
Loans have terms which must be respected: ensuring shared loans are only used to read and not write or mutate, or that a mutable loan is the only way to access a referent. An illegal access of the path borrowed by the loan is said to *invalidate* the terms of the loan, and this fact will be recorded in the `invalidates` relation.
Since the goal of the borrow checking analysis is to find these possible errors, this relation is important to the computation. Any loans it contains, and in turn, any region containing those loans, are the key facts the computation tracks.
---
Below are Albin's additions about the liveness facts:
### 8. `var_used`
`var_used(V, P)`: when the variable `V` is used for anything but a drop at point `P`. E.g `x = y + z` would have `var_used(y, _)` and `var_used(z, _)` at that point.
### 9. `var_defined`
`var_defined(V, P)`: when the variable `V` is assigned to (killed) at point `P`. E.g `x = 17` would have `var_defined(x, _)` at that point.
### 10. `var_drop_used`
`var_drop_used(V, P)`: when the variable `V` is possibly used in a drop at point `P`; i.e. there is an expression like `drop(x)` there.
### 11. `var_uses_region`
`var_uses_region(V, R)`: when the type of `V` includes the provenance `R`.
### 12. `var_drops_region`
`var_drops_region(V, R)`: when the type of `V` includes the provenance `R`, and `V` also implements a custom drop method which might need all of `V`'s data. **Fixme: this requires a lot more explanation**.