a-mir-formality subtyping

# a-mir-formality subtyping This article explains the subtyping and type inference algorithm currently implemented in a-mir-formality. ## Components The algorithm is a combination of a few pieces: * The “[simplesub]” approach to inference, extended with Rust’s [existing subtyping algorithm](http://smallcultfollowing.com/babysteps/blog/2014/07/09/an-experimental-new-type-inference-scheme-for-rust/). * Universes (which [simplesub] calls “levels”) to handle foralls. * A “subtype” algorithm that is deeply integrated with the trait solver, supporting a rich set of types including for-alls, exists, and implication types. * Built-in support for the “outlives” relation, which is deeply intertwined with inference. * Unifying type aliases, associated types, and TAITs with “alias types”. One other importance piece is that we separate out the approach to *environment elaboration* so that it only requires a very simple notion of syntactic equality. I won’t talk about that in this article. ## Simple Sub Our algorithm is — in some ways — a simplified version of the [simplesub] algorithm that excludes the intersection and union types and adds in a Rust-y twist or two. Interestingly, [simplesub] is itself a simplified version of Stephen Dolan’s [“Algebraic Subtyping”] thesis. So we are simple^2! [simplesub]: https://dl.acm.org/doi/10.1145/3409006 [Algebraic Subtyping]: https://www.cs.tufts.edu/~nr/cs257/archive/stephen-dolan/thesis.pdf ### Introducing an example type system based on ranges To explain simplesub, we’ll start with a very simple type system where types are bounded integer types and functions: ``` T = (N .. N) | fn(T) -> T N = some number ``` A type like `(3 .. 5)` is then a subtype of some larger range (e.g., `(0 .. 6)`), but not of a disjoint range like `(6 .. 10)` or a partially intersecting range like `(4 .. 6)`. If this seems abstract to you, consider that polonius thinks of regions as sets of leases, where each lease corresponds to some expression like `&x` in the source. We can interpret a range type like `(N .. N)` as a set of “lease ids”, in which case it’s the same basic concept. If we didn’t have to deal with type inference, we could write a simple “core subtyping” algorithm like so (in a kind of Rust-like pseudocode notation): ```rust fn subtype(T1, T2) -> bool { match (T1, T2) { ((N1 .. N2), (N3 .. N4)) if (N3 <= N1 && N2 <= N4) => true, (fn(T1) -> T2, fn(T3) -> F4)) => (subtype(T3, T1) && subtype(T2, T4)), _ => false, } } ``` ### Simplesub: introducing type variables Our goal is to extend this core subtyping algorithm to support type variables and inference. For example we might want to take this program… ```rust let mut x = 3; x = 5; let mut y = x; y = 8; ``` …and infer that `x: (3 .. 5)` and `y: (3 .. 8)`. To do that, we are going to extend our type grammar to include inference variables, written `?X`: ``` T = (N .. N) | fn(T) -> T | ?X ``` And then we are going to create a typing environment `Env` that contains the state of our type inferencer. To start, this is just a set of *inequalities*, where each inequality either has the form `(X <= T)` or `(X >= T)`: ``` Env = (Inequality …) Inequality = (X <= T) | (X >= T) ``` Now we can extend our subtyping algorithm. Before, when it only dealt with concrete types, it just returned a boolean. But now it’s going to return *either* a new environment or an error. The function is still basically a big match of cases that correspond to “valid subtypes”, and it now returns `Err(())` instead of false: ```rust fn subtype(mut Env, T1, T2) -> Result<Env, ()> { match (T1, T2) { …, _ => Err(()), } } ``` Let’s go through the cases of the match one by one. First we have the “known types” cases. These are easy, they work just like before, but we are now returning `Ok(Env)` instead of true and `Err(())` instead of false: ``` ((N1 .. N2), (N3 .. N4)) if (N3 <= N1 && N2 <= N4) => Ok(Env), (fn(T1) -> T2, fn(T3) -> F4)) => { Env = subtype(Env, T3, T1)?; subtype(Env, T2, T4)) } ``` The next case is more interesting. What do we do when we have `(?T <= U)` for some inference variable `?T` and some type `U`? The answer is that we want to add a bound `(?T <= U)` into the environment; but not *only* that. We also want to look for any bound `(?T >= V)` that is already present in the environment and ensure that `V <= U` is true:[^occurs] [^occurs]: In a-mir-formality, we do one additional step which is called the *occurs check* and which is designed to prevent infinite types like `Vec<Vec<Vec<…>>>`. The simplesub paper doesn’t have this check because they support recursive types, and hence could have a type like `μΤ. Vec<T>` to indicate that infinite sequence. a-mir-formality includes an occurs-check (though in writing this document, I realized it has a bug in it!) ``` (?T, U) => { Env = Env + (?T <= U)); for V in (V | (?T >= V) in Env) { Env = subtype(Env, V, U)?; } Ok(Env) } ``` The case for `(T, ?U)` is the same, but with the role of `<=` and `>=` reversed. Note that in this case we add an inequality `(?U >= T)` to the environment (i.e., the type variable always comes first): ``` (T, ?U) => { Env = Env + (?U >= T)); for V in (V | (?U <= V) in Env) { Env = subtype(Env, T, V)?; } Ok(Env) } ``` ### An intuition for simplesub Maybe you are starting to get an intuition for simplesub now — the core part is very simple: 1. For each type variable `?X`, we track their relationships to other things by either adding a `?X <= U` or `?X >= T` to the environment as appropriate. 2. Whenever we add an upper bound `?X <= U` to the environment, we find each lower bound `?X >= T` and ensure that `T <= U`. 3. Likewise, whenever we add a lower bound `?X >= T` to the environment, we find each upper bound `?X <= U` and ensure that `T <= U`. In other words, each lower bound `T` of `?X` must be less than each upper bound `U`. If that is ever violated, there is an error in the program somewhere. One thing that is cool about this algorithm is that we never compute the *least* upper bound or the *greatest* lower bound. That is really useful for rustc because for our higher-ranked types there is no “least” upper bound — there are a set of upper bounds, but none of them is necessarily *least* (i.e., we can find a bunch of mutual supertypes for two fn types, but it’s not clear one of them is the *best* supertype). ### An example with simplesub OK, let’s try this out with our code example and see how it works. If you recall, we wanted to infer types for this example: ```rust let mut x = 3; x = 5; let mut y = x; y = 8; ``` The way we do this is by first giving `x` and `y` fresh type variables as their types: ```rust let mut x: ?X = 3; x = 5; let mut y: ?Y = x; y = 8; ``` We’ll start with an empty environment `Env0 = []` and then compute a series of subtype relationships. Let’s go line by line in the source. ```rust let mut x: ?X = 3; // <— next line … ``` Here, the expression `3` would have the type `(3..3)`, and we would assign that type as a subtype of `?X`. The starting environment `Env0` ``` Env1 = subtype(Env0, (3..3), ?X) ``` The resulting environment `Env1` is just going to be a list with one inequality,`[(?X >= (3..3))]` (you can read that as “`?X` is some supertype of `(3..3)`”). ```rust let mut x: ?X = 3; // Env1 = subtype(Env0, (3..3), ?X) x = 5; // <— next line … ``` In a very similar way to the previous case, this results in a call: ``` Env2 = subtype(Env1, (5..5), ?X) ``` This again comes into the case where the type variable `?X` is a supertype of some other type, in this case `(5..5)`. As a result, we will add another bound of `(?X >= 5..5)` into the environment, resulting in `Env2 =[(?X >= (3..3)) (?X >= (5..5))]` — i.e,. `?X` must be a supertype of both `(3..3)` and `(5..5)`. Since there are no lower bounds yet on `?X`, the algorithm stops here. *Sidenote:* The simplesub algorithm assumes that `?X` can be equal to *any* set of supertypes. This could work ok in a language like ML where everything is a pointer etc; the idea is that at *some* point you will have to “produce” an `?X`, and so you’ll be forced to supply some kind of subtype that is a subtype of all of the supertypes. If you have two types no common upper bound, then that would be impossible. In Rust, I think we would prefer to “sanity check” that the set of supertypes could *possibly* be combined. That way if you have `let mut x = 3_i32; x = 5_i64;` you would immediately get a type error. The current a-mir-formality code doesn’t yet implement this “fast fail” check, but it makes sense and should be relatively straightforward to add. (Good starter project!) The next line in our source is… ```rust let mut x: ?X = 3; // Env1 = subtype(Env0, (3..3), ?X) x = 5; // Env2 = subtype(Env1, (5..5), ?X) y = x; // <— next line … ``` This line can be checked with the call ``` Env3 = subtype(Env2, ?X, ?Y) ``` We actually hit a slightly different case in the algorithm this time than before. Before, we always had the type variable only on the supertype side, and so we added “lower-bounds” like `(?X >= (3..3))`. But *this* time the type variable `?X` is on the *subtype* side, so we will add an upper-bound `(?X <= ?Y)` (just because of the ordering of the match arms, the algorithm prefers to add upper-bounds), yielding this intermediate result: ``` [?X >= (3..3), ?X >= (5..5), ?X <= ?Y] ``` Now we finally come to the case where the same variable (`?X`) has both lower bounds (`(3..3)`, `(5..5)`) and an upper-bound (`?Y`)! Therefore, we will relate them. The final result is: ``` Env3 = [?X >= (3..3), ?X >= (5..5), ?X <= ?Y, ?Y >= (3..3), ?Y >= (5..5)] ``` Now we come to the final line in our source: ``` let mut x: ?X = 3; // Env1 = subtype(Env0, (3..3), ?X) x = 5; // Env2 = subtype(Env1, (5..5), ?X) y = x; // Env3 = subtype(Env2, ?X, ?Y) y = 8; ``` Here we will invoke… ``` Env4 = subtype(Env3, (8..8), ?Y) ``` As in previous cases, this will add a lower bound `?Y >= (8..8)`. The final environment `Env4` is: ``` Env4 = [?X >= (3..3), ?X >= (5..5), ?X <= ?Y, ?Y >= (3..3), ?Y >= (5..5), ?Y >= (8..8)] ``` From these constraints, you can see that the `?X` is `(3..5)`, `?Y` is `(3..8)`, and `?X <= ?Y` holds — of course, we’d need to write a simple algorithm to find the solution, but that’s not too hard. Just do a fixed-point iteration over the inequalities, growing each variable as needed. ### An aside: intersection, union types to represent “type schemes” In ML speak, a *type scheme* is the type of a generic function (before it has been applied to specific arguments). The idea is that ordinary values being type checked have a *type*, but functions and things in the environment have a *type scheme*. Each time you reference the function, the type scheme gets instantiated — this is analogous to early-bound generic parameters in Rust. In the simplesub paper, they show a way to convert a typing environment like `Env4` into “union” and “intersection” types: * an *intersection* type `A && B` means “both an A and a B at the same type”. * If you think of Java classes, you could have a type `String && Object`, and that would effectively be the same as `String`, because every type is both a string and an object. (You could also call this “greatest lower bound”, because `String <= Object`.) * Some types have an empty intersection. e.g. `String && Integer` (still in Java) have no intersection. In Rust, types like `i32 && i64` have no intersection. In MLSub, this doesn’t make the intersection type *illegal*, it just means that there will be no valid instances of it (i.e., you could never generate a value whose type `T` is a subtype of `i32 && i64`). * Intersections appear in *input* (or contravariant) position: e.g., in function arguments. For example, a function type like `fn foo(f: String && Object)` might be inferred if `foo` used both `String` and `Object` methods on `f`. * a *union* type `A || B` means “either an A or a B value”. * Thinking again of Java classes, the type `String || Object` would be equivalent to `Object`, because a `String` could also be an `Object`. (You could also call this “least upper bound”, because `Object >= String`.) * In MLSub, there is no such thing as an “illegal union”, because all values are considered interchangeable. That could be true in Java too. In Rust, at least Rust today, it won’t work, since `i32 || i64` makes no sense. Even if we adopted anonymous enums, it still wouldn’t make sense, because an anonymous enum would need to carry some kind of discriminant to pick between the two cases, and hence it would be *either* an `i32` *not* a `i64` in terms of its layout. * Unions appear in *output* (or covariant) position: e.g., in function return types. For example, a function type like `fn foo() -> String || Object` might be inferred if `foo` returned both a `String` *and* an `Object`. Intersection and union types, though, aren’t really *necessary*. They are important to MLsub because function schemes (i.e., generic function types) in that language don’t include anything like a where clause, they are just a set of type variables and series of input and output types: ``` FnType (in ML) ::= for<{X}> (T_arg …) => T_ret ``` So, imagine that we had this function (in a kind of weird Java-like notation-y language): ```rust fn foo(x: ?X) { let x1: String = x; let x2: Object = x; } ``` This would give us an inference state like ``` String <: ?X // because of `let x1: String = x` Object <: ?X // because of `let x1: Object = x` ?X <: ?Y // because of `return x ``` How should we encode the resulting type? With intersection and union types, we can write something like this ``` for<X> fn(String && Object && X) ``` This says that the value is a string, an object, and an `X`, all at the same time.(The `X` isn’t really needed here, but it can be used to connect different variables. ## Type schemes with where clauses In a-mir-formality, I opted for a different approach. We [represent type schemes](https://github.com/nikomatsakis/a-mir-formality/blob/f64717f01178706a5e6aec23326e847c2003634a/src/ty/grammar.rkt#L19) as an expression of the form ``` Scheme ::= (TypeVariables WhereClauses Ty) ``` In other words, the example function above would have a type like ``` for<X> (where X <= String, X <= Object) fn(X) ``` ## Occurs check In MLsub, they permit recursive types, which therefore have infinite (if tree-structured) size. We don’t support that in Rust, so we add something called the *occurs check*, which checks that when you have a constraint like `?X <: T`, `?X` doesn’t appear in `T` (writing this doc made me realize we don’t currently reject something like `?X <: ?Y` and `?Y <: Vec<?X>`).