changed a year ago
Linked with GitHub

Fixing CFI VTables

Today, most things you can do that generates vtables will SIGILL with CFI enabled. Calling methods on a &dyn Foo will work properly, but nearly everything else can cause trouble. This document explores a proposed fix so that trait objects, closures, etc. will work under CFI in Rust.

Background (What is CFI?)

CFI, or control-flow integrity, is a class of mechanisms for hardening compiled code by enforcing that the runtime trace of the compiled program stays within the control flow graph of the source. It accomplishes this by restricting indirect jumps[1] to targets that analysis of the source program can't rule out.

Kinds of CFI

In order to do this efficiently and precisely, these are separated into different classes. Backwards-edge CFI specifically refers to protections for ret or equivalent instructions. Because this follows a stack discipline, there are a wide variety of specialized techniques that work better for this class of indirect jumps (shadow call stack[2], safe stack[3], stack cookies[4], etc.). The other major class is known as forwards-edge CFI. This covers calls through function pointers, longjmp, and calls that translate to vtables (e.g. calls through virtual functions in C++, calls through a trait object in Rust). There are a variety of approaches to this as well, such as IBT/BTI[5], FineIBT[6], LLVM CFI, and KCFI.

LLVM CFI

When you hear someone say they "enabled CFI" for something, this is almost always what they mean - this is LLVM's software-based forward-edge control flow protection which determines its alias sets based on types. This system gives you two tools:

  • Attach type metadata (an integer and a string) to a function declaration or global declaration.
  • Test and branch on whether a given LLVM address belongs to a particular equivalence class.

You can attach multiple equivalence classes to a given address, but functions and globals must not share an equivalence class.

C++

In C++, the integer is used as a byte offset into the vtable, and the string as itanium-mangled version of the type. Functions and globals will use an offset of 0. The offset acts as an additional disambiguator because a method that is not installed at a particular offset can't have been reached by loading from that offset, even if the type signature is otherwise compatible.

Clang offers a number of different hardenings based on these tools:

  • cfi-cast-strict / cfi-derived-cast / cfi-unrelated-cast - Dynamically check based on type information whether a cast is legal, at various levels of strictness.
  • cfi-vcall - Checks the type of the vtable before invoking a method on it to make sure it is a vtable known to have a method available to the provided receiver type.
  • cfi-nvcall - Uses the type of the vtable to check that direct calls are being called on objects of the correct type.
  • cfi-icall - Uses the type of a function pointer to check if it has the expected signature before calling.
  • cfi-mfcall - Like cfi-icall, but with support for pointer-to-member-function.
Implementation + Optimization

LLVM implements the llvm.type.test intrinsic in two different ways depending on whether it is the type signature of a global or a function.

Globals

Typesigning globals is primarily used for vtables in C++, though there's not anything preventing you from using it for something else.

LLVM will place all type-signed globals into the same region, and then for each call to llvm.type.test, compute a lower bound, upper bound, alignment restriction, and validity bitvector for which globals are legal. As an optimization, it will do its best to lay out globals which have at least one type in common next to each other, preserve alignment, and try to sort them so that bitvectors are all ones as often as possible (and so unnecessary). See the CFI design document for further details.

Functions

Functions are implemented similarly to globals, but use a jump table rather than placing all the functions in the same section. This allows alignment to remain constant and small despite the greatly varying size of functions (and possibly even size varying based on the layout itself). If I had three functions, f, g, and h, and all had their addresses taken in ways the compiler couldn't track, they'd end up in the table like this:

f:
jmp f_real
int3
int3
int3
g:
jmp g_real
int3
int3
int3
h:
jmp h_real
int3
int3
int3

Every time the source language tried to take "address of f", it'd get f, not f_real. If f and g have the same type signature, but h has a different one, at a callsite to the first signature it'd do a range and alignment check, then jump. If f and h had the same signature, it would use the bitvector to check. As with globals, the compiler will do its best to lay these out such that bitvectors are not usually necessary, but their flexibility means that functions can be assigned multiple types.

DSOs

All of the above can only be done efficiently within a single lowering pass, which is why it requires LTO - laying out the global table and the canonical jump table require global information. Unfortunately, shared libraries are quite popular, and implicitly reject this global knowledge.

To address this, there is an experimental mode where in addition to the inlinable checks described above, each module:

  • Exports a function called __cfi_check which can be used to validate an arbitrary type exposed with itself
  • If the inlinable checks fail, it locates the module which contains the pointer to be checked, and invokes its __cfi_check if it is CFI enabled.

KCFI (Kernel CFI)

LLVM CFI has a few major drawbacks. Notably:

  • LTO must be used.
  • Calls between modules are slow, experimental, and require matching compilers.
  • The address of the same function in two different modules may be different.
  • Lockin to LLVM (e.g. can't use gcc)

In the particular case of embedded and kernel code, many of these drawbacks move from unfortunate to untenable. The Linux kernel supported traditional LLVM CFI for a bit, but:

  • The kernel is large, so LTO is more of a hit to build performance
  • Kernel modules are frequently built by third parties, which now need to ensure they use identical compilers.
  • Addresses of functions are compared for equality within the kernel, which lead to bugs.
  • The kernel generally wants to be buildable with gcc, even if clang is a well-supported option these days.

This led to the introduction of KCFI. This makes a few observations:

  • C (not C++) lacks vtables to protect or typesign
  • All executable memory should be read-only
  • C doesn't have polymorphism, so every function has a principal type

From this, they came up with a simplified version of the LLVM type testing system:

  • Every function has a prefix before it that encodes a type ID as a valid instruction.
  • Before each indirect call, it extracts the tag from the header immediately before the function and checks it against the expected value.

These typeIDs are the xxHash of the Itanium-mangled representation of the function's type, which means that as long as compilers agree on the instruction to embed them in, this gives you cross-compiler CFI without global knowledge.

The primary limitations here are:

  • You cannot typesign globals - since you don't jump to them, you don't have an implicit check that they are in read-only memory, so you can't trust any type tag on them.
  • You cannot attach multiple types to the same function - even redesigning it to go to two types would make every indirect call site require 3 more instructions.

Rust CFI Usage Scenarios

Incomplete support for CFI may prevent Rust from being used in several environments that we could make real improvements in.

Mixed-language Vulnerabilities

Especially in the embedded world, Rust tends to get mixed into existing C/C++ codebases. Unfortunately, this can turn into a situation where the mixture is less secure than either. This is because while Rust has robust protection against memory safety violations baked into the language, hardened C++ environments have invested in CFI and related hardening features to prevent a memory safety violation from turning into a full instruction pointer hijack. This means that a memory unsafe C++ program which was compiled with CFI enabled can go from safe to unsafe when it adds a Rust dependency if Rust does not have CFI as well - the memory unsafety of the C++ code can modify data accessed by Rust, which then fails to detect the bad control flow transfer.

We want to be positioned to tell people that yes, using a Rust component in your existing stack will improve your security posture.

Android Kernel

The Android Kernel is generally deployed with KCFI enabled. We are currently working to deploy a Rust-based rewrite of the binder[7] driver after a number of memory safety issues over the years. This component is exposed to literally every process on the system, so any vulnerabilities in it are almost always exploitable. To even consider deploying it, the functions Rust exposes to C must be appropriately tagged with compatible KCFI types. To be confident deploying it, we need Rust's internal indirect calls to be protected as well, because C may corrupt them, even if Rust doesn't, and because unsafe code in the Rust portion of the kernel may have its own bugs.

Current State

Currently, regular functions work in a way compatible with C, largely thanks to the -fsanitize=cfi-icall-experimental-normalize-integers[8] flag.

Trait objects with receivers of &self or &mut self can have their methods called.

However, the following will currently result in a CFI mismatch (which will lead to a program abort) or a bug! triggering

  1. <S as Foo>::foo as fn(&S) where Foo is a trait with fn foo(&self) - you can't use methods from traits as functions
  2. Calling a method with signature fn foo(self: Arc<Self>) on a trait object - you can't use anything but &self or &mut self as a receiver
  3. let f: &fn() = &((|| ()) as _); - you can't convert a closure to a callable function pointer[9]
  4. let _ = Box<dyn Foo> = Box::new(S); - you can't drop a trait object, the drop entry has the wrong type.
  5. If you use self_cell, you'll get an infinite loop

This means that in practice, you can't enable CFI in a real Rust codebase and have it run. You'll also notice one thing in common here - other than the first and last entry, these are a result of calling vtable entries that have a type that does not match the Virtual call that is made to them, usually because the vtable entry has its fully concretized type, while the Virtual type is abstract. The first one is as a result of trying to fix this abstraction problem in a specific case by adjusting the type encoding.

Proposed Solution (CFI Shims)

If the vtable entries have alias sets that are incompatible with virtual call alias sets, what should we do about this?

Picking an Alias Set

Since objct-safe vtable functions are not directly exposed to C (they aren't extern "C" ABI, so there's no way to usefully do that), we are largely free to select whatever alias set we want. We explore several options:

Singleton

All call-sites and definition sites with Rust ABI have the same alias set. This is almost equivalent to disabling the CFI on intra-rust calls. This will work, and be simple, but leaves us open to mixed language issues and makes incorrect unsafe much more likely to be exploitable.

Rust ABI Compatibility

Use the rules for ABI compatibility to normalize types down before encoding them when the ABI is "Rust". This should cover all object-safe methods, which should cover the vtable. We would still need to use traditional, type-based encoding for non-"Rust" ABIs for compatibility with external code we link against.

Normalization would proceed on each argument and the return type before encoding as we already do, likely with a different prefix in order to prevent Rust ABI annotations from joining the alias set of extern "C" functions.

The per-type normalization would work roughly like this, following as a visitor

  1. If it's a *const T, *mut T, &mut T, Box<T, Global>, NonNull<T>, replace it with a *mut <T as Pointee>::Metadata, normalizing the type before proceeding.
  2. Rewrite usize or isize to the platform's pointer width.
  3. Rewrite char to u32
  4. Rewrite any extern "foo" fn(..) -> T to extern "foo" fn(), do not recurse into arguments
  5. Rewrite any align=1 ZST to ()
  6. Rewrite any repr(transparent) struct to its unique field, with a stack to catch cycles. On cycle, don't apply this rule but apply any other rules that apply.
  7. Rewrite NonZero<T> to T
  8. Strip Option<T> to T if T is subject to null pointer optimization.

Finally, any argument whose type is PassMode::Ignore in the ABI (the usual example being ZSTs in the "Rust" ABI) is removed.

The major remaining caveat is that Virtual uses force_thin to become ABI compatible. This means that when computing the "ABI signature" for a Virtual call, the first argument should always become *mut (), e.g. "a pointer with no metadata".

The primary advantage of this approach is that miri is already busy enforcing these rules on indirect call boundaries, so it should be valid already with no real modifications to codegen, just an alternate computation of alias set.

This approach sounds enticing, but this leaves alias sets extremely weak, not much better than the Singleton strategy. I haven't done measurements, but since all fn foo(&self) would be mutually compatible, regardless of what trait we're talking about, CFI would not be enforcing much beyond an IBT-style scheme here - little more than arity restrictions.

Middle Ground

Restrict further than the ABI compatibility rules require, but while maintaining a that the normalization procedure can assign a principal alias set per def_id/args combination. This restriction means we will never need to introduce an additional shim, because the alias set can be determined without casing out on the InstanceDef. We maintain this restriction on our choices because if we allow differing alias sets per shim[10], we gain the implementation complexity and runtime overhead we'd incur with the type signature approach and may as well use them as our basis since they would be more precise.

The only piece that we know actually mismatches in type-based alias sets today is the receiver type. For the first argument of any function, use the ABI normalization rules that we just proposed to transform the first argument. All other arguments and the return type must match the type.

This still leaves fn(&self) of all kinds compatible with each other, but at least fn(&self, &Foo) and fn(&self, &Bar) are no longer compatible, which improves things somewhat.

Limitations

We could do better than this while still maintaining a principal alias set, but we have a lot of limitations without whole-program analysis. We would effectively need to class potential receivers into components based on a transitive closure over an undirected variant of the "implements" and "supertype" relationship. With whole-program analysis, we have some hope of making a few smaller components to class things into, rather than all legal object receivers just becoming *mut (). Without full-program analysis (which is one of the goals for KCFI), all public types and traits must be assumed to be connected (as a foreign crate could easily create a link), and so most receiver types contract to the same representation.

Actual Alias Analysis

We could use LLVM's alias analysis passes on an un-instrumented variant of the program to dynamically generate alias sets. This would have the advantage of being able to use dataflow and LLVM type compatibility, but the disadvantage of requiring LTO without any hope of cross DSO support. It would also lead to difficult to predict changes in alias set when implementation details in the program change. This might be interesting research, or an experiment, but it doesn't satisfy our practical goals.

Type Signatures

This is what the codebase today attempts to do, and generates the bugs described above without alterations because it never alters shimming, and so no assignment of type-signature-based principal alias sets can be correct.

In this approach, every Instance has a principal alias set, and this alias set is allowed to depend upon shim information as well. If we need to call a given def_id + args combination at two different types (say, *thinconst dyn MyTrait and &MyConcrete), then there will be a way to shim it via InstanceDef that will result in each type being present. When creating an instance that may be the recipient of an indirect call (whether because it is going into a vtable or because a function pointer is being created), it will be resolved in a why that applies an appropriate InstanceDef to get the receiver type to match, not just be compatible.

The rest of the design assumes this is what we're doing.

Multiple Variants are Required

For LLVM CFI, we could perhaps attach every possible type that a method could have. However, if we want KCFI to work, we need at least one compilation mode where every address has at most one type. The rest of the design assumes this restriction. It will still solve the crashing problems of LLVM-CFI, just possibly not in a way that leverages the type-test system for optimal performance.

We cannot just make each Instance have the alias set expected by its virtual call for a few reasons:

  • The underlying instance should still be accessible at its conrete type - see bug 1
  • If trait Parent { fn f(&self); } trait Child: Parent {}, we expect to have an entry that works for *const dyn Parent and one that works for *const dyn Child[11]
  • drop_in_place<T> is inserted into every vtable. Since a given concrete T may implement any number of traits, and every one needs a drop_in_place entry in their vtable, we can't just adjust the type of drop_in_place<T>.

This leaves two approaches open:

Default fn-ptr alias set

Instances by default match the alias set it would be legal to perform an explicit indirect call at (e.g. fn-ptr based). Shim adjustments are applied when producing instances for a vtable to abstract them.

The primary advantage of this approach is in implementation simplicity and results looking as one might expect. For example, an Item of a function fn foo(&self) with Self = Bar will have the alias set for fn(&Bar), not fn(*const dyn Foo) with an unprinted caveat that this is a thin-call receiver, not a fat dyn pointer. It also results in fewer changes from the status quo when CFI is not enabled, as implicit argument transmutes will not need to be inserted into the generation of the base instance for methods. This keeps the changes generated by CFI mostly separated from normal compilation.

Basically, this assigns the alias set you'd expect if you were not thinking about vtables, and is simpler to implement and test.

This approach is what I have implemented in the current PR.

Default vtable alias set

Instances by default match the alias set it would be legal to perform a vtable call at (e.g. thindyn based). Shim adjustments are applied when producing instances for function pointer. Virtual calls are adjusted from today so that instead of calling using an expected signature of *const dyn MyTrait, they will call with *const dyn MySuperTrait if MySuperTrait provides the method being called.

The primary advantage of this approach is that it will likely result in faster and smaller code, as creating a function pointer from a trait function is uncommon relative to vtable dispatch. This means that the common case will avoid a shim.

This adds complexity, because not all Instance types can tell select a default vtable alias set, but are still used in vtables. drop_in_place via DropGlue is a key example of this - it cannot determine the trait it is being used in, because it may be used in multiple traits. This means that some instances do not need a shim when going into a vtable, and others do, and we need to figure out which they are and shim them.

If we make make the instance type match its alias set, this also means that the logic for determining if an instance needs to be called with a "thin self" will expand from a simple enum check on the InstanceDef type[12] to a computation involving both the shim kind and the def_id in order to determine whether that leading *const dyn Foo is a fat or thin pointer.

This logic gets even more confusing if we have a trait function which takes a fat &dyn MyTrait in the first argument, with no receiver, and uses a where Self: Sized clause to keep the trait object safe. We now need to check not only whether an item implements a trait, but whether it really would go in a vtable before we can determine whether it is force_thin or not.

The other approach dodges this by only shimming things as they're put into the vtable, which means they're pre-selected to have an object-safe receiver that can be rewritten to a thin pointer.

This approach would also make the extra arguments approach more viable, as ReifyShim would likely cover some of what is needed.

Don't enable CFI shims by default

These shims should be small, but there may be a lot of them and they introduce indirection through trampolines in many cases. This will lead to unnecessary code bloat, confuse the optimizer, etc.

Variant Instance Strategy

Regardless of which alias set we assign to unshimmed Instances, we know that we need to create new Instances either for vtables or function pointers. These instances need a field to determine which abstracted Self is in use, since drop_in_place demonstrates that we will have an unbounded number of variants for each existing instance. We still need the original Self type as well however, as we need to be able to generate the original shim or call the original instance. We explore several strategies for making this invisible available:

Add it to Instance

It seems tempting to add the replacement Self type to Instance, either by packing it into Args as an extra field or creating an Option<Ty> field for use during CFI, however MIR generation is done on InstanceDef, not Instance. This presents two problems:

  1. The MIR generator needs a concrete type, not just a type variable, to repeatedly unwrap the receiver type to make alternate receivers (e.g. Arc<Self>) compatible with thin_self.
  2. We cannot cheaply tell whether a cast was needed, so we don't know whether any instance we're calling should be thin or not.

Extra InstanceDef arguments

Only some InstanceDef elements can possibly be inside a vtable today - we could just add an extra argument to all of them, and add a helper method for getting the InstanceDef's abstracted Self which we could reference in shim generation. I started going down this road initially, but then remembered that Fn-family trait objects cover almost everything else; it's only Virtual, Intrinsic, and ThreadLocalShim that are not going to end up in a vtable.

We could still do this if we really want to avoid a wrapping InstanceDef, but we'd end up adding it to several different shim types and repeating our logic in several places.

Wrapping InstanceDef: CfiShim

This creates a new shim kind that is essentially a modifier for all other shim kinds. We can now transform any InstanceDef into a variant that abstracts to a particular type.

The primary advantages of this approach are:

  1. The compiler internal representation of non-shimmed code remains identical. No size increases or changes to the shim generation code other than a check for shim enablement.
  2. Composes well with new shims - if a new shim is added, it will more than likely Just Work with CFI out of the box.
  3. Centralizes logic for CFI concerns - for example, as discussed later, a CFI shim for a function abstracted to a particular trait goes in the crate that performed the &dyn cast, not in the defining crate. This decision is written in one place, not 7.

The primary disadvantage of this representation is that wrapping some InstanceDefs (closure-likes and trait implementation items) require additional logic.

This is the approach we assume through the rest of the design. Switching to extra InstanceDef arguments would be compatible with the rest of the design, but require adding codepaths to several branches rather than the central one.

Computing the Alias Set

Given an instance, how do we want to get the alias set?

Separate Machinery

We could add a separate method to Instance which computes the principal alias set. If the ABI is "Rust", we use the logic described in this section. "rustcall" likely needs special casing. All other ABIs directly compute the type and then call the type encoder, as they do today.

This allows us to avoid modifying the type in a way that would change how Rust expects to interact with things. Rather than being a transmute followed by a direct call, or a transmute followed by an existing shim, generated code can just be a direct call or the existing shim.

An Instance is available when assigning an alias set to a declaration, but not when making an indirect call - there will be an Instance only for direct calls and indirect calls through a vtable (Virtual). This means that we still need to have our call-site alias set be computable from the function pointer type as well.

Advantages:

  • Fewer changes to the flow of the compiler
  • Flexibility to adjust alias sets to contain non-type information for Virtual calls

Disadvantages:

  • Complexity in implementation - the type is available early on, where the alias set requires normalization and arguments.
  • Complexity in maintenance - instances now have an extra piece of metadata computed for them that could be easily brought out of sync when updating any shim generation code.
  • Complexity in debugging - it is not clear what the perceived alias set of an instance is as it goes through the compiler unless explicit debug statements are added.

In retrospect, this might be worth trying as a refactor. The current patchset uses the type directly.

Use the Type Directly

We generate shims such that the type of the shim reflects the generalization we're allowing. For example, with

trait Foo {
  fn foo(&self);
}

struct Bar;

impl Foo for Bar {
  fn foo(&self) {}
}

the Bar vtable for type Foo would have a foo entry with explicit type fn(*const dyn Foo), and a hidden[13] force_thin annotation based on the instance. The DROPINPLACE entry would have type fn(*mut dyn Foo) as well.

Advantages:

  • It's obvious what is misaligned when looking at any stage of the compiler - if you're trying to make an indirect call, the target needs the same type.

Disadvantages:

  • We need to insert a transmute in our shims to change the type. This will codegen out, but it is otherwise not needed.

This is the approach currently implemented and assumed for the rest of the design.

Replacing the Receiver

Our primary goal on each shim is to convert the receiver type of the function to match the abstracted type. We have several options:

Rewrite the Receiver Directly

The most straightforwards approach would be to examine the first argument's type, match on the receiver structures, and replace it. This approach is simple, and works for &self, &mut self, Self, Box<Self>, Rc<Self>, Arc<Self>, Pin<P> with P matching the rest of this pattern aside from Self. There are two difficulties this produces:

  • This does not support arbitrary external receiver types. This is not just a nice-to-have, as the Rust support in the Linux kernel uses multiple custom receivers.
  • A non-trivial amount of custom code is required to perform this rewrite.

Compute type at concrete and dyn MyTrait, merge

When producing a shimmed function, rewrite the underlying def_id if it is closure-like or a trait function to the generic version of it, with appropriate arguments. When we're only shimming vtables, it's gauranteed that all functions have a receiver, and after this transformation, the receiver is always in the first position.

Instantiate the function signature at the concrete type, then instantiate the receiver at the abstract type, and replace the first argument with the rewritten receiver.

This largely works, but if, and could theoretically fail if a receiver type depends on an associated type of MyTrait.

The main downside here is that it involves two type instantiations, creating a new signature, and all returned types are now FnPtr, which is not quite accurate. It is also slightly weakens CFI since the associated types are no longer qualifiers.

Compute type at abstract Self, with appropriate associated types

Produce a complete trait object type for the conrete type's implementation of the trait in question at the trait arguments. For example, if we have

trait Foo<U> {
    type T1;
}
trait Bar<U>: Foo<U> {
    type T2;
}
struct C;
impl Foo<u32> for C {
    type T1 = u8;
}
impl Bar<u32> for C {
    type T2 = i8;
}

then when we try to produce the type for C as it implements Bar<u32>, we emit dyn Bar<u32, T1 = u32, T2 = u32>. These constraints will be present at the call site, because they are required for a valid trait object.

We then replace the def_id as in the previous approach, switching closure-likes to the call family of functions with appropriate arguments, and trait implementations to the trait method they're implementing. This means that the first parameter of everything in the vtable at this point will be Self, and we can use the trait object type above. Because we added the associated types, the entire type should instantiate cleanly against the generalized type.

This is what is currently implemented in the patchset.

Which crate do CfiShims go in?

In this design, these shims should go in the crate where the vtable is created. The defining crate for the underlying instance cannot know all types the shim will need to be defined at.

If we shimmed function pointers rather than vtables, shims other than DropGlue (since we again don't know all the types it needs to be instantiated at) and similar shims[14] can go in the defining crate.

MIR Compatibility

This section is only needed if we choose to express the alias set directly through the type. If we were to use a separate mechanism, we would not need to adjust the MIR.

When we rewrite the type on a function through a shim, we need to line up the type of the generated MIR to match the type of the function. We generate for the wrapped InstanceDef, then we rewrite the body to have a transmute from the new receiver to the one the original instance would have expected.

To support alternate receivers, we unfortunately have to unwrap to the inner dyn pointer. For example, if we have Arc<dyn Foo>, we unwrap to NonNull<ArcInner<dyn Foo>>, then *mut ArcInner<dyn Foo> to match what the ABI expects and allow force_thin to do its job.

Patchstack Tour

This is a brief overview of the patchstack, intended to help reviewers find specific sections they're looking for. This refers to the patchstack as it is, and not to the design or alternate implementations.

Prelude

Introduce trait_obj_ty query

This computes the trait object type with associated types that will be later used to compute the abstract type of an instance. It's in a query both because some of the functions it calls are not available at the intended call-site, and because it is called with the same argument several times.

Refactor visiting instance_def

This splits InstanceDef visit code out into a function to make the introduction of CFI shims a little smaller

Refactor fmt_instance

Same as above - factors out a function to decrease noise in the CFI shim patch

Refactor to create InstanceDef::fn_sig

There are several places throughout the code where VTableShim is special-cased to handle its cast from self to *mut self. Most of these are annotated with a comment by eddyb to factor it out. Since the shim generator needed to do it one more time, I factored it out first.

CFI Shims

This creates the InstanceDef::CfiShim variant. It wraps another InstanceDef, which is expected to be another shim, and carries an abstract type, as computed by the trait_obj_ty query. The user is intended to construct a CfiShim via the .cfi_shim instance method. This method will be a no-op if cfi_shims() does not return true on the session, currently controlled by either CFI or KCFI being enabled. If it's wrapping a closure-like or a trait method implementation, these are replaced with a ReifyShim pointing to the abstract method, as described in the design. When generating the shim, it will prepend a transmute from the abstracted receiver to the concrete receiver.

Generate Shims

This enables usage of these shims with CFI

CFI: Apply CFI shims to drops

Attach .cfi_shim() to vtable drop and to collector visiting. This makes trait object drops start working.

CFI: Enable vtable shimming

Attach .cfi_shim() to vtable method entry and collector visiting. Alternate receivers begin to work. Standard receivers already work at this point because we changed the encoding of any method on a trait to abstract itself if it used &self or &mut self as a receiver.

Fixups

These fix individual remaining bugs, though in ways that may depend on CFI shimming being enabled already.

Revert "CFI: Fix SIGILL reached via trait objects"

This removes the encoding modification we previously had that explicitly re-encodes &self and &mut self on trait methods. It is no longer needed with the rest of this design, and removing it makes function pointers to methods on a trait work again.

CFI: Skip non-passed arguments

Some conversions to function pointers, most notably the conversion from a closure that captures nothing to a function pointer, depends on the understanding that a PassMode::Ignore argument does not alter the ABI. This patch loosens the CFI around these arguments by skipping them when generating the alias set. The alternative would be to introduce another kind of shim to explicitly truncate a non-passed argument from the type.

CFI: Handle dyn with no principal

In user Rust, dyn is not a type. However, it effectively appears in drop_in_place when generating a vtable with no principal trait. For example:
let x: Box<dyn Send> = Box::new(MyType) as _;
will construct a drop call for a vtable with no trait. Send and other auto-traits are non-principal, so at vtable allocation time, we have no description for the dropped type other than "It's a pointer to an object that was converted into a dyn ? at some point". Because we are describing our alias set in the type system, this means that the receiver type uses dyn as a self type, which causes hiccups in several places. This teaches those places to tolerate a dyn with no predicates.

CFI: Support self_cell-like recursion

This patch, or something like it, can and probably should land even if the rest don't. The type encoder wants to flatten #[repr(transparent)] into its single, non-ZST field for compatibility. The existing code attempts to avoid recursion by generalizing pointers, but the use of PhantomData or any similar structure defeats this. This pattern is used in self_cell, so it's not just an adversarial example.

CFI: Generate super vtables explicitly

Super vtables are currently skipped in the collector because the the child vtable includes instances for all the super vtable entries. In our case though, the super vtable will be shimmed to a different abstract type than the entries in the child vtable. This means it will have different instances, so we need to generate them.

CFI: Strip auto traits from Virtual receivers

Auto traits are not present when generating vtables, but they may be present on the receiver when calling them. This strips auto traits off the receiver of virtual calls so that they're compatible with the target, since a virtual call cannot require those additional bounds because it was an object-safe method.

Future Work

This stack gets us to "Most Rust code actually builds and runs under CFI", but there are more improvements we can still make in the CFI area.

Shim-less LLVM-CFI

A lot of this design is determined by the restrictions of KCFI - namely that we can't typesign globals, and every function can only have one type signature.

We can leverage both of those capabilities to remove shims and make things more efficient. If we do these, we'll want to enable KCFI in userspace to ensure it continues to work, as the implementations will diverge.

  1. Make Virtual call through dyn DefiningTrait rather than dyn MaybeSuperTrait.
  2. When attaching CFI types to a method, attach both its concrete type, and the method type instantiated with the generalized object type described earlier, with force_thin applied.
  3. For drop_in_place<T>, we still need a shim, but only one per crate. The implementor of some trait for T should provide the shim, and it should contain types for all traits that crate implements for it on the same shim.

Skip signature check on virtual calls

In short, check the vtable, don't check the virtual call itself.

I haven't verified the checks on loading vtables work correctly, but I see the implementation. If you checked the load of the vtable, you've already checked the type of the method, you don't need to do it again. The vtables reside in a RO region, so if you did a typechecked load, you're already safe.

Reduce KCFI Shim Count

We can't get rid of shims entirely, because the alias set expression we're using fundamentally doesn't have a principal alias set per unshimmed instance. Each shim we produce is essentially a witness for membership in an additional alias set. We can avoid these by reducing the number of alias sets a method is in, or by making the unshimmed case match what is common in real world code.

Reduce total possible alias sets

If we make Virtual use the defining trait as a receiver rather than the current one, we no longer need a shim for every supertrait on the trait the method implements. If deep supertrait hierarchies are used, this could significantly reduce the number of possible shims.

Make unshimmed the common case

We can switch to vtable-compatible defaults. Without the the Virtual call change, this wouldn't do much - supertraits would bring most of the shims back, while adding complexity. However with it, this would make shim generation much more unlikely, as calling a method through a vtable is much more common than converting a trait method to a function pointer. drop_in_place would be by far the most common shim remaining. This is potentially significantly more complicated, but would improve performance.

FineIBT Support

FineIBT is only available on some CPUs, but it provides more flexibility of implementation. The main benefits the Linux experimental implementation claims are:

  • Speculation barrier
  • No reads can allow XOM
  • No reads can improve performance

Today, they are using a hashed C signature as an alias set, the same way KCFI does. However, the interesting part is that the alias set policy is enforced at the callee. This means that if we convert the Virtual call to call at trait definition type, we could use a custom enforcement that checks if the caller is one of two values rather than just one, e.g.

fn_ptr_entry_f:
endbr
CHECK_FNPTR_BUNDLE
jmp direct_f
virtual_entry_f:
endbr
CHECK_FNPTR_BUNDLE
direct_entry_f:
// Actual implementation

This is experimental though, and not available on all chips, so it wouldn't actually allow us to ship Rust in the Android kernel. I'm mostly including it in case all this discussion of CFI got you excited, because this is probably the direction the version of this without whole-program-analysis is going.


  1. Indirect jumps are control transfer instructions which use a computed rather than a constant target. ↩︎

  2. Shadow Call Stack uses a separate stack in x18 which is only accessed around call/return to store return addresses. This makes it difficult to overwrite the return address because the real one is only accessible through a register which isn't used elsewhere in the program. ↩︎

  3. Safe Stack partitions the stack into two stacks, one of which holds compiler-controlled values (register spills, return addresses, etc.) and one which has address-taken and programmer controlled values. This makes it difficult to overwrite "safe" values from out-of-bounds writes relative to "unsafe" values because they are no longer adjacent. ↩︎

  4. Stack cookies put a randomized (extent of randomization varies by implementation) value onto the stack immediately after the return address. This value is checked before returning, which makes linear overwrites from the stack onto the return address difficult to perform because the cookie should be unpredictable. ↩︎

  5. These are a pair of hardware accelerated indirect branch control present on recent Intel and ARM CPUs respectively. They both work on the same principle - a special instruction that would decode into a nop on earlier versions of the CPU is placed at every location in the program it would be legal for an indirect branch to go. When the protection is enabled, indirect control flow transfers that do not end on one of these landing pads will fault. ↩︎

  6. Fine IBT is a software extension of IBT/BTI in which callers move the identifier for an alias set into a caller-saved register, and after each landing pad, there is an efficient check of the provided identifier. The paper does not select a scheme for defining alias classes, but the experimental Clang implementation uses hashes of type signatures, similar to KCFI. See FineIBT Support for further discussion. ↩︎

  7. Binder is the name of Android's IPC system. It's in the kernel to allow it to make scheduling decisions, reduce copy count, and manage the lifetime of objects passed between more than just two processes. ↩︎

  8. Kudos to @rcvalle for designing and pushing that flag through LLVM. ↩︎

  9. I believe @rcvalle has an alternate fix in progress for this focused on adjusting closure encoding. ↩︎

  10. We technically ignore VTableShim here, because it is only used for unsized_fn_params (which is in no danger of stabilization), and call_once, which is behind the fn_traits feature. This means that without using unstable features that aren't commonly used, we can set all trait methods that take self to encode as though they took *mut Self without breaking anything. ↩︎

  11. We might be able to avoid this in the future by changing virtual calls when in CFI mode to perform an implicit trait upcast. ↩︎

  12. Today, whether it's Virtual, with my patches, whether it's Virtual or CfiShim ↩︎

  13. In an ideal world, thin dyn pointers would be explicit in the type rather than based on logic around what instance is being called. ↩︎

  14. Since I haven't implemented this, I don't know that DropGlue is the only one that still requires a shim to go into the vtable, but it is an example. ↩︎

Select a repo