This doc describes a proposed design and current work on extending Miri to support the Rust C FFI.
The plan involves some changes to underlying data structures that are part of rustc
, and the doc also explains what these changes are and the reasoning behind them.
At its core, Miri is an abstract machine. It represents the state of the program it is executing, including an internal model of the memory used by the process. It consists of an Evaluator
struct, that has fields for all the components of the runtime state.
The Rust compiler provides a Machine
trait designed to help instantiate an interpreter for MIR. It provides hooks for different operations involved in program execution. For example, the trait provides a hook function memory_read
which is used to add custom functionality to memory reads. The code for this hook stub in the rustc Machine
is as follows:
/// Hook for performing extra checks on a memory read access.
///
/// Takes read-only access to the allocation so we can keep all the memory read
/// operations take `&self`. Use a `RefCell` in `AllocExtra` if you
/// need to mutate.
#[inline(always)]
fn memory_read(
_tcx: TyCtxt<'tcx>,
_machine: &Self,
_alloc_extra: &Self::AllocExtra,
_tag: (AllocId, Self::TagExtra),
_range: AllocRange,
) -> InterpResult<'tcx> {
Ok(())
}
In Miri, the Evaluator
implements the rustc Machine
trait. There, it overrides some trait functions. Among these functions is the memory_read
function described above: here, Miri has some custom functionality for tracking and dealing with data races and stacked borrows when memory is accessed.
The Machine
is running inside an evaluation context. This is the InterpCx
(Interpreter Context) struct, provided again by the rustc interpreter support. Miri has its own version of the InterpCx
, the MiriEvalContext
, which is just the base InterpCx
with the appropriate lifetime parameters, for the Miri Evaluator.
/// A rustc InterpCx for Miri.
pub type MiriEvalContext<'mir, 'tcx> = InterpCx<'mir, 'tcx, Evaluator<'mir, 'tcx>>;
Miri also provides an extension trait for custom evaluation contexts even within Miri itself. This is the mechanism by which different parts of Miri modularize their customizations to the environment. For example, Miri provides some functionality for detecting data races. As part of this functionality, they extend the evaluation context with some data race-specific functions: this is done by extending the MiriEvalContext
.
Miri does currently have some limited support for foreign function calls via emulation. This is all contained in the foreign_items
module.
This support consists of a hardcoded list of manually emulated functions, built to support commonly used foreign functions such as malloc
. As it stands, there is a custom extension to the MiriEvalContext
(in shims/mod) that implements a custom hook for function calls. This hook calls a function emulate_foreign_item
if the function being called is identified as being a “foreign item” (i.e., if its body cannot be found). The relevant call, along with the corresponding comments, is included below to illustrate.
// Try to see if we can do something about foreign items.
if this.tcx.is_foreign_item(instance.def_id()) {
// An external function call that does not have a MIR body. We either find MIR elsewhere
// or emulate its effect.
// This will be Ok(None) if we're emulating the intrinsic entirely within Miri (no need
// to run extra MIR), and Ok(Some(body)) if we found MIR to run for the
// foreign function
// Any needed call to `goto_block` will be performed by `emulate_foreign_item`.
return this.emulate_foreign_item(instance.def_id(), abi, args, dest, ret, unwind);
}
Since this list of supported foreign functions is hardcoded, it is limited to only built-in native calls (and is not an exhaustive list of these). If Miri encounters a foreign item whose name is unknown, then it throws an unsupported exception and crashes the interpreter.
We are not touching the C code being executed. As much as possible, we constrained the modifications to Miri itself, with some modifications to the rustc
compiler.
We have hooks around calls to external C functions, that handle any tagging or custom allocation of data returned from C calls or passed as arguments to C calls.
Currently we're only supporting this feature on Linux.
In order to call C code from a Rust program executing in Miri, we are extending Miri with the libffi
crate. This provides an interface to the host system’s libffi
. It allows us to dispatch calls to linked code.
Miri doesn’t currently have a mechanism to link to external C code. We’ve implemented this by adding a new command line argument -Zmiri-extern-so-file
that users can use to specify a path to a shared object file.
When an external C call is encountered by Miri, the steps it follows to dispatch the call are:
libloading
cratelibloading
libffi
’s call
functionFollowing is a simplified/condensed version of the code we added to call a function that returns an i32
primitive using libffi
. Note that we’ve removed the error handling code for simplicity.
unsafe {
// get the libloading::Library and extract the function
let lib = this.machine.external_so_lib.as_ref().unwrap();
let func: libloading::Symbol<unsafe extern fn()> =
lib.get(link_name.as_str().as_bytes());
// get the code pointer
let ptr = CodePtr(*func.deref() as *mut _);
// call function and get return value (in this case an i32)
let x = call::<i32>(ptr, &libffi_args.as_slice());
// store the value in Miri's internal memory
this.write_int(x, dest)?;
}
CodePtr
is a code pointer type supplied by libffi
, to provide access to the function being called.
Part of the simplification of the code above is that it elides the type conversion required to turn values from their Miri representation into their corresponding values that get passed into the C function call. This is required for both the function arguments (to construct the libffi_args
vector, we iterate over the arguments to the call in Miri) and for the function return.
To determine a correspondence between the Miri types and the C types, we refer to the available Miri types (TyKind
s) and the types that implement the CType
trait in libffi
. Clearly these do not have a 1:1 correspondence: there are many more complex types with a Miri TyKind
representation that are not explicitly supported by CType
. For us to support these types, we will need to make use of the “catch-all” CTypes: *const T
and *mut T
, the pointers.
As a demonstrative example of this conversion code, here is the code for converting a list of arguments that are all i32
.
// Get the function arguments, and convert them to `libffi`-compatible form.
let mut libffi_args = Vec::<CArg>::with_capacity(args.len());
for cur_arg in args.iter() {
libffi_args.push(Self::scalar_to_carg(
this.read_scalar(cur_arg)?,
&cur_arg.layout.ty,
this,
)?);
}
// Convert them to `libffi::high::Arg` type.
let libffi_args = libffi_args
.iter()
.map(|cur_arg| cur_arg.arg_downcast())
.collect::<Vec<libffi::high::Arg<'_>>>();
// ...
// scalar_to_carg for i32
match arg_type.kind() {
// If the primitive provided can be converted to a type matching the type pattern
// then create a `CArg` of this primitive value with the corresponding `CArg` constructor.
// the ints
TyKind::Int(IntTy::I32) => {
return Ok(CArg::Int32(k.to_i32()?));
}
The code for getting the corresponding return type is similar, just matching over the dest.layout.ty.kind()
(the dest
is the destination where the return is stored) instead of the arg_type
.
As discussed above, Miri handles dispatching to its emulated foreign functions through a function called emulate_foreign_item_by_name
in the foreign_items
module. In our implementation, we are adding the dispatch to linked foreign functions before the match to try and call the built-in emulated functions.
In foreign_items.rs
:
fn emulate_foreign_item_by_name(...) {
let this = self.eval_context_mut();
// First deal with any external C functions in linked .so file
// (if any SO file is specified).
if this.machine.external_so_lib.as_ref().is_some() {
// An Ok(false) here means that the function being called was not exported
// by the specified SO file; we should continue and check if it corresponds to
// a provided shim.
if this.call_and_add_external_c_fct_to_context(link_name, dest, args)? {
return Ok(EmulateByNameResult::NeedsJumping);
}
}
// continue to testing the emulated functions
The effect of this decision is that now, if Miri encounters a call to a linked foreign function that has the same name as a built-in (emulated) function, then the linked implementation will be run instead of the emulated version. If there is no linked foreign function then the execution will proceed as before: Miri will check to see if the foreign function matches one that is emulated, and if not, it will throw an unsupported error. The reasoning behind this design decision is that if a developer provides a function that has the same signature as a built-in function, it will take precedence over the built-in function, and we want to model this behavior.
Miri support for the C FFI for functions that take/return primitive values is done, and we're in the process of merging it upstream, with some ongoing discussion and great feedback from the Miri developers (see this PR).
As discussed, C FFI support for functions that take/return primitive values is fairly straight forward. The real technical challenge is in dealing with shared memory between the languages. This includes C pointers returned from C functions (which then may be used in the Rust program, and/or modified by future C function calls), and Miri pointers passed as arguments to C functions.
There's been some discussion on how to support/represent C pointers in Miri and how to pass Miri pointers to C. This can be found in this Zulip thread, and this GitHub issue. We summarize the main points here.
Here are some points about how the Miri allocator works, brought up in the Zulip discussion, that are relevant to the proposed implementation.
This means a pointer to the data bytes of Miri memory can be passed directly to a foreign function, but that we need to account for the metadata (provenance in particular) and make sure that after calls to foreign functions it is modified properly. The effect of foreign function calls on provenance is discussed in this section.
Allocation
means modifications to the compilerThe Miri allocator is actually the rustc
allocator – creation of Allocations in Miri are through hooks to the Allocation creation/modification in rustc
.
In its current state, the bytes of an Allocation are not directly accessible to Miri. To be able to pass a pointer to Miri memory to a foreign function call, we need access to the actual machine address of the bytes of an Allocation so that it can be passed to the foreign function.
In Miri, the function alloc_base_addr
serves to get an address for a given allocation, specified by an AllocID.
Right now, a fake address is used, but if the machine address of the bytes of the Allocation specified were accessible, then we could just use this instead (with no change on the functionality of Miri on non-FFI code).
A problem: when the bytes
field of an Allocation is created, it is not actually aligned with the alignment parameter specified.
This causes issues of alignment mismatch if we pass the machine address of the underlying bytes directly to a foreign function, as the bytes won't necessarily have the alignment that the foreign function expects it to have.
How does this manifest as a bug? The current setup causes issues with double-dereferencing. For example, consider this C function:
int double_deref(const int **p) {
return **p;
}
And the Rust that calls it:
extern "C" {
fn double_deref(x: *const *const i32) -> i32;
}
fn main() {
unsafe {
let base: i32 = 42;
let base_p: *const i32 = &base as *const i32;
let base_pp: *const *const i32 = &base_p as *const *const i32;
assert_eq!(double_deref(base_pp), 42); // seg fault!!
}
}
Here the C program segfaults, because the first dereference *p
does not match the actual address of base_p
.
In C
, when we dereference **p
this is not a valid dereference and it crashes.
rustc
changes requiredbytes
field of an Allocation, so it can be accessed in Miri.bytes
field of an allocation when it is created or updated.These changes are currently in a PR.
There is still some work to be done here, as seen in the discussion on the PR.
The code for manual alignment of the bytes
is unsafe, and the current plan is to remove the unsafe code from rustc
by following Ralf's suggestion, which is:
Allocation
over the type of the bytes
, defaulting to Box<[u8]>
AllocationBytes
and constrain the type of bytes
to this trait; then in this trait we include all the operations we'd likeBox<[u8]>
implementation in rustc
, and implement AllocationBytes
with the manual alignment for the Miri Machine
.
This way all the unsafe code for aligning the bytes is constrained to the Miri codebase.The current code for manually aligning the bytes also has an issue that needs to be addressed:
With this current code, when the bytes field is deallocated, even though the size is right, in the layout the alignment might be under-required: for example we might have allocated it with alignment 4 and deallocate it with alignment 2.
This is undefined behaviour violating the memory fitting requirements of dealloc – we're going to work around this by adding a wrapper struct for the Box<[u8]> that stores the alignment it was allocated with, and manually implements deallocation with the right alignment.
Specifically, we're going to use this AlignedSlice
that @maurer designed.
alloc_base_addr
function to use the actual address of the bytes of an Allocation instead of generating a fake address.These changes are done and in this PR. This PR will be updated with the new changes to Miri when we:
AllocationBytes
trait and remove the unsafe code from rustc
AllocationBytes
AlignedSlice
for proper deallocationThere are some tests in Miri’s test suite that fail if they’re executed using the real bytes of the Allocation
as the address.
All of the failures are because of allocation alignment assumptions being violated, and we don’t think they correspond to bugs in our implementation.
Note: this was found when we were testing – the version of Miri we pushed only uses the real bytes for the address if we’re executing in the FFI mode, and so none of these tests are affected.
These are listed and explained in this linked doc.
In order to support foreign functions that return pointers to foreign memory, we need further modifications to the structure of Allocations.
As it stands, the alloc_id_from_addr
function in Miri deals with retrieving the corresponding Allocation from a given address.
This working is predicated on there being an existing Allocation for an address to be valid – but of course, if a foreign function returns the address of a pointer to foreign memory, this will not correspond to an existing Allocation.
So, we will need to create an Allocation for this external memory.
The current structure of Allocations is such that they own their bytes.
However, this won't work for bytes in foreign memory, which are not even owned by the Rust program executing.
We propose that instead of using Box<[u8]>
for the type of the bytes
field of an Allocation, we introduce a new enum type:
pub enum AllocBytes {
/// Owned, boxed slice of [u8].
Boxed(Box<[u8]>),
/// Address, size of the type stored, and length of the allocation.
/// This is used for representing pointers to bytes that belong to a
/// foreign process (such as pointers into C memory, passed back to Rust
/// through an FFI call).
Addr(AddrAllocBytes),
}
This enum type will implement the AllocationBytes
trait discussed above, and then we will use AllocBytes
as the type that that the Miri Machine
will parameterize the Allocation
bytes with.
Here, AllocBytes::Boxed
represents the Box<[u8]>
with manual alignment.
The AllocBytes::Addr
variant is used to represent Allocations corresponding to foreign memory.
For this we use another new data structure, AddrAllocBytes
, which represents a section of memory, starting at a particular address, and of a specified length.
pub struct AddrAllocBytes {
/// Address of the beginning of the bytes.
pub addr: u64,
/// Size of the type of the data being stored in these bytes.
pub type_size: usize,
/// Length of the bytes, in multiples of `type_size`;
/// it's in a `RefCell` since it can change dynamically,
/// depending on how it's used in the program. UNSAFE
pub len: std::cell::RefCell<usize>,
}
With these changes, we can support foreign pointers in Miri by, in the case of foreign functions that return pointers, creating an allocation of this kind and adding it to memory.
Then, alloc_id_from_addr
works as before, since now the foreign address does have a corresponding Allocation.
This part of the implementation is still a work in progress.
Specifically, one thing that we will change is that in our current implementation, we've added a functionallocate_ptr_raw_addr
in the rustc
Memory
to allow Miri to create this new kind of Allocation, and store it in memory.
However, when we create the AllocationBytes
trait and move the AllocBytes
enum to Miri, this function will no longer be necessary.
The AddrAllocBytes
len
field represents the length of the Allocation.
It is a RefCell
so that it can be modified over the lifetime of the Allocation
Unfortunately, we can't know the exact size of the memory the C pointer refers to without some instrumentation or interception of the C code executing, which we are not currently doing.
The initial value chosen for len
determines the size that Miri considers valid for the pointer.
We want to allow C to return (for example) arrays as pointers and for Rust to then access sequential elements in the array.
For example, we should be able to run the following program.
// C code
int* array_pointer_test() {
const int COUNT = 3;
int *arr = malloc(COUNT*sizeof(int));
for(int i = 0; i < COUNT; ++i)
arr[i] = i;
return arr;
}
extern "C" {
fn array_pointer_test() -> *mut i32;
}
// Return pointer to array of i32 from C,
// and read part of the array as a slice
fn main() {
unsafe {
let arr_ptr = array_pointer_test();
let slice = std::slice::from_raw_parts(
arr_ptr as *const i32, 3u64 as usize);
assert_eq!(slice, [0, 1, 2]);
assert_eq!(*arr_ptr, 0);
assert_eq!(*arr_ptr.offset(1), 1);
}
}
When we create an Allocation for arr_ptr
in Miri, this needs to have a len
large enough that the creation of a slice of length 3 and the access to *arr_ptr.offset(1)
are not out of bounds.
Our current hack solution is to just say that every Allocation corresponding to a C pointer is given len
of 1000 (so in this case, we consider arr_ptr
to be a 1GB array).
The reasoning is that this should be large enough to cover the vast majority of pointers.
This doesn't actually allocate any memory, so it is not wasting space, it just means that if an access is actually out of bounds this error will not be caught.
This comment in the related GitHub issue raises some important questions about provenance. In particular:
We already need to create some provenance value for pointers to C memory.
One idea would be to have one provenance value for these, and then give this same C provenance value to any pointers to Miri memory that are exposed to C (i.e., passed as arguments).
This would involve recursing through any exposed Miri pointers (basically, building a pointer reachability graph and giving all of these pointers the same provenance) – this is the same idea as the retag_fields
option in stacked borrows, which determines if retagging (modifying the provenance) should recurse into fields, but in this case it should always be true.
Of course, we can't recurse into C memory, since we don't know if there are any pointer fields in a pointer to C memory – Miri will know that a pointer returned from C is a pointer into C memory, but not know the underlying structure of that memory.
We know that anything accessed through a pointer into C memory will be known to have the C provenance.
Note: things might get complicated here if a C object stores a pointer into Miri memory – but we will be able to tell that it is Miri memory by checking that there is a corresponding Miri AllocID for its address
The C provenance value is a similar idea to the Wildcard
provenance that already exists in Miri – we would reuse this to tag all the memory exposed to or originating from C.
The advantage of this option is that we would not need to track the list of exposed addresses and re-sync the provenance of this entire list after every call to C: there is no need to re-sync the provenance since we already know that everything in the sync list will have the same ( C ) provenance! This would be much more efficient than the more complex idea proposed below, and would be simpler to implement.
This idea would also not result in any loss in provenance data in memory not passed into the FFI. Essentially, it would only affect the provenance data for Miri memory that is pointed to by a Miri pointer that is passed to C.
We have various options when it comes to what to do with the provenance of memory exposed to C.
The first option would be the most conservative, and the most sound without implementing a strategy for being able to look at the specific effect of C code on values in memory. However, it is also pretty useless – this would mean all provenance in the entire program is lost with any use of the C FFI at all. We propose that the second option is a better idea – it allows for the checks that make use of provenance to continue unchanged in the pointers not explicitly exposed to C, and still allows to reasoning/tracking of the memory that is exposed.
The last solution is the best solution, as it is the most precise. We have some ideas about how this might be implemented, but they are future work for now, and we propose solution 2 in the meantime.
We propose that the FFI support and strict provenance mode not be allowed to be used together in Miri.
We have a couple potential ideas for implementing more fine-grained reasoning about provenance of pointers exposed to C.
We could add a list that tracks all the pointers and their provenance values.
This would be in the Miri evaluation context, separate from the memory
itself.
Then, after every FFI call, we would iterate over the memory and compare the pointer values to their original values (in the newly added list).
With this, we would catch all changes to the Miri memory. We would also be able to say what the changes to provenance should be, as we could identify when Miri pointers have been reassigned and what they have been reassigned to.
This solution still only has one "C provenance" value for all pointers to C memory (and this would be the provenance for Miri pointers reassigned to refer to C memory).
Pros
Cons
AddressSanitizer (ASAN) is a sanitizer designed to find memory use errors, such as use-after-free bugs. Of particular relevance for us, ASAN allows for specific memory to be exposed or "poisoned", such that access to poisoned memory would be considered a bug.
We could use ASAN to set up "guards" on the Miri memory that C doesn't have explicit access to, by only exposing the data it should be able to see and "poisoning" the rest. This would mean that our assumption about the provenance of the non-directly-C-accessible Miri memory not changing in the presence of C calls would be verifiable – and it would let us catch errors if C does access this memory. We might also be able to use it to detect if C doesn't access Miri memory that it technically has access to, in which case that provenance information can also remain unmodified.
ASAN has an interface for C/C++, and can be used with FFI, as long as the C/C++ code it is sanitizing is linked with ASAN when it is compiled. At a high level, it seems like it should work out-of-the-box on the C code being called with the Rust C FFI.
This solution also still has "C provenance" for all memory modified by or originating from C.
Pros
Cons
In this last section we summarize the current state of the project, and the work that still needs to get done.
At this point, we can call C functions from Miri with the following argument and return types:
Here are some examples of function calls that we support:
// C code
void deref_and_print(int *p) {
printf("deref in C has value: %d\n", *p);
}
long add_short_to_long(short x, long y) {
return x + y;
}
int* pointer_test() {
int *point = malloc(sizeof(int));
*point=1;
return point;
}
int* array_pointer_test() {
const int COUNT = 3;
int *arr = malloc(COUNT*sizeof(int));
for(int i = 0; i < COUNT; ++i)
arr[i] = i;
return arr;
}
// double dereference pointers, and swap what values they're pointing to
// note: this is only writing non-pointer values to memory
void swap_double_ptrs(short **x, short **y) {
short temp = **x;
**x = **y;
**y = temp;
}
// write non-pointer values to memory represented by pointers
void set(short *x, short val) { *x = val; }
extern "C" {
fn add_short_to_long(x: i16, y: i64) -> i64;
fn pointer_test() -> *mut i32;
fn deref_and_print(x: *mut i32);
fn array_pointer_test() -> *mut i32;
fn swap_double_ptrs(x: *mut *mut i16, y: *mut *mut i16);
fn set(x: *mut i16, v: i16);
}
fn main() {
unsafe {
// test function that adds an i16 to an i64
assert_eq!(add_short_to_long(-1i16, 123456789123i64), 123456789122i64);
// test return pointer to i32 from C, dereference, modify in Rust,
// and see changes in C
let ptr = pointer_test();
assert_eq!(*ptr, 1);
*ptr = 5;
assert_eq!(*ptr, 5);
deref_and_print(ptr); // void function that prints: *ptr is 5 in C
// test return pointer to array of i32 from C,
// and read part of the array as a slice
let arr_ptr = array_pointer_test();
let slice = std::slice::from_raw_parts(arr_ptr as *const i32, 3u64 as usize);
assert_eq!(slice, [0, 1, 2]);
assert_eq!(*arr_ptr, 0);
assert_eq!(*arr_ptr.offset(1), 1);
// mutate the pointer and see it reflected in the slice
*arr_ptr.offset(1) = 5;
assert_eq!(slice, [0, 5, 2]);
// test passing a Rust pointer to C and reassigning its value
let mut set_base: i16 = 1;
let mut set_base_p: *mut i16 = &mut set_base as *mut i16;
set(set_base_p, 3);
assert_eq!(set_base, 3);
assert_eq!(*set_base_p, 3);
// test passing two double pointers, and swapping the _values_ they point to
// note: this is _not_ C writing pointers to Miri memory
let mut new_base: i16 = 2;
let mut new_base_p: *mut i16 = &mut new_base as *mut i16;
let new_base_pp: *mut *mut i16 = &mut new_base_p as *mut *mut i16;
let set_base_pp: *mut *mut i16 = &mut set_base_p as *mut *mut i16;
assert_eq!(**set_base_pp, 3);
assert_eq!(**new_base_pp, 2);
assert_ne!(*new_base_pp, *set_base_pp);
swap_double_ptrs(set_base_pp, new_base_pp);
assert_ne!(*new_base_pp, *set_base_pp);
assert_eq!(**set_base_pp, 2);
assert_eq!(**new_base_pp, 3);
}
}
This is a subset of our whole test suite, which can be found:
There are some things that are not quite working yet: In particular, C functions that write pointers to Miri memory cause Miri to crash with a UB error stating that the memory access is invalid when this memory is dereferenced to a value.
For example:
// C code
void setptr(short **x, short *val) { *x = val; }
extern "C" {
fn setptr(p: *mut *mut i16, x: *mut i16);
}
fn main(){
unsafe {
// test passing a double pointer and a single pointer,
// and reassigning the double pointer
// to point to the single pointer
let mut new_base: i16 = 2;
let new_base_p: *mut i16 = &mut new_base as *mut i16;
assert_ne!(new_base_p, set_base_p);
setptr(set_base_pp, new_base_p);
assert_eq!(new_base_p, set_base_p);
// uh oh: the following code breaks
// let rust_ddref = **set_base_pp;
// let rust_dref = *set_base_p;
}
}
This is because the Miri pointers exposed to C need to be updated with Wildcard
provenance after calls to C, but this is not done yet.
In these cases, the expected provenance is now wrong, since the pointers have been reassigned in C.
There are also some concrete work items that still need to get done:
rustc
with the AllocationBytes
trait, and then propagate this change into Miri.Box<[u8]>
Allocation bytes representation to use the AlignedSlice
when it needs to be manually aligned.Wildcard
) provenance to the pointers to Miri memory that C gets access to – this should enable support for C functions that write pointers to Miri memory.In addition to the concrete work items listed above, there are various interesting avenues for future work on this project.
malloc
and free
to provide Miri with information on the actual size and lifetime of C pointers.
len
1000" in our Allocations.