owned this note
owned this note
Published
Linked with GitHub
# Miri C FFI Extension
This doc describes a proposed design and current work on extending Miri to support the Rust C FFI.
The plan involves some changes to underlying data structures that are part of `rustc`, and the doc also explains what these changes are and the reasoning behind them.
## Relevant links
* [**Fork of Miri**](https://github.com/emarteca/miri)
* [Pull request for support of C functions with `int` args/returns](https://github.com/rust-lang/miri/pull/2363)
* [Pull request for using the real address of bytes as the address in Miri](https://github.com/rust-lang/miri/pull/2498)
* [**Fork of rustc**](https://github.com/emarteca/rust/)
* [Pull request for adding serialization support for `Box` with a custom allocator](https://github.com/rust-lang/rust/pull/99920#issuecomment-1199907461)
* [Pull request for exposed and aligned bytes in Allocations in`rustc`](https://github.com/rust-lang/rust/pull/100467)
* [Pull request for adding different types of `bytes` in Allocations (for foreign memory) in `rustc`](https://github.com/rust-lang/rust/pull/100899)
* [Linked GitHub issue](https://github.com/rust-lang/miri/issues/2365)
## Miri Design
At its core, Miri is an [abstract machine](https://github.com/rust-lang/miri/blob/master/src/machine.rs). It represents the state of the program it is executing, including an internal model of the memory used by the process. It consists of an `Evaluator` struct, that has fields for all the components of the runtime state.
### The Abstract Machine
The Rust compiler provides a `Machine` trait designed to help instantiate an interpreter for [MIR](https://doc.rust-lang.org/nightly/nightly-rustc/src/rustc_const_eval/interpret/machine.rs.html). It provides hooks for different operations involved in program execution. For example, the trait provides a hook function `memory_read` which is used to add custom functionality to memory reads. The code for this hook stub in the rustc `Machine` is as follows:
```[rust]
/// Hook for performing extra checks on a memory read access.
///
/// Takes read-only access to the allocation so we can keep all the memory read
/// operations take `&self`. Use a `RefCell` in `AllocExtra` if you
/// need to mutate.
#[inline(always)]
fn memory_read(
_tcx: TyCtxt<'tcx>,
_machine: &Self,
_alloc_extra: &Self::AllocExtra,
_tag: (AllocId, Self::TagExtra),
_range: AllocRange,
) -> InterpResult<'tcx> {
Ok(())
}
```
In Miri, the `Evaluator` implements the rustc [`Machine` trait](https://github.com/rust-lang/miri/blob/master/src/machine.rs#L488). There, it overrides some trait functions. Among these functions is the [`memory_read` function](https://github.com/rust-lang/miri/blob/master/src/machine.rs#L745) described above: here, Miri has some custom functionality for tracking and dealing with data races and stacked borrows when memory is accessed.
### The Evaluation Context
The `Machine` is running inside an evaluation context. This is the [`InterpCx` (Interpreter Context) struct](https://doc.rust-lang.org/nightly/nightly-rustc/src/rustc_const_eval/interpret/eval_context.rs.html#31), provided again by the rustc interpreter support. Miri has its own version of the `InterpCx`, the [`MiriEvalContext`](https://github.com/rust-lang/miri/blob/master/src/machine.rs#L469), which is just the base `InterpCx` with the appropriate lifetime parameters, for the Miri Evaluator.
```[rust]
/// A rustc InterpCx for Miri.
pub type MiriEvalContext<'mir, 'tcx> = InterpCx<'mir, 'tcx, Evaluator<'mir, 'tcx>>;
```
Miri also provides an extension trait for custom evaluation contexts even within Miri itself. This is the mechanism by which different parts of Miri modularize their customizations to the environment. For example, Miri provides some [functionality for detecting data races](https://github.com/rust-lang/miri/blob/master/src/concurrency/data_race.rs). As part of this functionality, they extend the evaluation context with some data race-specific functions: this is done by extending the `MiriEvalContext`.
### Current FFI Support
Miri does currently have some limited support for foreign function calls via emulation. This is all contained in the [`foreign_items` module](https://github.com/rust-lang/miri/blob/master/src/shims/foreign_items.rs).
This support consists of a hardcoded list of manually emulated functions, built to support commonly used foreign functions such as `malloc`. As it stands, there is a custom extension to the `MiriEvalContext` ([in shims/mod](https://github.com/rust-lang/miri/blob/master/src/shims/mod.rs#L25)) that implements a custom hook for function calls. This hook calls a function `emulate_foreign_item` if the function being called is identified as being a “foreign item” (i.e., if its body cannot be found). The relevant call, along with the corresponding comments, is included below to illustrate.
```[rust]
// Try to see if we can do something about foreign items.
if this.tcx.is_foreign_item(instance.def_id()) {
// An external function call that does not have a MIR body. We either find MIR elsewhere
// or emulate its effect.
// This will be Ok(None) if we're emulating the intrinsic entirely within Miri (no need
// to run extra MIR), and Ok(Some(body)) if we found MIR to run for the
// foreign function
// Any needed call to `goto_block` will be performed by `emulate_foreign_item`.
return this.emulate_foreign_item(instance.def_id(), abi, args, dest, ret, unwind);
}
```
Since this list of supported foreign functions is hardcoded, it is limited to only built-in native calls (and is not an exhaustive list of these). If Miri encounters a foreign item whose name is unknown, then it throws an unsupported exception and crashes the interpreter.
## Proposed plan
We are not touching the C code being executed. As much as possible, we constrained the modifications to Miri itself, with some modifications to the `rustc` compiler.
We have hooks around calls to external C functions, that handle any tagging or custom allocation of data returned from C calls or passed as arguments to C calls.
### System support
Currently we're only supporting this feature on Linux.
### Calling C from Miri
In order to call C code from a Rust program executing in Miri, we are extending Miri with the [`libffi`](https://docs.rs/libffi/latest/libffi/index.html) crate. This provides an interface to the host system’s `libffi`. It allows us to dispatch calls to linked code.
#### Linking C code
Miri doesn’t currently have a mechanism to link to external C code. We’ve implemented this by adding a new command line argument `-Zmiri-extern-so-file` that users can use to specify a path to a shared object file.
#### Dispatching calls to C code
When an external C call is encountered by Miri, the steps it follows to dispatch the call are:
1. Load the specified linked C shared object file (if applicable)
* using the [`libloading`](https://docs.rs/libloading/0.7.3/libloading/) crate
2. Load the specified function call from the linked library
* again using `libloading`
3. Convert all arguments to the function into values that can be passed into C
4. Call the C function
* using [`libffi`’s `call` function](https://docs.rs/libffi/latest/libffi/high/call/fn.call.html)
5. Store the return value of the function
Following is a simplified/condensed version of the code we added to call a function that returns an `i32` primitive using `libffi`. Note that we’ve removed the error handling code for simplicity.
```[rust]
unsafe {
// get the libloading::Library and extract the function
let lib = this.machine.external_so_lib.as_ref().unwrap();
let func: libloading::Symbol<unsafe extern fn()> =
lib.get(link_name.as_str().as_bytes());
// get the code pointer
let ptr = CodePtr(*func.deref() as *mut _);
// call function and get return value (in this case an i32)
let x = call::<i32>(ptr, &libffi_args.as_slice());
// store the value in Miri's internal memory
this.write_int(x, dest)?;
}
```
[`CodePtr`](https://docs.rs/libffi/latest/libffi/low/struct.CodePtr.html) is a code pointer type supplied by `libffi`, to provide access to the function being called.
#### A note on types
Part of the simplification of the code above is that it elides the type conversion required to turn values from their Miri representation into their corresponding values that get passed into the C function call. This is required for both the function arguments (to construct the `libffi_args` vector, we iterate over the arguments to the call in Miri) and for the function return.
To determine a correspondence between the Miri types and the C types, we refer to the available Miri types ([`TyKind`](https://doc.rust-lang.org/nightly/nightly-rustc/src/rustc_type_ir/sty.rs.html#26)s) and the types that implement the [`CType` trait](https://docs.rs/libffi/latest/libffi/high/types/trait.CType.html) in `libffi`. Clearly these do not have a 1:1 correspondence: there are many more complex types with a Miri `TyKind` representation that are not explicitly supported by `CType`. For us to support these types, we will need to make use of the “catch-all” CTypes: `*const T` and `*mut T`, the pointers.
As a demonstrative example of this conversion code, here is the code for converting a list of arguments that are all `i32`.
```[rust]
// Get the function arguments, and convert them to `libffi`-compatible form.
let mut libffi_args = Vec::<CArg>::with_capacity(args.len());
for cur_arg in args.iter() {
libffi_args.push(Self::scalar_to_carg(
this.read_scalar(cur_arg)?,
&cur_arg.layout.ty,
this,
)?);
}
// Convert them to `libffi::high::Arg` type.
let libffi_args = libffi_args
.iter()
.map(|cur_arg| cur_arg.arg_downcast())
.collect::<Vec<libffi::high::Arg<'_>>>();
// ...
// scalar_to_carg for i32
match arg_type.kind() {
// If the primitive provided can be converted to a type matching the type pattern
// then create a `CArg` of this primitive value with the corresponding `CArg` constructor.
// the ints
TyKind::Int(IntTy::I32) => {
return Ok(CArg::Int32(k.to_i32()?));
}
```
The code for getting the corresponding return type is similar, just matching over the `dest.layout.ty.kind()` (the `dest` is the destination where the return is stored) instead of the `arg_type`.
#### When and where are we dispatching the C calls?
As discussed above, Miri handles dispatching to its emulated foreign functions through a function called `emulate_foreign_item_by_name` in the `foreign_items` module. In our implementation, we are adding the dispatch to linked foreign functions before the match to try and call the built-in emulated functions.
In `foreign_items.rs`:
```[rust]
fn emulate_foreign_item_by_name(...) {
let this = self.eval_context_mut();
// First deal with any external C functions in linked .so file
// (if any SO file is specified).
if this.machine.external_so_lib.as_ref().is_some() {
// An Ok(false) here means that the function being called was not exported
// by the specified SO file; we should continue and check if it corresponds to
// a provided shim.
if this.call_and_add_external_c_fct_to_context(link_name, dest, args)? {
return Ok(EmulateByNameResult::NeedsJumping);
}
}
// continue to testing the emulated functions
```
The effect of this decision is that now, if Miri encounters a call to a linked foreign function that has the same name as a built-in (emulated) function, then the linked implementation will be run instead of the emulated version. If there is no linked foreign function then the execution will proceed as before: Miri will check to see if the foreign function matches one that is emulated, and if not, it will throw an unsupported error. The reasoning behind this design decision is that if a developer provides a function that has the same signature as a built-in function, it will take precedence over the built-in function, and we want to model this behavior.
Miri support for the C FFI for functions that take/return primitive values is done, and we're in the process of merging it upstream, with some ongoing discussion and great feedback from the Miri developers (see [this PR](https://github.com/rust-lang/miri/pull/2363)).
## Pointers
As discussed, C FFI support for functions that take/return primitive values is fairly straight forward. The real technical challenge is in dealing with _shared_ memory between the languages. This includes C pointers returned from C functions (which then may be used in the Rust program, and/or modified by future C function calls), and Miri pointers passed as arguments to C functions.
There's been some discussion on how to support/represent C pointers in Miri and how to pass Miri pointers to C.
This can be found in [this Zulip thread](https://rust-lang.zulipchat.com/#narrow/stream/269128-miri/topic/libffi.20.2B.20pointers), and [this GitHub issue](https://github.com/rust-lang/miri/issues/2365).
We summarize the main points here.
### Passing Miri pointers to external C functions
Here are some points about how the Miri allocator works, brought up in the Zulip discussion, that are relevant to the proposed implementation.
* Allocations in Miri are not resized or moved after they are created.
* The data stored in Miri allocations must already be compatible with the type expected by a foreign function if a pointer to it is being passed in as an argument, if this FFI call worked in the original Rust program.
* Miri pointers, in addition to the data bytes, also carry metadata such as the provenance.
This means a pointer to the data bytes of Miri memory can be passed directly to a foreign function, but that we need to account for the metadata (provenance in particular) and make sure that after calls to foreign functions it is modified properly.
The effect of foreign function calls on provenance is discussed [in this section](https://hackmd.io/eFY7Jyl6QGeGKQlJvBp6pw?view#Provenance).
#### Implementation details: modifications to `Allocation` means modifications to the compiler
The Miri allocator is actually the `rustc` allocator -- creation of Allocations in Miri are through hooks to the Allocation creation/modification in `rustc`.
In its current state, the bytes of an Allocation are not directly accessible to Miri.
To be able to pass a pointer to Miri memory to a foreign function call, we need access to the actual machine address of the bytes of an Allocation so that it can be passed to the foreign function.
In Miri, [the function `alloc_base_addr`](https://github.com/rust-lang/miri/blob/master/src/intptrcast.rs#L159) serves to get an address for a given allocation, specified by an AllocID.
Right now, a fake address is used, but if the machine address of the bytes of the Allocation specified were accessible, then we could just use this instead (with no change on the functionality of Miri on non-FFI code).
**A problem**: when the `bytes` field of an Allocation is created, it is not actually aligned with the alignment parameter specified.
This causes issues of alignment mismatch if we pass the machine address of the underlying bytes directly to a foreign function, as the bytes won't necessarily have the alignment that the foreign function expects it to have.
**How does this manifest as a bug?**
The current setup causes issues with double-dereferencing.
For example, consider this C function:
```[C]
int double_deref(const int **p) {
return **p;
}
```
And the Rust that calls it:
```[rust]
extern "C" {
fn double_deref(x: *const *const i32) -> i32;
}
fn main() {
unsafe {
let base: i32 = 42;
let base_p: *const i32 = &base as *const i32;
let base_pp: *const *const i32 = &base_p as *const *const i32;
assert_eq!(double_deref(base_pp), 42); // seg fault!!
}
}
```
Here the C program segfaults, because the first dereference `*p` does not match the actual address of `base_p`.
In `C`, when we dereference `**p` this is not a valid dereference and it crashes.
##### `rustc` changes required
* Add getters for the address of the `bytes` field of an Allocation, so it can be accessed in Miri.
* Properly align the `bytes` field of an allocation when it is created or updated.
These changes are currently [in a PR](https://github.com/rust-lang/rust/pull/100467).
There is still some work to be done here, as seen in the discussion on the PR.
The code for manual alignment of the `bytes` is unsafe, and the current plan is to remove the unsafe code from `rustc` by following [Ralf's suggestion](https://github.com/rust-lang/rust/pull/100467#discussion_r950847697), which is:
* Parameterize the `Allocation` over the type of the `bytes`, defaulting to `Box<[u8]>`
* Define a trait such as `AllocationBytes` and constrain the type of `bytes` to this trait; then in this trait we include all the operations we'd like
* Use the normal `Box<[u8]>` implementation in `rustc`, and implement `AllocationBytes` with the manual alignment for the Miri `Machine`.
This way all the unsafe code for aligning the bytes is constrained to the Miri codebase.
The current code for manually aligning the bytes also has an issue that needs to be addressed:
With this current code, when the bytes field is deallocated, even though the size is right, in the layout the alignment might be under-required: for example we might have allocated it with alignment 4 and deallocate it with alignment 2.
This is undefined behaviour violating the memory fitting requirements of dealloc -- we're going to work around this by adding a wrapper struct for the Box<[u8]> that stores the alignment it was allocated with, and manually implements deallocation with the right alignment.
Specifically, we're going to use [this `AlignedSlice`](https://play.rust-lang.org/?version=nightly&mode=debug&edition=2021&gist=f105a2ed162d41f2592d61a130a35de9) that @maurer designed.
##### Miri changes required
* Modify the `alloc_base_addr` function to use the actual address of the bytes of an Allocation instead of generating a fake address.
These changes are done and [in this PR](https://github.com/rust-lang/miri/pull/2498).
This PR will be updated with the new changes to Miri when we:
* Implement the `AllocationBytes` trait and remove the unsafe code from `rustc`
* Move this unsafe code into a custom type in Miri that implements `AllocationBytes`
* This custom type for aligned Allocations will use the `AlignedSlice` for proper deallocation
##### Miri tests that fail if executed in FFI mode
There are some tests in Miri’s test suite that fail if they’re executed using the real bytes of the `Allocation` as the address.
All of the failures are because of allocation alignment assumptions being violated, and we don’t think they correspond to bugs in our implementation.
_Note_: this was found when we were testing – the version of Miri we pushed only uses the real bytes for the address if we’re executing in the FFI mode, and so none of these tests are affected.
These are listed and explained in [this linked doc](https://hackmd.io/zM1YBMH9TTiVh6s7U8qOMg).
### Passing pointers to C memory back to Miri
In order to support foreign functions that return pointers to foreign memory, we need further modifications to the structure of Allocations.
As it stands, [the `alloc_id_from_addr` function in Miri](https://github.com/rust-lang/miri/blob/master/src/intptrcast.rs#L62) deals with retrieving the corresponding Allocation from a given address.
This working is predicated on there being an existing Allocation for an address to be valid -- but of course, if a foreign function returns the address of a pointer to foreign memory, this will _not_ correspond to an existing Allocation.
So, we will need to create an Allocation for this external memory.
The current structure of Allocations is such that they own their bytes.
However, this won't work for bytes in foreign memory, which are not even owned by the Rust program executing.
We propose that instead of using `Box<[u8]>` for the type of the `bytes` field of an Allocation, we introduce a new enum type:
```[rust]
pub enum AllocBytes {
/// Owned, boxed slice of [u8].
Boxed(Box<[u8]>),
/// Address, size of the type stored, and length of the allocation.
/// This is used for representing pointers to bytes that belong to a
/// foreign process (such as pointers into C memory, passed back to Rust
/// through an FFI call).
Addr(AddrAllocBytes),
}
```
This enum type will implement the `AllocationBytes` trait discussed above, and then we will use `AllocBytes` as the type that that the Miri `Machine` will parameterize the `Allocation` bytes with.
Here, `AllocBytes::Boxed` represents the `Box<[u8]>` with manual alignment.
The `AllocBytes::Addr` variant is used to represent Allocations corresponding to foreign memory.
For this we use another new data structure, `AddrAllocBytes`, which represents a section of memory, starting at a particular address, and of a specified length.
```[rust]
pub struct AddrAllocBytes {
/// Address of the beginning of the bytes.
pub addr: u64,
/// Size of the type of the data being stored in these bytes.
pub type_size: usize,
/// Length of the bytes, in multiples of `type_size`;
/// it's in a `RefCell` since it can change dynamically,
/// depending on how it's used in the program. UNSAFE
pub len: std::cell::RefCell<usize>,
}
```
With these changes, we can support foreign pointers in Miri by, in the case of foreign functions that return pointers, creating an allocation of this kind and adding it to memory.
Then, `alloc_id_from_addr` works as before, since now the foreign address _does_ have a corresponding Allocation.
This part of the implementation is still a work in progress.
* [Miri branch with these ongoing changes](https://github.com/emarteca/miri/tree/c-pointers)
* [PR to `rustc` branch with these ongoing changes](https://github.com/rust-lang/rust/pull/100899)
Specifically, one thing that we will change is that in our current implementation, we've added a function`allocate_ptr_raw_addr` in the `rustc` `Memory` to allow Miri to create this new kind of Allocation, and store it in memory.
However, when we create the `AllocationBytes` trait and move the `AllocBytes` enum to Miri, this function will no longer be necessary.
##### Length of C pointers: current hack solution
The `AddrAllocBytes` `len` field represents the length of the Allocation.
It is a `RefCell` so that it can be modified over the lifetime of the Allocation
Unfortunately, we can't know the exact size of the memory the C pointer refers to without some instrumentation or interception of the C code executing, which we are not currently doing.
The initial value chosen for `len` determines the size that Miri considers valid for the pointer.
We want to allow C to return (for example) arrays as pointers and for Rust to then access sequential elements in the array.
For example, we should be able to run the following program.
```[c]
// C code
int* array_pointer_test() {
const int COUNT = 3;
int *arr = malloc(COUNT*sizeof(int));
for(int i = 0; i < COUNT; ++i)
arr[i] = i;
return arr;
}
```
```[rust]
extern "C" {
fn array_pointer_test() -> *mut i32;
}
// Return pointer to array of i32 from C,
// and read part of the array as a slice
fn main() {
unsafe {
let arr_ptr = array_pointer_test();
let slice = std::slice::from_raw_parts(
arr_ptr as *const i32, 3u64 as usize);
assert_eq!(slice, [0, 1, 2]);
assert_eq!(*arr_ptr, 0);
assert_eq!(*arr_ptr.offset(1), 1);
}
}
```
When we create an Allocation for `arr_ptr` in Miri, this needs to have a `len` large enough that the creation of a slice of length 3 and the access to `*arr_ptr.offset(1)` are not out of bounds.
Our current *hack solution* is to just say that every Allocation corresponding to a C pointer is given `len` of 1000 (so in this case, we consider `arr_ptr` to be a 1GB array).
The reasoning is that this should be large enough to cover the vast majority of pointers.
This doesn't actually allocate any memory, so it is not wasting space, it just means that if an access is actually out of bounds this error will not be caught.
<!---
### Distinguishing Miri memory from external (foreign) memory
In order to be able to efficiently distinguish between foreign memory and Miri memory (i.e., for a given location in memory, determine if it corresponds to C memory or Miri), we propose to reserve a large section of virtual memory for the Miri allocator. This will ensure that there is no overlap in the memory used by the Miri allocator, and external calls. Using this, we can tell if the memory is Miri internal merely by looking at its `AllocID`.
If the return of a C call is a Miri pointer (i.e., if it is contained in the memory range reserved by the Miri allocator), then we will need to handle this case: likely we'll need to create the corresponding `Allocation` object (if it doesn't already exist), and we'll have to track that this Miri memory is shared with C (as discussed in the next section).
If the return of a C call is a C pointer (i.e., if it is in the foreign range of memory), then we can represent it with placeholder `AllocID` that indicates that it is a foreign pointer.
-->
## Provenance
[This comment in the related GitHub issue](https://github.com/rust-lang/miri/issues/2365#issuecomment-1187651410) raises some important questions about provenance.
In particular:
* Pointers in Miri have provenance metadata.
If a C function returns a Miri pointer, this provenance data will have been stripped.
How do we restore the provenance?
* What about the provenance of pointers to C memory that C returns?
### One "C" provenance value all memory explicitly exposed to C
We already need to create some provenance value for pointers to C memory.
One idea would be to have one provenance value for these, and then give this _same_ C provenance value to any pointers to Miri memory that are exposed to C (i.e., passed as arguments).
This would involve recursing through any exposed Miri pointers (basically, building a pointer reachability graph and giving all of these pointers the same provenance) -- this is the same idea as the `retag_fields` option in stacked borrows, which determines if retagging (modifying the provenance) should recurse into fields, but in this case it should always be true.
Of course, we can't recurse into C memory, since we don't know if there are any pointer fields in a pointer to C memory -- Miri will know that a pointer returned from C is a pointer into C memory, but not know the underlying structure of that memory.
We know that anything accessed through a pointer into C memory will be known to have the C provenance.
_Note: things might get complicated here if a C object stores a pointer into Miri memory -- but we will be able to tell that it is Miri memory by checking that there is a corresponding Miri AllocID for its address_
The C provenance value is a similar idea to the `Wildcard` provenance that already exists in Miri -- we would reuse this to tag all the memory exposed to or originating from C.
The advantage of this option is that we would not need to track the list of exposed addresses and re-sync the provenance of this entire list after every call to C: there is no need to re-sync the provenance since we already know that everything in the sync list will have the same ( C ) provenance!
This would be much more efficient than the more complex idea proposed below, and would be simpler to implement.
This idea would also not result in any loss in provenance data in memory not passed into the FFI.
Essentially, it would only affect the provenance data for Miri memory that is pointed to by a Miri pointer that is passed to C.
#### Design decision: only changing provenance of explicitly exposed memory
We have various options when it comes to what to do with the provenance of memory exposed to C.
1. Recognize that _all_ memory is exposed to C, and so use only the "C provenance" for all memory in a Rust program if the C FFI is used, regardless of whether or not it is explicitly exposed.
2. Make an assumption that C will only modify the Rust memory that is exposed to it, and use the "C provenance" for pointers to all of this memory, while leaving the provenance of the rest of the Rust memory untouched (**this is our proposed plan**).
3. Implement a strategy for more fine-grained reasoning about the provenance of the pointers exposed to C.
The first option would be the most conservative, and the most sound without implementing a strategy for being able to look at the specific effect of C code on values in memory.
However, it is also pretty useless -- this would mean all provenance in the entire program is lost with any use of the C FFI at all.
We propose that the second option is a better idea -- it allows for the checks that make use of provenance to continue unchanged in the pointers _not_ explicitly exposed to C, and still allows to reasoning/tracking of the memory that is exposed.
The last solution is the best solution, as it is the most precise.
We have some ideas about how this might be implemented, but they are future work for now, and we propose solution 2 in the meantime.
#### Strict provenance?
We propose that the FFI support and strict provenance mode not be allowed to be used together in Miri.
#### Questions
* What level of provenance tracking is acceptable?
* Can we use the strategy of having one provenance value for anything exposed to C?
* What is the plan for Miri's support of non strict provenance mode in the future?
### Ideas for more fine-grained reasoning about provenance
We have a couple potential ideas for implementing more fine-grained reasoning about provenance of pointers exposed to C.
#### Sync list: keeping track of C changes to Miri pointer provenance
We could add a list that tracks all the pointers and their provenance values.
This would be in the Miri evaluation context, separate from the `memory` itself.
Then, after every FFI call, we would iterate over the memory and compare the pointer values to their original values (in the newly added list).
With this, we would catch all changes to the Miri memory.
We would also be able to say what the changes to provenance should be, as we could identify when Miri pointers have been reassigned and what they have been reassigned to.
This solution still only has one "C provenance" value for all pointers to C memory (and this would be the provenance for Miri pointers reassigned to refer to C memory).
**Pros**
* All the changes would be constrained to Miri itself
**Cons**
* Inefficient in both time and space -- storing these extra lists and running all these extra checks would add a lot of overhead
#### ASAN: using a sanitizer to detect memory accesses and modifications
[AddressSanitizer (ASAN)](https://clang.llvm.org/docs/AddressSanitizer.html) is a sanitizer designed to find memory use errors, such as use-after-free bugs.
Of particular relevance for us, ASAN allows for specific memory to be exposed or "poisoned", such that access to poisoned memory would be considered a bug.
We could use ASAN to set up "guards" on the Miri memory that C doesn't have explicit access to, by only exposing the data it should be able to see and "poisoning" the rest.
This would mean that our assumption about the provenance of the non-directly-C-accessible Miri memory not changing in the presence of C calls would be verifiable -- and it would let us catch errors if C _does_ access this memory.
We might also be able to use it to detect if C _doesn't_ access Miri memory that it technically has access to, in which case that provenance information can also remain unmodified.
ASAN has [an interface for C/C++](https://github.com/llvm/llvm-project/blob/main/compiler-rt/include/sanitizer/asan_interface.h), and can be used with FFI, as long as the C/C++ code it is sanitizing is linked with ASAN when it is compiled.
At a high level, it seems like it should work out-of-the-box on the C code being called with the Rust C FFI.
This solution also still has "C provenance" for all memory modified by or originating from C.
**Pros**
* More efficient, and only requires the compilation of the linked C/C++ program with ASAN.
**Cons**
* Requires ASAN: now the changes are no longer constrained to Miri.
* Don't immediately see how we could use it to get specific provenance values if C *does* access the memory, even if it is reassigning Miri pointers to each other (in which case their provenance values should be swapped, for e.g.).
## Current state of the project
In this last section we summarize the current state of the project, and the work that still needs to get done.
### What is working?
At this point, we can call C functions from Miri with the following argument and return types:
* Primitive (integer) or void/empty arguments and returns
* Arguments that are pointers to Miri memory
* the C function can change the _value_ that the pointer points to
* we _do not yet_ support C functions that write pointers to Miri memory
* Arguments and returns that are pointers to C memory
Here are some examples of function calls that we support:
```[c]
// C code
void deref_and_print(int *p) {
printf("deref in C has value: %d\n", *p);
}
long add_short_to_long(short x, long y) {
return x + y;
}
int* pointer_test() {
int *point = malloc(sizeof(int));
*point=1;
return point;
}
int* array_pointer_test() {
const int COUNT = 3;
int *arr = malloc(COUNT*sizeof(int));
for(int i = 0; i < COUNT; ++i)
arr[i] = i;
return arr;
}
// double dereference pointers, and swap what values they're pointing to
// note: this is only writing non-pointer values to memory
void swap_double_ptrs(short **x, short **y) {
short temp = **x;
**x = **y;
**y = temp;
}
// write non-pointer values to memory represented by pointers
void set(short *x, short val) { *x = val; }
```
```[rust]
extern "C" {
fn add_short_to_long(x: i16, y: i64) -> i64;
fn pointer_test() -> *mut i32;
fn deref_and_print(x: *mut i32);
fn array_pointer_test() -> *mut i32;
fn swap_double_ptrs(x: *mut *mut i16, y: *mut *mut i16);
fn set(x: *mut i16, v: i16);
}
fn main() {
unsafe {
// test function that adds an i16 to an i64
assert_eq!(add_short_to_long(-1i16, 123456789123i64), 123456789122i64);
// test return pointer to i32 from C, dereference, modify in Rust,
// and see changes in C
let ptr = pointer_test();
assert_eq!(*ptr, 1);
*ptr = 5;
assert_eq!(*ptr, 5);
deref_and_print(ptr); // void function that prints: *ptr is 5 in C
// test return pointer to array of i32 from C,
// and read part of the array as a slice
let arr_ptr = array_pointer_test();
let slice = std::slice::from_raw_parts(arr_ptr as *const i32, 3u64 as usize);
assert_eq!(slice, [0, 1, 2]);
assert_eq!(*arr_ptr, 0);
assert_eq!(*arr_ptr.offset(1), 1);
// mutate the pointer and see it reflected in the slice
*arr_ptr.offset(1) = 5;
assert_eq!(slice, [0, 5, 2]);
// test passing a Rust pointer to C and reassigning its value
let mut set_base: i16 = 1;
let mut set_base_p: *mut i16 = &mut set_base as *mut i16;
set(set_base_p, 3);
assert_eq!(set_base, 3);
assert_eq!(*set_base_p, 3);
// test passing two double pointers, and swapping the _values_ they point to
// note: this is _not_ C writing pointers to Miri memory
let mut new_base: i16 = 2;
let mut new_base_p: *mut i16 = &mut new_base as *mut i16;
let new_base_pp: *mut *mut i16 = &mut new_base_p as *mut *mut i16;
let set_base_pp: *mut *mut i16 = &mut set_base_p as *mut *mut i16;
assert_eq!(**set_base_pp, 3);
assert_eq!(**new_base_pp, 2);
assert_ne!(*new_base_pp, *set_base_pp);
swap_double_ptrs(set_base_pp, new_base_pp);
assert_ne!(*new_base_pp, *set_base_pp);
assert_eq!(**set_base_pp, 2);
assert_eq!(**new_base_pp, 3);
}
}
```
This is a subset of our whole test suite, which can be found:
* [Rust code](https://github.com/emarteca/miri/blob/c-pointers/tests/pass/extern-so/call_extern_c_fcts.rs)
* Corresponding [C code being called](https://github.com/emarteca/miri/blob/c-pointers/tests/extern-so/test.c)
### What is not quite working, and needs to get done?
There are some things that are not quite working yet:
In particular, C functions that write pointers to Miri memory cause Miri to crash with a UB error stating that the memory access is invalid when this memory is dereferenced to a value.
For example:
```[c]
// C code
void setptr(short **x, short *val) { *x = val; }
```
```[rust]
extern "C" {
fn setptr(p: *mut *mut i16, x: *mut i16);
}
fn main(){
unsafe {
// test passing a double pointer and a single pointer,
// and reassigning the double pointer
// to point to the single pointer
let mut new_base: i16 = 2;
let new_base_p: *mut i16 = &mut new_base as *mut i16;
assert_ne!(new_base_p, set_base_p);
setptr(set_base_pp, new_base_p);
assert_eq!(new_base_p, set_base_p);
// uh oh: the following code breaks
// let rust_ddref = **set_base_pp;
// let rust_dref = *set_base_p;
}
}
```
This is because the Miri pointers exposed to C need to be updated with `Wildcard` provenance after calls to C, but this is not done yet.
In these cases, the expected provenance is now wrong, since the pointers have been reassigned in C.
There are also some concrete work items that still need to get done:
* Replace the unsafe code in `rustc` with the `AllocationBytes` trait, and then propagate this change into Miri.
* Change the `Box<[u8]>` Allocation bytes representation to use the `AlignedSlice` when it needs to be manually aligned.
* Keep up with the existing PRs and make the changes/fixes requested.
* Properly set up propagation of C (`Wildcard`) provenance to the pointers to Miri memory that C gets access to -- this _should_ enable support for C functions that write pointers to Miri memory.
### Directions for future work
In addition to the concrete work items listed above, there are various interesting avenues for future work on this project.
* Incorporate ASAN to enable more fine-grained reasoning about the provenance of pointers once they are passed to C (as described above).
This would allow us to:
* Find bugs where C is accessing memory it should not have access to.
* Confidently use the provenance information of pointers in Miri that C did not change (since then we will _know_ that C did not modify the memory).
* Intercept calls to `malloc` and `free` to provide Miri with information on the actual size and lifetime of C pointers.
* We could use something more precise than "every C pointer has `len` 1000" in our Allocations.
* Run this on real code bases where C FFI is used, and see what bugs we can find!
* ... and other suggestions welcome :)
<!---
#### Sync list: keeping track of C changes to Miri pointer provenance
If we want to be more precise with the provenance of Miri pointers exposed to C, we propose the following approach:
##### PNVI-ae-udi
PVNI-ae-udi is a provenance-aware memory model for C.
[This great blog post](https://gankra.github.io/blah/tower-of-weakenings/#strict-provenance-news) states that with PVNI-ae-udi the abstract machine stores a list of all the exposed addresses.
Then, when there is an `int2ptr` transmute, the machine traverses the list of exposed addresses to see if there are any matches.
If there is one match, then we can assign this pointer (the transmuted int) the provenance of the matching address.
If there is more than one match, then report an error -- this means that the same memory location has multiple provenances.
##### Miri pointers exposed to C
We're planning to basically implement PNVI-ae-udi.
We can `expose_addr` all the Miri pointers that get passed to C, and store this list in the machine.
This will be the list of all the pointers whose provenance needs to be re-synchronized after every call to C.
Let's call this list the **sync list**.
We plan to implement this as a `RangeMap` so we can access the `Allocation` from the pointer (and get the provenance metadata).
The `expose_addr` needs to be _recursive_: we need to track all the pointers that are exposed.
If (for example) a pointer to a struct or an array is exposed, then this also exposes pointers to all the fields/elements.
Any of these "sub-pointers" can be modified by C or returned individually.
When a Miri pointer is returned from a call to C, we can use the sync list to restore its provenance when we call `from_exposed_addr` (i.e., when we transmute it back to a pointer from an int).
###### Resyncing the provenance of everything in the sync list
After every call to C, we also need to re-sync the provenance of everything in the sync list.
This is because C may have modified these exposed pointers in a way that would modify the provenance.
For example:
* expose Miri pointers `ptr_a` and `ptr_b` to C
* this exposes `ptr_a.field_a` and `ptr_b.field_b` to C
* in C, reassign `ptr_a.field_a = ptr_b.field_b`
* now, the provenance should be updated to reflect that `ptr_a.field_a` now points into `ptr_b`
###### What do we do if a Miri pointer is returned from C that is not in the sync list?
Technically, there is no limit to the effect C could have on Miri data: since it is not an isolated process, to reason soundly about what external calls could have modified we would need to assume the entire Miri memory could be affected.
However, a conclusion of "anything could happen" would not be very useful.
We don't want to consider every single Miri address as exposed.
For now, we propose that if a pointer to Miri memory is returned from a C call that is _not_ in the list of exposed addresses, then we throw an error.
###### Potential for future optimizations
If a pointer is `const` or is passed to C as read-only, then we don't need to recurse through the entire structure of the exposed object.
This is also a potential opportunity for checking to see if C modified memory that it was not supposed to: if we track a list of read-only exposed addresses in the machine and check for data modification in this memory after a call to C.
For now though, we propose to have a single sync list of exposed addresses in the machine, and recurse through the entire structure when we're exposing sub-pointers.
// ##### C pointers passed to Miri
All pointers to C memory that are returned from C should also be tracked in our sync list.
We propose to just have on "C provenance" value for all pointers to C memory.
##### Removing addresses from the list of exposed addresses
We don't want the sync list of addresses tracked in the machine to grow indefinitely.
Figuring out when addreses should be removed from the sync list gets complicated.
For pointers to Miri memory in the sync list, when this memory is deallocated then we can remove their address from the sync list.
If an object is destroyed then (if there are no memory leaks) the memory of all of its child objects will also be deallocated -- so, this will also remove the child pointers from the sync list.
For pointers to C memory in the sync list, any deallocation won't necessarily be done through Miri.
Our current plan is to implement a reference counting system on the Miri side: if the reference count to a pointer to C memory goes to zero, then we will remove it from the sync list.
The reference counting itself can get complicated too: to do this precisely we'd need to account for copying of the C pointers in the rust program, and then track amalgamate references to these aliases.
### Intercepting memory reads and dispatching on memory origin
Given that with this proposed approach we can distinguish whether a pointer corresponds to C memory or Miri memory based on the value of the corresponding `AllocID`, we can use the `memory_read` machine hook function described above as it takes the `AllocID` of the memory being read as an argument.
Here, we can dispatch any custom behaviour that we want to trigger on the reads of memory from each language.
Note that this `memory_read` hook function is read-only, so we can't use it to modify the internal state of the `Evaluator`.
Another machine hook function we might need to modify is `ptr_get_alloc`.
The signature for this function is (taken from the [rustc `Machine`](https://doc.rust-lang.org/nightly/nightly-rustc/src/rustc_const_eval/interpret/machine.rs.html#323)):
```[rust]
/// Convert a pointer with provenance into an allocation-offset pair
/// and extra provenance info.
///
/// The returned `AllocId` must be the same as `ptr.provenance.get_alloc_id()`.
///
/// When this fails, that means the pointer does not point to a live allocation.
fn ptr_get_alloc(
ecx: &InterpCx<'mir, 'tcx, Self>,
ptr: Pointer<Self::PointerTag>,
) -> Option<(AllocId, Size, Self::TagExtra)>;
```
Miri already has a custom implementation of this hook function ([here](https://github.com/emarteca/miri/blob/int-function-args-returns/src/machine.rs#L764)).
We may need to modify this hook function to customize getting the `AllocRef` of C external memory.
#### Synchronizing memory
With this proposed plan, since memory is directly shared between the languages and there is no reallocation on the Miri side, we do not think that we will need to manually synchronize memory (as the pointers will persist).
### Advice appreciated!
Any advice/ideas on this proposed plan would be appreciated.
In particular:
* How should we go about reserving a section of memory for the Miri allocator?
* Is it possible to use the same `AllocID` for all foreign pointers, and what are the pros/cons of this idea?
* Do the machine hooks we propose to use seem reasonable (any other hooks we should be using instead / in addition?)
* Are there any situations in which we will actually need to maintain two versions of some shared memory? (i.e., a Miri version and a C version)
* And, following this, if so, we need to design a plan for memory synchronization.
* How does the provenance tracking sound?
-->