Precise tracking of FFI events

# Precise tracking of Miri FFI events This proposal aims to improve the information available to Miri with regard to memory accesses that happen to the machine's memory once pointers to it cross the FFI boundary. As it stands, this proposal solely applies to FFI done via libffi on Unix systems via the `-Zmiri-native-lib` parameter. The existing approach within Miri consists of invalidating more or less all data about the memory that is passed to foreign code; all provenances which can be derived from the pointer(s) sent across the bound are exposed, and all accessible memory is assumed to have been initialised by the foreign code. The only way to be able to make any stronger guarantees about what actually happened with the memory would be to track reads and writes done by the foreign code, which so far Miri does not attempt to do. ## Possible implementations Three realistic options exist for improving the information we have, but a lot of the scaffolding required for each will be shared with the others. Namely, the machine's memory will need to be separate from that of Miri as a whole so that we can hook up some form of tracking to it. This is currently done with a small custom allocator, but it could be replaced by any allocator crate so long as we control the span from which it allocates. The options are: ### `sigaction`/`userfaultd`/similar-based This would be the simplest, giving us page-grained information on accesses done. In effect, for whatever the system page size is on the host machine, we would know whether at least one read, at least one write, or at least one of both was performed somewhere on that page. #### Benefits: - By far the simplest of all options in terms of code size - We can still use this to avoid exposing unnecessary provenances, in many cases, as it's unlikely that a single page will contain a lot of pointers - Easily portable on any unix system #### Drawbacks: - We *only* get page-grained info... ### `ptrace` + a disassembler: This consists of making use of debug tooling to intercept each read and write happening once FFI has been initiated and then reporting these back at the end. Effectively, we cause the foreign code to segfault every time they access our memory and then intercept the signal without letting the process die, and disassemble the instruction which caused the segfault to determine the size and type of access. #### Benefits: - We get very precise (perfect, with some effort) information on every single access done - As long as we have an appropriate disassembler for the host architecture, should work on the vast majority of systems (quite rare for `ptrace` to be disabled) - Can be easily extended to support other functionality like intercepting `malloc`/`mmap` calls if we ever want to allow Miri to receive pointers from FFI - With more difficulty, could also be extended to prevent the foreign code from causing UB in Miri itself via access through an unlucky dangling pointer - *Vague gestures in the direction of letting foreign code call Rust closures / function* #### Drawbacks: - Miri needs to operate as 2 processes, since generally only parent processes can trace children - We will need to either abstract away over disassembler internals to get the needed info, or pull in a relatively bulky cross-platform one (e.g. [`llvm_sys`](https://docs.rs/llvm-sys/latest/llvm_sys/)) - IPC is needed to communicate the events that the parent detected back to the child process - Performance during FFI will suffer somewhat due to the constant stop-and-start ### `FUSE`-based We could guarantee perfect information on reads and writes regardless of architecture if we accept requiring systems to support `FUSE` in order to use the better FFI. This would consist of registering (entirely in memory) a "filesystem" which transparently just maps reads and writes to someplace in memory, and then `mmap`ing "files" from it to act as our machine's memory. #### Benefits: - We cleanly get to have code execute at every access and get told whether it's a read or write, and its exact size - Completely arch-independent - Can compose with the `ptrace` approach and remove the need for a disassembler #### Drawbacks: - Requires systems to have the `FUSE` kernel module loaded, which is not as ubiquitous as simply allowing `ptrace` - Requires us to put in logic to actually implement an entire (very simple) filesystem, which might be a good amount of code - Also will need IPC to communicate with Miri ### Honourable mentions These are ideas that were considered but, in my opinion, do not present any significant improvement over the above while having increased complexity or have some kind of dealbreaker constraint. #### Hooking into Valgrind This would entail attempting to run the C code under Valgrind or some other freestanding profiling tool. It was discarded due to the fact that it would entail we run *all* of Miri inside of Valgrind (at likely a significant perf cost *always*) as there's no way to just run certain bits of code or libraries with it. Plus, we'd still need to have a 2 process architecture doing it this way. #### Using an instruction interpreter No significant advantage over the `ptrace` + disassembler idea, but it was floated around in the past as a suggestion on this topic. It also requires a lot more arch-specific code, most likely. #### Hooking into a sanitiser e.g. `ubsan` Would be basically perfect, except that it requires we have access to the source of the library and it's possible that recompiling it will modify its actual semantics (e.g. if the C code has UB, or if it was at all miscompiled); we need to preserve its real semantics as that's what non-Miri Rust code will interact with. ## Detailed structure See my forks of [`rustc`](https://github.com/nia-e/rust/tree/exposed-ffi) and [`miri`](https://github.com/nia-e/miri/tree/ffi-ptrace) for code. So far, the code that has been written has gone for the `ptrace` option, using a somewhat lightweight but x86-only disassembler ([`iced_x86`](https://docs.rs/iced-x86/latest/iced_x86/)). It also already intercepts calls to `malloc` and `mmap` in order to trace allocations made in foreign code. The structure of an FFI call before this proposal was as follows: - `call_native_fn()` is executed, which itself calls `prepare_exposed_for_native_call()` right before the FFI call. This sets all uninitialised memory to zeroes and marks it as initialised, and exposes all provenances that can be derived from the arguments passed - `libffi::ffi::call()` is executed - And that's about it Now, instead, and in a lot of detail, it looks like: - `call_native_fn()` is executed and `shims::trace::Supervisor::init()` is called; if this is the first FFI call, it will fork the process and monitor events (otherwise it's a no-op). Miri continues as the child - If the previous `init()` did not report an error, `prepare_exposed_for_native_call()` is *not* called (but it is in case of error, as a fallback to the old behaviour; if this happens, all of this is disregarded and things go on as before) - Immediately before `libffi::ffi::call()`, we execute `shims::trace::Supervisor::start_ffi()` which: - Calls `mprotect` on every page of memory owned by the `MiriMachine`, setting it as `PROT_NONE` (cannot be read/written) - Sends an IPC message to the supervisor process asking it to attach to us, and halts Miri until this happens - Upon attaching, the supervisor does a few more things to the Miri process: - The first byte of the `malloc`, `realloc`, and `free` functions is replaced with the architecture-specific breakpoint instruction (on x86, `0xCC`) - Syscalls are set as breakpoints also (via an actual `ptrace` option, this time) - `libffi::ffi::call()` is called in the child - The supervisor loops while waiting for an event to happen in the child. Namely: - On `SIGSEGV`, the supervisor process checks whether the segfault was triggered by access to the MiriMachine's memory. If yes: - We `longjmp` the child process by hand into a function labelled `mempr_off()`, which sets the page that triggered the segfault to RW - Instead of returning, this function raises a `SIGSTOP` at the end which we wait for - We `longjmp` the child back where the segfault happened and restore all registers (this is thankfully done in an arch-independent way by `ptrace`) - The child process steps 1 instruction then pauses - The supervisor reads bytes between where the instruction pointer was pre-step and where it is now, which it feeds into `iced_x86` for disassembly - This tells us the size and kind of access performed - If we commit to having arch-specific code, we could also feed it info about register states to give us exact info on the size and access performed; right now we conservatively overestimate (e.g. `iced_x86`'s `OpAccess::CondRead` becomes a yes-for-sure-read) - We again `longjmp`, this time into `mempr_on()`, which acts similarly but reapplies the page protection - When this is over, we let the code continue as before - On `SIGTRAP`, we check if we are at the known address of `malloc`, `realloc`, or `free`. If so: - We grab the relevant info from registers about e.g. the size of the `malloc` call; this is actually quite easy to port, since all of these functions only have nice pointer-sized arguments - The original contents of the function that were overwritten with the breakpoint instruction are restored - We read the return address from the stack and write a breakpoint *there*, so that we get another `SIGTRAP` when `malloc`/etc. returns - Wait for the return then grab the return value from registers again, e.g. in case of `free` whether it was successful, and if everything went okay we save the new pointer to our known list on `malloc` / delete it on `free` / do both on `realloc` - Restore everything as it was and continue - If we get the unique `PtraceSyscall` event, we check if it was an `mmap`/`munmap` and perform similar logic to the `SIGTRAP` block to get the addresses and return values (though no need for writing instructions) - If at any point the child process dies or e.g. performs a real segfault, we print some debug info about it and kill it - Once the FFI call returns, the child process immediately calls `shims::trace::Supervisor::end_ffi()` which undoes everything that `start_ffi()` did and causes the parent process to detach and send over all accesses and allocations it intercepted over IPC. It will still hang around if e.g. we enter a new FFI call and need it to attach again - After returning all the way back into `call_native_fn()` and right before it returns, Miri calls `shims::trace::Supervisor::get_events()` which grabs the obtained information from the IPC channel. This consists of `reads: Vec<Range<u64>>` (`addr..addr+len`), `writes: Vec<Range<u64>>` (same) and `mappings: Vec<Range<u64>>` (same) which it then uses to update Miri's internal state tracking. Currently, the mappings are ignored but the read and write accesses are fully integrated (modulo bugs).