Rust Patterns from Hardening a Compiler Driver

# Lessons learned from tinkering with a Compiler Driver ## Part 01: Shake your foundations (Ayy-ayy-oh!) So, here's the context: [stable-mir-json](https://github.com/runtimeverification/stable-mir-json) is a rustc driver that serializes Stable MIR to JSON. It wraps the compiler, which means the thing it depends on changes constantly, panics on inputs we can't predict, and has internal details (thread pools, global state, platform-specific paths) that bleed through our abstractions in inconvenient ways. This document covers the patterns that fell out of a [PR-140: Shake your foundation](https://github.com/runtimeverification/stable-mir-json/pull/142): catching panics from code we don't own, eliminating hardcoded paths that rot between nightly bumps, and generally making the tool less fragile against its own dependency. None of these are novel; what's useful is seeing where the "obvious" approach went wrong first. ## 1. Catching Panics from Foreign Code You Can't Fix ### The situation We have a type visitor (`TyCollector`) that walks every reachable type in a crate and asks rustc for its memory layout. For most types this works fine. But some types (certain `dyn Trait` constructions, for instance) cause rustc's layout computation to outright panic. Not return an error; *panic*. We don't control that code, we can't fix it, and we can't predict which types trigger it. But we process hundreds of types per crate, and one bad layout shouldn't take down the whole run. ### The approach The original code was straightforward: ```rust let maybe_layout_shape = ty.layout().ok().map(|layout| layout.shape()); ``` This appeared at three sites in the `Visitor` impl. When rustc panicked, the process died. So, what do you do when a library panics and you can't stop it from panicking? You wrap it in `catch_unwind`: ```rust fn try_layout_shape( ty: &stable_mir::ty::Ty, ) -> Result<Option<stable_mir::abi::LayoutShape>, String> { let result = catch_unwind(AssertUnwindSafe(|| ty.layout().ok().map(|l| l.shape()))); match result { Ok(shape) => Ok(shape), Err(payload) => { let message = if let Some(s) = payload.downcast_ref::<&str>() { (*s).to_string() } else if let Some(s) = payload.downcast_ref::<String>() { s.clone() } else { "(non-string panic payload)".to_string() }; Err(message) } } } ``` A few things to unpack in the above snippet: **`AssertUnwindSafe`.** A natural question: why do we need this wrapper? `catch_unwind` requires its closure to be `UnwindSafe`, a trait that exists to prevent you from observing partially-mutated state after a panic. Our `ty` reference borrows from compiler internals that might be in an inconsistent state if the call panics. So the compiler (correctly!) refuses to let us use it across an unwind boundary. `AssertUnwindSafe` is us saying "I know, and I promise I won't touch `ty` if this panics." That promise is easy to keep here: the `Err` branch only uses the panic payload, never `ty`. If you find yourself reaching for `AssertUnwindSafe`, make sure you can actually hold up your end of the bargain. **The return type tells the whole story.** `Result<Option<LayoutShape>, String>` encodes three distinct outcomes: `Ok(Some(shape))` means layout succeeded; `Ok(None)` means layout returned a normal error (rustc said "I can't compute this," which is fine); `Err(message)` means rustc panicked. The caller can distinguish "this type legitimately has no layout" from "computing the layout crashed the compiler." That distinction matters for diagnostics. **The downcast chain.** `catch_unwind` gives you a `Box<dyn Any + Send>` (see [docs](https://doc.rust-lang.org/nightly/core/any/trait.Any.html#impl-dyn+Any+%2B+Send)), which is delightfully unhelpful. The actual payload could be a `&str` (from `panic!("literal")`), a `String` (from `panic!("{}", formatted)`), or honestly anything else. The nested `downcast_ref` checks handle the common cases. You'll write this exact chain every time you `catch_unwind` code you don't control; it's boilerplate but unavoidable. ### The misstep: swapping the global panic hook Here's where it gets interesting. The first version of this fix also tried to suppress the default panic hook's backtrace output, because `catch_unwind` still lets the hook fire before catching the unwind: ```rust // First attempt (38c9ae4): suppress panic output per-call let prev_hook = take_hook(); set_hook(Box::new(|_| {})); let result = catch_unwind(AssertUnwindSafe(|| ty.layout().ok().map(|l| l.shape()))); set_hook(prev_hook); ``` Looks clean, right? Save the old hook, install a no-op, do your thing, restore. Turns out this is wrong in three separate ways, and they're all the same root cause: `set_hook` and `take_hook` operate on a single *process-global* [panic hook](https://doc.rust-lang.org/nightly/std/panic/fn.set_hook.html) (an `AtomicPtr` under the hood). First: **thread safety.** rustc uses threads internally for parallel query evaluation. If a panic fires on some other thread during the window between our `take_hook` and the restoring `set_hook`, that panic hits our no-op hook and is silently swallowed. We've turned a diagnosable crash into silent data corruption. The window is small, but "small race window" is not "no race." Second: **re-entrancy.** If the code inside our `catch_unwind` triggers something that itself panics-and-recovers (and rustc does this; query cycle detection works this way), our no-op hook suppresses *their* diagnostic too. We're not just suppressing our own caught panics; we're suppressing all panics process-wide for the duration of the call. Third: **there is no scoped alternative.** Unlike `thread_local!` state, there's no way to say "suppress the panic hook for just this call stack" or "just this thread." The API is inherently global. It's designed for one-time setup at program start (installing a custom crash reporter, say), not for per-call suppression around individual operations. The fix (commit [`4067da0`](https://github.com/runtimeverification/stable-mir-json/pull/142/changes/4067da0b76a0c4d1cef3573e4c5c1706d340a074)) was to just drop the hook manipulation entirely. The default hook prints backtraces to stderr, which is noisy but harmless. Our tool collects the panic messages via the `catch_unwind` payload anyway and prints its own summary at the end. The stderr backtraces are redundant noise, not incorrect output. The lesson is worth stating plainly: when choosing between "slightly noisy but correct" and "clean but subtly racy," take the noisy option. Suppressing output from code you don't control is cosmetic; races on global state are correctness. If the noise genuinely matters (it confuses users, say), the right fix is `RUST_BACKTRACE=0` or post-processing stderr, not swapping a global hook around individual calls. ### The principle `catch_unwind` is appropriate when you're calling foreign or library code that can panic on inputs you can't predict, you need to keep processing after the panic, and you can genuinely throw away the panicking operation's state without corrupting anything downstream. It is *not* a general error handling mechanism. Rust panics are not exceptions; `catch_unwind` is not try/catch. Reach for it at FFI-like boundaries where you need isolation, not for control flow. Note: `catch_unwind` only catches panics that unwind. If the panic hook calls `abort()`, or if the code was compiled with `panic = abort`, `catch_unwind` does nothing. In our case rustc is compiled with `panic = unwind`, so this works. Verify that assumption for your own dependencies. ## 2. Collecting Errors During Traversal, Reporting After ### The situation The type visitor processes hundreds or thousands of types. Some will fail (in our case, by panicking during layout computation). We don't want to stop on the first failure, but we also don't want to log errors inline and move on; that produces interleaved noise mixed in with normal output that's hard to parse. We need a way to say "note this failure, keep going, and tell me everything that went wrong at the end." ### The approach The `TyCollector` struct gains a `layout_panics` field to accumulate failures: ```rust pub(super) struct TyCollector<'tcx> { tcx: TyCtxt<'tcx>, pub types: TyMap, pub layout_panics: Vec<LayoutPanic>, resolved: HashSet<stable_mir::ty::Ty>, } ``` Each call site uses a method that records rather than propagates: ```rust fn layout_shape_or_record( &mut self, ty: &stable_mir::ty::Ty, ) -> Option<stable_mir::abi::LayoutShape> { match try_layout_shape(ty) { Ok(shape) => shape, Err(message) => { self.layout_panics.push(LayoutPanic { ty: *ty, message }); None } } } ``` Then the caller (in `collect.rs`) checks the accumulator after traversal is complete: ```rust if !ty_visitor.layout_panics.is_empty() { eprintln!( "warning: {} type layout(s) could not be computed (rustc panicked):", ty_visitor.layout_panics.len() ); for panic in &ty_visitor.layout_panics { eprintln!(" type {:?}: {}", panic.ty, panic.message); } } ``` ### The principle This is "accumulate then report," and it sits between two more familiar patterns. "Fail fast" (`Result` with `?`) stops at the first problem, which is too aggressive when failures are independent and non-fatal. "Log and continue" (printing inside the visitor) produces coherent output only by accident; when you're processing a thousand types, the three panic messages get lost in the noise. The accumulator gives you a clean separation of concerns: `layout_shape_or_record` doesn't need to decide how to format or where to print. It records the fact; someone else decides what to do with it. The caller gets a single coherent summary with a count and per-type details, and it's free to change its mind later (bail on any panics, filter to certain types, whatever) without touching the visitor. But this only works when failures are truly non-fatal. If a layout panic means downstream data is *wrong* (not just missing), you need `Result` propagation, full stop. Here, `None` layout is a clean "don't know" that serialization handles gracefully, so accumulation is appropriate. Don't reach for this pattern when the right answer is to stop. ## 3. Eliminating Hardcoded Paths by Querying the Toolchain ### The situation stable-mir-json depends on a specific Rust nightly. The nightly version is the source of truth; it lives in `rust-toolchain.toml`. But the same information (or things derived from it) was also needed in three other places: a helper binary that sets up `LD_LIBRARY_PATH`, a CI workflow that installs the toolchain, and a test script that checks out the right rustc commit. When any of these got out of sync, things broke in ways that were hard to diagnose because the failure message had nothing to do with a stale nightly string. ### The approach The original `record_ld_library_path` function had the nightly date and *architecture* hardcoded: ```rust let rust_toolchain_path = format!( "{}/.rustup/toolchains/nightly-2024-11-29-aarch64-apple-darwin/lib", rustup_home ); ``` This breaks if you change the nightly. It breaks if you're on x86. It breaks if rustup stores toolchains somewhere non-default. It's brittle in three independent dimensions. The fix is to just ask the compiler where it lives: ```rust fn rustc_lib_path() -> Result<PathBuf> { let output = std::process::Command::new("rustc") .args(["--print", "sysroot"]) .output()?; if !output.status.success() { bail!("rustc --print sysroot failed"); } let sysroot = String::from_utf8(output.stdout)?.trim().to_string(); Ok(PathBuf::from(sysroot).join("lib")) } ``` The same principle rippled through the PR. CI had a hardcoded nightly: ```yaml # Before: hardcoded, manually synced toolchain: nightly-2024-11-29 # After: derived from rust-toolchain.toml CHANNEL=$(yq -r '.toolchain.channel' rust-toolchain.toml) ``` And the test script had a manually-maintained `rustc-commit` in a TOML metadata section: ```toml # Before: manually looked up and pasted [metadata] rustc-commit = "a2545fd6fc66b4323f555223a860c451885d1d2b" ``` ```bash # After: derived from the running compiler RUSTC_COMMIT=$(rustc -vV | grep 'commit-hash' | cut -d' ' -f2) ``` The `[metadata]` section was removed from `rust-toolchain.toml` entirely. It was a manually-maintained cache of a derivable value; removing it removes the obligation to keep it in sync. ### The principle Every piece of derived information that you write down by hand is a sync obligation you're signing up for. The nightly date determines the commit hash; the commit hash determines which source to check out; the nightly determines the sysroot path. These are all functions of the same input, and encoding their outputs as separate constants means every nightly bump requires updating N files in lockstep. You will eventually forget one. The fix: compute derived values from the source of truth at the point of use. `rustc --print sysroot` gives you the lib path. `rustc -vV` gives you the commit hash. `yq .toolchain.channel rust-toolchain.toml` gives you the nightly. One file changes; everything else reacts. The tradeoff is real: you now depend on `rustc` being available and correct when these values are needed. For a tool that *is* a rustc driver, that's a safe bet. For a tool that might need to run without a compiler installed, you'd want a hardcoded fallback. Know your deployment context. ## 4. Deduplicating Platform-Conditional Code with `cfg` on Bindings ### The situation macOS uses `DYLD_LIBRARY_PATH`; Linux uses `LD_LIBRARY_PATH`. The entire logic for reading the path, falling back to a default, creating the file, and writing the result is identical on both platforms. The only thing that differs is a single string. ### The approach The original code had two complete `#[cfg]` blocks, each ~20 lines, that differed by exactly one string literal. The file-creation logic, the error handling, the fallback path construction: all copy-pasted. It looked like a lot of code, but it was really one function written twice. The fix puts the `#[cfg]` on the binding, not the block: ```rust fn record_ld_library_path(smir_json_dir: &Path) -> Result<PathBuf> { let lib_path = { #[cfg(target_os = "macos")] let env_var = "DYLD_LIBRARY_PATH"; #[cfg(not(target_os = "macos"))] let env_var = "LD_LIBRARY_PATH"; if let Some(paths) = env::var_os(env_var) { paths .to_str() .ok_or_else(|| anyhow::anyhow!("Couldn't cast {} to str", env_var))? .to_string() } else { let toolchain_lib = rustc_lib_path()?; format!("{}:/usr/local/lib:/usr/lib", toolchain_lib.display()) } }; let mut ld_library_file = std::fs::File::create(smir_json_dir.join("ld_library_path"))?; writeln!(ld_library_file, "{lib_path}")?; Ok(lib_path.into()) } ``` The `#[cfg]` attributes sit on the `let env_var = ...` bindings. Only the varying value is conditional; the 15 lines of logic that use `env_var` are written once. ### The principle `#[cfg]` in Rust can be applied at many granularities: items, expressions, statements, and (as here) individual let bindings. When platform-specific code differs by a single value, put the `#[cfg]` on just that value's binding rather than duplicating the surrounding logic. Same principle as "parameterize, don't duplicate," but using conditional compilation instead of a function parameter, so there's zero runtime cost. This only works when the platform difference is a *value*, not a *type* or *control flow*. If the macOS path needs a fundamentally different sequence of operations (not just a different string), you're back to `#[cfg]` on blocks. But check first whether the difference is really structural or just a different constant. In my experience, it's usually a different constant. Note: both `let` bindings must have the same type. Here they're both `&str`, so the compiler sees a single `env_var: &str` regardless of platform. If the types differed, you'd need a wrapper or `Cow`. ## 5. Single Source of Truth for Build Configuration ### The situation We pin a specific nightly in `rust-toolchain.toml`. CI needs that same nightly. Test scripts need the rustc commit that backs it. The helper binary needs the sysroot path. One authoritative value, multiple consumers. ### The approach Before this PR, the nightly version appeared in three places: | Location | Value | Sync mechanism | |----------|-------|----------------| | `rust-toolchain.toml` `[toolchain].channel` | `nightly-2024-11-29` | Source of truth | | `.github/workflows/test.yml` | `nightly-2024-11-29` (hardcoded) | Manual copy | | `rust-toolchain.toml` `[metadata].rustc-commit` | `a2545fd...` (hardcoded) | Manual lookup + copy | After the PR, the derived values are computed at the point of use: | Location | Derives from | How | |----------|-------------|-----| | CI toolchain step | `rust-toolchain.toml` | `yq -r '.toolchain.channel' rust-toolchain.toml` | | Test script rustc commit | Active compiler | `rustc -vV \| grep commit-hash` | | Helper binary lib path | Active compiler | `rustc --print sysroot` | The `[metadata]` section was removed entirely. It was a manually-maintained cache of a derivable value. ### The principle Every duplicated constant is a future bug. The cost isn't the initial copy; it's the maintenance: someone bumps the nightly in `rust-toolchain.toml` and forgets to update CI, or updates CI but forgets the metadata commit hash. These failures are silent until the test suite breaks on a PR that "only changed the nightly." (Ask me how I know.) The principle from build system design: **derive, don't declare.** If a value can be computed from something that's already authoritative, compute it. Yes, you pay a cost (shelling out to `rustc -vV`, parsing TOML with `yq`), but you pay it once at build time, and what you get back is freedom from sync obligations. The tradeoff: derivation adds tool dependencies. CI now needs `yq`; the test script needs the right `rustc` on `PATH` (which `rust-toolchain.toml` ensures via rustup, so this is self-consistent). If those tools aren't available, derivation fails loudly, which is strictly better than silently using a stale hardcoded value that passes locally and breaks in CI at 2am.