# GOSIM unconf 2024: better rust codegen [raw notes](https://hackmd.io/xIH7YrPMT_KRkjkWdiCOPg) At the GOSIM 2024 unconf on the 20th of October, we ran a short unconf session on improving rust code generation: having rustc emit code that runs faster. We lookat at three different general topics: changing the language to more directly express efficient codegen, improving rust MIR optimizations, and improving LLVM operating on rustc-generated LLVM IR ## attendees - Gary Guo - Folkert de Vries - Amanieu d'Antras - Josh Triplett - Jack Huey - Nicholas Nethercote - David Lattimore - ... (let me know who I've missed) # Language features ## Labeled match draft RFC by Folkert here https://github.com/folkertdev/rust-rfcs/blob/labeled-match/text/0000-labeled-match.md Labeled match can express control flow that we cannot express easily today (like C's switch fallthrough), and can be used for improved codegen (generating a direct jump instead of a jump table). Labeled match was discussed with many attending in one-on-one conversations before, so was not given a lot of attention here ## Improve ergonomics of per-cpu-feature implementations In various places (SIMD code, the linux kernel) it is possible to write faster implementations of a function by using a specific CPU feature. But running code that uses a CPU feature on a system that does not actually have that feature is UB, so you have to be careful with which variation you run. Deciding what function variation to run at runtime can be slow. In glibc, a mechanism called IFUNC is used: this resolves what variation to call at program load time based on the available CPU features. In this approach, the relevant functions are all indirect calls that go via the global offset table (GOT). This table gets patched to refer to the best variation. The kernel actually goes over the whole binary and fixes the indirection, which is even more efficient. Related things in the rust space are - https://docs.rs/multiversion/0.7.4/multiversion/index.html can compile the same function with multiple target features (e.g. for autovectorization), then picks the best one at runtime. The CPU feature checking is done only once. This uses an atomic containing a function pointer. - [externally implementable functions](https://github.com/rust-lang/rfcs/pull/3632): A mechanism for defining a function whose implementation can be defined (or overridden) in another crate. - [struct target features](https://github.com/rust-lang/rfcs/blob/01c44d38f5e22ea79a09aa91e067955891a61dea/text/3525-struct-target-feature.md) improve the ergonomics of defining functions with target features. [zullip thread](https://rust-lang.zulipchat.com/#narrow/channel/257879-project-portable-simd/topic/struct_target_features.20.28RFC.20.233525.29) A simpler use case for a "check once, cache result" is something like [`tracing`](https://docs.rs/tracing/latest/tracing/)'s log level: the value is cached in an atomic, but that means every log print must still perform an atomic read. A good next step would be a crate that implements static if (so an if that stores its result in a interior-mutable static), to see what performance impact that has. ## Guaranteed Tail Calls the blockers appear to be - it is unclear what amount of optimization is actually guaranteed - bikeshedding on the syntax (`become` keyword, `#[tail] return`, maybe others) A limitaton of the current approach is that a function that gets tail-called must have the same signature as the caller. That is so that registers can be reused so it is easier to make guarantees about performance, but there might be apetite for the more basic "a tail call is just a jump to the function body" approach too. Suggestion by Gary: make `become` just change the drop order (anything not used in the `become <expr>` gets dropped before the `become`). Then use separate pragma for the tail call. Even without the `#[tail]` pragma LLVM can do more with such a call (because it is in "tail position"). ### callee-cleanup versus caller-cleanup When a function returns, who is responsible for cleaning up its stack? it could be either the callee, doing the work before its return, or the caller, cleaning up after receiving the result of the callee. Should we question whether Rust's default ABI should be callee-cleanup or caller-cleanup? We should try the experiment: even if it's not always better, it could be opt-in for TCO. A [recent post by Graydon Hoare](https://hachyderm.io/@graydon@types.pl/113353301335646877) suggests that this has been thought about before. Amanieu: On platforms where arguments are in registers you might not need to match the signature if the differences are in arguments passed in registers. # MIR optimizations MIR optimization appears pretty basic today: even simple patterns are not optimized, and the MIR optimizer (by default) skips large funtions (i.e. the function where you want optimization). But MIR optimizations have great potential: the rust compiler has more context than LLVM about rust programs, and the process is fully controled by rustc developers. Gary: It looks like some basic patters are optimization blockers. E.g. taking the address of a value can block optimizations. E.g. `Drop` uses `&mut self` hence blocking optimizations Folkert: by default MIR won't optimize functons larger/more complex than a certain threshold (but these are likely the functons to benefit most). Even basic things like `let x = 1; match x { 1 => 'a', _ => 'b' }` don't reliably get simplified by MIR. [Trifecta](https://trifectatech.org/) is interested on working on MIR optimizations if funding can be found. ## Fixing the pass ordering problem We discussed using an approach similar in spirit to Cranelift for working around the compiler pass ordering problem. The core idea is that an optimizer has a library of rewrite rules, and applies these rewrite rules to the program. E.g. `x + 0 => x`. Normally, the order in which such rewrite are applied matters: applying one rule can cause another not to match down the line. [Equality saturation](https://egraphs-good.github.io/) explores all possible rewrites without using excessive memory. At some point, the fuel for the rewrite process runs out (e.g. a time limit, or limit on the number of steps), and the best combination of the rewrites that were explored is picked. Equality saturation does not directly apply to more complicated structures like programs, so modifications are needed. But this changed system still shares many high-level properties with equality saturation. Amanieu: Probably not helpful to do [what Cranelift does] in the Rust compiler. Primarily works with pure operations, not side effects (e.g. memory loads/stores). ## `Hidden` function visibility & other linkage details The way that Rust references functions is not optimal. It assumes it might come from a shared object, and generates longer instructions than it needs to. Ends up being longer instructions. We should mark all our internal symbols as having hidden visibility. We should review symbol visibility and see if we're doing the right thing. Sounds like we might not be. Thread-local storage (TLS) model is very pessimistic by default. Makes rayon much slower (`tls_get_addr` very high in profiles). # LLVM discussed only very briefly: in practice LLVM generates more instructions for a rust program than for an equivalent C program. The theory is that LLVM overfits on the output of clang. Rust generates LLVM IR that is close, but not identical to clang's output, causing optimizations to be missed. Folkert: Based on conversations with e.g. the [rav1d](https://github.com/memorysafety/rav1d) team, probably there is a lot of low-hanging fruit in LLVM where just making LLVM also apply the optimization with the rustc LLVM IR output would improve codegen. Which companies are working on both LLVM and Rust? Could we talk to them about working on LLVM to improve Rust codegen?