GOSIM unconf 2024: better rust codegen

raw notes

At the GOSIM 2024 unconf on the 20th of October, we ran a short unconf session on improving rust code generation: having rustc emit code that runs faster. We lookat at three different general topics: changing the language to more directly express efficient codegen, improving rust MIR optimizations, and improving LLVM operating on rustc-generated LLVM IR

attendees

Gary Guo
Folkert de Vries
Amanieu d'Antras
Josh Triplett
Jack Huey
Nicholas Nethercote
David Lattimore
… (let me know who I've missed)

Language features

Labeled match

draft RFC by Folkert here https://github.com/folkertdev/rust-rfcs/blob/labeled-match/text/0000-labeled-match.md

Labeled match can express control flow that we cannot express easily today (like C's switch fallthrough), and can be used for improved codegen (generating a direct jump instead of a jump table). Labeled match was discussed with many attending in one-on-one conversations before, so was not given a lot of attention here

Improve ergonomics of per-cpu-feature implementations

In various places (SIMD code, the linux kernel) it is possible to write faster implementations of a function by using a specific CPU feature. But running code that uses a CPU feature on a system that does not actually have that feature is UB, so you have to be careful with which variation you run. Deciding what function variation to run at runtime can be slow.

In glibc, a mechanism called IFUNC is used: this resolves what variation to call at program load time based on the available CPU features. In this approach, the relevant functions are all indirect calls that go via the global offset table (GOT). This table gets patched to refer to the best variation. The kernel actually goes over the whole binary and fixes the indirection, which is even more efficient.

Related things in the rust space are

https://docs.rs/multiversion/0.7.4/multiversion/index.html can compile the same function with multiple target features (e.g. for autovectorization), then picks the best one at runtime. The CPU feature checking is done only once. This uses an atomic containing a function pointer.
externally implementable functions: A mechanism for defining a function whose implementation can be defined (or overridden) in another crate.
struct target features improve the ergonomics of defining functions with target features. zullip thread

A simpler use case for a "check once, cache result" is something like tracing's log level: the value is cached in an atomic, but that means every log print must still perform an atomic read.

A good next step would be a crate that implements static if (so an if that stores its result in a interior-mutable static), to see what performance impact that has.

Guaranteed Tail Calls

the blockers appear to be

it is unclear what amount of optimization is actually guaranteed
bikeshedding on the syntax (become keyword, #[tail] return, maybe others)

A limitaton of the current approach is that a function that gets tail-called must have the same signature as the caller. That is so that registers can be reused so it is easier to make guarantees about performance, but there might be apetite for the more basic "a tail call is just a jump to the function body" approach too.

Suggestion by Gary: make become just change the drop order (anything not used in the become <expr> gets dropped before the become). Then use separate pragma for the tail call. Even without the #[tail] pragma LLVM can do more with such a call (because it is in "tail position").

callee-cleanup versus caller-cleanup

When a function returns, who is responsible for cleaning up its stack? it could be either the callee, doing the work before its return, or the caller, cleaning up after receiving the result of the callee.

Should we question whether Rust's default ABI should be callee-cleanup or caller-cleanup? We should try the experiment: even if it's not always better, it could be opt-in for TCO. A recent post by Graydon Hoare suggests that this has been thought about before.

Amanieu: On platforms where arguments are in registers you might not need to match the signature if the differences are in arguments passed in registers.

MIR optimizations

MIR optimization appears pretty basic today: even simple patterns are not optimized, and the MIR optimizer (by default) skips large funtions (i.e. the function where you want optimization). But MIR optimizations have great potential: the rust compiler has more context than LLVM about rust programs, and the process is fully controled by rustc developers.

Gary: It looks like some basic patters are optimization blockers. E.g. taking the address of a value can block optimizations. E.g. Drop uses &mut self hence blocking optimizations

Folkert: by default MIR won't optimize functons larger/more complex than a certain threshold (but these are likely the functons to benefit most). Even basic things like let x = 1; match x { 1 => 'a', _ => 'b' } don't reliably get simplified by MIR. Trifecta is interested on working on MIR optimizations if funding can be found.

Fixing the pass ordering problem

We discussed using an approach similar in spirit to Cranelift for working around the compiler pass ordering problem.

The core idea is that an optimizer has a library of rewrite rules, and applies these rewrite rules to the program. E.g. x + 0 => x. Normally, the order in which such rewrite are applied matters: applying one rule can cause another not to match down the line. Equality saturation explores all possible rewrites without using excessive memory. At some point, the fuel for the rewrite process runs out (e.g. a time limit, or limit on the number of steps), and the best combination of the rewrites that were explored is picked.

Equality saturation does not directly apply to more complicated structures like programs, so modifications are needed. But this changed system still shares many high-level properties with equality saturation.

Amanieu: Probably not helpful to do [what Cranelift does] in the Rust compiler. Primarily works with pure operations, not side effects (e.g. memory loads/stores).

`Hidden` function visibility & other linkage details

The way that Rust references functions is not optimal. It assumes it might come from a shared object, and generates longer instructions than it needs to. Ends up being longer instructions. We should mark all our internal symbols as having hidden visibility. We should review symbol visibility and see if we're doing the right thing. Sounds like we might not be.

Thread-local storage (TLS) model is very pessimistic by default. Makes rayon much slower (tls_get_addr very high in profiles).

LLVM

discussed only very briefly: in practice LLVM generates more instructions for a rust program than for an equivalent C program. The theory is that LLVM overfits on the output of clang.
Rust generates LLVM IR that is close, but not identical to clang's output, causing optimizations to be missed.

Folkert: Based on conversations with e.g. the rav1d team, probably there is a lot of low-hanging fruit in LLVM where just making LLVM also apply the optimization with the rustc LLVM IR output would improve codegen.

Which companies are working on both LLVM and Rust? Could we talk to them about working on LLVM to improve Rust codegen?

bjorn3

2024/10/23 07:58:53

The kernel actually goes over the whole binary and fixes the indirection, which is even more efficient.

Wait, what??? Are you sure the kernel patches the binary? I've never seen a case before where the linux kernel does any patching of executable code. The compile time linker is the only component that does any patching of executable code. Even the dynamic linker avoids patching code on basically every architecture and rather patches data like the GOT or the `.data` or `.data.rel.ro` sections.

2024/10/23 08:05:25

Or do you mean the alternatives mechanism that is used inside the kernel itself? That doesn't apply to userspace programs. And for example iOS and (by default) Android disallow mapping executable code as writable too, which means that userspace code can't rely on being able to patch it's own binary.

Gary Guo

2024/10/24 13:54:42

This was about the kernel itself