owned this note
owned this note
Published
Linked with GitHub
# Rustc's parallel LLVM backend
Main phases of a crate going from rustc and getting out a binary artifact:
1. Crate is split into *codegen units*, an rustc-specific data structure to say what to translate where
2. Each codegen unit is translated to an LLVM module. Each LLVM module is in its own `LLVMContextRef` and is independent of all other modules/contexts, allowing us to send these between threads.
3. Codegen units are always optimized. Even at `-O0` we optimize. (think of `#[inline(always)]`)
4. If enabled, codegen units then paraticipate in ThinLTO or "fat LTO"
5. Remaining codegen units are then translated to machine code
Step (1) cannot be parallelized, it's a query.
Step (2) today is sequential. With a parallel rustc it can be parallelized
Step (3) is parallelized today
Step (4) cannot be parallelized (by definition it's serial work)
Step (5) is parallelized today
An excellent visual overview of codegen today looks like the graph in [this comment](https://github.com/rust-lang/rust/issues/64913#issuecomment-537226853). The notable points on that graph are:
* Thread 0 is the thread that generates LLVM modules. (translates to LLVM). It runs at the top but finishes after a point producing modules.
* The "stair step" look of the other threads represents how the main thread creates an LLVM module and then sends it to another thread. After sending the LLVM module to another thread the main thread keeps translating. This "stair" is "fixed" with a parallel rustc and it can look more like a wall.
* This is an optimized build, so ThinLTO is enabled. There's a first point where all LLVM cgus stop. That's where the first phase of optimization is done and a small amount of serial work is done. This serial work is attributed to Thread 6, you can see it poke out a bit.
* Afterwards a wall of optimization happens while all CGUs are optimized after ThinLTO in parallel.
![](https://i.imgur.com/Ei7b0iD.png)
## Codegen Units
For the purposes of parallelism codegen units aren't really too interesting beyond the fact that rustc can create multiple codegen units for one crate. One very interesting aspect though is that the split of a crate into codegen units is automatic and has no user input. We ideally want each codegen unit to take roughly the same amount of time in LLVM to avoid having one thread spinning on an extra-huge CGU while all other threads are idle because their LLVM modules were small.
[PRs like this](https://github.com/rust-lang/rust/pull/65281) are recent attempts to improve the auto-splitting algorithm on behalf of rustc.
Apart from codegen splitting, though, this isn't too particularly interesting with respect to the parallelism of the backend.
## Main Thread vs Workers
> **Note**: this is documentation that is basically only relevant to today's architecture. Much of this will change with a truly parallel rustc.
Translation from Rust MIR to LLVM IR requires the `TyCtxt` to be around, which means that today this translation is a single-threaded task. The main thread is the only thread which can create an LLVM module, and as a result it will create CGUs and then send them to a "coordinator thread" for further work.
The "coordinator thread" above is unconditionally spawned by rustc and is used to coordinate work between the main thread and worker threads which are optimized/codegen'ing. It has a [very large doc block](https://github.com/rust-lang/rust/blob/b7a9c285a50f3a94c44687ba9ff3ab0648243aaa/src/librustc_codegen_ssa/back/write.rs#L1055-L1189). This thread is responsible for a few critical tasks:
* It actually signals the main thread when it can translate a new CGU (more on this later)
* It receives translated CGUs from the main thread, and then it will spawn a new thread to optimize the CGU
* It collects optimized CGUs from worker threads. It also performs serial ThinLTO work (or linking in "fat LTO").
* It manages codegen for each CGU.
* It is responsible for collecting all of the results and sending them back to the main thread when everything is ready.
Today the coordinator thread has a "mysterious" channel in `TyCtxt` which is a backchannel to send it information. This has probably been refactored by this point though, but there's basically a channel to send messages to it.
Worker threads which actually perform work are pretty simple. They have some shenanigans to ensure that error reports from LLVM are routed back to the main thread as well as handling of panics to ensure things are torn down cleanly. Other than that though they're pretty standard "just go do the work then exit" threads. We don't currently have a thread pool, and I don't think we've actually yet seen benchmarks showing that we need a thread pool...
## ThinLTO
This was discussed a bit above and you can see it in action [with the timing graph](https://github.com/rust-lang/rust/issues/64913#issuecomment-537226853). The general gist is that *any parallelism at all* requires multiple CGUs. We cannot run passes on an LLVM module in parallel, each LLVM module is single-threaded (think today's rustc `TyCtxt`). Therefore to get any parallelism whatsoever we need to create multiple codegen units for the backend.
As soon as you create multiple codegen units though you've now removed inlining opportunities between those codegen units. Due to the automatic nature of partitioning and the lack of "`#[inline]` everywhere" when ThinLTO was added, we basically need an automatic compiler-built-in way of recovering inlining opportunities. While we *could* just execute full LTO (merge everything into one CGU) that removes parallelism opportunities.
ThinLTO exists for this purpose! ThinLTO is designed to allow cross-module inlining to happen and then performs optimization passes on each module in isolation. AKA it requires a piece of serial work to calculate some inlining data structures and then each module can be optimized in parallel. This has almost all of the benefits of "single CGU fat LTO" but critically can take advantage of all your CPU cores.
In any case, that's the motivation for ThinLTO. The impact on the compiler is that we have a coordination point for crates just after optimization and just before codegen when compiled in release mode. This is handled by the coordinator thread and is quite complicated today, unfortunately.
## Jobserver
Parallelism in a build tool is hard. Cargo will, for example spawn `$NCPU` `rustc` processes in parallel. It would be a big bummer for each `rustc` to then spawn `$NCPU` threads, possibly creating `$NCPU * $NCPU` amount of work on the system. In addition to overloading the system work-wise it can also cause a lot of OOM situations because that's a huge amount of CGUs in memory. Anyway the "solution" to this is to use a [jobserver](https://www.gnu.org/software/make/manual/html_node/POSIX-Jobserver.html), which was invented by `make` and we've lifted it to use in Cargo.
A "jobserver" is a glorified IPC semaphore. You add in a bunch of tokens to it, share it with all your processes, and then whenever a process wants to do more work it acquires a token first. By placing N tokens in a jobserver you're guaranteed more than N processes won't be running at a time.
The integration in rustc isn't too too interesting here. The [`jobserver` crate](https://crates.io/crates/jobserver) has all the cross-platform details and is used by both `rustc` and Cargo. First `rustc` needs a [`Client`](https://docs.rs/jobserver/0.1.17/jobserver/struct.Client.html) which it does so via the [`Client::from_env`](https://docs.rs/jobserver/0.1.17/jobserver/struct.Client.html#method.from_env) method. If that fails it just creates a local one with `$NCPU` tokens.
The weird thing about jobserver is that they're always blocking. Apparently across platforms and across `make` versions you just can't rely on nonblocking I/O. That's a bummer for having our coordinator thread, because while blocking for a jobserver token other messages might come in. Turns out `make` literally relies on EINTR via SIGCHLD to wake up the blocking call for a jobserver token. It's weird. In any case we solve this with **another** helper thread via [`Client::into_helper_thread`](https://docs.rs/jobserver/0.1.17/jobserver/struct.Client.html#method.into_helper_thread).
The jobserver helper thread works as so:
* Occasionally you call [`HelperThread::request_token`](https://docs.rs/jobserver/0.1.17/jobserver/struct.HelperThread.html#method.request_token). This will cause the helper thread to attempt to read a token from the jobserver (blocking the helper thread)
* When acquired, the helper thread will invoke the callback passed to [`Client::into_helper_thread`](https://docs.rs/jobserver/0.1.17/jobserver/struct.Client.html#method.into_helper_thread). In rustc's case this sends the `Acquired` on a channel back to the coordinator thread.
* The coordinator thread receives a token and then may let itself spawn more work.
Tokens are [stored locally in the coordinator thread](https://github.com/rust-lang/rust/blob/b7a9c285a50f3a94c44687ba9ff3ab0648243aaa/src/librustc_codegen_ssa/back/write.rs#L1224-L1226), are [then use to limit](https://github.com/rust-lang/rust/blob/b7a9c285a50f3a94c44687ba9ff3ab0648243aaa/src/librustc_codegen_ssa/back/write.rs#L1343-L1358) how much work is spun up if we have work to do, and they are [immediately released](https://github.com/rust-lang/rust/blob/b7a9c285a50f3a94c44687ba9ff3ab0648243aaa/src/librustc_codegen_ssa/back/write.rs#L1360-L1361) if we have tokens but aren't using them for work. In general this strategy works out pretty well because the units of parallelism here are extremely coarse (whole LLVM modules) so excessive jobserver traffic doesn't really show up much.
## Limiting CGUs in-memory
The last thing that's a real gotcha with the parallel backend today is a feature added long ago in [#43506](https://github.com/rust-lang/rust/pull/43506). The jobserver is a coarse way to limit the amount of active memory on a system, but that only matters if each process doesn't actually allocate a ton of memory in a single threaded context. Before [#43506](https://github.com/rust-lang/rust/pull/43506) rustc would, serially, translate all modules to LLVM and *then* process all modules in parallel. This means that rustc had a massive peak memory spike during translation where `TyCtxt` and every single LLVM module were all resident in memory at the same time.
The solution to this problem was basically to avoid all LLVM modules being resident in memory at the same time. The way to do that was to hook the main translation thread into the coordinator thread, and have the coordinator thread start/stop the main thread whenever it thinks that there's enough LLVM modules in memory and such. Inside the coordinator thread loop [is a block that governs this](https://github.com/rust-lang/rust/blob/b7a9c285a50f3a94c44687ba9ff3ab0648243aaa/src/librustc_codegen_ssa/back/write.rs#L1253-L1275) as well as [a heuristic function](https://github.com/rust-lang/rust/blob/b7a9c285a50f3a94c44687ba9ff3ab0648243aaa/src/librustc_codegen_ssa/back/write.rs#L1518-L1526). This heuristic attempts to keep the number of active LLVM modules in memory at a minimum, stopping the main thread translating new LLVM modules as approriate and using the main thread's jobserver token to instead optimize/codegen a module.
## How parallel rustc changes translation
A parallel compiler will radically change our LLVM backend and (I believe) drastrically simplify it. As soon as we can use `TyCtxt` in parallel this lifts the limitation that only the main thread can produce LLVM modules, allowing us to remove the coordinator thread entirely and restructure in a much more understandable format. It won't be "insanely simple" but for an optimized (ThinLTO) build we could start having driver code that looks like:
```rust
// serially partition the crate
let cgus = tcx.codegen_units();
// in parallel translate CGUs to LLVM and then optimize.
let objects = cgus
.par_iter()
.map(|cgu| translate_to_llvm(cgu))
.map(|cgu| optimize(cgu))
.collect::<Vec<_>>();
// serially calculate ThinLTO data
let data = thin_lto::prepare(&objects);
// and then in parallel again do final optimizations and codegen
let objects = objects
.into_par_iter()
.map(|cgu| thinlto_optimize_and_codegen(&data, cgu))
.collect::<Vec<_>>();
return objects;
```
One of the critical simplifications is that jobserver specialization goes away since it's baked into whatever rustc has for "run this loop in parallel" already. Additionally the previous section, limiting CGUs in-memory, no longer needs special treatment. We don't run the risk of the main thread running away and translating all LLVM modules at once because there is no main thread.
... as I write this though I realize that we don't necessarily drop `TyCtxt` eagerly here and that may be important. Anyway this is definitely something we can play with but I think will become much simpler at least for removing the coordinator thread. (because oh man would it be nice to do that.)