Small projects build relatively quickly, with a combination of check builds and incremental compilation, but issues compound
at scale, e.g. with larger projects with interdependent crates and where available parallelism is often untapped.
It's hard to specify precise projects in advance. Most of the initial work would be investigating reasons for any current
slowness and then proposing solutions.
Hopefully a holistic look at how users are experiencing compile times. We don't have to limit investigations to the perf.rlo
benchmarks, and measurements can be gathered on real projects. (Low-hanging fruit, easily seen in our existing benchmarks,
is disappearing, but there surely still are some to be found elsewhere)
Similarly, we can also look at wins by
- aggregation of marginal gains: even small wins compound with numbers (many small wins become big wins), and over time
- reducing ecosystem-wide TCO: even small wins compound over space, i.e. all users.
Some possible priorities:
- A. Making pipelined compilation as efficient as possible
- B. Improving raw compilation speed
- C. Improve support for persistent, cached and distributed builds (for cases where incremental compilation is not an obvious
win but caching can sometimes be one, e.g. on CI)
(we will refer to these items elsewhere in the document as e.g. "theme-pipelined-compilation")
- Improving compilation times by making rustc/cargo faster
- Improving compilation times by preventing rustc/cargo from getting slower
- Improving compilation times by making changes to the input code: finding sources of slowness in a project (e.g. unused dependencies, duplicate dependencies, buggy build scripts causing unnecessary recompiles, stalling pipelines due to architecture e.g. some possibly slow to build and execute proc-macros)
- Helping with other people's tasks achieving the same goals (e.g. reviews, benchmarks, etc)
Some shorter tasks' size/complexity (S/M/L) can roughly be guesstimated and we'll refer to them as e.g. "size-s", or open-ended as ongoing work and harder to measure and call complete ("size-open-ended").
Only scope 1 above is easily directly measurable, scope 2-4 are indirect and more open-ended, although actual concrete tasks are identifiable as parts of ongoing work towards a goal.
### 1. Gathering benchmarks to detect problems and opportunities (the benchmarking itself is generally size-s)
- benchmark/optimize execution and compile times of key crates in the ecosystem, syn/quote/serde and similar slow-to-compile crates that are heavily being depended on, or the list of most popular crates (e.g. the playground refers to the 100/200 most popular ones), and so on. Goal: improvements to all dependent crates.
- look for extreme cases of compilation (pathologic compile times?) on crates.io, look for crates exercizing the most the different queries, compilation phases and passes, etc. At some scale, quadratic spots can be hit, e.g. in coherence checking; and separate rustc from LLVM (and evaluate the impact of e.g. polymorphization) with at least:
* timely / differential-dataflow (likely nowadays: materialize): Frank has mentioned before that they generated gigabytes of LLVM IR (from what I imagine was excessive monomorphization and extremely generic APIs)
- with the increasing use of compile-time features, check if the CTFE engine/miri are imposing a big cost in practice (and if so, see how much a better VM / faster interpreter, some JIT, etc, could help). Benchmarking: size-s
- analyze proc-macro usage on all of crates.io to see their impact on the ecosystem, look for common work and patterns that could be moved into the proc-macro server, as well as help move along in-progress work there (and try to find optimization opportunities there if needed).
- try to assess the `parallel-compiler` in-progress work on all of crates.io (I'm not sure whether this cfg compiling is enforced in CI, but it does still compile as of writing.). Although, results could be seen as a min-threshold since IIUC some queries could be re-architectured to make opt-in better use of parallelism (e.g. doing some work at the module level in parallel)
- maybe interesting, contrast miniserde vs serde vs serde-erased: their design goals are different, see if how they respectively impact compile times and the ecosystem. (and a possible link to "replacing monomorphization with dynamic dispatch")
The first step of gathering data is in-progress, and we've tried to focus on:
- differentiating leaf crates, from the full compilation schedule
- CPU and memory usage measurements
- metrics of "size" (line counts, amount of LLVM IR or MIR) to have some idea of throughput in looking for outliers
- cargo timings (although available parallelism here needs to be tailored carefully to common realistic cases)
### 2. Helping others improve perf, and people help themselves (size-open-ended, indirect tasks, scope 2-4)
- help with the weekly perf report done by the performance WG
- collaboration with interested people
- observability and allowing people to understand which parts of their code or dependencies slow compile times down: a mix between showing unused dependencies & duplicates, cargo llvm-lines and cargo bloat, displaying heavy items in the monomorphization graph and their actual cost in CGUs rather than MIR statement counts, etc to see how a user could improve build times by changing their own project (pulling things into different crates, limiting mono items using dynamic dispatch, outlining and polymorphization opportunities). Sometimes long compile times are caused by unexpected rebuilds via build.rs: there's a hard to find environment variable to track the cause of these rebuilds, which can help fix these issues. We should at least document it better, maybe surface it in the CLI in a more prominent manner. Maybe being able to monitor builds over time and show possible improvements (to users or rustc devs), e.g. single-core bottlenecks in their cargo timings graphs (not necessarily related to the old telemetry ideas).
### 3. rustc (generally theme-raw-compilation-speed)
#### 3a. rustc - beginning of the pipeline
- help the frontend work to make it more incremental and queryfied, and parallel. (There are many open issues and PRs: each having different scope, so more "size-open-ended")
- emit rmeta's as early in the pipeline as possible, both in efficiency and latency. Some work in this area has already been tried by njn and it didn't make that much of an improvement. The situation may have changed: higher core counts, deeper/wider build graphs, and could also be more impactful if combined with cargo-specific build schedule changes we'll touch on below (theme-pipelined-compilation. And there are some quick things to try here as size-s.)
- optimize encoding/decoding of metadata files as well as incremental compilation caches, e.g. LEB128 encoding via SIMD, as well as encoding & decoding of Spans. Note that LEB128 handling has already been very optimized, and low-hanging fruit are unlikely here. (Some SIMD LEB128 experiments chould be size-s/size-m if SSSE3 can be detected early enough -- of course, the concerns of micro-architecture levels apply here and all similar topics requiring levels other than 1, be it in libcore/libstd or rustc. Scalar improvements could require to rearchitect things to not encode one int at a time) (theme-pipelined-compilation as well)
- (probably not super interesting: a few things have been tried before by njn) lexing/parsing: a few years back, some perf tests were done on the list of pre-interned symbols (as well as a few PRs I don't remember being merged, to improve cacheability and open up parallelism opportunities here) which seemed interesting to try again, esp as one such change had landed and was categorized as "rogue optimization" in a t-compiler meeting. Gather a list of most common identifiers/names on crates.io and libstd/libcore, and use them as pre-interned symbols to speed up lexing/parsing. Or use data from previous compilations to intern the most common symbols in the current crate, or from the whole crate graph to limit symbols lookup when importing metadata.
- look for parallelism opportunities in incremental compilation cache encoding
#### 3b. rustc - general
- there are obviously allocations on the happy path that could be avoided (e.g. the most recent example from my own memory in reordering generic parameters), and some similar thinking could be trying to avoid gather diagnostic data if there are ultimately no errors that would make use of it (kind of size-open-ended as this entails many small tasks). (But those areas are hard to find, and tooling will not easily help out to point at areas of interest)
- look into reducing memory usage by dropping pieces of context earlier (from recent perf tests on the intervals PR, I was surprised at some state being retained in borrowck until codegen and the final context drop). Also look into doing the drops on another thread (if that's not done already, I think it's not) e.g. the "drop AST" tasks. (Although there can be an impact on allocators tailored for highly parallel applications, e.g. it could drain tcmalloc's size class unexpectedly). Generally size-s, size-m. (Can we use the multi-process trick where an allocating process takes the hit of memory deallocation while the initiating process handles the results and can quit earlier?)
- how well does the query system work for both its memoization and incremental compilation use-cases ? do they impose significant costs to one another ? (e.g. would they individually work better if separated ? say, hash stability looks like a concern for cross-session serialization that could impact general use and memoization when needing to limit the existing structures' capabilities à la removing some `Ord` impls recently)
- related (although long, complicated, unlikely to be worthwhile): can salsa be evaluated, on the subsets it shares with the query system ?
- are there ideas from the polonius work on location insensitivity, separating and avoiding work for loans and subset errors, that could be applied to NLLs ?
- Obligations and ObligationForest (although some of this has been worked, and has been optimized extensively already since it's so hot in profiles):
* a bunch of in-progress work was never (to my knowledge) completed (in rustc and possibly in ena as well), and since this is quite regularly hot, maybe that could help here
* obligations are stored in a Vec but require deduplication, and this has some performance impact (and in some cases, a big impact on compile times, cf. recent issues), improvements there could be using a dedicated set (or understanding the source of these duplicate obligations: the deduping is a workaround, and may be avoidable)
- some parts that are expected to be feature gated impact the rest of the code even when not active (hard to find in general, maybe sifting through old perf.rlo benchmarks results could help locate some of these; but in general tools cannot easily help here), but a couple examples are/were known and there may be others:
* the closure capture RFC 2229 implementation (it's now enabled by default on edition 2021, and therefore not a concern per se)
* some checks in negative impls appear to impact regular code (because coherence can already be hot at scale), although some work has already been done to mitigate them
- allocators (generally size-s):
* revisit jemalloc for sized deallocations (some activity has been done in this area and it seems like they're an actual loss whereas the authors didn't expect that). Open issues there so that they can be fixed if possible. Check jemalloc stats for the benchmarked cases. Possibly move to the 5.3 release which should be released soon.
* if jemalloc's fast-path sized deallocations aren't a actually win (while we were expected to be using its slow path), then test alternatives: mimalloc seems to have faster non-sized deallocs (and using it was generally a win in rustc benchmarks at the time, compared to older versions of jemalloc before switching to tikv's most recent rev) or tcmalloc (I don't remember it being tested recently, possibly ever). (mimalloc could also be nice to evaluate on windows, as the rustc allocator for the win32 targets)
* mimalloc: in my limited tests (mostly incremental + debug profiles on the shorter perf.rlo benchmarks), it seemed like a win of a few 1-5% (but big regressions in max-rss, supposedly fixed for later beta releases of mimalloc, but these also regressed performance). And apparently since I've written this, someone else also started evaluating this.
* jemalloc seems to have windows CI and testing under windows: is the situation similar to years ago when using on windows was pessimizing, or have these regressions since been fixed ?
* snmalloc (a couple years ago its performance was close to mimalloc and some inspiration happened between the two) (but uses sized deallocations IIRC): snmalloc v1's regular standard "unsized" deallocations are in my tests slower than jemalloc's (or mimalloc's), but up to 4% improvements with sdallocx, with no big issue (apart from a const eval test failing unexpectedly for OOM, and which needs investigating: it looks like an allocator panic/abort rather than returning an error that the allocation failed). Less lock/mutex use shows up in the profile compare to jemalloc, and supposedly multi-threaded performance and cross-thread deallocations are a key feature (which ties up nicely with the other "Drop XXX" items in this list). snmalloc v2 is being worked on with more improvements. Promising in any case. (Although I'm not quite 100% confident I managed to override LLVM's allocator at the same time as rustc's)
* amanieu mentioned rpmalloc as another possibility
- evaluate MIR optimizations and their potential:
* what's the maximum performance LLVM could achieve if rustc generated perfect IR (size-s?). Also: the MIR inliner looked promising in recent tests, it would be good to better analyze and understand where it does offer gains (that is, it intuitively should be because inlining allows rustc to then optimize the MIR better, and greatly reduce the amount of IR given to LLVM).
* monomorphizations are likely the area with the biggest potential for improvement and impact (and the source of many wins in the past): targeted optimizations there (or manual hints ?) to limit size and number of instantiations will help. Polymorphization is intuitively of interest here, and evaluating it globally, finding possible issues and expanding its scope is expected to be worthwhile. Focusing existing MIR optimizations on the most heavily instantiated generic functions could be as well. On the opposite side of the spectrum, there could be an opt-in to trading runtime speed for compilation speed, e.g by switching from generic function monomorphization to use dynamic dispatch instead.
* iterators seem to generate a ton of IR that LLVM can take a while to handle, are there such patterns that codegen emits ?
* how common is drop-glue bloat due to inlining a bunch of code in leaf crates (IIRC alex crichton had opened an issue about this, encountered in one the stdlib types) ? Manually outlining these would help here (but maybe the LLVM IR outliner & similarity analyzer passes are worth trying as well ?)
* (or: pcwalton's local-GVN tests, although incomplete, seemed interesting as well; maybe look into that.)
* there must still be things that are hashed multiple times à la hashes-of-hashes (as that happened before), check this
* quantify the impact of incremental hash verifications (+ current non-determinism): can it be graphed and tracked on perf.rlo, and most importantly can it be reduced and removed altogether (this is hard and would require completing the work to limit access to data that is not stable across sessions, it's already in-progress and helping there could be interesting)
* can we use e.g. LCGs + Unhash on things to reduce the number of pieces of data hashed very often
* check if we can actually make some changes in fxhash (unlikely to be super worthwhile): some WIP PRs exist, ahash's creator could be starting a low quality hash, maybe some ILP improvements could help throughput without increasing collisions (or vectorizing even more the hashing of bigger non-primitive types). A lot of work has already been put into this, so easy pickings are unlikely, as fxhash is frustratingly effective in our benchmarks.
* (.. or in hashing of the incremental compilation cache ? where cryptographic properties are more important than runtime hashing speed per se)
* why is there MD5 hashing in incremental, and why are we sometimes hashing full source code with fxhash (check if that's indeed fast enough and out of the critical path as we think it is) instead of other unique interned value about that piece of code
* re-analyze the sizes and distributions of values for hashes and keys: lots of small values, lots of empty tables in the end, and see if we can special-case/optimize the most common cases
- random codegen and simd:
* technically rustc is a rust program and improving codegen improves rust compile times (of course, this is indirect and hard to test in the absence of runtime benchmarks à la lolbench). Some work could be interesting there, a lot of tests have been done in the past, including some apparently showing the niche layout optimization (on Option-like enums) can be a pessimization in some cases, because of bad codegen of match/discriminant checking (it's not clear whether fixing that could be an interesting compiler throughput improvement, and there are a couple of in-progress work towards fixing it. It does block more niche optimization work from landing though.).
* some of the derives lower to AST in interesting ways, e.g. PartialEq/Eq on enums extract the discriminant and then do a match on the values which also lower to extracting the discriminant, check if that AST lowering has impact downstream, in MIR construction all the way to codegen ?
* the included Derive macros compile slowly compared to a manual implementation of the traits (e.g. rylev's benchmarks), and can compile slowly in general. Can their cost be evaluated on all of crates.io, and see how much would using MIR shims help and defer costs until they're actually used. (Maybe recognizable dummy implementations could also reduce typechecking and other steps: if there may be a way to take advantage of their final known structure, or similarity between different instantations, to avoid early/middle work ?) (Maybe here as well look whether polymorphization could help improve the scaling behaviour when there's a lot of Derives)
* are there hot loops with bounds checks in them ? they often are hurtful in performance sensitive contexts. (at a recent LLVM dev meeting there was mention of a recent pass trying to use dominators to help track redundant checks, and that could be interesting to try)
* use SIMD in libcore/libstd/rustc e.g. for rust_data_structures; the intervalset could use it for compression, searches and intersections, and there are a bunch of Lemire's work that could provide inspiration. Maybe collecting use-cases could help provide incentive to improve/fix the situation.
* changes to libstd, e.g. sorting with SIMD acceleration (à la VxSort) and/or Lomuto/Hoare partitioning (unfortunately, stjepang has deleted all his github repositories, including the pattern-defeating quicksort benchmarks he did when implementing it into the unstable sorting functions in the stdlib)
* check usage of the crc32fast crate, and if rustc is indeed able to take advantage of SSE4.2 at runtime
* check real-world impact of aborting when a panic occurs during drop, it looked promising on syn in amanieu's WIP PR
- data structures (more exploratory with size-open-ended):
* try using intervals/interval set in more areas of the compiler: e.g. in borrowck, compute reachability at the same time as the SCCs à la Nuutila, or where dataflow is super sparse, or where some metadata is super dense (CTFE allocations)
* IntervalSet: check if datafrog's Leapers optimizations can apply to bulk inserting ranges, and caching binary search insertion points when inserting multiple (sorted) intervals. (Or if van Emde Boas and Eytzinger layouts from e.g. Paul Khuong's papers on binary search layouts could be useful here). (Also: SIMD).
* can some point-queries be turned into data-oriented range-queries (and whether specialized structures could help here, e.g. implicit in-order forests, various segment/fenwick/etc trees)
* (are there heavily used sorted structures where tries, adaptive radix trees, etc could help ?)
* (check cranelift's bforests properties ?)
* can probabilistic data structures à la bloom/binary fuse filters speed up the fast type rejection in coherence on cases where there cannot be any overlapping impls (or help lower the cost of seemingly quadratic overlap checks) ?
* in general, there's often (bigger) objects with disparate data, partially read in different contexts. Splitting some of these entities in hot/cold sets, making them smaller and densely packed together in memory, in addition to following more data-oriented design principles, should help. (this conjecture would need to be validated in benchmarks but it's so common that it is a likely source of cache misses in all hot pieces of code)
#### 3c. rustc - end of the pipeline
- split debuginfo can be a big time saver in debug builds for big crates, evaluate it (do we need to check and track the performance of thorin ?) to have data for stabilization, and inclusion in "fast build" profiles. (Could it also help when building rustc itself with debuginfo ?). Look for crates with big debuginfo and where linking takes a long time and see how it helps there. (theme-pipelined-compilation)
- try to evaluate and see how to evolve the linker situation, both in detangling rustc from it, and checking numbers and issues on LLD, and mold in non-LTO cases (e.g. test it out on all of crates.io). (Also: split dwarf helps reducing the work done by the linker in debug builds.) (theme-pipelined-compilation)
- codegen: can there be some parallelism be moved up into codegen_ssa (although: could create tricky requirements on the backends, even if cg_clif doesn't really use cg_ssa that much IIRC) or in CGU partitioning ?
- LLVM (there were ideas in theme-raw-compilation-speed to track perf on CI with LLVM master, but the breaking changes there would make that hard, although for this cycle it seems we've tracked these way earlier than we used to):
* LLD master is supposedly faster than 13.0.1, so that could interesting to verify in the LLVM 14 upgrade
* IIRC clang 14 has a recent new LICM optimization hoisting some loads, check if that's in clang or LLVM, and could be worthy of tests with the LLVM 14 upgrade as well.
* (other fixes to optimizations, causing actual issues in rust (e.g. creating strings from constant data) will be present in LLVM14 as well).
* while we wouldn't expect to be making perf improvements to LLVM itself (besides WG-llvm members, that is) the opposite could be done: finding actual issues, be that related to the newPM, or catastrophic inlining issues that still exist, etc.
- cg_clif & cranelift backend: (can range from size-s to size-m, to unfeasible)
* it seems the work turning the backend into a rustup component, making it available to test for more people, is stalled on some windows tests or work
* cranelift also lacks features present in gcc and llvm, like inline assembly, possibly some unwinding work as well IIRC: how should graceful errors or degradation work here, so that it can land and be iterated upon ?
* update the results for the rust benchmark suite for the backend (it'll be good to check if all old and new benchmarks look similar as the most recent published numbers)
* test it out on all of crates.io, to see if the couple missing features are in fact seldom depended on (although the new inline assembly support recently being stabilized will tend to increase that possible pain point)
* check if the lazy JIT code and plans are still good, or what remains to be done and tested here
* check the existing backend's parallelism (in theory cranelift was designed to allow for one function te be compiled per-thread)
* cranelift is in the middle of an intertwined transition to ISLE and regalloc2, which should improve performance and unlock other opportunities. Helping there could improve the cg_clif backend sooner rather than later.
- PGO (generally size-s; but not generally easy to test ?):
* the current crates used for PGO are a subset of the benchmarked crates, with apparently good results on the other benchmarks. But what is the effect on crates.io ?
* can this subset be improved for both perf.rlo and crates.io ? are there possibly other more interesting crates, exercizing code paths not currently profiled here ?
* can we enable it on more targets than only x64 linux ?
* BOLT: evaluate the effect it could have on rustc (although: could be using intel-only LBR?) and LLVM
* recent published research looked into machine-learned PGO without a profiling step, replacing LLVM's BPI heuristics with offline-trained model results. Maybe could be interesting to try out e.g. to speed up bootstrap and our CI.
### 4. Cargo and build systems (mostly theme-pipelined-compilation)
- people don't usually use rustc directly, so see if there are important cargo codepaths that could be sped up / parallelized
- look into tuning cargo's build plan / timings to see if the parallelism level can be improved ? (are there interesting algorithms in taskflow that could be of interest here ?). Or others from the DAG scheduling literature: there are a few cases where pipelines can be stalled by long tasks, delaying successors. If it were known to be the case, they could be scheduled earlier in the pipeline; here providing scheduling hints, or using previous builds' timings to prioritize how cargo schedules units of work could offer that (e.g. prioritizing units of work on the critical path). This and alternative scheduling could be evaluated on all of crates.io. There are already issues about this: https://github.com/rust-lang/cargo/issues/7396, https://github.com/rust-lang/cargo/issues/5125, or https://github.com/rust-lang/cargo/issues/7437
- could we map cargo's build graph to a causal profile (Coz), and evaluate virtual speedups per crate over the complete schedule ?
- check if cargo's named profiles have all the settings needed for fast build configurations (IIRC they mostly do) to be able to use the cg_clif backend, share_generics, LLD/Mold, etc (maybe setting the linker via Cargo still requires rustflags ?)
- evaluate possible faster cargo configs on all of crates.io
- evaluate the -Z binary-dep-depinfo flag (and if it's currently set by tools and the perf collector and so on, I think it is indeed) and if we can use it by default (there's a compiler-team MCP about it IIRC) to help track transitive dependencies (and unused ones?) e.g with better build plan from a point above
- surely a ton more here (including some tasks about the theme-distributed-builds)
- (switching allocators, if cargo doesn't already use e.g. jemalloc, wouldn't have a lot of benefits ?)
### 5. Tooling (mostly indirect tasks, in scope 2-4)
- check WIP PRs that are still not merged in sccache, and things to do to improve caching there or in general (related to linking), between projects, pre-built artifacts (possibly on crates.io but that is hard) (theme-distributed-builds)
- sometimes things compile so slow that we have to kill the process, but measureme and the self-profiler don't support that. It could be nice to have that, to at least get a bit of information there (when not recording event keys)
- can we have automation tooling to help bisect compile times issues ?
* there has been (and are likely still) a bunch of unseen regressions in both rustc and LLVM (projection caching, nested decorator types, NewPM and inlining, lack of vectorization on AVX2 and vectorization regressions). Can they be found, tracked/fixed ?
* track the benchmarks that are most affected by some given queries and their variance/history (this instability can sometimes cause confusion about results, in particular recently this has happened on process_obligations).
* land measureme/rustc/perf.rlo infrastructure to track performance counters in the self-profilers instead of just wall time
* record and display hits and misses stats of the query system
* record and display CPU cache misses and branch mispredictions in the perf collector ?
* record and display the amount of IO done, to track and correlate size changes on metadata and incremental compilation with the amount of data read, and changes in performance (esp. on spinning disks)
* record and display syscalls
* display broken benchmarks better, and fixed, by PRs; rather than just globally en passant on the status page
* make artifact sizes more prominent, track and graph them over time ?
* add a toggle to filter out the rustdoc benchmarks on the compare UI
* per-benchmark and per-query graphs on details page, stats (e.g. change point analysis à la Laurie Tratt), more iterations, record more data as byproduct of benchmarking (e.g. cargo timings)
* could better tracking of variance ranges, and change-point analysis, have helped avoid the hashing issue #1126 ?
* (why is the codegen-schedule not working ?)
- perf bot:
* ~~display the profile as well as the benchmark name. Without that, it can be confusing whether benchmarks in the summary are rustdoc benches~~ My PR doing that landed already.
* ~~ability to launch a perf run on a subset of benchmarks (e.g. only for rustdoc)~~ neither guillaume nor I knew that: this already works
### 6. Misc (exploratory, size-open-ended)
- MIR-only rlibs for crates.io (faster to download?), and check other recent use-cases where they've been mentioned, even though they didn't seem to be a win a few years ago
- remove unused crates (and duplicates) from rustc and try to find sources of bloat (generally size-s and size-open-ended, and I've already opened a few PRs for that)
- with such a long history of changes and refactors in the crate structure, there's likely be project-wide dead code in rustc. An on-demand lint tracking the overall uses of items could be interesting (and incidentally speed up compile times)