Compiler performance roadmap for 2022

Update: With 2022 having passed, this roadmap is now a historical document. It was a useful tool and guided a lot of performance work in 2022. Good progress was made on most top-level items, except for "Better UX for perf evaluation" where not much progress was made. Future performance work will be directed elsewhere.

There are many ideas on how to reduce Rust compile times and avoid regressions. It can be hard to know in advance which ones are likely to work. This document outlines a relatively small number of tractable tasks with a high chance of success that we hope to complete in 2022. It is partly informed by the analysis on the lqd's large-scale profiling exercise.

It is best to think of this document as a rough guide, rather than a strict prescription. There are ideas not covered by this roadmap that could help compiler perf, including large projects like the parallel compiler. This document's existence does not preclude people working on those ideas.

As well as a plan, this document will serve as a means of tracking who is doing/has done what work. Task assignations are shown in square bracket, e.g. "[name]". Task completion is indicated like so:

  • incomplete
  • complete

Faster single crate compilation

Tasks that will speed up compilation of individual crates.

  • Fully optimize rustc on all tier-1 platforms. There are various outside-the-compiler optimizations only applied to x86_64-unknown-linux-gnu. We should bring these optimizations to other popular tier-1 platforms like Mac and Windows. In some cases this may be as easy as using the relevant scripts on the appropriate CI builders. This could give 15-30% wins for relatively little effort!

    • PGO on Windows [lqd + Kobzol, done in #96978, lots of 10-20% wins]
    • PGO on Mac [CI configuration/capacity is the limiting factor here]
    • Build LLVM with ThinLTO on Windows.
    • Build LLVM with ThinLTO on Mac.
    • Use BOLT on x86-64 Linux (the only Rust tier 1 platform supported by BOLT). Expected 5-10% improvement atop the existing set of PGO/ThinLTO. [Kobzol, #94381 did it for the LLVM backend, giving 3.5% bootstrap improvement and 3.6% average max-rss improvement.]
    • A better allocator on Windows than the system allocator.

    Issue #103595 is tracking PGO/LTO/BOLT/allocators more comprehensively across all the most popular platforms.

  • Hot code. Try to optimize the hottest code, as identified by Cachegrind/DHAT/etc in lqd's data set.

    • parse_tt and other functions related to macro parsing are easily the hottest, and do many allocations, and are likely worth significant effort. [nnethercote, this blog post has details]
    • Some crates trigger uses of huge BitSets, e.g. http-0.2.6. This causes high memory usage, lots of memcpying, etc. [nnethercote, #93984]
    • There is a long tail of opportunities that affect a few crates. It will be worth looking at each of them briefly, there are probably a few easy wins. See the analysis document for details. [nnethercote + others]
    • Metadata decoding takes roughly constant time for crates like std and core. For tiny crates this is a high fraction of compilation time, and may be worth some effort reducing. It's also something that is repeated for every crate in a multi-crate project. [martingms, #95981, helped a bit] ] [nnethercote, #95981, helped a lot]
  • FxHasher improvements. #93651 refers to a promising change to rustc-hash. The performance benefit should be reconfirmed and if it persists, the improvement merged and imported into rustc. [lqd, #96863; results seemed good on AMD, but indifferent on Intel. We decided to give up on this because FxHasher is a tar pit and lots of time has been previously wasted on attempts to improve it.]

  • Linker improvements.

    • lld is much faster than the default linkers, but isn't used by default. Can it be? [lqd, MCP 510 is the first step, which links to other PRs.]
    • mold is a new linker that is reputed to be even faster than lld. We should investigate how well it works with Rust, and whether it an be made easier to use, including documentation. Given that it is new and not yet multi-platform, this would be for opt-in use. [nnethercote successfully tried -Clink-arg=-fuse-ld=lld on a small program on Linux, found it to be faster than lld, and updated the perf-book accordingly. lqd's work on lld will also partly pave the way for mold in the future.]

Faster project compilation

Tasks that will speed up the compilation of multi-crate projects.

  • Improve cargo scheduling. The choice of which crates to compile first can greatly affect overall time. Use better heuristics to improve scheduling, e.g. based on crates size, or past records of compile times. Support for crate priority (a measure of the transitive number of dependencies) in the wait queue has landed in #11032 (and there are more benchmark results here): when there are multiple crates waiting to be built, e.g. when there are a lot of units of work or a low core-count, the highest priority will be picked next. [lqd. More work can be done here, like described in the description, and an experiment where these priorities are computed differently, where popular proc-macros are preferred, can also be found here. Similarly, we could also choose different defaults targeting compilation speed, like disabling debuginfo for build dependencies in #10493: the topics can be subtle and require interaction with the cargo team, which has little time at the moment. Both directions have potential and measured improvements, but now may not be the best time to pursue them]

  • Improve syn/proc-macro2/quote. These crates incur a lot of compilation costs, being among the most popular crates and quite slow to compile. Also, they block compilation of proc macro crates that are themselves often slow to compile. This leads to long and slow dependency chains like: proc-macro2, quote, syn, serde_derive, serde,serde_json. (And that omits the build scripts.) Can they be improved somehow?

  • Investigate build script costs. Build scripts are typically tiny but take a surprisingly long time to compile. What is going on, and can they be improved? Alternatively, can we provide features (e.g. in Cargo) to avoid the need for build scripts that just do simple things like setting conditional compilation flags? [lqd, in-progress. Note: build scripts are a cargo concept, so any changes there may need time from t-cargo and the rustup wg, which is currently limited as mentioned above]

Better benchmarks

rustc-perf is the primary benchmark suite used for gauging Rust compilation speed, in particular for detecting improvements and regressions. It can be improved in several ways.

  • Split into "primary" and "secondary" benchmarks. The suite contains a lot of "real-world" crates, like cargo, diesel, ripgrep, and syn. It also contains a number of "synthetic" stress tests and microbenchmarks, either extracted from real-world code, or simply constructed out of thin air. When looking at the performance effects of a change on, changes to the speed of real-world tests is more important than changes to the speed of synthetic tests, but the latter often crowd out the former.

    It should be simple to split the suite into "primary" and "secondary" benchmarks, and present the primary results on all pages before the secondary results. The division would be roughly "real-world" vs. "synthetic", but the terms "primary" and "secondary" allow for some human judgment. E.g. the helloworld benchmark isn't exactly "real-world" but it's a useful indication of the minimum time taken for any program, and might be considered a "primary" benchmark despite its simplicity.

    [Kobzol, #1181]

  • Update the real-world benchmarks. The versions in the suite are mostly quite old, e.g. 3 or 4 years. For example, the syn crate version in the suite is 0.11.11 from April 2017, but the latest version of that crate at the time of writing is 1.0.86.

    We should update the real-world benchmarks to their latest versions to ensure we are benchmarking widely-used code. The previous upgrade of hyper (and its renaming within the suite as hyper-2) provides precedent for this, though perhaps we should switch to using the crate version number, e.g. syn-1.0.86, which would be visually noisier but have a clearer meaning.

    [nnethercote, rylev, lqd, Kobzol; tracked here]

  • Add new benchmarks. lqd's data set identifies the following.

    • Widely-used crates that are not in the benchmark suite, such as libc, quote, proc-macro2, cfg-if, and log.
    • Hot macro parsing functions (like parse_tt), used in numerous crates, that the existing benchmarks hardly use.
    • Crates that stress the compiler in interesting ways, such as nalgebra.

    We should consider adding new benchmarks to represent these. We should also consider adding one or more build scripts (as leaf crates, if possible) because these are common but not represented.

    [nnethercote, tracked here]

  • Remove or combine low-value benchmarks. So that the suite doesn't grow in size monotonically, we should remove or combine benchmarks that are uninteresting. Highly synthetic benchmarks should be candidates for this; it is generally better to have a real-world benchmark that stresses a particular aspect of the compiler. For example, tuple-stress contains a huge literal extracted from a real program, whereas deep-vector contains a huge vec! literal with thousands of zeroes. And we don't need three different benchmarks testing "deeply nested" behaviour.

    [nnethercote, tracked here]

  • Establish benchmark add/remove/update policies. Previous additions, removals, and updates have been done on an ad hoc basis. It would be good to establish some basic policies around this. For example, should we update real-world crates every 1 year? Every 2 years?

    [nnethercote, #1318]

  • Consider multi-crate benchmarks. The suite mostly measures intra-crate compilation speed, specifically that of the final crate compiled in a package. (There is also one multi-crate benchmark, measuring the time taken to compile the compiler itself. This is a moving target because it always compiles the latest version of the compiler.) Project-level improvements, such as pipelining, do not show up much in the suite results.

    We should consider whether to add new benchmark types to capture the multi-crate improvements/regressions. Adding cargo --timings to the profiling tools would also be useful.

Better UX for perf evaluation

Identify and implement features and/or process improvements to improve drive-by performance evaluation of pull requests, both in likelihood of occurrence and effective decision making.

  • Define a set of rules for go/no-go decisions. Enable clear-cut cases to land without manual review by perf team (as signaled by rust-timer comments), and cases that should be referred for evaluation to the performance team. [All comments now have a clear "ACTION NEEDED"/"no action needed" marker, and the perf team is now CC'd to all "ACTION NEEDED" ones and often triage/comment quickly. Not quite what this item was asking for, but both changes make life easier for rustc devs who don't intimately understand the perf suite.]

  • Survey compiler team on impediments to result triage. Enable triage of results and/or local investigation into them. Depending on results of discussions, may involve cutting down time to perf.r-l.o results (try + perf run) or providing more granular per-benchmark results pages (e.g., cachegrind diffs). [Known problems: regressions in rollups; noisy results]

  • Improve memory usage tracking. Current max-rss statistics are sufficiently noisy that the significance threshold is often high, impeding small improvements from making a difference. Plus, tackling memory usage not at peak is also important and not tracked at all. This is of particular importance as core counts in machines greatly increase in the next 2-3 years, while available memory remains largely flat, meaning that memory usage in rustc must go down if we are to spread across all cores.

  • Expose more/all metrics into decision-making. Currently, the only metrics we use are the instructions:u for benchmarks, excluding all other metrics. We also don't include bootstrap data into summary reports. These metrics are then essentially not considered when evaluating performance changes. [RSS and cycles are now put in the GitHub comments, but more can be done here]

Select a repo