Compiler performance roadmap for 2022

Update: With 2022 having passed, this roadmap is now a historical document. It was a useful tool and guided a lot of performance work in 2022. Good progress was made on most top-level items, except for "Better UX for perf evaluation" where not much progress was made. Future performance work will be directed elsewhere.

There are many ideas on how to reduce Rust compile times and avoid regressions. It can be hard to know in advance which ones are likely to work. This document outlines a relatively small number of tractable tasks with a high chance of success that we hope to complete in 2022. It is partly informed by the analysis on the lqd's large-scale profiling exercise.

It is best to think of this document as a rough guide, rather than a strict prescription. There are ideas not covered by this roadmap that could help compiler perf, including large projects like the parallel compiler. This document's existence does not preclude people working on those ideas.

As well as a plan, this document will serve as a means of tracking who is doing/has done what work. Task assignations are shown in square bracket, e.g. "[name]". Task completion is indicated like so:

incomplete
complete

Faster single crate compilation

Tasks that will speed up compilation of individual crates.

Faster project compilation

Tasks that will speed up the compilation of multi-crate projects.

Improve cargo scheduling. The choice of which crates to compile first can greatly affect overall time. Use better heuristics to improve scheduling, e.g. based on crates size, or past records of compile times. Support for crate priority (a measure of the transitive number of dependencies) in the wait queue has landed in #11032 (and there are more benchmark results here): when there are multiple crates waiting to be built, e.g. when there are a lot of units of work or a low core-count, the highest priority will be picked next. [lqd. More work can be done here, like described in the description, and an experiment where these priorities are computed differently, where popular proc-macros are preferred, can also be found here. Similarly, we could also choose different defaults targeting compilation speed, like disabling debuginfo for build dependencies in #10493: the topics can be subtle and require interaction with the cargo team, which has little time at the moment. Both directions have potential and measured improvements, but now may not be the best time to pursue them]
Improve syn/proc-macro2/quote. These crates incur a lot of compilation costs, being among the most popular crates and quite slow to compile. Also, they block compilation of proc macro crates that are themselves often slow to compile. This leads to long and slow dependency chains like: proc-macro2, quote, syn, serde_derive, serde,serde_json. (And that omits the build scripts.) Can they be improved somehow?
Investigate build script costs. Build scripts are typically tiny but take a surprisingly long time to compile. What is going on, and can they be improved? Alternatively, can we provide features (e.g. in Cargo) to avoid the need for build scripts that just do simple things like setting conditional compilation flags? [lqd, in-progress. Note: build scripts are a cargo concept, so any changes there may need time from t-cargo and the rustup wg, which is currently limited as mentioned above]

Better benchmarks

rustc-perf is the primary benchmark suite used for gauging Rust compilation speed, in particular for detecting improvements and regressions. It can be improved in several ways.

Split into "primary" and "secondary" benchmarks. The suite contains a lot of "real-world" crates, like cargo, diesel, ripgrep, and syn. It also contains a number of "synthetic" stress tests and microbenchmarks, either extracted from real-world code, or simply constructed out of thin air. When looking at the performance effects of a change on perf.rust-lang.org, changes to the speed of real-world tests is more important than changes to the speed of synthetic tests, but the latter often crowd out the former.

It should be simple to split the suite into "primary" and "secondary" benchmarks, and present the primary results on all pages before the secondary results. The division would be roughly "real-world" vs. "synthetic", but the terms "primary" and "secondary" allow for some human judgment. E.g. the helloworld benchmark isn't exactly "real-world" but it's a useful indication of the minimum time taken for any program, and might be considered a "primary" benchmark despite its simplicity.

[Kobzol, #1181]
Update the real-world benchmarks. The versions in the suite are mostly quite old, e.g. 3 or 4 years. For example, the syn crate version in the suite is 0.11.11 from April 2017, but the latest version of that crate at the time of writing is 1.0.86.

We should update the real-world benchmarks to their latest versions to ensure we are benchmarking widely-used code. The previous upgrade of hyper (and its renaming within the suite as hyper-2) provides precedent for this, though perhaps we should switch to using the crate version number, e.g. syn-1.0.86, which would be visually noisier but have a clearer meaning.

[nnethercote, rylev, lqd, Kobzol; tracked here]
Add new benchmarks. lqd's data set identifies the following.
- Widely-used crates that are not in the benchmark suite, such as libc, quote, proc-macro2, cfg-if, and log.
- Hot macro parsing functions (like parse_tt), used in numerous crates, that the existing benchmarks hardly use.
- Crates that stress the compiler in interesting ways, such as nalgebra.
We should consider adding new benchmarks to represent these. We should also consider adding one or more build scripts (as leaf crates, if possible) because these are common but not represented.

[nnethercote, tracked here]
Remove or combine low-value benchmarks. So that the suite doesn't grow in size monotonically, we should remove or combine benchmarks that are uninteresting. Highly synthetic benchmarks should be candidates for this; it is generally better to have a real-world benchmark that stresses a particular aspect of the compiler. For example, tuple-stress contains a huge literal extracted from a real program, whereas deep-vector contains a huge vec! literal with thousands of zeroes. And we don't need three different benchmarks testing "deeply nested" behaviour.

[nnethercote, tracked here]
Establish benchmark add/remove/update policies. Previous additions, removals, and updates have been done on an ad hoc basis. It would be good to establish some basic policies around this. For example, should we update real-world crates every 1 year? Every 2 years?

[nnethercote, #1318]
Consider multi-crate benchmarks. The suite mostly measures intra-crate compilation speed, specifically that of the final crate compiled in a package. (There is also one multi-crate benchmark, measuring the time taken to compile the compiler itself. This is a moving target because it always compiles the latest version of the compiler.) Project-level improvements, such as pipelining, do not show up much in the suite results.

We should consider whether to add new benchmark types to capture the multi-crate improvements/regressions. Adding cargo --timings to the profiling tools would also be useful.

Better UX for perf evaluation

Identify and implement features and/or process improvements to improve drive-by performance evaluation of pull requests, both in likelihood of occurrence and effective decision making.

Define a set of rules for go/no-go decisions. Enable clear-cut cases to land without manual review by perf team (as signaled by rust-timer comments), and cases that should be referred for evaluation to the performance team. [All comments now have a clear "ACTION NEEDED"/"no action needed" marker, and the perf team is now CC'd to all "ACTION NEEDED" ones and often triage/comment quickly. Not quite what this item was asking for, but both changes make life easier for rustc devs who don't intimately understand the perf suite.]
Survey compiler team on impediments to result triage. Enable triage of perf.rust-lang.org results and/or local investigation into them. Depending on results of discussions, may involve cutting down time to perf.r-l.o results (try + perf run) or providing more granular per-benchmark results pages (e.g., cachegrind diffs). [Known problems: regressions in rollups; noisy results]
Improve memory usage tracking. Current max-rss statistics are sufficiently noisy that the significance threshold is often high, impeding small improvements from making a difference. Plus, tackling memory usage not at peak is also important and not tracked at all. This is of particular importance as core counts in machines greatly increase in the next 2-3 years, while available memory remains largely flat, meaning that memory usage in rustc must go down if we are to spread across all cores.
Expose more/all metrics into decision-making. Currently, the only metrics we use are the instructions:u for benchmarks, excluding all other metrics. We also don't include bootstrap data into summary reports. These metrics are then essentially not considered when evaluating performance changes. [RSS and cycles are now put in the GitHub comments, but more can be done here]

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`	在筆記中貼入程式碼
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.