owned this note
owned this note
Published
Linked with GitHub
# Benchmarking Rust runtimes
We have good benchmarking for Rust compile times, via [rustc-perf](https://github.com/rust-lang/rustc-perf/). We should add benchmarking for Rust runtimes, i.e. the speed of the generated code.
There is a defunct project called [lolbench](https://github.com/anp/lolbench) which used to do this. It was only run on each Rust release. It hasn't run for a couple of years, and the [website](lolbench.rs) no longer exists, but the code is still available on GitHub.
The (in-progress) implementation is [here](https://github.com/rust-lang/rustc-perf/tree/master/collector/runtime-benchmarks).
## Goals
The goals are similar to the goals of the existing compile-time benchmarks: to detect regressions and improvements in the Rust compiler.
There are several types of runtime performance changes that we would like to detect:
1) **Codegen changes**
We can add small micro-benchmarks that can detect codegen regressions. For example, if some very small benchmark now takes 3x more instructions after an innocent looking change, that can mean a change in codegen.
These are already covered by codegen tests, but we can also add them to the perf suite. It is probably easier to write a small benchmark and let the infrastructure track its instruction counts over time rather than to write a full codegen test.
2) **LLVM changes**
LLVM is regularly updated and its updates cause non-trivial changes in the performance of generated code. Therefore we would like to both detect regressions caused by LLVM bumps and also to track how much does the generated code improve if we upgrade LLVM
3) **MIR optimization changes**
When a new MIR pass is added or some existing one is updated, we currently do not have a good way of measuring how does it affect the performance of generated code. With a runtime benchmark suite, we could be more confident about the effect of MIR optimizations on real-ish code.
An explicit non-goal is to compare Rust speed against the speed of other languages.
- There are existing benchmark suites for such purposes (e.g. the [Computer Language Benchmarks Game](https://benchmarksgame-team.pages.debian.net/benchmarksgame/index.html), formerly the Great Computer Language Shootout).
- These suites are magnets for controversy, because the results often depend greatly on the quality of the implementations.
- Cross-language comparisons won't help compiler developers on a day-to-day basis.
## Benchmark methodology
This is possibly the most important section and also one with most unanswered questions. How do we actually measure the performance of the runtime benchmarks? Here are three groups of metrics that we could use, from the most to the least stable:
1) Instruction counts - these is probably the most stable and dependable metric, so we should definitely measure it, but it also doesn't tell the whole story (e.g. cache/TLB misses, branch mispredictions etc.).
2) Cycles, branch mispredictions, cache misses, etc. - these metrics could give us a better idea about how does the code behave, but they are also increasingly noisy and sensitive to annoying effects like code layout.
3) Wall time - this is ultimately the metric that interests us the most (how fast is this program compiled with `rustc`?), but it's also the most difficult one to measure correctly. We could measure wall times by launching the benchmark multiple times (here we have an advantage vs the comptime benchmarks, because the runtime ones will probably be much quicker to execute than compiling a crate), but there will probably still be considerable noise.
We could try to employ techniques to reduce noise (disable ASLR, hyper-threading, turbo-boost, interrupts, pin threads to cores, compile code with over-aligned functions etc.), but that won't solve everything.
A related question is what tool to use to actually execute the benchmarks. Using `cargo bench` or `Criterion` will probably be quite noisy and only produce wall-times. We could measure the other metrics too, but they would include the benchmark tool itself, which seems bad.
Another option would be to write a small benchmark runner that would e.g. let the user define a block of code to be benchmarked using a macro. We could then use e.g. `perf_event_open` to manually gather metrics only for that specific block of code. This is basically what [`iai`](https://github.com/bheisler/iai) does (although it seems unmaintained?).
## Benchmark selection
There are many things that we could benchmark. We could roughly divide them into two categories (although the distinction might not always be clear):
1) **Micro-benchmarks**
Small-ish pieces of code, like the ones in the current rustc benchmark suite. For these we mostly care about specifically about their codegen.
Example: check that an identity match is a no-op, check that collecting a vector doesn't allocate unnecessarily etc. Such micro-benchmarks would probably have considerable overlap with existing codegen tests.
2) **Real-world benchmarks**
These benchmarks would probably be more useful, as they should check the performance of common pieces of code that occur in real-world Rust projects.
Example: computing an `n-body` simulation, searching in text using the `regex`/`aho-corasick` crate, stress-testing a `hashbrown` table etc.
These two categories kind of correspond to the existing `primary` and `secondary` categories of comptime benchmarks.
Example pull request that adds a runtime benchmark: https://github.com/rust-lang/rustc-perf/pull/1459
- **Sorting** - take e.g. `slice::sort_unstable` microbenchmarks from `stdlib` and port them to `rustc-perf`.
- **Text manipulation** - use e.g. the `regex` to go through a body of text and find/replace several regexes.
- **Hashmap** - insertion/lookup/removal from hashmaps containing items of various sizes and counts.
- **I-slow issues** - port issues with the `I-slow` label from the `rustc` repo. Candidates:
- [x] [BufReader](https://github.com/rust-lang/rust/issues/102727) issues - [PR](https://github.com/rust-lang/rustc-perf/pull/1460)
- [nested, chunked iteration](https://github.com/rust-lang/rust/issues/53340), the crate from which it was extracted [has benchmarks and sample data](http://chimper.org/rawloader-rustc-benchmarks/)
- wasmi ([#102952](https://github.com/rust-lang/rust/issues/102952))
- **Compression/Encoding** - adapting benchmarks from various compression and encoding crates
- **Hashing**
- **Math/simulation**
- [x] nbody simulation - [PR](https://github.com/rust-lang/rustc-perf/pull/1459)
- stencil kernels
- **complex applications** that can still be condensed into a CPU-bound benchmark.
- rendering the acid3 test with servo
- techempower [web framework benches](https://github.com/TechEmpower/FrameworkBenchmarks/tree/master/frameworks/Rust)
## Benchmark location
We should decide in which repository should the benchmarks live.
- `rustc-perf` Since we will most probably want to use the existing rustc-perf infrastructure for storing and visualising the results, this is an obvious choice.
- `rust` This would probably make it more discoverable and easier to add for existing rustc developers, but the same could be said about the existing comptime benchmarks.
- A separate repository (e.g. revive `lolbench`) Probably not worth it to stretch the benchmarks amongst yet another repository.
## Benchmark configuration
What configurations should we measure. A vanilla `release` build is the obvious starting point. Any others?
## Infrastructure (CI) cost
Obviously, it will take some CI time to execute the benchmarks. We should decide whether the current perf.rlo infrastructure can handle it.
Since the vast majority of rustc commits shouldn't affect codegen, we can make the runtime benchmarks optional for manual perf. runs (e.g. `@rust-timer build runtime=yes`). In terms of automated runs on merge commits, we could only run the benchmarks e.g. if some specific parts of the compiler are changed (MIR, LLVM). But this was already envisioned for comptime benchmarks and may not work well.
## Regression policy
We should also think about how important will the runtime benchmarks be for us. Are they lower or higher priority than comptime benchmarks? Do we want to stop merging a PR because of a runtime regression? Do we want to run the runtime suite on all merged commits?
## Implementation ideas
We could reuse the `rustc-perf` infrastructure. We can use the same DB as is used for comptime benchmarks (e.g. store profile=opt, scenario=runtime or something like that, and reuse the `pstat_series` table). We could put the runtime benchmarking code under the `collector` crate, because we will need the existing infrastructure to use a specific rustc version for actually compiling the runtime benchmarks.
Then we could prepare a benchmarking mini-library that would allow crates to register a set of benchmarks, it would execute them and write the results as e.g. JSON to stdout.
Something like this:
```rust
fn main() {
let mut suite = BenchmarkSuite::new();
suite.register("bench1", || { ... });
suite.run();
}
```
Then we could create a new directory for runtime benchmarks in `collector`. We could either put all of the runtime benchmarks into a single crate or (preferably) create several crates (each with different dependencies etc., based on the needs of its benchmarks) that would contain a set of benchmarks.
`collector` would then go through all the crates, build them, execute them, read the results from the JSON and store it into the DB. I'm not sure how should the interface look like, maybe we could introduce another level to the existing commands, like `bench_next comptime --profiles ...` and `bench_next runtime --benchmarks x,y,z`.
A sketch of this implementation can be found [here](https://github.com/rust-lang/rustc-perf/tree/lolperf/runtime).
## User interface
Since we already have the perf.RLO dashboard, it probably makes the most sense to reuse it. Even though we will probably reuse the DB structure for the existing comptime benchmarks, the runtime benchmarks might want a separate UI page, because of these reasons:
1) Avoid mixing runtime benchmarks with the comptime benchmarks. The compare page is already complex as it is, and stuffing a bunch of very different benchmarks into it would make it worse.
2) This is not decided yet, but it's possible that we will have eventually have **a lot** of runtime benchmarks. After all, they will probably be much faster to execute than the comptime benchmarks, so we could in theory afford that. The compare page might not be prepared for such a large number of benchmarks.
3) We can add runtime specific functionality to the runtime performance UI page. For example, we could display the codegen diff of some benchmark (this might not be so easy though). Maybe we could just include a redirect to godbolt :)