Benchmarking Rust runtimes

We have good benchmarking for Rust compile times, via rustc-perf. We should add benchmarking for Rust runtimes, i.e. the speed of the generated code.

There is a defunct project called lolbench which used to do this. It was only run on each Rust release. It hasn't run for a couple of years, and the website no longer exists, but the code is still available on GitHub.

The (in-progress) implementation is here.

Goals

The goals are similar to the goals of the existing compile-time benchmarks: to detect regressions and improvements in the Rust compiler.

There are several types of runtime performance changes that we would like to detect:

  1. Codegen changes
    We can add small micro-benchmarks that can detect codegen regressions. For example, if some very small benchmark now takes 3x more instructions after an innocent looking change, that can mean a change in codegen.
    These are already covered by codegen tests, but we can also add them to the perf suite. It is probably easier to write a small benchmark and let the infrastructure track its instruction counts over time rather than to write a full codegen test.
  2. LLVM changes
    LLVM is regularly updated and its updates cause non-trivial changes in the performance of generated code. Therefore we would like to both detect regressions caused by LLVM bumps and also to track how much does the generated code improve if we upgrade LLVM
  3. MIR optimization changes
    When a new MIR pass is added or some existing one is updated, we currently do not have a good way of measuring how does it affect the performance of generated code. With a runtime benchmark suite, we could be more confident about the effect of MIR optimizations on real-ish code.

An explicit non-goal is to compare Rust speed against the speed of other languages.

  • There are existing benchmark suites for such purposes (e.g. the Computer Language Benchmarks Game, formerly the Great Computer Language Shootout).
  • These suites are magnets for controversy, because the results often depend greatly on the quality of the implementations.
  • Cross-language comparisons won't help compiler developers on a day-to-day basis.

Benchmark methodology

This is possibly the most important section and also one with most unanswered questions. How do we actually measure the performance of the runtime benchmarks? Here are three groups of metrics that we could use, from the most to the least stable:

  1. Instruction counts - these is probably the most stable and dependable metric, so we should definitely measure it, but it also doesn't tell the whole story (e.g. cache/TLB misses, branch mispredictions etc.).
  2. Cycles, branch mispredictions, cache misses, etc. - these metrics could give us a better idea about how does the code behave, but they are also increasingly noisy and sensitive to annoying effects like code layout.
  3. Wall time - this is ultimately the metric that interests us the most (how fast is this program compiled with rustc?), but it's also the most difficult one to measure correctly. We could measure wall times by launching the benchmark multiple times (here we have an advantage vs the comptime benchmarks, because the runtime ones will probably be much quicker to execute than compiling a crate), but there will probably still be considerable noise.

We could try to employ techniques to reduce noise (disable ASLR, hyper-threading, turbo-boost, interrupts, pin threads to cores, compile code with over-aligned functions etc.), but that won't solve everything.

A related question is what tool to use to actually execute the benchmarks. Using cargo bench or Criterion will probably be quite noisy and only produce wall-times. We could measure the other metrics too, but they would include the benchmark tool itself, which seems bad.

Another option would be to write a small benchmark runner that would e.g. let the user define a block of code to be benchmarked using a macro. We could then use e.g. perf_event_open to manually gather metrics only for that specific block of code. This is basically what iai does (although it seems unmaintained?).

Benchmark selection

There are many things that we could benchmark. We could roughly divide them into two categories (although the distinction might not always be clear):

  1. Micro-benchmarks
    Small-ish pieces of code, like the ones in the current rustc benchmark suite. For these we mostly care about specifically about their codegen.
    Example: check that an identity match is a no-op, check that collecting a vector doesn't allocate unnecessarily etc. Such micro-benchmarks would probably have considerable overlap with existing codegen tests.
  2. Real-world benchmarks
    These benchmarks would probably be more useful, as they should check the performance of common pieces of code that occur in real-world Rust projects.
    Example: computing an n-body simulation, searching in text using the regex/aho-corasick crate, stress-testing a hashbrown table etc.

These two categories kind of correspond to the existing primary and secondary categories of comptime benchmarks.

Example pull request that adds a runtime benchmark: https://github.com/rust-lang/rustc-perf/pull/1459

  • Sorting - take e.g. slice::sort_unstable microbenchmarks from stdlib and port them to rustc-perf.
  • Text manipulation - use e.g. the regex to go through a body of text and find/replace several regexes.
  • Hashmap - insertion/lookup/removal from hashmaps containing items of various sizes and counts.
  • I-slow issues - port issues with the I-slow label from the rustc repo. Candidates:
  • Compression/Encoding - adapting benchmarks from various compression and encoding crates
  • Hashing
  • Math/simulation
    • nbody simulation - PR
    • stencil kernels
  • complex applications that can still be condensed into a CPU-bound benchmark.

Benchmark location

We should decide in which repository should the benchmarks live.

  • rustc-perf Since we will most probably want to use the existing rustc-perf infrastructure for storing and visualising the results, this is an obvious choice.
  • rust This would probably make it more discoverable and easier to add for existing rustc developers, but the same could be said about the existing comptime benchmarks.
  • A separate repository (e.g. revive lolbench) Probably not worth it to stretch the benchmarks amongst yet another repository.

Benchmark configuration

What configurations should we measure. A vanilla release build is the obvious starting point. Any others?

Infrastructure (CI) cost

Obviously, it will take some CI time to execute the benchmarks. We should decide whether the current perf.rlo infrastructure can handle it.

Since the vast majority of rustc commits shouldn't affect codegen, we can make the runtime benchmarks optional for manual perf. runs (e.g. @rust-timer build runtime=yes). In terms of automated runs on merge commits, we could only run the benchmarks e.g. if some specific parts of the compiler are changed (MIR, LLVM). But this was already envisioned for comptime benchmarks and may not work well.

Regression policy

We should also think about how important will the runtime benchmarks be for us. Are they lower or higher priority than comptime benchmarks? Do we want to stop merging a PR because of a runtime regression? Do we want to run the runtime suite on all merged commits?

Implementation ideas

We could reuse the rustc-perf infrastructure. We can use the same DB as is used for comptime benchmarks (e.g. store profile=opt, scenario=runtime or something like that, and reuse the pstat_series table). We could put the runtime benchmarking code under the collector crate, because we will need the existing infrastructure to use a specific rustc version for actually compiling the runtime benchmarks.

Then we could prepare a benchmarking mini-library that would allow crates to register a set of benchmarks, it would execute them and write the results as e.g. JSON to stdout.

Something like this:

fn main() {
   let mut suite = BenchmarkSuite::new();
   suite.register("bench1", || { ... });
   suite.run();
}

Then we could create a new directory for runtime benchmarks in collector. We could either put all of the runtime benchmarks into a single crate or (preferably) create several crates (each with different dependencies etc., based on the needs of its benchmarks) that would contain a set of benchmarks.

collector would then go through all the crates, build them, execute them, read the results from the JSON and store it into the DB. I'm not sure how should the interface look like, maybe we could introduce another level to the existing commands, like bench_next comptime --profiles ... and bench_next runtime --benchmarks x,y,z.

A sketch of this implementation can be found here.

User interface

Since we already have the perf.RLO dashboard, it probably makes the most sense to reuse it. Even though we will probably reuse the DB structure for the existing comptime benchmarks, the runtime benchmarks might want a separate UI page, because of these reasons:

  1. Avoid mixing runtime benchmarks with the comptime benchmarks. The compare page is already complex as it is, and stuffing a bunch of very different benchmarks into it would make it worse.
  2. This is not decided yet, but it's possible that we will have eventually have a lot of runtime benchmarks. After all, they will probably be much faster to execute than the comptime benchmarks, so we could in theory afford that. The compare page might not be prepared for such a large number of benchmarks.
  3. We can add runtime specific functionality to the runtime performance UI page. For example, we could display the codegen diff of some benchmark (this might not be so easy though). Maybe we could just include a redirect to godbolt :)
Select a repo