Real World v Synthesis: Dawn of Benchmarks

The main points pnkfelix wants to address today are:

why we include pathological benchmarks in the test suite. (It is definitely motivated. But its good to make sure everyone is on the same page about that point.)
How to categorize the existing set of benchmarks

If you want to add a question/topic for the meeting, you can jump to that section at the end.

Background Text

Some summary text adapted from zulip archives
categorize “real world” vs “pathological” benchmarks?
and T-compiler meeting 2021-12-16:

pnkfelix wants to make sure we avoid over-indexing our efforts based on pathological benchmarks that are solely meant to expose certain kinds of bad scenarios (by multiplying certain patterns of code by factors of 100x or 1000x). When such cases regress by 0.8%, its not nearly as concerning as it is when real world code like diesel or serde regress by 0.8%.

Note from jyn514 and nagisa: pathological benchmarks that do a lot of a single thing aren't necessarily as pathological as it may seem – real code may do a lot of the same thing (many match arms, deeply nested types, large basic blocks) just as well. e.g. the reason deeply-nested was added was because there was no async-like benchmark before.

wwiser refines the above by noting some of the pathological benchmarks are meant to catch exponential regressions. Ie externs used to be quadratic in run time or something. We want to avoid going back to that but a linear 10% regression wouldn't necessarily be the end of the world if it improved real world code and was actually only linear.

Categorization

There are many categorizations one could construct here. Our goal should probably be to have two categories, in the name of simplicity for people looking at perf.rlo.

But in for the meta conversation we are having now (and presumably that the performance investigations will continue having in the future), we should at least be aware of finer-grain distinctions.

Here are some properties, potentially overlapping, that I think it may be useful to define:

real-world: directly reflects a case found in the wild.
synthetic: constructed (or derived) for the purposes of benchmarking. Often in service of micro-benchmarking, where its trying to zero in on one facet of the system. Has two main forms I'm aware of:
- distilled: takes a real-world case and refines it to focus on the detail of interest. E.g. a benchmark that continues to act as an effective witness to some prior regression. (Thus this can have characteristics of real-world and synthesis…)
- toy: a program that we do not expect to actually see in practice in the real world. (Note that these are still sometimes useful for understanding.)
pathological: takes some pattern of coding and multiplies it. (Note that the real-world can yield pathological scenarios.) This can be a method of distillation. But, it can also yield toys…
important: this labels benchmarks that represent code that is popularly used in the community or that matters to Rust stakeholders. We want to prioritize performance enhancements (and addressing regressions) for these benchmarks, and are potentially willing to take regressions elsewhere to achieve it.
outdated: no longer models real-world case of interest

To be clear: pnkfelix is not suggesting that we present such fine-grain labels on perf.rlo. E.g., others have suggested a simple "primary"/"secondary" categorization, which is probably fine. But these fine-grain labels may be useful as metadata to drive the decisions about how benchmarks are categorized now and in the future.

So, some concrete examples taken from the zulip chat:

deep-vector just contains a vec!() literal with 135,000 0 elements, which is 100% artificial: synthetic toy, pathological
tuple-stress contains an array with 64k tuples of real geographic data, which was stripped out of a real program: synthetic distilled, pathological
inflate is a really old version of that crate that doesn't reflect the current version, and it's a weird example: real-world, outdated
deeply-nested was added was because there was no async-like benchmark before, heavily async code uses lots of nested types: synthetic distilled, pathological, important

Once we have a classification, what should we do with it?

simple ideas

I guess the dumb thing is to just split everything (tables, pages of graphs, etc.) into two halves: real first, then artificial

pnkfelix wants to stress: unimportant toy benchmarks can be invaluable in dissecting a problem once it is identified. So continuing to have access to them seems useful. What is potentially not useful is letting them be an important part of day-to-day workflows: developers should not be stressing about small (or maybe any) regressions to toy benchmarks, and they should not be occupying time during weekly performance triage unless motivated by some other regression.

Draft Classification

Contributed by nnethercote

My draft classification:

Real: cargo, clap-rs, cranelife-codegen, diesel, encoding, futures, html5ever, hyper-2, piston-image, regex, ripgrep, serde, stm32f4, style-servo, syn, tokio-webpush-simple, ucd, webrender, webrender-wrench
Maybe real: inflate, keccak, wg-grammar, unicode_normalization
Artificial: coercions, ctfe-stress-4, deeply-nested, deeply-nested-async, deeply-nested-closures, deep-vector, derive, externs, issue-*, many-assoc-items, match-stress-enum, match-stress-exhaustive_patterns, regression-31157, token-stream-stress, tuple-stress, unify-linearly, unused-warnings, wf-projection-stress-65510

Appendices

Hennesy and Patterson breakdown

Hennessy & Patterson is always good for this sort of thing, too. Chapter 1 distinguishes: real applications, modified applications, kernels, toy benchmarks, and synthetic benchmarks

(if time permits, pnkfelix will transcribe relevant definitions from their copy of H&P)

Breakdown from `nofib` suite

For anyone who likes thinking about this topic, sections 2.2-2.4 of http://web.mit.edu/~ezyang/Public/10.1.1.53.4124.pdf may be of interest. It's a Haskell benchmarking suite divided into "real", "spectral", and "imaginary"

Transcribed from Will Partain's paper on nofib.

Real subset

Written to ~~standard Haskell (version 1.2 or later)~~ stable Rust
Written by someone trying to get a job done, not by someone trying to make a pedagogical or stylistic point
- (pnkfelix wonders whether "trying to get a job done" includes expertise level; i.e. do you benchmark beginner code that has not had an optimization pass? Ah: see potential answer below)
Performs some useful task such that someone other than the author might want to execute the program for other than watch-a-demo reasons
Neither implausibly small nor impossibly large (the Glasgow Haskell compiler, written in Haskell, falls in the latter category)
- Aside from pnkfelix: We are benchmarking the compiler itself on perf.rlo. Is nofib solely used for benchmarking the output code, or is it also used to benchmark Haskell compilers? (Perhaps we have inherently placed our task into an impossibly large area according to this document, and are just varying the inputs to that task… but still we must press on…)
The run time and space for the compiled program must be neither too small (e.g. time less than five secs.) or too large (e.g. such that a research student in a typical academic setting could not run it).

Other desiderata for the Real subset as a whole:

Written by diverse people, with varying functional-programming skills and styles, at different sites
Include programs of varying "ages", from first attempts, to heavily-tuned rewritten-four-times behehmoths, to transliterations-from-LML, etc…
Span across as many different application areas as possible.
The suite, as a whole, should be able to compile and run to completion overnight, in a typical academic Unix computing environment

Spectral subset

Don't quite meet te criteria fr Real programs, usually the stipulation that someone other than the author might want to run them. Many of these programs fall into Henessy and Patterson's category of "kernel" benchmarks, being "small, key pieces from real programs"

Imaginary subset

Usual small toy benchmarks, e.g. primes, kwic, queens, and tak. These are distinctly unimportant, and you may get a special commendation if you ignore them completely. They can be quite useful as test programs, e.g. to answer the question, "Does the system work at all after Simon's changes?"

Meeting Questions/Topics

pnkfelix: Bi-classification sound good? Or should we go tri-classification like nofib?
jack huey: an important question is less about how we categorize benchmarks, but how do we use the categorization to decide if a PR is acceptable perf-wise
- rylev: We in general leave it up to not-very-scientific heuristics on what is acceptable to merge or not. We'll likely want to decide what is acceptable for "important" benchmarks first and then decide how other categories differ.
aaron hill: set of metrics themselves should be expanded. We should disable changes in disk usage / artifact sizes
- aaron hill: Right now, the only way to see artifact size changes is to open each individual benchmark page
  - rylev: This is a well known limitation and has simply not been implemented. We'd love to implement it but just haven't had the time yet
- aaron hill: and there can often be trade-offs between compilation speed and disk usage
nagisa: Right now we use statistics extensively to hide what we consider to be irrelevant (right now only those that are within noise) changes in performance. I think it is important that we are careful to not hide what we consider synthetic benchmarks the same way.
pnkfelix: I think our current dashboard both shows too much (as in, too many benchmarks, or at least too many toy ones that are given equal prominence to real-world ones), but it also shows too little (having to switch between different metrics, rather than being able to see multiple metrics at once.
- rylev: We also need to take profile (e.g., release, debug, doc, etc.) and scenario (i.e., incremental comp with different changes and cache states) into account

Discussion Topics

Assuming some (simple) classification, maybe a binary one, how would we use it to improve things?
what metrics should we gather and make prominent, by default, in our dashboard.
crater for benchmarks?