rustc-perf - multi-collectors

# Rustc-perf - multi-collectors This document aims to note some of the design considerations required to make `rustc-perf` both multi-architecture and multi-collector, and also to support new features, such as backfilling missing benchmark results for non-standard benchmark parameters. The overall philosophy described below is to provide a base to build upon. Balancing today's needs with tomorrow's ideals. It might be missing in certain details and perhaps too detailed in others. For purposes of discussion the table below details a set of keywords, or a glossary of terms. The naming aims to minimally identify the constituent parts of the system. However the precise naming of these items below is illustrative and open to improvement. ## Keywords | Term | Meaning | |------|---------| | **artifact** | A single Rust compiler toolchain built from a specific commit SHA. | | **metric** | A quantifiable metric gathered during the execution of the compiler (e.g. instruction count). | | **benchmark** | A Rust crate that will be used for benchmarking the performance of `rustc` (a compile-time benchmark) or its codegen quality (a runtime benchmark) | | **profile** | Describes how to run the compiler (e.g. `cargo build/check`). A profile is a **benchmark parameter**. | | **scenario** | Further specifies how to invoke the compiler (e.g. incremental rebuild/full build). A scenario is a **benchmark parameter**. | | **backend** | Codegen backend used when invoking `rustc`. A backend is a **benchmark parameter**. | | **target** | Roughly the Rust target triple, e.g. `aarch64-unknown-linux-gnu`. A target is a **benchmark parameter**. | | **benchmark suite** | A set of *benchmarks*. We have two suites - compile-time and runtime. | | **test case** | A combination of a *benchmark* and its *benchmark parameters* that uniquely identifies a single *test*. For compile-time benchmarks, it's *benchmark* + *profile* + *scenario* + *backend* + *target*, for runtime benchmarks it's just *benchmark*. Unique instance of compile-time/run-time benchmark parameters. | | **test** | Identifies the act of benchmarking an *artifact* under a specific *test case*. Each test consists of several *test iterations*. | | **test iteration** | A single actual execution of a *test*. | | **collection** | A set of all *statistics* for a single *test iteration*. | | **test result** | The result of gathering all *statistics* from a single *test*. Aggregates results from all *test iterations* of that *test*, so a *test result* is essentially the union of *collections*. Usually we just take the minimum of each statistic out of all its *collections*. | | **statistic** | A single measured value of a *metric* in a *test result*. | | **run** | A set of all *test results* for a set of *test cases* measured on a single *artifact*. | | **benchmark request** | A request for a benchmarking a *run* for a given *artifact*. Can be either created from a try build on a PR, or it is automatically determined from merged master/release *artifacts*. | | **collector** | A physical runner for benchmarking the compiler. | | **cluster** | One or more collectors of the same target, for benchmarking the compiler. | | **collector_id** | A unique identifier of a *collector* (hard-coded at first for simplicity). | | **COLLECTOR_PER_TARGET_COUNT** | Number of collectors *per architecture* (initially `2 x x86_64`, `1 x AArch64`). | | **COLLECTOR_COUNT_TOTAL** | Collectors across **all** architectures. | | **job** | High-level "work item" that defines a set of *test cases* that should be benchmarked on a specific collector. | | **job_queue** | Queue of *jobs*. | | **MAX_JOB_RETRIES** | Maximum number of retries for a *job*. | | **Assigning a job** | The act of allocating one or more *jobs* to a collector. | | **website** | A standalone server responsible for inserting work into the queue. | ## Launching a multi-collector system To support both multi-collector and multi-architecture execution, we propose building a parallel system, which would have separate DB tables. Some of them for new concepts, some of them for duplicating old data in a better format. This would allow us to ship code incrementally and test in a live environment on a parallel collector. Once we have confirmed that the new system works as expected, we will remove the old code and switch to the new system. ## Requirements We want to support the following features, which affect the design of the whole system: - The *artifact* (commit SHA of the compiler) is used as the unique identifier of benchmark results (of the *run*). - When a *benchmark request* is created, it might request *test cases* that were not previously benchmarked for its parent commit (e.g. Cranelift codegen backend). In that case it should be possible to *backfill* additional *test cases* into the *run* of the parent commit, even though it was benchmarked previously and marked as finished. - However, it will not be possible to remove old results, only append to them. And you can only append test results for test cases that were missing previously. - This should be useful for requesting non-default benchmark parameters on a PR, e.g. Clippy or rustdoc with JSON output. - *Test cases* should be split into multiple subsets, so that each subset is always executed on exactly one *collector*. - They should be split based on the *target* and a subset of *benchmarks*. ## High-level design From the outside, the whole system will behave quite similarly as [before](https://kobzol.github.io/rust/rustc/2023/08/18/rustc-benchmark-suite.html). The website will make sure that try build benchmark requests from PRs, and master and published artifacts from our CI, will be benchmarked in a timely manner. Benchmarks are always recorded in the DB using *benchmark requests*. These can be created in two ways: - When someone does `@rust-timer queue/build` on a PR, a benchmark request for a try build will be created and stored in the [benchmark_requests](#benchmark_requests-table) table, with the `waiting for artifacts` or `waiting for parent` status (`queue` vs `build`). - For the `build` command, we should also check if a request for the same commit SHA wasn't already made previously. We can either error here or allow backfilling data (but this should be super rare). - When the website notices a missing master/published artifact, it will also be stored into this table. The website will run a periodic cron job (e.g. every minute or something) that will do a number of things for different types of artifacts: > Note that the descriptions here use some terminology described in the `benchmark_request` table. It's a dependency cycle and we have to unwrap it somewhere :) ### Published artifacts The website will go through all recent published artifacts, and check if they are done by looking at the `sha` and `status` column in the `benchmark_request` table. - If the request is already marked as `completed`, nothing happens. - If the request is `in progress`, nothing happens. - If it request is missing, it will be immediately inserted into the table and will be [*enqueued*](#Enqueing-a-commit). ### Master artifacts The website will go through all recent master commits, and check if they are done by looking at the `sha` and `status` column in the `benchmark_request` table. - If the request is already marked as `completed`, nothing happens. - If the request is `in progress`, check [request completion](#Checking-request-completion). - If the request is `waiting for parent` commit benchmark to be completed, nothing happens. - If it request is missing, we will recursively find a set of parent master commits that are missing data (by looking at their status in `benchmark_request`). - If the set is non-empty, these commits will be handled recursively with the same logic as this commit. - If the set is empty, the request will be *enqueued*. ### Try artifacts > The logic for try artifacts can either happen both in cron and in the GH webhook listener (that receives `@rust-timer queue/build` notifications), or only in cron. The website will go through all try artifacts in `benchmark_request` that are not yet marked as `completed`. - If the request is `waiting for artifacts`, do nothing (sometime later a GH notification will switch the status to `waiting for parent` once the artifacts are ready). - If the request is `waiting for parent`: - Recursively find a set of **grandparent** master commits that are missing data (by looking at their status in `benchmark_request`). This could happen on the edge switch from `waiting for artifacts` to `waiting for parent` in the GH webhook handler, or it could happen in each cron invocation. - If that set is empty, generate all necessary **parent** jobs and check if they are all completed in the `job_queue`. - If yes, *enqueue* the request. - If not, insert these jobs into the jobqueue. This is where backfilling happens, as we can backfill e.g. new backends for a parent master commit that was only benchmarked for LLVM before. - If the request is `in progress`, check [request completion](#Checking-request-completion). ## Enqueing a commit Enqueing a commit means two things: 1) Generate all jobs for a request 2) Insert them into `job_queue` AND ATOMICALLY set the request to have status `in progress`. ## Checking request completion Once the website sees a try or a master request with status `in progress`, it will check if all its jobs in `job_queue` have been completed. We could either: 1) Store a FK for each job that links it to a single benchmark request. With this approach, we can simply query all jobs belonging to a given request and check if they are completed. - This would however mean that "fake" backfilled jobs that were inserted into the DB for a master parent commit would link to a benchmark request that wouldn't be fully consistent with the job (e.g. a job with cranelift backend would link to a request that does not ask for cranelift). However, that might not be an issue :man-shrugging: 2) Alternatively, we can generate all jobs required for the try benchmark request, and check if all of them are in the DB with status `completed`. This has the benefit that collector wouldn't have to touch `benchmark_request` at all (but it shouldn't really matter, it would only read anyway). Once we do that, and we figure out that a request was completed, we switch its state from `in progress` to `completed`, and if it was a try or a master request, send a comment to its PR. ## Job lifecycle When a job is inserted into the job queue, it starts in the status `queued`. ### Collector Once a collector tries to pick up a job, it does the following: - If there is already a job for the collector in state `in progress`, it keeps that same status, but increments the retry counter, and then goes on to benchmark the job. - If the retry counter reaches a predetermined maximum, the job is marked as `failed` instead. - Invariant: there shouldn't ever be more than a single job in state `in progress` for a single collector. - If there are jobs for the collector in state `queued`, it picks up (according to the [job ordering](#Job-ordering)) and marks it as `in progress` - Note that each job already contains predetermined collector ID, so two collectors shouldn't ever race on the same job. - If the collector fails expectedly during benchmarking a job (i.e. `Result`), and it thinks the error is unrecoverable, it marks the job as `failed`. - If the collector fails unexpectedly during benchmarking a job (i.e. panic/crash), the job will stay at `in progress` and it should be picked up later once the collector restarts. - If the benchmark job is successful, it is marked as `success`. ### Website In the cron job, the website goes through benchmark requests that are marked as `in progress`. For each such request, it: - Gets all jobs for that request. - Finds out if they are all *completed* (either `failed` or `successful`). If not, it bails out. - If yes, and it's a master/try artifact, it sends a PR comment to GitHub with the result of the benchmark request. It also looks for jobs that have a non-NULL `completed_at` date, and if it is older than 30 days, it removes these jobs from `job_queue`. ## `benchmark_request` table This table stores permanent benchmark requests for try builds on PRs and for master and published artifacts. If any benchmarking happens (through the website), there has to be a record of it in `benchmark_request`. Columns with `?` are `NULL`able. | Column | Data Type | |--------------|--------------| | id | auto int | | tag | text | | parent_sha | text? | | commit_type | text | | pr | int | | created_at | timestamptz | | completed_at | timestamptz? | | status | text | | backends | text | | profiles | text | - `tag` represents commit SHA for master/try artifacts, and release name for release artifacts - `commit_type` is `master`/`try`/`release`. - `finished_at` is set when `status` becomes `complete` - The benchmark parameters included in this table determine what we can backfill. - `backends` => backfill Cranelift - `profiles` => backfill Clippy/DocJson The `status` of the request can be: - `waiting for artifacts`: a try build is waiting until CI produces the artifacts needed for benchmarking - `waiting for parent`: - master artifact waits for all its (grand)parent benchmark requests to be completed - try artifact waits for all its (grand)parent benchmark requests to be completed, plus optionally for all its direct parent jobs to be completed (due to backfilling) - `in progress`: jobs for this request are currently in `job_queue`, waiting to be benchmarked - `completed`: all jobs have been completed, and a GH PR comment was sent for try/master builds ### Benchmark requests ordering We need to figure out how to construct a "virtual queue" to display on the status page. This queue is also used to estimate when will a given benchmark request finish. 1) In-progress requests - Sort them by start time, then by PR number 2) Release requests - Sort them by release date, then by name 3) Requests whose parent is ready - Do a topological sort (topological index = transitive number of parents that are not finished yet) - Order by topological index, type (master before try), then PR number, then `created_at` 4) Requests that are waiting for artifacts - Order by PR number, then `created_at` ## `job_queue` table This table stores ephemeral benchmark jobs, which specifically tell the collector which benchmarks it should execute. The jobs will be kept in the table for ~30 days after being completed, so that we can quickly figure out what master parent jobs we need to backfill when handling try builds. If you request backfill of data after 30 days (should be incredibly rare), new jobs will be created, but that shouldn't matter, because the collector will pick them up, do essentially a no-op (because the test results will be already in the DB), and then mark the job as finished, at which point it will stay in the queue for another 30 days. The table keeps the following invariant: each job stored into it has all its corresponding parent test cases benchmarked and stored in the DB. | Column | Data Type | |-----------------|-------------| | id | auto int | | request_id | FK to `benchmark_request` | | target | text | | backend | text | | profile | text | | benchmark_set | int | | collector_id | text | | started_at | timestamptz | | completed_at | timestamptz | | status | text | | retry | int | | error | text | - `request_id` is a FK that allows fetching commit SHA, PR number and commit type - `collector_id` could alternatively be a FK to the `collector_spec` table - `status` is one of `queued`, `in progress`, `failed`, `success`. - `retry` marks the number of times the job has been tried but has failed. A job could be retried up to a predetermined number of times. - `error` contains a "global" error that happened during the job. Benchmark errors are actually stored in a separate `errors` table, which links to a given artifact and benchmark. But there can also be non-benchmark errors, such as failure to download an artifact from CI (the most common error). That would be stored here. ### Job ordering When a collector determines what job to pull from the queue, it should: - Filter only jobs with `status` in TODO - Order them by (`commit_type`, `pr`, `created_at`, `sha`) - `commit_type`: "release" then "master" then "try" --- REST of the document (@kobzol ended here) --- ## High level diagramatic overview ![benchmark_request_job_queue](https://hackmd.io/_uploads/S1PtjH_7lg.jpg) <figure>Overview of the job queue hierarchy</figure> This structure is then used in combination with the following. Which allows multiple collectors to read configuration to determine which jobs they should take; ![collector_config](https://hackmd.io/_uploads/B1CqirdXee.jpg) <figure>Overview of how a collector consults some intermediary tables to know what benchmark "test iterations" it should perform</figure> ## benchmark_set A `benchmark_set` tells a collector which benchmarks it must run. If, for example, the crate `serde` belongs to set 1, the collector assigned to set 1 will run every combination of profile, scenario, backend, and target for serde. Thus, when both the Cranelift and LLVM back-ends are requested, that same collector handles them all. The configuration will be hardcoded in the github repository and changes to it will be made through pull requests. This saves us from configuring things at the database level with little visibility of the changes. Some of the downsides of this static dependency; - If a collector goes offline then the queue stalls. We could mitigate against this by having an idle collector take the job. If we had both collectors doing all jobs over time we would know how long it takes for each collector to benchmark a particular job. Or we might be able to send a message to Zulip to notify us. - An open question is; if we decommission a collector what do we do with the old results? How do we re-balance the jobs? ## Benchmark set schema While somewhat verbose, the below provides a way for a collector to look up what jobs it should be benchmarking. The simplest way to describe this would be a `benchmark_set_id` and a list of strings for the jobs. This will simply be hardcoded in rust. Currently we split between a compile-time/runtime benchmark is by directory, thus it could be added here instead. In the absence of an admin dashboard for the maintenance of rustc-perf configuration hardcoded in the repo is seemingly the simplest approach. If we need to change the configuration we can do so via submitting PRs which means we have a form of audit trail for the changes to the configuration. As opposed to SSH-ing into a database to update a Table which is fairly opaque. The downside of a hardcoded configuration is it becomes another thing we need to update when adding a new benchmark for the repo. We could add a "Adding a New Benchmark" section to the README.md to provide a checklist which describes the process for adding a new benchmark. ## How to split the benchmark jobs for multiple collectors? For purposes of discussion we will assume `COLLECTOR_PER_TARGET_COUNT = 2` and `JOB_COUNT = 4` we need to roughly determine how long each job takes to equally split the jobs between the collectors. Say we have the following jobs; - `A` 4mins - `B` 20mins - `C` 6mins - `D` 10mins In this case, one collector would take jobs `A`, `C` and `D` the other would take `B`. As this would perfectly be a 20-minute split per collector. In the instance where there was only one collector all jobs would need to be taken by that collector. In order to set this split up we would need to compose a list of all the jobs. ## Information about collectors Possibly we might want to have some information about the collectors that are running so we can identify which collector ran which job. This may also be useful for internal bookkeeping. On this table there is a suggestion for a `last_heartbeat_at` column so we can detect if a collector has gone offline. The collector would have a cron job that periodically updates the date. The website, which is responsible for queueing work, would determine if the collector is still alive or not. `is_active` denotes if the collector should be used for benchmarking. **`collector_config` table** | Column | Data Type | |--------|------| | id | UUID | | target | TEXT | | date_added | TIMESTAMPTZ | | last_heartbeat_at | TIMESTAMPTZ | | benchmark_set | UUID | | is_active | BOOLEAN | - `benchmark_set`, if this is NULL it can be assumed the collector should do all of the benchmarking ### Job Queue open questions (2025/06/04) - How to handle iterations? - How to handle errors? (retry count) - Which benchmark parameters are in the job? - target, backend - **How to represent the benchmark sets?** - How to represent runtime benchmarks and the special rustc benchmark? ### Debugging Have a temporary page where we can inspect the contents and ordering of the queue.

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.