Rust-lang CI issues

This document shortly documents various problems that plague the CI of rust-lang/rust. I don't propose any solutions, just enumerate what I consider to be problematic aspects of our CI.

Amount of (wasted) work

  • We rebuild the universe from scratch, even if we change a single comment in code that does not change functionality. More generally, we do not have almost any caching in CI, apart from caching the root Docker images on Linux and using sccache for LLVM builds.
  • We also recompile dependencies many times. We build a bunch of tools and crates in the dist builders, and AFAIK they do not share their target directories, therefore they compile the same dependencies many times. I checked a random dist build and it built the same version of syn around 6 times.
  • For every rollup, we unroll each PR out of that rollup and run a separate Linux try build for it. This is done unconditionally even if the PR is not marked as a regression in any way, and in fact we never run any benchmarks on most of these unrolled PRs. However, this can also help bisects, not just perf.
  • We build all dist artifacts on each merge. Most of these artifacts won't ever be used by anyone, outside of those that are tagged as nightly/beta/stable. That being said, it does make development and maintenance of rustc simpler, and allows quick bisection. So it is probably worth it. But do we really need to build rust-analyzer for every PR?
  • Our CI success rate of auto builds is barely above 50%. That means that we waste essentially half of our spent CI time.
  • Our builds are embarrassingly parallel, which means that they do a lot of duplicated work. The most prominent example is doing the initial stage1 build on all runners.
  • We are very benevolent in adding new CI jobs and additional work to do in our CI. While this is good for Rust users, it does increase our CI costs. We also build CI artifacts for all kinds of targets (like the S390x IBM mainframe target).
  • The Rust compiler is notoriously not stellarly fast, especially for clean builds. Its speed fundamentally also affects the speed of our CI. The CI could benefit from better optimizing the compiler, or at least configuring it to be faster (e.g. by using the LLD linker across the board).

Maintainability

  • It is very difficult to figure out "what does our CI run". The definition of CI jobs is spread across literally hundreds of files: jobs.yml, ci.yml, calculate-job-matrix.py, one Makefile, tens of bash scripts, tens of Dockerfiles and also essentially the whole of bootstrap.
  • It is not easy to perfectly reproduce CI jobs locally. While for Linux jobs it works ok-ish, you still cannot currently easily execute the same thing as on CI, mostly because of missing/different environment variables.
  • Docker files are not deduplicated. We have tens of Dockerfiles for Linux jobs, that have almost arbitrary versions of their base image, and install arbitrary version of dependencies, without any shared common base. This makes it challenging to regularly update these images, and to maintain them.

Observability

  • We don't really know what tests do we even run on CI. Are any important tests skipped? Do we run all tests on all Tier 1 targets? Hard to say. We do in fact store this information (metrics.json), but don't have any tooling to visualize it in an aggregated way or do send alerts if something regresses (see below).
  • We don't know what exact amount of resources do we spend. This recently got better with DataDog integration, but it's still quite basic, as we don't yet have information about what parts of the CI job do we spend the most time on.
  • We do not have any CI alerts. It happened a few times that the time/consumed resources of some of our CI jobs regressed heavily, without anyone noticing for a few days/weeks (it's possible that such regressions also happened with the actual tests that we run, maybe we started skipping something important!).