Cargo Artifact Caching

GitHub issues

  • #5931 - per-user compiled artifact cache
  • #8716 - Reconsider RUSTFLAGS artifact caching

Scope for hackathon

  • Single user package cache (no cross-user nor cross-machine sharing)
  • Single shared cache directory (no daemon, no GC, no pruning)
  • Registry crates only (local / git / etc. crates have longer tail of issues)
  • No changes to: local target dir layout, fingerprint definition, metadata definition
  • Don't support RUSTFLAGS yet (not currently included in file hash, treated specially by Cargo)
    • There is a possible low-effort option to support it, but not for MVP

Hackathon outcome

hackathon branch in Arlo's fork: https://github.com/arlosi/cargo/tree/hackathon
Comparison on GitHub: https://github.com/rust-lang/cargo/compare/master...arlosi:cargo:hackathon?expand=1

What works?

cargo build -Z shared-user-cache supports caching of registry crates without build scripts, and with only non-build-script registry crates as dependencies.

Cached files get copied out of the target directory when put into the cache, and get copied back (or reflinked, if that's supported) into the target directory when fetching from the cache. So the general operation of cargo and rustc is unaltered.

The caching is implemented with cacache which handles the concurrency issues nicely. This is a new Cargo dependency but is fairly mature. The maintainer added a sync-only feature for us during the hackathon, to allow us to use the crate without bringing an async runtime into Cargo.

There is a Cache trait which defines:

  • get that takes a fingerprint and a list of output files, and returns "true" if those output files could be populated from cache (or "false" if there was a miss);
  • put which takes the same, and populates the cache if no entry for the given key yet exists.

We are open to all feedback about whether we are using enough of the current fingerprint in our cache key.

Limitations

The following is considered "uncachable" and, thus, anything that depends on one of these is also uncacheable:

  • Anything other than registry crates (as agreed prior to the hackathon).
  • The outputs of build scripts and proc macros.
  • If RUSTFLAGS are set (other than a couple of minor exemptions).

Is it faster?

Slightly. rustc is shockingly quick at rlib compilation, and most build time is taken by the final binary compilation. So we see modest wins (10-20% of end-to-end wall clock time) when doing a clean build if the cache is prepopulated. We did observe that the build script binaries for proc-macro2, windows-sys, serde, serde_json and syn were all getting cached during our testing.

Where is the caching implemented?

The fingerprint::calculate function previously built a Fingerprint and validated that the local filesystem dependencies were up to date.

This function is now extended to hit the cache for any Fingerprint that is not UpToDate after the fingerprint::check_filesystem check. If a cache hit occurs, the fingerprint gets set to a new FsStatus::LoadedFromCache status.

Subsequently, the fingerprint::compare_old_fingerprint method:

  • Writes out a new on-disk Fingerprint for any crate that is LoadedFromCache but doesn't have an on-disk fingerprint yet.
  • Writes out a .cached marker file; if that file exists the next time fingerprint::check_filesystem runs, that method will skip checking any local deps- fingerprint files.

Consequently, cached files lack any deps- fingerprint files whatsoever.

TL;DR: compared to local crates, after a cache hit:

  • Cached crates have exactly the same output files in the same target_dir locations as local crates.
  • Cached crates have exactly the same crate-level Fingerprints as local crates.
  • Cached crates have .cached marker files, whereas local files do not.
  • Local files have deps- fingerprint files (for each dependency), whereas cached crates do not.

This seemed the simplest implementation to us, and seems reasonably correct, but we are open to horrified cries of "WHAT WERE YOU THINKING?" from people who have been working with this code for years ;-)

What didn't happen?

We were considering restructuring the directory layout of a cached crate, to colocate all files from the crate; this would have helped make the file layout more understandable.

But we wound up handling caching by copying into and out of the target directory, implying that we needed to preserve the identical layout of the target directory, so we couldn't reorganize that. And we wanted to use cacache with minimal overhead, so we leave the cache directory layout completely up to cacache, which uses its own organizing scheme.

What remains to be done before submitting a PR?

Especially want Cargo team feedback here.

  • Validation that the cache key is correct, and comprehensive.
  • Validation that the set of things that makes a crate "uncacheable" is complete.
  • More tests, certainly. Open to feedback on what testing should be added.

Questions:

  • Are any of the tasks below critical enough to have in the first PR?
    • For example, is cache GC/pruning a requirement for even the first version?
    • Likewise, how about multiple providers?
  • Any concerns with adding cacache as a dependency?

What remains to be done before stabilization?

Especially want Cargo team feedback here. MSFT team wants to get this to stabilization but needs to understand the scope.

Overall cache concerns: layout of cache isn't too important until there is GC. Once there is GC, have to track state related to that. If there is an unstable feature related to GC, then have to track different complexities for build cache. Should understand overall way this new cache (or new cache framework) fits into all the current ones.
Target directory should not be file-by-file; should be at package basis. Wouldn't need cacache can use Eric's new locking system.
Trying to figure out the owner of a random target dir file, it's painful, because you have to build an entire database of the meaning and relationships of every file. Have to reverse engineer the purpose of each file into the DB, basically.
How do we extend cache API for GC concerns?
Currently, caching based on file doesn't work well because depending on target dir, your paths are different. What if you use rename-path-prefixes? But doesn't quite work because if rename-path argument changes then you have to invalidate the cache.
Looks like usage of cacache is not really leveraging it as a content addressable store it's more just a filesystem-atomic key-value store. Maybe something that was simply that would avoid some potential complexity.
epage: Active PR that shifts to having three lock modes (one for GC (exclusive from all), one multi-reader mode, one "cache append mode" (excl to GC & itself but atomic with readers). Reader-writer aspect of that model could work here.
epage: Target dir revisions: makes sense this was out of scope for hackathon, but if we do this before revising target dir structure, it would add complexity later. danpao: not sure? Can change layout later on. epage: depends on what granularity the GC has to track the resources.
Ref/softlinking not supported on NTFS, but hardlinking is scary.
Rob: Possible to wrap some kind of decent provider abstraction around the existing target directory code? Existing target directory is a cache, can we modularize it as such? epage: probably don't want to put hashes in a directory name etc. Would like to see target directory reworked to where caching model and target directory model are aligned. Ideally could move directly from target directory to cache (no complex transformation between file layouts like today).
Target dir backcompat: what do? There are a lot of tools that inspect or use target dir files. Cargo expand, etc., etc., dozens of them. All of those would break if the layout is changed. Don't actually know how to solve that problem. Unstable flags generate output that ends up in deps folder in particular places. In most extreme form, only intermediate artifacts are cached, and only final artifacts end up in target directory.
CAN we change the target dir layout? Unstable but do we want to actually break the dependents in reality? Do cargo expand or other tools have the ability to decouple from the layout? Config setting for new target dir layout that's off by default, allow them to be side by side. Hopefully will also simplify things once we get to GC because we can sanely drop directories corresponding 1-1 to packages.
Scott: How do we change the target directory? Could you have a crate that would help you find the files related to a crate? Build an API around accessing the files. Arlo: similar to thoughts around build.rs?
eh2406: fingerprint directory is fairly terrible lots of tiny files. If SQLite is coming in anyway, could there be one SQLite database? Eric: the intent in the plan is to eventually get there. eh2406: currently have a bad design that's hard to interact with. Woudl like to move to a cleaner API.
Rob: possible to do "cache extreme" mode that builds simple filesystem provider that has the target dir layout we want? And can we use that for prototyping the new target dir layout altogether, leaving old target dir empty? Doesn't solve problems of backcompat with tools dependent on old layout. eh2406: will have to do backcompat hacks, but could at least start with having simple desired target dir layout in the new world.
Build script caching internal exploration? Good things to look at during the stabilization window for the MVP.

cacache dependencies not too bad for first PR would be OK with first PR having cacache in it. Could potentially merge what we have, and then through the work on that, could rewrite it incrementally away from cacache. No GC prototyping in first PR.
Make sure we have docs and tests consistent & complete in first PR.
Cargo already takes a crate file and extracts it. Because supposed unique, extracts to path. Doesn't record the hash of what went in there, because supposed unique. Works great until buggy registry changes contents, causing interesting crashes. If there was user interpretable path, then there was hash of what needs extracted, could crash on read so you'd have consistent reads (like content addressible) while also having meaningful filesystem path advantages. Transitioning from content based to path based could leverage local marker files (in the path based location) that lets them catch metadata consistency issues ASAP.

  • GC / pruning.

Eric working on cleaning up global registry cache to start with. Target directory later, perhaps much later. EPage: could look at checking liveness from workspaces & concrete roots, look at cleaning up entire target dirs.

  • Support for multiple cache providers?
  • Read-only provider for static content?

Eric: no, not good for stable MVP, let's add extensions and extensibility later. Though Arlo points out we may want to prototype one before we stabilize.

  • Support for crates with build scripts?
    • Possible to add unstable flag for this support to allow experimentation in deterministic environments?
  • Other flags?
    • Should we support a flag to permit more parts of RUSTFLAGS as being cachable, for teams that expect their environments to be deterministic?
    • Should we have a flag to select whether to use integrity-checked copying out of the cache, or direct copying (for speed)?
  • Demonstrated GitHub implementation?
  • Any of the other tasks below?

NOT needed (per epage): first stabilization NOT supporting build scripts, proc macros. BUT could do those as later unstable passes.

Post-hackathon tasks

  1. Add support for hard-linking?
  2. Implement content-comparing user-shared-cache provider
    • Report diagnostic details whenever non-matching content is stored for a given cache key
  3. Implement read-only provider for static content (allowing convenient delivery of prebuilt content)
  4. Include RUSTFLAGS in -C extra-filename
  5. Consider RUSTFLAGS_BINARY as a thing
  6. Some kind of Azure/network blob-store cache provider implementation.
  7. Some kind of Github Actions CI integration so CI can prepopulate your team's network cache

WHOLE SEPARATE HACKATHONS

Cargo asyncification

'Nuff said :-)

Cargo-Rustc contract revision

  • some better contract between Cargo and rustc. Would be great to not have to expand archives all the time just for rustc.
  • Even better, a conversational protocol that would eliminate TOCTTOU bugs between rustc and cargo.
  • CLI & stdout for cargo<->rustc communication is bursting at the seams. Probably need to rethink that.
  • Lint table to Cargo.toml but can't tell rustc what the source is for those at the moment.
  • Could rustc support direct file access from unexpanded tarballs? Could this align well with per-crate-compilation directories?
  • Currently we check out a full commit per commit that gets built on the system. Would be even nicer if rustc could pull from git directly.
    • Virtualize rustc file access!
    • How does debugger find the file then?

Office Hours

https://meet.jit.si/CargoOfficeHours
1PM Pacific

Second office hours discussion notes (2023/10/05)

fill this in

First Office hours discussion notes (2023/09)

Ed's concern: Layout might not be the right thing at all here. Seems like a very different purpose. Target directory is meant to commingle everything.
Overall issue: how do you know what files to push/pull if there is a "shared cache target dir" (e.g. hackathon reuse Layout as is)?
Current Layout scatters per-package files around the Layout.
Could we instead have "per-crate Layouts" so crates still put their files relatively where they expect, but we still have one per-crate parent directory that lets us track them all easily?

Where does shared target dir currently cause self-footstepping? Could hackathon tackle some of those places? Let's look for bugs.
When you refer to a relative path, it mostly stays relative to ensure that you can rename your project folder. But if you have a shared directory and you have two projects which point to "/a", they have no idea about each other. Or, if build target makes foo.exe then another project could collide there. "Registry crates only" simplifies this however.

"Package has no build script and package does not depend on a proc macro." Those can both read from external state on the system. Perhaps allow packages to opt in to claiming they are deterministic.

At end of hackathon, let's shoot to land SOMETHING as unstable.

What are issues with "multiple target directories"? Currently we tell rustc "look at this dir and slurp up what you need."
Might have two files that both look reasonable to rustc, so rustc will not be able to pick a winner.
Would need to either:

  • copy things out into some structure that works with rustc's behavior, or
  • we enumerate every metafile and rlib into rustc so we remove rustc's implicit behavior.
    Not immediately clear how to find rustc's implicit behavior.

Directory per package would require that sort of system.
Multiple plugin system would also affect this complexity.

Once we have this, do we actually have most of a story for cargo binstall and for bundled packages of precached crates? (e.g. bundled packages = another LayoutScope that's static, not built into)

Ed: concern about community adoption if it's too corporate focused
Ed: concern that the cache won't have a high enough hit rate.
Eric is looking at cache GC: https://github.com/rust-lang/cargo/pull/12634
1 directory per cached item makes GC much easier.

Future possibilities

Network cache

Big opporunity for CI build speed caching. Community could get behind this.

  • Speed up within CI
  • Problems with existing solutions
    • hard to know exactly what to cache
    • larger cache than needed slows things down
    • cache is all or nothing
  • Allow users read-only access to CI's cache
  • Concern: GHA's Windows networking is slow

Rob: suggests having multiple cache directories (tiered cache)
Ed: avoid the term layout: we have a package hash, doesn't have to coexist with other packages.

Would be really cool if there was a better interface between Cargo and rustc. Would make change detection easier and avoid TOCTTOU issues. Something beyond CLI and response files.

Reading directly from tarballs/git

If rustc & cargo could read directly out of tarballs could save some overhead.
Likewise could potentially avoid largely redundant git clones if could pull from git directly.

Select a repo