Not keeping unresolved drvs outputs

--- title: Not keeping unresolved drvs outputs tags: ca --- # Suggested design ## Initial 1. We don't store anything about non-resolved derivations, 2. Whenever we realise a derivation, we first resolve it, and then query the db to get the informations about the corresponding resolved derivation (if it exists) 3. This design is falled because one can only gc-root individual resolved derivation mappings, which is not very useful. ## Final 1. Also have "composite" mappings for resolved derivations 2. They refer to other mappings, with the leaf mappings in the graph being resolved derivations. 4. Composite mappings are most useful for GC. 4. Mappings can be signed 1. Initial mappings are signed straightforwardly 2. Composit mappings are signed *with their references*. e.g. one should build a merkle tree or something of the hashes of signatures for each child drv, and then include that tree root with the rest of the original composite mapping to sign. These rules on composite signatures ensure that stores are not allowed to be more vague about what they trust, just because we introduced composite stores for GC. This specificity ("everything I believe is proven from resolved drv mappping axioms") makes them easier to audit. An alternative design would be to only have composite mappings *internally* for GC purposes, and always exchange resolved derivation mappings between stores. This achieves the same goal of no vague trust. # Issues Assume two derivations, `gcc` and `libfoo`, such that `libfoo` has a build-time dep on `gcc`. Assume also that the build of `gcc` isn't deterministic. What should happen if: 1. You build `libfoo`, 2. You gc `gcc` (but not `libfoo`), 3. You build `gcc` again ? ## Initial version Since we don't have nice GC pinning, let's be very explicit about what we keep before GCing: 1. Pin all mappings involvd in building libfoo, so that would be - libfoo-resolvd mapping - gcc-resolved mapping If my understanding is correct, the content of the db will be: - After step1: - A mapping `resolved(gcc) -> outPath1(gcc)` - A mapping `resolved(libfoo) -> outPath1(libfoo)` - A ValidPath entry for `outPath1(gcc)` and `outPath1(libfoo)` - After the gc: - A mapping `resolved(gcc) -> outPath1(gcc)` (somehow kept because we need it to re-resolve `libfoo` **How?**) - A mapping `resolved(libfoo) -> outPath1(libfoo)` - A ValidPath entry for `outPath1(libfoo)` (but not `outPath1(gcc)` as it has been gc'ed) Then, building the third step will fail: 1. The build for GCC itself will succeed 2. But when we try to register the new mapping, it will fail because the old mapping is still there, and disagrees on the path/CA of what GCC produces. We can however go back and delete the old mapping first, and then try again. Now step 3 succeed. But we will have paid a price: - The `libfoo` provenance has been "trucated". If one tries to combine resolved mappings to reduce the number of "black box" content-addressed paths with no deriver, they will get stuck with the big GCC output(s) that no longer have a deriver. - Note that this rebuilding can happen with or without the unresolved `libfoo` derivation. We can try to fit together mappings given an existing `libfoo` unresolved derivation, or we can just replace `inputSrcs` with `inputDrvs` *creating* unresolved derivations as we go. - More concretely, if one tried to rebuild `libfoo`, they will reuse the new `gcc` build, but we will have to rebuild `libfoo` as the old `libfoo` build used different outputs (by CA). - In general, no build plan will be able to use both `gcc` and `libfoo` build artifacts, without having separate (unresolved) GCC derivations to map to each of the non-deterministic outputs. ## Final The basic facts about the conflict are the same, but mapping GC story is nicer. - By default, having built `libfoo` with an out-link, we will have a GC root (TODO decide mechanism) on the composite mapping from the *un*resolved `libfoo` to the final output. - This mapping will depend on the unresolved and resolved mappings of `gcc` and the resolved one of `libfoo` - This mapping will keep everything alive during GC, and prevent any manual delete of the resolved gcc mapping. - We can still delete the composite mappings (for gcc and libfoo) and then delete the resolved-gcc mapping, as before. But the idea is deleting the compond mapping constitute *consent* from the user that conflicts may now happen. - Notice that preventing conflicts with the old way was opt-in (add a bunch of manual GC roots on individual resolved derivations), while with the new way it's opt-out (composite mappings from unresolved drvs are pinned by default). # Commentary Clearly the final design is better, but I think to get it right we need a bunch of good machinary in place: - Some plan for mapping GC roots - Need something like symlinks for "out links" - Some plan for mappping references - (Optional, but John very much wants) merkle tree of signatures for composite mappings, for auditing and to catch corruption in the mapping referencess graph. This is a lot of prereqs! Now we could try to skip more steps, but John is worried about trying to retrofit these strong guarantees of complete references and signatures after the fact. Conversely, if we just focus on the resolved mappings, we have something which is crude - (e.g. maybe even you loose all your mappings every GC, too bad) but also correct. - (e.g. without GC roots one *should* loose all their mappings) I'm much more confortable iterating on a design when we try to be correct every step of the way, implementing fewer features correctly rather than more features possibly-incorrectly. # Trust-or-gc A weaker version of this is to store/sign the drvOutputId->outputPath mapping only for resolved derivations, and for the other ones just have a (signed) mapping to their resolved drv. This means that we're not blindly trusting the output mappings for unresolved derivations (which accounts to trusting the build of the whole dependency tree in one block), but that the trust is divided between 1. Trusting the dependencies of the build (the resolution function) 2. Trusting the build itself (the output mapping for the resolved derivation) This means that as long as the runtime deps are available (either because we or a remote cache has them or because we can rebuild them deterministically), we can recompute the resolved drv (and know that it's the right one) and use that to check the build. What we loose compared to a stricter tracking of all the build inputs is that if we can't reproduce the correct resolved derivation then we can't know which build input hasn't been reproduced, but 1. I don't think it really has any security implication (knowing which input differs doesn't really give you any interesting information), 2. That seems to be very much of an edge-case (I think anyone really serious about the trust of its builds should keep the build outputs anyways), 3. It's always possible to keep more precise mappings in an external db for the 0.01% of users that could need it. An interesting property of this approach is that because we don't re-resolve already built derivations, we can happily gc and rebuild build-time inputs and rebuild them even if they aren't deterministic, but we can also check the whole closure of a derivation if we want to.