Trusting CA derivations realisations

--- title: Trusting CA derivations realisations date: 2020-12-04 tags: ca --- :::info Quick vocabulary note: In here (and probably everywhere from now on because I like it) 1. the term **realisation** designates a triplet `(derivation, outputName, outputPath)` − what used to be called `DerivationOutput`; 2. A **derivation output** refers to the tuple `derivation!outputName` − what used to be called `DerivationOutputId`. ::: # What to keep/which trust do we want? ## Coherence There are a few sorts of coherence we might be interested in: **Global coherence**: No (derivation, outputName) is assigned more than one path in the store. **build-time local coherence**: No (derivation, outputName) is assigned more than one path in any build-time closure. **run-time local coherence**: No (derivation, outputName) is assigned more than one path in any run-time closure. run-time is more leanient than build-time. Global is always stronger than run time, and may also be stronger than build time if build closures are some how tracked. We definitely want run-time coherence, we *might* also want build time coherence, although that's more debatable. Global coherence is the easiest to implement because we can leverage SQL unique constraints. ## Traceability/reproducibility Problem: I have a drv `libfoo.drv` that yielded an output path `/123-foo`, how can I verify this? There's several ways in which this could be handled, depending on how granular we want to be (and thus how precisely we want to be able to pinpoint a possible corruption/non-determinism): ### Possible kinds of reproducibility We can define two kinds of reproducibility for a derivation output: 1. **Pointwise reproducibility**: Every (transitive) input of the derivation is reproducible (or: if we take an arbitrary derivation in the build-time closure and rebuild it with the same inputs we will get the same realisations) 2. **Global reproducibility**: The output path is reproducible (or: If we take the whole derivation graph and rebuild it from scratch we will get the same result) 3. **Mixed reproducibility** (very bad name): The build-environment of the toplevel derivation is reproducible, and given that build-environment, the toplevel derivation itself is reproducible. (Note that because the output path includes references to its runtime dependencies, global reproducibility implies that the whole runtime closure is reproducible). Pointwise reproducibility is a stronger property than mixed reproducibility, which itself is stronger than global reproducibility. In practice we're mostly interested in global reproducibility (build-time-only inputs are in a way just temporary garbage that don't matter, the same way it's fine for a concurrent program to have infinitely many different execution paths as long as these converge to the same result). Global reproducibility is however a difficult property to ensure (in particular because it's global nature makes it non-composable). It's also much harder to debug the cases where it doesn't hold − precisely because it's a global property, so the only thing we can say is “the whole closure isn't reproducible” without being necessarily able to pinpoint which derivation is at fault. Local reproducibility on the other hand is really easy to check, and if it's broken then it's possible to precisely pinpoint which derivation couldn't be reproduced. The consequence of that is that it might be interesting to be able to ensure local reproducibility. Mixed reproducibility is a somehow weird thing, probably not a worthy goal by itself, but interesting nontheless because it can naturally emerge as an easy-to-check property and composes slightly better than global reproducibility. ### Requirements for checking reproducibility What do we need to store about a derivation output closure to be able to check its reproducibility according to the definitions above? If we're only interested in global reproducibility, all we need (apart from the derivation tree itself) is the realisation of the derivation output: `outPath(foo.drv!out) = /123-foo`. In particular we don't need to know to which derivation `foo.drv` was resolved. Local reproducibility otoh requires much more information: We need to be able to trace all the steps that occured for the build, meaning that we must be able to replay all the resolutions that happened, which in turn means that for every transitive build input we must know its realisation that has been used during the build. ### Which reproducibility do we want? #### The case for no requirements Input-addressed Nix works under the assumption of some (non formally defined and probably undefineable) form of *observational equivalence* between all the possible realisations of a derivation. The idea behind that is that although a derivation might not be deterministic (and in practice that's fairly often the case), all its possible realisations should have the same behavior given a “reasonable” usage. Content-addressed Nix obviously works at its best when everything is fully reproducible, but we can still get it to perform reasonably under the same *observational equivalence* condition. That way, we can keep most of the ease-of-use of ia-Nix, while still benefitting of most advantages of ca-Nix. This means that non-determisim should be as invisible as possible where it can be. In particular, a non-deterministic intermediate step (a build-time-only input) shouldn't matter at all for the final build output. Or more precisely, once a derivation is realised, it shouldn't have to remember which realisations of its inputs have been used (as the assumption is that these are all equivalent, so it doesn't matter which one was used). So Nix shouldn't have any reproducibility guaranty to work. #### The case for strict requirements Regardless of storage or build savings, content-addressing is really useful for security and auditing: If we know exactly the hash of the inputs that have been used for a build, then we can know which inputs have been used (or at least check that they match what we expect), meaning that we can check each build in isolation to ensure that it hasn't been tampered with. In other terms, the possibility of checking **pointwise reproducibility** is a must-have for auditing purposes. ### One schema to rule them all A summary of the above is that we shouldn't enfore anything (or at least not by default), but still provide a way to be able to reconstruct the whole history of a given realisation of a derivation output. A possibility for that would be to store two (possibly conflictual, but that's the point of checking it) things: 1. A mapping from each realised derivation output to its output path. This should serve as the “source of truth” for the general usage. 2. For each realised derivation, a pointer to the realisation of each of its dependencies that has been used for the build. The first mapping would be a “weak” source of truth in that it would generally be considered as the source of truth, but we could check it against the informations given by the second mapping (rebuilding stuff if needed) to ensure its validity. To avoid the DDD issue (`hello` has been built with a realisation of `gcc` that isn't available anymore, and rebuilding `gcc` leads to a new realisation, so I can't register that without having two conflicting entries), I suggest that we decouple the knowledge of realisations from their trust as valid existant derivations. So we'd have two tables, on that lists all the realisations that have been known at some point, and one that lists the one that are known and trusted to be valid right now. This means that it's fine for a valid realisation to refer to a potentially-invalid one. The final schema would look something like: ```sql= -- Outputs of all the builds that have happened for derivations − resolved -- or not create table Realisations ( id integer primary key autoincrement not null, drvPath text not null, outputName text not null, outputPath text not null, -- ^ Or maybe just `outputHash` as we can compute the path from the hash, -- and it makes more sense to check against the hash than the path ); -- Realisations that are known/trusted to be valid (and map to an existing -- store path) create table TrustedRealisations ( userName text not null, realisation integer not null, foreign key (realisation) references Realisations(id) ); -- Each row (realisation, inputRealisation) means -- that the realisation `realisation` has been computed with the assumption -- that -- ``` -- outputPath(inputRealisation.drvPath!inputRealisation.outputName) == -- inputRealisation.outputPath -- ``` create table RealisationDeps ( realisation integer not null, foreign key (realisation) references Realisations(id), inputRealisation id not null, foreign key (inputRealisation) references Realisations(id) ); ``` Normal build mode would only read the `Realisations` and `trustedRealisations` tables (to know what needs to be rebuilt). A check mode would query the `RealisationDeps` to replay the history of a build − and check each step along the way. Open questions: 1. There could be *several* histories for a given resolution, should we just pick one at random? (cc @rLhuoqbiTjK7Gi3lnJubAg) 2. How do the runtime dependencies of resolutions fit in here? 3. Should the `RealisationDeps` be local-only things? Or do we want to copy them around one way or another? 4. How do we GC everything? # Random jotting of other ideas that have popped-up (just to remember them) - Resolved drvs could only hold a weak pointer to their input srcs - Resolved drvs could also hold a pointer to the input drv (but that would be hashed modulo) for reproducibility