Remote caching of ca derivations

--- title: Remote caching of ca derivations tags: ca --- We consider the following set of drvs: ```nix with import <nixpkgs> {}; rec { libhello = stdenv.mkDerivation { ... }; libhello2 = libhello.overrideAttrs (a: { doCheck = false; }); helloWith = lib: stdenv.mkDerivation { buildInputs = [ cmake lib ]; ... }; hello = helloWith libhello; hello2 = helloWith libhello2; } ``` - `libhello` and `libhello2` are two different derivations, but with the same output path - `hello` and `hello2` are two different derivations, but that resolve to the same one - `cmake` is only a build-time dependency of `hello` (and not a dependency of `libhello` at all) # Scenarios to cover ## Substitution, same drv ### Scenario 1. Alice builds `hello` and push the result to Charlie the binary cache. ```shell alice@alice> nix copy --to charlie .#hello ``` 2. Bob wants to build `hello` too, using Charlie as a binary cache ```shell bob@bob> nix build .#hello --option substituters charlie ``` ### Expected result Bob shouldn't have to build anything and should end-up with the runtime closure of `hello` in store. ## Substitution, different drv (but same resolved drv) ### Scenario 1. Alice builds `hello` and push the result to Charlie the binary cache. ```shell alice@alice> nix copy --to charlie .#hello ``` 2. Bob wants to build `hello2` ```shell bob@bob> nix build .#hello2 --option substituters charlie ``` ### Expected result Bob should 1. Build `libhello2` 2. Download the rest of the runtime closure of `hello2` which is the same as the closure of `hello` In particular, Bob should neither rebuild `hello2` nor fetch `cmake`. ## Substitution of a dependency ### Scenario 1. Alice builds `hello` and push the result to Charlie the binary cache. ```shell alice@alice> nix copy --to charlie .#hello ``` 2. Bob wants to build `libhello` ```shell bob@bob> nix build .#libhello --option substituters charlie ``` ### Expected result Bob should fetch everything from the remote cache and end-up with the runtime closure of `libhello` in store. ## Direct copy _(This scenario isn't an absolute priority for the first step, but it's likely that supporting the first two makes this either redundant or very easy to add)_ ### Scenario 1. Alice builds `hello` and push the result directly to Bob ```shell alice@alice> nix copy --top bob .#hello ``` 2. Bob builds `hello2` ```shell bob@bob> nix build .#hello2 --option substituters charlie ``` ### Expected result Bob should 1. Build `libhello2` 2. Register `hello2!out` (with the same output path as `hello!out` as they have the same resolved drv) # Design proposal 1: resolve everything locally (This should mostly be what @Ericson2314 suggested) ## High-level build-loop implem ```cpp struct DrvOutputId { StorePath drvPath; string outputName; Derivation drv(); }; // Similar to DrvOutputId, but with the extra assumption that the derivation at // `drvPath` is a `BasicDerivation` typedef DrvOutputId BasicDrvOutputId; // Recursively build (or fetch) a derivation output. void buildDrvOutput(LocalStore& store, DrvOutputId& id, Store& substituter) { BasicDerivation resolvedDrv = resolveDrv(store, id.drvPath, substituter); buildBasicDrvOutput(store, BasicDrvOutputId{resolvedDrv.path, id.outputName}, substituter); } // Recursively build (or fetch) an output of a basic derivation. // This assumes that all the inputs are present either locally or in the remote // cache (as Nix has no way to know how to build them otherwise) void buildBasicDrvOutput(LocalStore& store, BasidDrvOutputId& id, Store& substituter) { if (store.queryDrvOutputInfo(id)) return; fetchFromCache(substituter, id); if (store.queryDrvOutputInfo(id)) return; for (StorePath & input : id.drv().inputPaths) // At that point, we know that either the local store or the substituter // knows about that path, so we can just fetch it if it is absent // XXX: This breaks if the substituter is inconsistent and told us that he // knew this path while it doesn't if (!store.isValidPath(input)) fetchFromCache(substituter, input); } store.realiseDrv(id.drv()); } // Get the path of a derivation output (building it only if needed) StorePath getOutPathOf(LocalStore& store, BasicDrvOutputId& id, Store& substituter) { auto resolvedInput = resolveDrv(inputDrv, substituters); auto resolvedOutputId = DrvOutputId{resolvedInput, inputOutputName}; auto inputOutputInfo = store.queryDrvOutputInfo(resolvedOutputId); if (!inputOutputInfo) inputOutputInfo = substituter.queryDrvOutputInfo(resolvedOutputId); if (!inputOutputInfo) { // Neither the local store nor the substituter know about that input, so // we'll have to build it to resolve the derivation buildBasicDrvOutput(resolvedInput); // We've just built it locally, so this should return something inputOutputInfo = queryDrvOutputInfo(resolvedOutputId); } // In case we've fetched it from a substituter store.registerDrvOutput(id, inputOutputInfo); return inputOutputInfo->outPath; } // Resolve a derivation, realising its inputs if needed BasicDerivation resolveDrv(LocalStore& store, DrvOutputId& id, Store& substituter) { map<DrvOutputId, StorePath> outputsOfInputDrvs; for (auto & [inputDrv, inputOutputName]: drv.inputDrvs) { auto inputId = DrvOutputId{inputDrv, inputOutputName}; outputsOfInputDrvs[inputId] = getOutPathOf(store, inputId, substituter); } return drv.replaceInputs(outputsOfInputDrvs); } ``` ## What needs to be stored? ### Remotely For the above algo to work, we need `queryDrvOutputInfo` to be implemented for the remote cache, meaning that it must hold a mapping from every known `BasicDrvOutputId` to its output path (+ a signature and whichever metadata might be needed). Strictly speaking, there's no need for the remote cache to know anything about unresolved derivations − as all the resolving happens locally. ### Locally An important property to keep is that if `foo!out` is a valid path, then `nix build foo!out` must be a no-op and work offline. This means that Nix must be able to resolve `foo` locally. This can be done in two ways: 1. Keep the output path mappings for the whole build-time closure of `foo`, or 2. Memoize the result of `resolveDrv` in the db Option 1. is a bit annoying for two reasons 1. It means partially loosing the property that `foo!out` is totally oblivious to its build inputs. I'm however not sure this has any practical implication (that's already more-or-less how Nix works when `keep-derivers` is set) 2. More importantly, this means that the local store will have drv-output mappings for store paths that it doesn't have. I fear that this might have some fishy interactions with non-deterministic derivations (what happens if we have in the db `outPath(foo!out) == /nix/store/xxx`, but we later build it and get `yyy` rather than `xxx` as the output hash? We can't reuse the first one because the corresponding path has been gc'ed in the meantime, but we can't replace it by the second either because there are live store paths that depend on the old mapping. Option 2. is probably less problematic in that regard, but also less flexible: If I change one input of the derivation, then I'll have to query for all its other inputs. But that's similar to the way Nix already works wrt the drv inputs in general, so it's probably fair. # Design proposal 2 ## High-level build loop implem ```cpp struct DrvOutputId { StorePath drvPath; string outputName; Derivation drv(); }; // Similar to DrvOutputId, but with the extra assumption that the derivation at // `drvPath` is a `BasicDerivation` typedef DrvOutputId BasicDrvOutputId; // Recursively build (or fetch) a derivation output. void buildDrvOutput(LocalStore& store, DrvOutputId& id, Store& substituter) { if (store.queryDrvOutputInfo(id)) return; fetchFromCache(store, id, substituter); if (store.queryDrvOutputInfo(id)) return; BasicDerivation resolvedDrv = resolveDrv(store, id.drvPath, substituter); buildBasicDrvOutput(store, BasicDrvOutputId{resolvedDrv.path, id.outputName}, substituter); } // Recursively build (or fetch) an output of a basic derivation. // This assumes that all the inputs are present void buildBasicDrvOutput(LocalStore& store, BasidDrvOutputId& id, Store& substituter) { if (store.queryDrvOutputInfo(id)) return; fetchFromCache(substituter, id); if (store.queryDrvOutputInfo(id)) return; store.realiseDrv(id.drv()); } // Resolve a derivation, realising its inputs if needed BasicDerivation resolveDrv(LocalStore& store, DrvOutputId& id, Store& substituter) { map<DrvOutputId, StorePath> outputsOfInputDrvs; for (auto & [inputDrv, inputOutputName]: drv.inputDrvs) { auto inputId = DrvOutputId{inputDrv, inputOutputName}; buildBasicDrvOutput(store, inputId, substituter); outputsOfInputDrvs[inputId] = store.queryDrvOutputInfo(inputId)->outPath; } return drv.replaceInputs(outputsOfInputDrvs); } ``` ## What needs to be stored ### Remotely For this to work, the remote server should have the `DrvOutputId->OutputPath` mapping for both the resolved derivations and the non-resolved ones. ### Locally Like for the server, the client must be able to know the `DrvOutputId->OutputPath` mappings for all the derivations it instantiates. For pushing to the remote cache, we also want to know the dependencies of each derivation output (_i.e._ the set of drv outputs used to build its reference closure), because we want to push all of these to the remote cache (so that the third scenario (“substitution of a dependency”) works properly. ## Limitations This proposal doesn't really satisfy the use case #2 as if we change `libhello`, then we need to re-resolve `hello` which itself requires fetching `cmake`. # Proposal 3 This is a mix of proposals 1&2 ## High-level build loop Not rewriting it, but essentially similar to the one of #1 The differences are that: 1. We don't store locally the drv-output mappings that we don't need (or we might store them, but only as a cache) 2. We store the drv-output mappings for the **non-resolved** drvs (in addition to the resolved ones). The consequence of that is that 1. We don't need to re-resolve drvs that have already been built locally 2. Non-determisinm is less of a concern as storing the drv-output mappings for non-resolved drvs mean that we have a non-ambiguous source of truth that doesn't depend on the output path of build-time dependencies anymore (meaning that a built drv output can assume a build-time dependency that doesn't exist anymore). ```cpp struct DrvOutputId { StorePath drvPath; string outputName; Derivation drv(); }; // Similar to DrvOutputId, but with the extra assumption that the derivation at // `drvPath` is a `BasicDerivation` typedef DrvOutputId BasicDrvOutputId; // A bidirectional map associating drv output ids to their output path // (and the output paths to the drv outputs that can produce them) struct DrvResolvings { map<DrvOutputId, StorePath> directMappings; multimap<StorePath, DrvOutputId> inverseMappings; void insert(DrvOutputId &id, StorePath &path) { directMappings.insert(id, path); inverseMappings.insert(path, id); } }; // Memoize the know resolvings, to // 1. Be able to fallback if a substituter is corrupted (not implemented) // 2. Be able to register the original drv with the correct inputs DrvResolvings knownResolvings; // Recursively build (or fetch) a derivation output. void buildDrvOutput(LocalStore &store, DrvOutputId &id, Store &substituter) { // Try getting the unresolved output id, either from the local store or from // the remote cache if (store.queryDrvOutputInfo(id)) return; fetchFromCache(store, id, substituter); if (store.queryDrvOutputInfo(id)) return; // If we couldn't, resolve and realise the drv BasicDerivation resolvedDrv = resolveDrv(store, id.drvPath, substituter); auto outputMappings = buildBasicDrvOutput( store, BasicDrvOutputId{resolvedDrv.path, id.outputName}, substituter); // Also register the built outputs as the outputs of *this* drv // The dependencies of each of these drv outputs are the drv outputs that // produced the store paths that are a dependency of the corresponding output // in the resolved drv. for (auto [outputName, outputInfo] : outputMappings) { auto producersOfPath = knownResolvings.inverseMappings.equal_range(outputInfo); outputInfo.dependencies.insert(knownResolvings.begin(), knownResolvings.end()); } registerOutputsOf(store, id.drvPath, outputMappings); } // Recursively build (or fetch) an output of a basic derivation. // This assumes that all the inputs are present either locally or in the remote // cache (as Nix has no way to know how to build them otherwise) map<string, DrvOutputInfo> buildBasicDrvOutput(LocalStore &store, BasidDrvOutputId &id, Store &substituter) { // First see whether the output exists locally or in a remote cache if (auto outputInfo = store.queryDrvOutputInfo(id)) return {{id.outputName, outputInfo->outPath}}; fetchFromCache(substituter, id); if (auto outputInfo = store.queryDrvOutputInfo(id)) return {{id.outputName, outputInfo->outPath}}; // If not, fetch its inputs and build it for (StorePath &input : id.drv().inputPaths) { // At that point, we know that either the local store or the substituter // knows about that path, so we can just fetch it if it is absent // XXX: This breaks if the substituter is inconsistent and told us that he // knew this path while it doesn't if (!store.isValidPath(input)) fetchFromCache(substituter, input); } auto outputMappings = store.realiseDrv(id.drv()); registerOutputsOf(store, id.drvPath, outputMappings); return outputMappings; } void registerOutputsOf(Store &store, StorePath &drvPath, map<string, StorePath> &outputPaths) { for (auto &[outputName, outputPath] : outputMappings) store.registerDrvOutput(DrvOutputId{drvPath, outputName}, outputPath); } // Get the path of a derivation output (building it only if needed) StorePath getOutPathOf(LocalStore &store, BasicDrvOutputId &id, Store &substituter) { auto resolvedInput = resolveDrv(inputDrv, substituters); auto resolvedOutputId = DrvOutputId{resolvedInput, inputOutputName}; auto inputOutputInfo = store.queryDrvOutputInfo(resolvedOutputId); for (auto &substituter : substituters) { if (inputOutputInfo) break; inputOutputInfo = substituter.queryDrvOutputInfo(resolvedOutputId); } if (!inputOutputInfo) { // None of the substituters know about that input, so we'll have to build // it to resolve the derivation buildBasicDrvOutput(resolvedInput); // We've just built it locally, so this should return something inputOutputInfo = queryDrvOutputInfo(resolvedOutputId); } knownResolvings.insert(id, inputOutputInfo->outPath); return inputOutputInfo->outPath; } // Resolve a derivation, realising its inputs if needed BasicDerivation resolveDrv(LocalStore &store, DrvOutputId &id, Store &substituter) { map<DrvOutputId, StorePath> outputsOfInputDrvs; for (auto &[inputDrv, inputOutputName] : drv.inputDrvs) { auto inputId = DrvOutputId{inputDrv, inputOutputName}; outputsOfInputDrvs[inputId] = getOutPathOf(store, inputId, substituter); } return drv.replaceInputs(outputsOfInputDrvs); } ``` ## What needs to be stored? ### Remotely - A mapping from every known DrvOutputId to its output path (+ a signature and whichever metadata might be needed). ### Locally - Like remotely, a mapping from every DrvOutputId present locally to its output path - We also (probably) need to store the set of dependencies of each drv output to be able to copy them to the remote store - The mapping between drv outputs that are remotely-known but not present in the local store and their output path can be kept as a separate cache, but doesn't have to be stored in the db itself − meaning that we diminish the issues with non-reproducible drvs quite a lot.