HackMD - Collaborative Markdown Knowledge Base

# Understanding computational reproducibility sub-chapter before the more applied chapter with some definitions and finer distinctions [Research Software Sharing, Publication, & Distribution Checklists](https://gitlab.com/HDBI/data-management/checklists) - Degrees of reproducibility - What are the most and least reproducible practices and what is a good level to aim for in your case? - Verifiability Vs Reproducibility [[@Hinsen2018]] - The ability to understand and interrogate the logic of a program vs the ability simply to have it run - Understanding dependencies - language - system - buildtime - runtime - Package management vs Environment management - A heirarchy of reproducibility? - it's a bit non-linear - Results / Output Only - Binaries / Container Images / VMs - Source Code - Source Code + Version Information - Source Code + Versioned Dependencies - Source Code + Versioned Dependencies + System dependencies - x + runtime dependencies - x + build dependencies - bootstrappable - ability to build all dependencies from source in the appropriate order logseq.order-list-type:: number - Tools for package &/or environment management - language specific - renv (R environment manages - wraps package management) - language agnostic - conda / mamba - Nix - GUIX - VMs - containers - Hardware & Firmware considerations - CPU Architechtures - x86, Arm, RISCV - Virtual: WASM - Drivers, Firmware / microcode - GPU - CUDA & Proprietary other APIs - Resulting differences in output from implementation choices obfuscated from the developer (rocm vs nvidia) # ... title TBC Understanding the limits of reproducible computation Considerations for choosing a reproducible computational environment tool - limitations This chapter covers a number of different tools technologies for describing, capturing and sharing computaitonal environments. Two axes of comparision on which these methods differ are: - Verifiability (as defined in...) or the ability to inspect and understand how the environment was built - Completeness how much of the environment is described and how exactly A virtual machine image is a highly complete way to capture a computation environment but in the absense of a description of how it was built it is not particularly verifiable. ## Verifiability A container or virtual machine VM image once built given the same inputs will very likely give te same outputs but they are something of black box. Whilst they can be inspected and details of how they were built inferred this not always easy, and not really a feature around which these technologies were designed. Their design came out of pragmatic considerations of capturing working computational environments to use in deployed software to keep things working in ever growing production environments. This is changing, increasingly provenance, software manifests or 'bills of materials', and 'supply chain' (link not a supplier post...) are becoming a concern of industry. This is for reasons of security and compliance with new cyberscurity regulations, so more moneyed interest is now being focused on these problems. The file specifying how the container or VM image is build e.g. a Dockerfile helps us to some degree as this is a record of the steps taken to build an image of an environment. However these methods can fall short in the completeness with which they capture these steps. A common example is starting a Dockerfile wth `FROM ubuntu:latest` depending on when you ran this you may get a different output. It is common to follow this up with updating the operating system in case any important patches have come out since the base image was built this introduces another time depended aspect It is also not uncommon to be relying on various resoruces from the internet during such a build and what is at the other end of a URL can change or disapear making fom one build to the next. Another common problem here is 'packing more than you need' you migt add something to your computational environment that you then never use and it's not always easy to tell what you do and do not need. Many of the dependencies of what runs in your environment remain implicit. Your base operating system if sufficient to provide the environment that you need but is all of it necessary? When trying to understand your environment most of these technologies leave you with a lot of noise, a lot of extra parts that may or may not be relavant. (This is also extra attack surface if you are taking a security lens on the problem.) Producing a reproducible image of a computational environment is somewhat analogous to the problem of producing reproducible binary builds of compiled software. The ability fro people to independently produce bit-for-bit identical binary builds from the same source code is a valuable feature for verifying that a cached binary build of a piece of software is what it claims to be. Independent parties can take the same code build it and if they get the same result attest to its correctness, this is a valuable countermeasure to the introduction of malitious code during the build process that is not in the source tree. [Reproducible Builds](https://web.archive.org/web/20240603001422/https://reproducible-builds.org/) is a project sharing best practices for achieving this across a number of open source stware projects. (The [recent attempt to introduce a backdoor into the XZ utils library](https://en.wikipedia.org/w/index.php?title=XZ_Utils_backdoor&oldid=1226548252) is an interesting case study into the intricacies of these sorts of attacks and the limits of technical measures for protecting such codebases.) ## Completeness There are more or less complete ways to describe a computational environment. We've seen a number of approaches to this problem in this chapter but many either miss certain aspects of an environment in its description or whilst they might capture the whole environment do so in an opaque fashion. We will start by discussing the nature of the problem of dependencies in a little more depth and then move onto discussing some tools which offer more complete solutions to the problem of more complete description of computational environments. These are [Nix](https://nixos.org/) and other Nix-like package management solutions such as [GUIX](https://guix.gnu.org/). ## Dependencies - a deeper look When working on an individual software project you as the developer will likely be paying most attention to the packages that you are using from your language's package repository. You may use a language specific package manager to install and manage the packages that your project needs, in python this might be `pip`. It is important to remember that your project's dependencies may also have dependencies, these become transitive dependencies of your project creating a dependency tree. You may also use a language specific environment mangement tool so that you can have different projects with different version of the same dependencies on your system at the same time, in python this might be `venv`. Where things start to get a bit more complicated are system dpendencies. A package in a language may expect another tool to be installed on the system on which it is installed, for example a performant image manipulation package written in another language that it uses to perform certain tasks by the language specific package. Another source of complexity are build-time vs run-time dependencies. A build-time dependency is only need whilst a piece of software is being packaged or compiled but not once it is built. The packaged or binary version requires only its run-time dependencies to be executed once built. A compiler for example can be a build-time dependency, once compiled the tool likely does not need compiler to perform its function, unless it itself is a build tool. This leads in nicely to the bootstraping problem for software dependency trees. Software is written as source code but is generally distributed as pre-built packaged binaries, this save time and resources as it only needs to built once centrally. All of the tools necessary to build a given sequence of pieces of software must be available in the correct order to build up the dependency tree from source. For this to be possible the dependency tree must be a directed acyclic graph, you cannot have a piece of software that idirectly depends on itself in order to be built if you want to be able to bootstrap from source. You can easily end up with loops in this graph when using pre-built binaries from others if you are not fastidious about keeping such loops out. As you may imagine this gets interesting very fast for things like compilers, how these issues are resolved at the base of the tree is a fascinating subject in its own right but out of scope for the current discussion. For a large collection of software, like a modern operating system, compiling everything in the right order from source 'manually' is a very laborious process. If you want to experience some of what this is like or just get an idea of what this process would look like to do then checkout the [Linux From Scratch](https://web.archive.org/web/2/https://www.linuxfromscratch.org/) project. At the root of the tree is a potentially very minimal set of binary files, [GUIX is perhaps the project with the most complete ability to bootstrap from the most minimal starting point](https://web.archive.org/web/20240601042922/https://guix.gnu.org/en/blog/2023/the-full-source-bootstrap-building-from-source-all-the-way-down/) and in a completely automated fashion. ### Dependency resolution & package conflicts You can sometimes encounter situations where you need two tools installed which have mutually incompatible dependencies. This arises because conventionally a package has one place on your system that it is installed with a single name with which it can be invoked in a given environment. The name 'PackageX' can refer to only one version of that package in your environment. Even if you can have version 1.0 and version 2.0 of 'PackageX' installed on your system at the same time by naming them differently centally and aliasing them to their usual name in your environment you cannot generally depend on the diffferent versions in the same environment. For example: 'Tool A' requires 'PackageX' version >2.0 and 'Tool B' requires 'PackageX' version 1.0 to 1.6 you need both 'Tool A' and 'Tool B' in your project - what do you do? If you are using these tools at different steps of your analysis it may be possible, indeed desierable, to seperate out those steps using a pipeline/workflow management tool where each step is run in its own environment with only the dependencies that it needs. Sometimes though your are just stuck with this problem and have no good solutions. Dependency resolution is generally understood to be an NP hard problem without guaranteed solutions, so searching your dependency tree for a reconcilable set of pckage versions can take a very long time and may not return a valid result. Nix-like package management side-steps this issue by having explict for dependencies of each package, internally uniquely naming package versions and independently linking the dependencies of each package. This way each tool could make reference to a different version of the same dependency and not have this result in a conflict in an environment containing both tools. ### Nix-like package mangement Nix and tools adopting many of the same underlying design principles like GUIX adopt a 'functional' approach to package management. Software packages in nix are called derivations. Ideally a derivation is a 'pure' or 'referentially transparent' function meaning that they have no 'side effects', change nothing outside the function when run and their output is completely determined by their inputs. So in order to packge a piece of software you have to give the derivation everything needed to build the software as an input and the build is performed in an isolated environment where it cannot access anything not explicity included as an input in the build. So for example when you provide the source code to a derivation from a remote repository you also provide a hash of that code which is recorded in the derivation so it is only valid if the exact same code is given to the derivation by the remote if you attempt to rebuild it. This approach requires something of a higher up front investment as it requires you to be specific and exaustive about what software you are using up front when packaging your code. This has some excellent benefits in the long term The article [Building a Secure Software Supply Chain with GNU Guix](https://doi.org/10.22152/programming-journal.org/2023/7/1) goes over the uniquely robust trust architecture of this approach to software packaging. I will note that Thomas Depierre observed in his 2022 blog post [I am not a supplier](https://web.archive.org/web/2/https://www.softwaremaxims.com/blog/not-a-supplier) that concieving of open source software as a part of the 'software supply chain' misses the crutial point that open source maintainers are not generally treated like supplies of other goods and serives would be in a supply chain and should be if you want to treat their output as a part of your supply chain. The technical robustness of your system is great for reducing the number of parties who you need to trust or the number of places untrusted parties could target but ultimately the trustworthiness of the software comes back to its maintainers and developers. If a project you depended is maintained by one person who gets burnt out and stops doing security updates it does not matter the you have a technically robust trust architecture. reframe - research reliability ### Firmware & Microcode These are pieces of our computational environments that are often forgotten but are of increasing importance as more computation is performed by dedicated hardware accelerators, this is especially true for machine learning computations. We are seeing the sofistication of the software installed at a low level on our hardware devices increase as these devices become more capable. Many PC components from SSDs to GPUs are now essentially entire small computers in their own right. Functionality which might previous have lived in operating system drivers is moving on device and is accessed by the driver through relatively high level interfaces. As our hardware does more, how and what it does matters more for reproducible and trustworthy computation. Unfortunately the firmware of GPUs and CPUs is almost universally closed and proprietary. (RISC-V CPUs being somewhat more open than X84 and Arm but still a very small market segment.) This is a significant problem for the independent verifiability of the trustiworthiness of computation as many of the same arguments advanced by Ken Thompson in [Reflections on trusing trust](https://doi.org/10.1145/1283920.1283940) aout cimpiler based attacks apply to the potential for firmware to be used as an attack vector.