Experimenting with CSI-based Software Catalogs on Kubernetes

# Experimenting with CSI-based Software Catalogs on Kubernetes ## Software collections in a containerized world ? ### Do you remember ? Do you remember [**Software Collections**](https://www.softwarecollections.org/en/) ? The motto of this project backed by Red Hat was: > All versions of any software on your system. Together. The promise was to give you the power to build, install, and use multiple versions of software on the same system, without affecting system-wide installed packages. And it worked, even won a **Top Innovator award** at DeveloperWeek 2014. Coming back to this promise the main point was: > ... without affecting system-wide installed packages In other words being able to _**provide additional tooling without changing anything to the current state of the Operating System seen as a whole**_. ### A new landscape but the same need Things have changed since 2014: the container revolution popped up, and containers solved quite a great number of problems by providing features like execution isolation, file system layering, volume mounting, etc... One could say that, thanks to containers, the old Software Collections were not needed anymore. But Container Orchestrators came as well (I'll stick to Kubernetes in this blog post) and deploying workloads as containers inside Pods has become a standard. Finally, even workloads like build pipelines or IDEs move to the cloud and also run inside containers. And containers also have a limitation... At some point we start experiencing, inside containers, the same type of limitation, and the same type of need that Software Collections once tried to solve at the Operating System level. Why ? A container is based on an image whose layers are overlays in a single inheritance tree. A *container* is based on a single *container image* which is like a container template. And a *container image* is optionally based on a single *container image* parent. So to build a container image you typically start from a basic operating system image, and layers are added one by one, each one on top of the previous one to provide every additional tool or feature needed by your container. Finally a container image is precisely like the snapshot of the current state of an Operating System at a given point in time. And here also, the promise would be useful: being able to _**provide additional tooling without changing anything to the current state of the Operating System seen as a whole**_, in other words _**without having to build a new container image**_. ### Combinatorial-explosion ? Let's take an <a name="example">example</a>: - I'd like to run a Quarkus application, let's say the [_Getting Started_](https://github.com/quarkusio/quarkus-quickstarts/tree/master/getting-started) example, directly from source code, in development mode. I will need at least both a JDK version and a Maven version, on top of the base system on which my Quarkus server will run. - But I'd also like to test it with a the widest range of versions and flavors of the JDK, Maven, and the underlying operating system. For each combination of the possible variants for those 3 components (JDK, Maven, OS), I would need to create a dedicated container image. And what if I also want to test with as many Gradle versions as possible, and also include the native build use-case which requires GraalVM ? Now imagine the combinatorial explosion that will occur if I want to also have the choice to include arbitrary versions of all my preferred tools. ### Inheritance vs composition Once again we are limited by a _**single-inheritance model**_ while we would need _**composition**_. Obviously, sometimes it would be great to be able to compose a number of features or tools inside a container, possibly even different versions of tools, without having to produce a new dedicated container image. We just need to ... _**compose container images**_ at runtime. Obviously, allowing that _**in full generality**_ seems tricky to implement, if not impossible, at least in the current state of what Kubernetes and containers are. But what about the limited case where we would like to inject exernal self-contained tooling or read-only data into an existing container ? ## Towards composable containerized software catalogs on Kubernetes Injecting exernal self-contained tooling or read-only data into a container at runtime would obviously be particularly relevant if you think at things like Java, Maven, Gradle, even Node, NPM, Typescript, and the growing number of self-contained Go utilities like Kubectl, Helm, as well as Knative or Tekton CLI tools. None of them needs to be "installed" strictly speaking, since they were designed to be downloaded and simply extracted to be working on most Linux variants of a given platform. ### Let's combine 2 forefront container technologies Let me now introduce 2 container technologies that will indeed allow making such a software catalog possible to implement quite simply: - [Container Storage Interface (CSI)](https://kubernetes.io/blog/2019/01/15/container-storage-interface-ga/), and more specifically the [CSI Inline Ephemeral volumes](https://kubernetes.io/blog/2020/01/21/csi-ephemeral-inline-volumes/), - [`buildah`](https://buildah.io/) containers. According to the Kubernetes documentation, > CSI was developed as a standard for exposing arbitrary block and file storage storage systems to containerized workloads on Container Orchestration Systems (COs) like Kubernetes. With the adoption of the Container Storage Interface, the Kubernetes volume layer becomes truly extensible. Using CSI, third-party storage providers can write and deploy plugins exposing new storage systems in Kubernetes without ever having to touch the core Kubernetes code. This gives Kubernetes users more options for storage and makes the system more secure and reliable. This opens many doors to implement and integrate storage solutions into Kubernetes. On top of that, the **CSI Inline Ephemeral Volumes** feature, still in beta for now, allows specifying a CSI volume of a given type, along with all its parameters, directly in the Pod spec, and only there. That's exactly what we would need: allow referencing, directly inside the Pod definition, the name of a tool to inject into Pod containers. As for `buildah`, it is a now well-known CLI tool that facilitates building OCI container images. Among many other features, it provides two features that are very interesting for us: - Creating a container (from an image) that is not executing any command at start, but can be manipulated, completed, modified in order to possibly create a new image from it, - Mounting such a container to have access to its underlying file system. The first attempt at combining these 2 technologies started as a prototype example implemented by the Kubernetes-CSI contributors: https://github.com/kubernetes-csi/csi-driver-image-populator, which I took inspiration from and used to bootstrap the current work. It provides a very lightweight simple CSI driver, with the `image.csi.k8s.io` identifier, that uses container images as volumes. Deployed with a DaemonSet, the driver runs as a Pod on each node of the Kubernetes cluster and waits for volume mount requests. From an image reference specified in the Pod as a CSI volume parameter ... ```yaml apiVersion: v1 kind: Pod metadata: name: example spec: containers: - name: main image: main-container-image volumeMount: - name: composed-container-volume mountPath: /somewhere-to-add-the-composed-container-filesystem volumes: - name: composed-container-volume csi: driver: image.csi.k8s.io volumeAttributes: image: composed-container-image ``` ... it both creates a container and mounts its file system through `buildah`. The container filesystem is thus available to finally mount it directly as a Pod volume. Upon Pod removal, the Pod volume is unmounted by the driver, and the `buildah` container is removed. ### Now adapt it for the Software Catalogs use-case However some aspects of the design of the `csi-driver-image-populator` prototype do not really fit our use-case for containerized Software Catalogs: - We don't need containers in the Pod to have write access to composed image volumes. The whole idea here is to inject _**read-only**_ tools and data to the Pod containers through the CSI inline volumes. - Sticking to the read-only use-case will allow us to use a single `buildah` container for a given tool image, and share its mounted file-system with all the Pods that reference it: the number of `buildah` containers would then only depend on the number of images provided by the software collection on the CSI driver side. This will open the door to additional performance optimizations. - For both performance and security reasons, we should certainly avoid pulling automatically the container image mounted as a CSI inline volume. Instead of pull the image at mount time, let's keep image pulling to be managed by an external component, outside the CSI driver, and let the CSI driver expose already-pulled images only. This will allow us to limit the images that can be mounted to a well-defined list of images and, in other words, to stick to a managed software catalog. - Finally, for Kubernetes clusters that use an OCI-conformant container runtime ([cri-o](https://cri-o.io/) for example), we should be able to reuse images already-pulled by the Kubernetes container runtime on the cluster node. This will avoid duplication of container image layers in both the Kubernetes node and inside the CSI driver container file-system. This can be simply implemented by using [`containers-storage`](https://github.com/containers/storage/blob/master/docs/containers-storage.conf.5.md#storage-options-table) _Additional Image Stores_ feature, as described in [this `buildah` blog article](https://developers.redhat.com/blog/2019/08/14/best-practices-for-running-buildah-in-a-container/). To validate the idea described in this article, the changes mentioned just above were implemented in the newly-created [`csi-based-tool-provider`](https://github.com/katalogos/csi-based-tool-provider) CSI driver project, starting from the `csi-driver-image-populator` prototype to bootstrap the code. ### And provide dedicated <a name=tooling-images>tooling images</a> In general the new `csi-based-tool-provider` driver is able to mount, as a Pod read-only volume, any file system sub-path of any container image. But it would still be useful to define a typical structure for the container images that would populate such a Software Catalog. For "no-installation" software like Java, which is simply delivered as an archive to extract, the most straightforward way is to use `from scratch` images with the software directly extracted at the root of the filesystem. An example of `Dockerfile` for the OpenJDK 11 image would be: ``` FROM registry.access.redhat.com/ubi8/ubi as builder WORKDIR /build RUN curl -L https://github.com/AdoptOpenJDK/openjdk11-binaries/releases/download/jdk-11.0.9.1%2B1/OpenJDK11U-jdk_x64_linux_hotspot_11.0.9.1_1.tar.gz | tar xz FROM scratch WORKDIR / COPY --from=builder /build/jdk-11.0.9.1+1 . ``` The same stands for the Maven distribution required by our Quarkus example mentioned above. ## Coming back to our example Now let's come back to our Quarkus "Getting Started" quickstart example for which I don't want to build any dedicated container image, but only use an interchangeable basic OS for my container, and manage additional tooling through tools brought by CSI volume mounts on images from my new Containerized Software Catalog. The full deployment looks like this: ```yaml apiVersion: apps/v1 kind: Deployment metadata: name: csi-based-tool-provider-test spec: selector: matchLabels: app: csi-based-tool-provider-test replicas: 1 template: metadata: labels: app: csi-based-tool-provider-test spec: initContainers: - name: git-sync image: k8s.gcr.io/git-sync:v3.1.3 volumeMounts: - name: source mountPath: /tmp/git env: - name: HOME value: /tmp - name: GIT_SYNC_REPO value: https://github.com/quarkusio/quarkus-quickstarts.git - name: GIT_SYNC_DEST value: quarkus-quickstarts - name: GIT_SYNC_ONE_TIME value: "true" containers: - name: main image: registry.access.redhat.com/ubi8/ubi args: - ./mvnw - compile - quarkus:dev - -Dquarkus.http.host=0.0.0.0 workingDir: /src/quarkus-quickstarts/getting-started ports: - containerPort: 8080 env: - name: HOME value: /tmp - name: JAVA_HOME value: /usr/lib/jvm/jdk-11 - name: M2_HOME value: /opt/apache-maven-3.6.3 volumeMounts: - name: java mountPath: /usr/lib/jvm/jdk-11 - name: maven mountPath: /opt/apache-maven-3.6.3 - name: source mountPath: /src volumes: - name: java csi: driver: toolprovider.csi.davidfestal volumeAttributes: image: quay.io/dfestal/csi-tool-openjdk11u-jdk_x64_linux_hotspot_11.0.9.1_1:latest - name: maven csi: driver: toolprovider.csi.davidfestal volumeAttributes: image: quay.io/dfestal/csi-tool-maven-3.6.3:latest - name: source emptyDir: {} ``` First let me just mention that, in order to clone the example source code from GitHub, I reuse the [`git-sync`](https://github.com/kubernetes/git-sync) utility inside an `initContainer` of my Kubernetes `Deployment`, but that's just for the sake of laziness and doesn't relate to the current work. The first real interesting part is: ```yaml ... volumes: - name: java csi: driver: toolprovider.csi.kcsc volumeAttributes: image: quay.io/dfestal/csi-tool-openjdk11u-jdk_x64_linux_hotspot_11.0.9.1_1:latest - name: maven csi: driver: toolprovider.csi.kcsc volumeAttributes: image: quay.io/dfestal/csi-tool-maven-3.6.3:latest ... ``` It uses the new CSI driver to expose my 2 [tooling images](https://github.com/katalogos/csi-based-tool-provider/tree/master/examples/catalog) as CSI read-only volumes. This makes Java and Maven installations available for the main Pod container to mount them at the needed place. And since the Pod container owns the final path where Java and Maven installation will be mounted, it can also set the related environment variables accordingly: ```yaml ... containers: - name: main ... env: ... - name: JAVA_HOME value: /usr/lib/jvm/jdk-11 - name: M2_HOME value: /opt/apache-maven-3.6.3 volumeMounts: - name: java mountPath: /usr/lib/jvm/jdk-11 - name: maven mountPath: /opt/apache-maven-3.6.3 ... ``` Finally the container that will build and run the application source code in development mode can be based on a bare Operating System image, and has nothing more to do than calling the [recommended startup command](https://quarkus.io/guides/getting-started#running-the-application): ```yaml - name: main image: registry.access.redhat.com/ubi8/ubi args: - ./mvnw - compile - quarkus:dev - -Dquarkus.http.host=0.0.0.0 workingDir: /src/quarkus-quickstarts/getting-started ... ``` Here the example will start on a [RHEL 8 UBI image](https://developers.redhat.com/products/rhel/ubi). But the great thing is that you can for example switch to an `ubuntu` image and the server will start and run the same way, without any other change. And if you want to switch to another version of Maven, just change the reference to the corresponding container image in the `maven` CSI volume. Now if you scale up this deployment to 10 Pods, the same underlying Java and Maven installations will be used. No files will be duplicated file on disk, no additional container created on the cluster node. Only additional `bind mount`s will issued on the cluster node. And it will be the same whatever be the number of workloads that will, on this node, use the Java and Maven tooling images. ### What does it solve ? Beyond the example I used for this article, we can forsee a number of use-cases where such composable containerized software catalogs could be useful. First, obviously it would allow reducing the overall size of image layers stored on Kubernetes cluster nodes, by removing the combinatorial-explosion effect of having to manage the versioning and lifecycle of both the underlying system and all various system-independant tools in a single image. A similar thought comes to mind when considering a Pod with a number of container running microservices, all using common tools: The tools would then be shared among all the Pod containers that might have distinct base images. More concretely, I'm thinking at the [OpenShift Web terminal](https://www.openshift.com/blog/a-deeper-look-at-the-web-terminal-operator-1) for example. By default, the Pod started by the OpenShift Console to run a Web Terminal defines a container which embeds the typical tooling that you might need. But if you need some additional tools, you would have to replace this image by your own customized one, built by your own means. This would not be necessary if all the CLI tools could be provided as volumes in a basic container. This would also relieve the CI burden of having to rebuild the all-in-one container image each time one of the tools should be updated. Going even one step further, this should even allow using, in the Web Terminal, exactly the same version of the kubernetes-related command-line tools (like `oc`, `kubectl`), as the version of the underlying OpenShift cluster. I also imagine how it could be used to inject off-the-shelf build tools into [Tekton Task Steps](https://github.com/tektoncd/pipeline/blob/master/docs/tasks.md#defining-steps), thus relieving the burden of having to change and possibly rebuild Step container images each time you want to run your pipeline with different build tool variants or versions. Last but not least, this could of course benefit to the various cloud-enabled IDEs, such as [Eclipse Che](https://www.eclipse.org/che/): A simple example is the ability to easily and efficiently switch the Java or Maven installations you would use in your workspace, share these installations among the various containers, even have several versions at the same time. But here as well, this would greatly reduce the CI burden of building and maintaining a container image for each combination of underlying OS and tools... and easily unlock the combination of developer tools at runtime according to the developer needs. ### What about the performances ? In the very first implementation, the new `csi-based-tool-provider` driver used to run [`buildah manifest`](https://github.com/containers/buildah/blob/master/docs/buildah-manifest.md) commands to store the various metadata related to mounted images and the associated containers and volumes, inside an [OCI manifest](https://github.com/opencontainers/image-spec/blob/master/manifest.md#oci-image-manifest-specification). Though this was useful to have a POC quickly working, this mainly required hard locks on the whole CSI mounting / unmounting process (`NodePublishVolume` and `NodeUnpublishVolume` CSI requests), in order to avoid concurrent modification of this global index, and ensure consistency. Moreoever, initially the `buildah` container was created on the fly at mount time if necessary, and as soon as a given tool was not mounted by any Pod container anymore, the corresponding `buildah` container was removed by the CSI criver. This design could lead to mount durations of several seconds, especially when mounting a given image for the first time. Instead of this, the driver now uses an embeddable, high-performance, transactional key-value database: [BadgerDB](https://github.com/dgraph-io/badger). This allows a much better performance and less contention due to read-write locks. In addition to that, the list of container images exposed to the driver is now configured through a [mounted ConfigMap](https://kubernetes.io/docs/concepts/configuration/configmap/#mounted-configmaps-are-updated-automatically) and images, as well as their related `buildah` containers, are managed, created and cleaned up in background tasks. With these 2 simple changes, volume mount durations were reduced to some fractions of seconds, as shown by following graph of the corresponding [Prometheus](https://prometheus.io/) metric: ![](https://i.imgur.com/lJzq7mN.png) On a local [Minikube](https://minikube.sigs.k8s.io/docs/) installation, for a simple Pod containing: * **only one** mounted CSI volume with the JDK image mentioned above, * **only one** very simple container (doing nothing more than listing the content of the mounted volume then sleeping), the average duration required to mount the JDK inside the Pod fluctuated between 15 and 20 ms, which is pretty insignficant compared to the overall Pod startup duration that oscillated between 1 and 3 seconds at the same time. ### Can I test it ? The related code is available in the [`csi-based-tool-provider` GitHub repository](https://github.com/katalogos/csi-based-tool-provider) , as well as instructions to test it using pre-built container images. ## What next ? Though the POC presented in this blog post is in early alpha stage, some of the next steps to move it forward can already be imagined. ### Please welcome Katalogos There is much to build upon the foundation of the `csi-based-tool-provider` CSI driver. But the first step is certainly to set up a wider project dedicated to Kubernetes Containerized Software Catalogs, whose first component will be the CSI driver. We've called it **Katalogos**, from the ancient greek word for a catalog, a register, especially used for enrollment. ### Package it as a complete solution As soon as the wider Katalogos project is bootstrapped, here are some next steps that come to mind: - Provide a _Software Catalog Manager_ component to organize, pull and manage images as Software Catalogs and make them available to the CSI driver on each cluster node, - Provide an Operator to install the CSI driver as well as configure the Software Catalog Manager, - Provide a way to easily inject the required CSI volumes, as well as related environment variables, into Pods according to annotations, - Provide related tooling and processes to easily build Software Catalogs that can feed the Software Catalog Manager, - Extend the mechanism to the more complex case of software packages that are not inherently self-contained. ### Get feedback and build a community The goal of this article was to present some ideas, with the related minimal POC, for a project which, I believe, could meet a real need in the current status of containerized develomment. And since it's also an attempt at getting feedback, sparking interest, gathering other use-cases where it would fit, please try, open issues, fork the GitHub repository... or simply star it.