Supporting out-of-tree drivers on OpenShift - "@zvonkok"

--- title: Supporting out-of-tree drivers on OpenShift - "@zvonkok" reviewers: - "@ashcrow" - "@darkmuggle" - "@cgwalter" approvers: - "@ashcrow" - "@cgwalters" - "@darkmuggle" creation-date: 2020-04-03 last-updated: 2020-07-21 status: provisional see-also: - "/enhancements/TODO.md" replaces: - "/enhancements/TODO.md" superseded-by: - "/enhancements/TODO.md" --- # Supporting out-of-tree drivers on OpenShift ## Release Signoff Checklist - [ ] Enhancement is `implementable` - [ ] Design details are appropriately documented from clear requirements - [ ] Test plan is defined - [ ] Graduation criteria for dev preview, tech preview, GA - [ ] User-facing documentation is created in [openshift-docs](https://github.com/openshift/openshift-docs/) ## Open Questions [optional] 1. Should SRO be CVO or OLM managed, SRO creates an ClusterOperator object for must-gather and better status reporting to customers/developers/users 2. Should the driver-toolkit container be part of the payload? It should be accessible through registry.redhat.io/openshift4 without a cluster for out-of-tree driver development, testing on prereleases, ... If a customer has a pull-secret for OCP he should be able to pull without "login" 3. ## Summary OpenShift will support out-of-tree and third-party kernel drivers and the support software for the underlying operating systems via containers. ## Terminology ### Day 0, Day 1, Day 2 The terms Day 0, Day 1, Day 2 refer to different phases of the software life cycle. There are different interpretations what Day \<Z\> means. In the context of OpenShift: - Day 1 are all operations involved to install an OpenShift cluster - Day 2 are all operations involved after an OpenShift cluster is installed In this enhancement we are solely concentrating on Day 2 operations. ### DriverContainers DriverContainers are used more and more in cloud-native environments, especially when run on pure container operating systems to deliver hardware drivers to the host. Driver containers are more than a delivery mechanism for the driver itself, as they extend the kernel stack beyond the out-of-box software and hardware features of a specific kernel. Additionally, a driver container can handle the configuration of modules, and start userland services. DriverContainers work on various container capable Linux distributions. With DriverContainers the host stays always "clean", and does not clash with different library versions or binaries on the host. Prototyping is easier, updates are done by pulling a new container with the loading and unloading done by the DriverContainer with checks on /proc and /sys and other files to make sure that all traces are removed). ## Current Solutions Here are the current solutions in use today. ### Special Resource Operator For any day-2 management of kernel modules we can leverage the Special Resource Operator (SRO) features. SRO was written in such a way that it is highly customizable and configurable to any hardware accelerator or out-of-tree kernel module. A detailed description of SRO and its inner workings are described in the following two blog posts: - [https://red.ht/2JQuNwB](https://red.ht/2JQuNwB) - [https://red.ht/34ubzq3](https://red.ht/34ubzq3) SRO supports full lifecycle management of an accelerator stack but it can also be used as a stripped down version to manage e.g only one kernel module. Furthermore SRO can handle multiple kernel modules from different vendors and is able to model a dependency between those too. Another important feature is the ability to consume build artifacts from other kernel modules to build a more sophisticated DriverContainer. SRO is capable of delivering out-of-tree drivers and supporting software stacks for kernel features and hardware that is not shipped as part of the standard Fedora/RHEL distribution. Ideally, SRO would pull a prebuilt DriverContainer with precompiled drivers from the vendor. Any module updates (and downgrades) will be delivered by container. The Special Resource Operator (SRO) is currently only available in OperatorHub. SRO has proven in the past to be the template for enabling hardware when on OpenShift. Its capabilities to handle several DriverContainers with only one copy of SRO running makes it a preferable solution to tackle kmods on OpenShift. SRO is going to be a core-component of OpenShift and delivered/managed by CVO. Here is an example how one can use SRO + KVC to deliver a simple kernel module via container in a OpenShift cluster: [https://bit.ly/2EAlLEF](https://bit.ly/2EAlLEF) ### kmods-via-containers (KVC) [kmods-via-containers](https://github.com/kmods-via-containers/) is a framework for building and delivering kernel modules via containers. The implementation for this framework was inspired by the work done by Joe Doss on [atomic-wireguard](https://github.com/jdoss/atomic-wireguard). This framework relies on 3 independently developed pieces. 1. [The kmods-via-containers code/config](https://github.com/kmods-via-containers/kmods-via-containers) Delivers the stencil code and configuration files for building and delivering kmods via containers. It also delivers a service `kmods-via-containers@.service` that can be instantiated for each instance of the KVC framework. 2. The kernel module code that needs to be compiled This repo represents the kernel module code that contains the source code for building the kernel module. This repo can be delivered by vendors and generally knows *nothing* about containers. Most importantly, if someone wanted to deliver this kernel module via the KVC framework, the owners of the code don't need to be consulted. The project provides an [example kmod repo](https://github.com/kmods-via-containers/simple-kmod). 3. A KVC framework repo for the kernel module to be delivered This repo defines a container build configuration as well as a library, userspace tools, and config files that need to be created on the host system. This repo does not have to be developed by the owner of the kernel module that is wanted to be delivered. It must define a few functions in the bash library: - `build_kmods()` - Performs the kernel module container build - `load_kmods()` - Loads the kernel module(s) - `unload_kmods()` - Unloads the kernel module(s) - `wrapper()` - A wrapper function for userspace utilities Customers can hook in their procedures on how to build, load, unload etc their kernel modules. We are providing only an interface not the actual "complicated" implementation of those steps (facade pattern). Customers can then use any tool(s) (akmods, dkms, ..) they need to build their modules. [This repo](https://github.com/kmods-via-containers/kvc-simple-kmod) houses an example using `simple-kmod`. ## Motivation In OpenShift v3.x out-of-tree drivers could be easily installed on the nodes, since the node was a full RHEL node with a subscription and needed tools were installed easily with yum. In OpenShift v4.x this changed with the introduction of RHCOS. There are currently two different documented ways to enable out-of-tree drivers. One way is using a SRO based operator and the other is using kmods-via-containers. We want to come up with a unified solution that works for our customers across RHCOS and RHEL. The solution should also help customers that are currently on RHEL7 to consider moving to RHCOS which is fully managed and easier to support in OpenShift. ### DriverContainer Management OpenShift More and more customers/partners want to enable hardware and or software on OpenShift that need kernel drivers, which are currently (and maybe forever) out-of-tree and not upstream aka in-tree. We want to provide a unified way to support out-of-tree kernel drivers (multiple) on OpenShift day-2. It has to work for classical (RHEL7, 8, Fedora) and container based operating systems (RHCOS, FCOS) in the same way. Many customers/partners are using [dkms](https://github.com/dell/dkms)/akmods as the solution to build, rebuild modules on kernel changes. Adopting dkms/akmods is not a workable solution for OpenShift/OKD; we need to create and own acceptable build and delivery mechanism for RHEL, Fedora, and OpenShift. ### Fill the gap of providing drivers that are not yet, or will never be, upstream For the DriverContainer we need to cover several stages of driver packaging. - **source repository or archive** The driver is available as source code, is not packaged, and/or is required to be set up before the cluster is available; this is where KVC can help - **kmod-{vendor}-src.rpm** The next step is a source RPM package that can be recompiled with rpmbuild. This is also the base for akmods, dkms - **precompiled-{vendor}-{kernelversion}.rpm** Precompiled RPMS this is the wish thinking to have in the future. DriverContainers could be easily built just by using RPMs. Some drivers will *never* be upstreamed and can be in any of the states described above. The proposed solutions needs to handle drivers in any state and build by "any" tool. The compilation of kernel modules was always anticipated to be the fallback solution when dealing with kernel modules. Some kernel modules will always be out-of-tree and are not going to be included in the near future. Some kernel modules are out-of-tree but we are working with the vendors on upstreaming them to the mainline kernel. ### Goals - A unified way to deploy out-of-tree drivers on OpenShift 4.x on all supported Red Hat Operating Systems - The solution should avoid rebuilds on every node and allow for distribution of drivers on a cluster using the cluster registry - A solution for day-2 kernel modules - Support upgrades of OpenShift for multiple kernel module providers - Hierarchical initialization of kernel modules (modeling dependencies) - Handle dependencies between kernel modules (depmod) in tree and out of tree - Should support disconnected and proxy environments - Support heterogeneous cluster: - OpenShift with RHEL7, 8 and RHCOS ### Non-Goals - The solution is not a replacement for the traditional way of delivering kernel modules, customer/partners should be aware that we prefer they deliver the drivers upstream - We are not providing a way how to build the drivers this is business logic of a specific vendor, we are providing the interface to hook into specific stages of a DriverContainer - Extending customer support for third-party modules or implications of said modules. ## Proposal The SRO pattern showed how to enable hardware and the complete hardware accelerator stack on OpenShift. The heavy-lifting was the management of the DriverContainer. Approximately 5% of the logic behind SRO was used for deploying the remaining parts aka stack. Based on the current SROv1alpha1 we're going to build a new version SROv1beta1 that has more functionality focusing on the out-of-tree driver aspect. The new version of SRO will have an API update and hence called SROv1beta1 for Tech Preview and will be SROv1 for GA. ### Combining both approaches For managing the module in a container we are going to use KVC as the framework of choice. Targeting RHCOS solves the problem also for RHEL7 and RHEL8. The management of those KVC containers aka DriverContainers are managed by SRO, ### Day-2 DriverContainer Management OpenShift For any day-2 kernel module management or delivery we propose using SRO as the building block on OpenShift. We will run a single copy of the SRO as part of OpenShift that is able to handle multiple kernel module drivers using the following proposed CR below. The following section will cover three kernel module instantiations (1) A single kernel module (2) multiple kernel modules with build artifacts (3) full-stack enablement. There are three main parts involved in the enablement of a kernel module. We have a specific (1) set of meta information needed for each kernel module, a (2) set of manifests to deploy a DriverContainer plus enablement stack and lastly (3) a framework running inside the container for managing the kernel module (dkms like functions). *(1) The metadata are encoded in the CR for a special resource* *(2) The manifests with templating functions to inject runtime information are the so called recipes* *(3) This will be done by KVC and some enhancements that will be discussed later* The following section will walk one through the enablement of the different use-case scenarios. After deploying the operator the first step is to create an instance of a special-resource. Following are some example CRs for how one would instantiate SRO to manage a kernel module or hardware driver. #### Example CR for a single kernel module #1 ```yaml apiVersion: sro.openshift.io/v1alpha1 kind: SpecialResource metadata: name: <vendor>-<kmod> spec: metadata: version: <semver> driverContainer: - git: ref: "release-4.3" uri: "https://gitlab.com/<vendor>/<kmod>.git" ``` The second example shows the combined capabilities of SRO for dealing with multiple driver containers and artifacts. On the other side SRO can also be used in a minimalistic form where we only deploy a simple kmod. The example CR above would create only one DriverContainer from the git repository provided. For each kernel module one would provide one CR with the needed information. #### Example CR for a hardware vendor (all settings) #2 ```yaml apiVersion: sro.openshift.io/v1alpha1 kind: SpecialResource metadata: name: <vendor>-<hardware> spec: metadata: version: <semver> namespace: <vendor>-<driver> machineConfigPool: <vendor>-<mcp> matchLabels: <vendor>-<label>: "true" configuration: - name: "key_id" value: ["AWS_ACCESS_KEY_ID"] - name: "access_key" value: ["AWS_SECRET_ACCESS_KEY"] driverContainer: source: git: ref: "master" uri: "https://gitlab.com/<vendor>/driver.git" buildArgs: - name: "DRIVER_VERSION" value: "440.64.00" - name: "USE_SPECIFIC_DRIVER_FEATURE" value: "True" runArgs: - name: "LINK_TYPE_P1" # 1st Port value: "2" #Ethernet - name: "LINK_TYPE_P2" # 2nd Port value: "2" #Ethernet artifacts: hostPaths: - sourcePath: "/run/<vendor>/usr/src/<artifact>" destinationDir: "/usr/src/" images: - name: "<vendor>-{{.KernelVersion}}:latest" kind: ImageStreamTag namespace: "<vendor>-<hardware>" pullSecret: "vendor-secret" paths: - sourcePath: "/usr/src/<vendor>/<artifact> destinationDir: "/usr/src/" claims: - name: "<vendor>-pvc" mountPath: "/usr/src/<vendor>-<internal>" nodeSelector: key: "deployment-cluster" values: ["frontend", "backend"] dependsOn: - name: <CR_NAME_VENDOR_ID_SRO> - name: <CR_NAME_VENDOR_ID_KJI> ``` Since SRO will manage several special resources in different namespaces, hence the CRD will have cluster scope. The SRO can take care of creating and deleting of the namespace for the specialresource, which makes cleanup of a special resource easy, just by deleting the namespace. Otherwise one would have a manual step in creating the new namespace before creating the CR for a specialresource. If there is no spec.metadata.namespace supplied SRO will set the namespace to the CR name per default to separate each resources. With the above information SRO is capable of deducing all needed information to build and manage a DriverContainer. All manifests in SRO are templates that are templatized during reconciliation with runtime and meta information. Recipes will have a version field to distinguish them between operator upgrades. An operator upgrade will not create an updated version of the recipe. This can be done by editing the CR and updating the version field. ##### MachineConfigPools There is also an optional field to set a MachineConfigPool per special resource. A paused MCP will not be upgraded but all other workers, masters and operators will be. An upgrade could introduce an incompatibility with the special resource and the kernel. The production workload can stay in the paused MCP and an updated special resource nodeSelector can be used to deploy the special resource to the upgraded nodes. SRO can handle different kernel versions in a cluster see [OpenShift Rolling Updates](#OpenShift-Rolling-Updates) This can reduce application downtime where we would have always a working version running in the cluster. If the new upgraded Node can handle the special resoure the MPC can be unpaused an the rolling upgrade can be finished. ```yaml metadata: name: <vendor>-<hardware> spec: metadata: namespace: <vendor>-<hardware> configuration: - key: "key_id" value: "ACCESS_KEY_ID" - key: "access_key" value: "SECRET_ACCESS_KEY" driverContainer: source: git: ref: "master" uri: "https://gitlab.com/<vendor>/driver.git" ``` The name e.g. is used to prefix all resources (Pod, DameonSet, RBAC, ServiceAccount, Namespace, etc) created for this very specific **{vendor}-{hardware}**. The DriverContainer section expects optionally the git repository from a vendor. This repository has all tools and scripts to build the kernel module. The base image for a DriverContainer is an UBI7,8 with the KVC (kmod-via-containers) framework installed. Simpler builds can be accomplished by including the Dockerfile into the Build YAML. KVC provides hooks to build, load, unload the kernel modules and a wrapper for userspace utilities. We might extend the number of hooks to have a similar interface as dkms. The configuration section can be used to provide an arbitrary set of key value pairs that can be later templatized for any kind of information needed in the enablement stack. ```yaml buildArgs: - name: "DRIVER_VERSION" value: "440.64.00" - name: "USE_SPECIFIC_DRIVER_FEATURE" value: "True" ``` Another important field is the build arguments. We have often seen incompatibility between workloads and driver versions. Selecting a specific version is sometimes the only way to have a workload successfully running on OpenShift or BareMetal. This field can also be used by an administrator to upgrade or downgrade a kernel module due to CVEs, bug fixes or incompatibility. Some drivers have also some flags to enable or disable specific features of a driver. ```yaml runArgs: - name: "LINK_TYPE_P1" # 1st Port value: "2" #Ethernet - name: "LINK_TYPE_P2" # 2nd Port value: "2" #Ethernet ``` Run arguments can be used to provide configuration settings for the driver. Some hardware accelerators e.g. need to change specific attributes that are only available after the DriverContainer is executed. ```yaml artifacts: hostPaths: - sourcePath: "/run/<vendor>/usr/src/<artifact>" destinationDir: "/usr/src/" images: - name: "<vendor>-{{.KernelVersion}}:latest" kind: ImageStreamTag namespace: "<vendor>-<hardware>" pullSecret: "vendor-secret" paths: - sourcePath: "/usr/src/<vendor>/<artifact> destinationDir: "/usr/src/" claims: - name: "<vendor>-pvc" mountPath: "/usr/src/<vendor>-<internal>" ``` The next section is used to tell SRO where to find build artifacts from other drivers. Some drivers need e.g. symbol information from kernel modules, header files or the complete driver sources to be built successfully. We are providing two ways for these artifacts to be consumed. (1) Some vendors expose the build artifacts in a hostPath. The DriverContainer with KVC needs a hook for preparing the sources, which means it would copy from sourcePath on the host to the destinationDir in the DriverContainer. (2) The other way to get build artifacts is to use a DriverContainer image that is already built to get the needed artifacts (We are assuming here that the vendor is not exposing any artifacts to the host). We can leverage those images in a multi-stage build for the DriverContainer. ```yaml nodeSelector: key: "feature.../pci-<VENDOR_ID>.present" values: ["val1", "val2"] ``` The next section is used to filter the nodes on which a kernel module or driver should be deployed on. It makes no sense to deploy drivers on nodes where the hardware is not available. Furthermore this can also be used to target even subsets of special nodes either by creating labels manually or leveraging NFDs hook functionality. To retrieve the correct image we are using SRO templating to inject the correct runtime information, here we are using **{{.KernelVersion}}** as a unique identifier for DriverContainer images. For the case when no external or internal repository is available or in a disconnected environment, SRO can consume also sources from a PVC. This makes it easy to provide SRO with packages or artifacts that are only available offline. ```yaml dependsOn: - name: <CR_NAME_VENDOR_ID_SROv2> imageReference: "true" - name: <CR_NAME_VENDOR_ID_KJI> ``` There are kernel-modules that are relying on symbols that another kernel-module exports which is also handled by SRO. We can model this dependency by the dependsOn tag. Multiple SRO CR names can be provided that have to be done (all states ready) first before the current CR can be kicked off. CRs with now dependsOn tag can be executed/created/handled simultaneously. Users should usually deploy only the top-level CR and SRO will take care of instantiating the dependencies. There is no need to create all the CRs in the dependency, SRO will take care of it. If special resource *A* uses a container image from another special resource *B* e.g using it as a base container for a build, SRO will setup the correct RBAC rules to make this work. ```yaml buildArgs: - name: "KVER" value: "{{.KernelVersion}}" (1) - name: "KMODVER" value: "SRO" ``` One can also use template variables in the CR that are correctly templatized by SROv2 in the final manifest. SRO does a 2 pass templatizing, 1st pass is to inject the variable into the manifest and the second pass it templatize this given variable. Even if we do not know the runtime information beforehand of an cluster we can use it in a CR. #### DriverContainer Manifests (recipes) The third part of enablement are the manifests for the DriverContainer. SRO provides a set of predefined manifests that are completely templatized and SRO updates each tag with runtime and meta information. They can be used for any kernel module. Each Pod has a ConfigMap as an entrypoint, this way custom commands or modification can be easily added to any container running with SRO. See [https://red.ht/34ubzq3](https://red.ht/34ubzq3) for a complete list of annotations and template parameters. To ensure that a DriverContainer is successfully running SRO provides several annotations to steer the behaviour of deployment. We can enforce an ordered startup of different stages. If the drivers are not loaded it makes no sense to startup e.g. a DevicePlugin it will simply fail and all other dependent resources as well. DriverContainers can be annotated and are telling SROv2 to wait for full deployment of DaemonSets or Pods. SROv2 watches the status of the resources. Some DriverContainers can be in a running state but are executing some scripts before being fully operational. SROv2 provides a special annotation for the manifest to look for a specific regex in the container logs to match before declaring a DriverContainer as operational. This way we can guarantee that drivers are loaded and subsequent resources are running successfully. #### Supporting Disconnected Environments SRO will first try to pull a DriverContainer. If the DriverContainer does not exist, SROv2 will kick off a BuildConfig to build the DriverContainer on the cluster. Administrators could build a DriverContainer upfront and push it to an internal registry. If is able to pull it, it will ignore the BuildConfig and try to deploy another DriverContainer if specified (ImageContentSourcePolicy). ### Operator Metrics & Alerts Like DevicePlugins the new operator should provide metrics and alerts on the status of the DriverContainers. Alerts could be used for update, installation or runtime problems. Metrics could expose resource consumption, because some of the DriverContainers are also shipping daemons and helper tools that are needed to enable the hardware. ### User Stories [optional] #### Story 1: Day-2 DriverContainer Kernel Module As a vendor of kernel extensions I want a procedure to build modules on OpenShift with all relevant dependencies. Before loading the module I may need to do some housekeeping and start helper binaries and daemons. This procedure should enable an easy way to interact with the module and startup and teardown of the entity delivering the kernel extension. I should also be able to run several instances of the kernel extension (A/B testing, stable and unstable testing). #### Story 2: Multiple DriverContainers As an administrator I want to enable a hardware stack to enable a specific functionality with several kernel modules. These kernel modules may have an order and the procedure enabling them needs to expose a way to model this dependency. It may even be the case that a specific module needs kernel modules loaded that are already installed on the node. The intended clusters are either behind a proxy or completely disconnected and hence the anticipated procedure has to work in these environments too. #### Story 3: Day-2 DriverContainer Accelerator As a vendor of a hardware accelerator I want a procedure to enable the accelerator on OpenShift no matter which underlying OS is running on the nodes. The life-cycle of the drivers should be fully managed with the ability to upgrade and downgrade drivers for the accelerator. It should support all kernel versions (major, minor, z-stream) and handle driver errors gracefully. Uninstalling the drivers should not leave any trace of the previous installation (keep the node as clean as possible). The driver will not be upstreamed to the mainline kernel, which means it will always be out-of-tree. #### Story 4: Multiple Driver Containers with Artifacts As an administrator I want to enable a specific vendor stack. I need to build kernel modules that are dependent on each other during the build and at the time of loading. Specific build artifacts need to be available during the build not for loading. These artifacts can be available in another DriverContainer or extracted during runtime. ### Implementation Details/Notes/Constraints DriverContainers need at least the following packages: - kernel-devel-$(uname -r) - kernel-headers-$(uname -r) - kernel-core-$(uname -r) Kernel-core is needed for running `depmod` inside the DriverContainer to resolve all symbols and load the dependent modules. The DriverContainer is not installing the kmods on the host and hence the already installed modules from the host are missing in the container. These packages can be installed from different sources, SRO can currently handle all three: - from base repository - from EUS repository - from machine-os-content (missing kernel-core) [BuildConfigs do not support volume mounts](https://issues.redhat.com/browse/DEVEXP-17), cannot be used for artifacts on a hostPath. If we have all artifacts stored in a container image, BuildConfig can/could be leveraged for building. Where no build artifacts are needed a BuildConfig is a first choice, because of the functionality it provides (triggers, source, output, etc). One of the most important things, we have no issues with SELinux, because we are interacting with the same SELinux contexts (container_file_t). Accessing host devices, libraries, or binaries from a container breaks the containment as we have to allow containers to access host labels. For DriverContainers to build the kernel modules we need entitlements first. The e2e story is described here: [https://bit.ly/2XjZq5D](https://bit.ly/2XjZq5D) We need to provide an interface for vendors to hook in their business logic of building the drivers For prebuilt containers pullable from a vendor's repository we're going to use a ImageContentSourcePolicy, currently only pulling by digest works, we cannot pull by label now. We need to accommodate this in the naming scheme of a DriverContainer. #### The driver-toolkit container The handling of repositories and extracting RPMs from the the machine-os-content can be a complex task. To make this easier SRO builds a driver-toolkit base container for easier out-of-tree driver building. This base container has the right kernel versions that are needed for a specific OpenShift release. This container should be preferrably being build by ART and pushed to the registry.redhat.io/openshift4 registry. The build should be done on all z-stream releases and nightlies to cover pre-release testing of customers and to catch any changes of the kernel between releases. Customers that want to build out-of-tree drivers would not need entitlements per se and would have all needed RPMs at hand. This container should be externally accessible to be used in customer CI/CD pipelines that do not need a full cluster installation. This base container could also be used as the base for developers as a prototyping and testing tool in the pre-release phase. Tested drivers with a pre-release would make sure that when a OpenShift version goes GA the customer has already tested several version before the date. There could be several z-stream releases with the very same kernel but there wouldn't be a single z-stream with different kernels. Currently the driver-toolkit by ART can only be tagged with the OpenShift "full" version (x.y.z). Meaning currently it is not easy to relate a specific driver-toolkit:vX.Y.Z to a specific node that could be in different versions in the cluster depending on the state of MCPs. For building the driver-toolkit on the cluster as a fallback solution, if we do not have a recent build, the other problematic is that we cannot easily relate the correct m-o-c of the nodes. The proposal is to create an annotation on the release-paylod to the machine-os-content. The machine-os-content has already the kernel version annotation. ```json release-payload:4.7.2 -> moc:8.3 -> kernel-4.20 release-payload:4.7.0 -> moc:8.2 -> kernel-4.19 mcp0: node -> kernel-4.20 mcp1: node -> kernel-4.19 ``` The *primary key* of those two datasets would be the kernel. This would also solve the issue of finding the right m-o-c for a specific release. The extensions are used to build on cluster as a fallback solution if the driver-toolkit container is not available, e.g. for an early nightly build. Otherwise I would need to do the following, I am aware of the oc adm ... command but this literally pulls the container, mounts it and reads the manifest to print out the osImage URL. ```yaml $ CNT=`buildah from registry.ci.openshift.org/ocp/release:4.8.0-0.ci-2021-03-17-153948` $ MNT=`buildah mount $CNT` $ yq '.data.osImageURL' $MNT/release-manifests/0000_80_machine-config-operator_05_osimageurl.yaml: "registry.ci.openshift.org/ocp/4.8-2021-0 ... " ``` A simple inspect of the image should work in this case (https://issues.redhat.com/browse/ART-2763), see also `Can we update os-release to reflect the "full" version of OpenShift?` on cores-devel. ```yaml $ skopeo inspect docker://registry.ci.openshift.org/ocp/release:4.8.0-0.ci-2021-03-17-153948 | grep os-image-url "io.openshift.release.os-image-url": "registry.ci.openshift.org/ocp/4.8-2021-03-17-153948@sha256:cb00332da7d29f98990058cbe4376615905cf05857ff81c0cb408ca6365b4196" ``` From here we can use the annotations without pulling the image: ```yaml $ skopeo inspect docker://registry.ci.openshift.org/ocp/4.8-2021-03-17-153948@sha256:cb00332da7d29f98990058cbe4376615905cf05857ff81c0cb408ca6365b4196 | grep kernel "com.coreos.os-extensions": "kernel-rt;kernel-devel;qemu-kiwi;usbguard", "com.coreos.rpm.kernel": "4.18.0-240.15.1.el8_3.x86_64", "com.coreos.rpm.kernel-rt-core": "4.18.0-240.15.1.rt7.69.el8_3.x86_64", ``` ### Risks and Mitigations What are the risks of this proposal and how do we mitigate. Think broadly. For example, consider both security and how this will impact the larger OKD ecosystem. How will security be reviewed and by whom? How will UX be reviewed and by whom? Consider including folks that also work outside your immediate sub-project. ## Design Details ### Test Plan **Note:** *Section not required until targeted at a release.* Consider the following in developing a test plan for this enhancement: - Will there be e2e and integration tests, in addition to unit tests? - How will it be tested in isolation vs with other components? No need to outline all of the test cases, just the general strategy. Anything that would count as tricky in the implementation and anything particularly challenging to test should be called out. All code is expected to have adequate tests (eventually with coverage expectations). ### Upgrade / Downgrade Strategy #### Red Hat Kernel ABI Red Hat kernels guarantee an stable kernel application binary interface. If modules are only using whitelisted symbols then they can leverage weak-updates in the case of an upgrade. An kmod that is build on 8.0 can be easily loaded on all subsequent y-stream releases. The weak-update is nothing more than a symlink in `lib/modules/..../weak-updates` for the driver. We will extend KVC to check if an out-of-tree driver is able to do weak-updates and leverage the weak-modules script (part of RHEL) to create the correct symlinks. If the driver is not kABI compatible SRO will create an alert on the console for awareness. In some rare occassions the kABI can change (CVE, bugs, etc) and hence as a preflight check SRO is going to comparte the curent kABI with the kABI coming with the update. #### Updates in OpenShift Updates in OpenShift can happen in two ways: 1. Only the payload (operators and needed parts on top of the OS) is updated 2. The payload and the OS are simultaneously updated. The first case is "easy" the new version of the operator will reconcile the expected state and verify that all parts of the special resource stack are working and then "do" nothing. For the second case, the new operator will reconcile the expected state and see that there is a mismatch regarding the kernel version of the DriverContainer and the updated Node. It will try to pull the new image with the correct kernel version. If the correct DriverContainer cannot be pulled, will update the BuildConfig with the right kernel version and OpenShift will reinitiate the build since we have the ConfigChange trigger as described above. Besides the ConfigChange trigger, we also added the ImageChange trigger, which is important when the base image is updated due to CVE or other bug fixes. For this to happen automatically we are leveraging ImageStreams of OpenShift, an ImageStream is a collection of tags that gets automatically updated with the latest tags. It is like a container repository that represents a virtual view of related images. To be always up to date another possibility would be to register a github/gitlab webhook so every time the DriverContainer code changes a new container could be built. One has just to make sure that the webhook is triggered on a specific release branch, it is not advisable to monitor a fast moving branch (e.g. master) that would trigger frequent builds. #### OpenShift Rolling Updates The OpenShift update can be split into two major parts. The first one being the upgrade of all CVO managed operators (with OLM) and the second part the update of the operating system upgrade. An operator that is deploying a ClusterOperator object can signal CVO if it is ready to be upgraded or not (Upgradeable=False). MCO and other operator will check several cluster constraints and signal upgradebility. SRO will use this very first phase to execute a preflight check to see if the new kernel that is coming with the operating system is compatible with the current deployed out-of-tree drivers. SRO will set Upgradeability=False and only set it to true if the special resources that are managed can be either pulled (meaning a customer/partner CI/CD pipeline has already created them) or try to build the out-of-tree driver with the new kernel. This way one can guard the special resource from being updated and prevent an upgrade to a non working kernel. If the preflight checks are successfull SRO needs to take care of the operating system upgrade. CVO will do the upgrade of MCO as the last step and create all neccessary manifests (e.g. creating an updated osimage ConfigMap with the new URL pointing to the osImage) that is used by MCO to roll out the new OS. Per defaul there are two MachineConfigPools in OpenShift (master and worker), when MCO starts updating the OS it will do it one machine at a time per each MachineConfigPool. To prevent MCO from updating a specific MCP we can set it to "paused:true". One way to handle this rolling update is to wait for all machines to be updated in a MCP and then rollout the new drivers but this would mean a service or application downtime and depending on the amount of nodes it can be very long. SROs goal here is to minimize the downtime to a bare minimum. To handle this situation SRO will create for each triplet of cluster, OS and kernel version a DaemonSet DriverContainer. Supposing we have a MCP with 5 machines and X is the current version and Y the version to be upgraded to we would have the following picture (N is the node): ```bash Node: N:0 N:1 N:2 N:3 N:4 OS Version: V:X V:X V:X V:X V:X DaemonSet: D:X D:X D:X D:X D:X ``` MCO will do a rolling update and pick the first node and apply version Y to it, the action item for SRO is to create now a new DaemonSet with the new version Y and provide the new drivers to the node. Since we have run the preflight check in the first phase SRO knows that it will work and create the new DaemonSet. ```bash Node: N:0 N:1 N:2 N:3 N:4 OS Version: V:Y V:X V:X V:X V:X DaemonSet: D:Y D:X D:X D:X D:X ``` The DameonSet will have a nodeSelector targeting the different kernel versions, meaning a DaemonSet build for version X will only run on nodes with compatible kernel X and a DaemonSet with version Y will only run on nodes with compatible kernel Y. This has the effec that the new DameonSet (Y) will automatically scale up to the new nodes updated to the new version Y by MCO and the DameonSet (X) will automatically scale down. ```bash Node: N:0 N:1 N:2 N:3 N:4 OS Version: V:Y V:Y V:Y V:Y V:X DaemonSet: D:Y D:Y D:Y D:Y D:X ``` MCO will do the rolling update of all the nodes and the DaemonSets will scale up or scale down automatically, making sure that always a version of the out-of-tree driver is running on all nodes at any time. This way SRO can support multiple version skews of OpenShift and handle several MCPs with different versions running. Depending on the cluster constraints SRO will also delete obsolete not "supported" DaemonSet DriverContainer versions. #### Special Resource Driver Downgrade Having a look at the example CRs above we can see that one can provide a driver version for a specific hardware aka DriverContainer. will take care of updating the BuildConfig and DriverContainer manifests. Tainting the node as **specialresource.openshift.io/downgrade=true:NoExecute** will evade all running Pods and the DriverContainer can be restarted. When the DriverContainer is again up and running, the node can be un-tainted to allow workloads to be scheduled on the node again. #### Update proactive DriverContainers A preferable workflow for updates could also be to be proactive on updates. When OpenShift is updated we would need a mechanism (notification, hook for updates) to provide the kernel version of the next update before attempting the upgrade and rebooting the nodes (needinfo installer/CVO team) . This way DriverContainers are prebuilt and potential problems can be examined before the update completes (e.g. no drivers for newer kernels, build errors, etc). Another major point is to know the underlying RHEL version with major and minor number. Many drivers have dependencies on RHEL8.0 or RHEL8.1 etc. Currently there is no easy way to find out if RHCOS is based on RHEL8.0 or RHEL8.1 (OpenShift 4.3 e.g changes from 8.0 to 8.1 depending on the z-Stream) #### Exception Handling If there is no prebuilt DriverContainer and no source git repository is provided to build the DriverContainer the current behaviour is to wait until one of these prerequisites are fulfilled. Either a DriverContainer is pushed to a registry known to or a new updated CR is created. The current status is exposed in the status field of the special resource. To prevent such a state, before an update happens the user/administrator should know the kernel version upfront. We need an most obvious way to expose the kernel version. Even with the kernel version exposed it is hard to know if an update will break the cluster. There are several constraints of the drivers and how they tie to a kernel version. You have one single source of drivers that can be compiled on all major RHEL versions. Here it does not matter which kernel version we are running. Here we can assume that drivers work for all 3.xx.yy and 4.xx.yyy kernels. One could have drivers that are only dependent on the major RHEL versions. We would need to consider "only" upgrades from one major to the other. Here drivers are sensitive going from one major kernel version to the other. Another case is where drivers are also sensitive to minor version changes which means they are driver changes for any kernel version. #### DCI - Distributed CI Environment (RHEL) ### Version Skew Strategy In some use-case scenarios relies on NFD labels, which are used as node selectors for deploying the DriverContainers. NFD labels are not changed during updates. The specific label is an input parameter for the CR of a hardware. NFD labels are integral parts of the node, if a label is not discovered then the hardware is not available and hence not intended to be used as a deployment target for DriverContainers.

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.