---
title: Supporting out-of-tree drivers on OpenShift
- "@zvonkok"
reviewers:
- "@ashcrow"
- "@darkmuggle"
- "@cgwalter"
approvers:
- "@ashcrow"
- "@cgwalters"
- "@darkmuggle"
creation-date: 2020-04-03
last-updated: 2020-07-21
status: provisional
see-also:
- "/enhancements/TODO.md"
replaces:
- "/enhancements/TODO.md"
superseded-by:
- "/enhancements/TODO.md"
---
# Supporting out-of-tree drivers on OpenShift
## Release Signoff Checklist
- [ ] Enhancement is `implementable`
- [ ] Design details are appropriately documented from clear requirements
- [ ] Test plan is defined
- [ ] Graduation criteria for dev preview, tech preview, GA
- [ ] User-facing documentation is created in [openshift-docs](https://github.com/openshift/openshift-docs/)
## Open Questions [optional]
1. Should SRO be CVO or OLM managed, SRO creates an ClusterOperator object for must-gather and better status reporting to customers/developers/users
2. Should the driver-toolkit container be part of the payload? It should be accessible through registry.redhat.io/openshift4 without a cluster for out-of-tree driver development, testing on prereleases, ... If a customer has a pull-secret for OCP he should be able to pull without "login"
3.
## Summary
OpenShift will support out-of-tree and third-party kernel drivers and the
support software for the underlying operating systems via containers.
## Terminology
### Day 0, Day 1, Day 2
The terms Day 0, Day 1, Day 2 refer to different phases of the software life
cycle. There are different interpretations what Day \<Z\> means. In the context
of OpenShift:
- Day 1 are all operations involved to install an OpenShift cluster
- Day 2 are all operations involved after an OpenShift cluster is installed
In this enhancement we are solely concentrating on Day 2 operations.
### DriverContainers
DriverContainers are used more and more in cloud-native environments, especially
when run on pure container operating systems to deliver hardware drivers to the
host. Driver containers are more than a delivery mechanism for the driver itself,
as they extend the kernel stack beyond the out-of-box software and hardware
features of a specific kernel. Additionally, a driver container can handle the
configuration of modules, and start userland services.
DriverContainers work on various container capable Linux distributions.
With DriverContainers the host stays always "clean", and does not clash with
different library versions or binaries on the host. Prototyping is easier,
updates are done by pulling a new container with the loading and unloading done
by the DriverContainer with checks on /proc and /sys and other files to make
sure that all traces are removed).
## Current Solutions
Here are the current solutions in use today.
### Special Resource Operator
For any day-2 management of kernel modules we can leverage the Special Resource
Operator (SRO) features. SRO was written in such a way that it is highly
customizable and configurable to any hardware accelerator or out-of-tree kernel
module.
A detailed description of SRO and its inner workings are described in the
following two blog posts:
- [https://red.ht/2JQuNwB](https://red.ht/2JQuNwB)
- [https://red.ht/34ubzq3](https://red.ht/34ubzq3)
SRO supports full lifecycle management of an accelerator stack but it can also
be used as a stripped down version to manage e.g only one kernel module.
Furthermore SRO can handle multiple kernel modules from different vendors and is
able to model a dependency between those too.
Another important feature is the ability to consume build artifacts from other
kernel modules to build a more sophisticated DriverContainer. SRO is capable of
delivering out-of-tree drivers and supporting software stacks for kernel
features and hardware that is not shipped as part of the standard Fedora/RHEL
distribution.
Ideally, SRO would pull a prebuilt DriverContainer with precompiled drivers from
the vendor. Any module updates (and downgrades) will be delivered by container.
The Special Resource Operator (SRO) is currently only available in OperatorHub.
SRO has proven in the past to be the template for enabling hardware when on
OpenShift. Its capabilities to handle several DriverContainers with only one
copy of SRO running makes it a preferable solution to tackle kmods on OpenShift.
SRO is going to be a core-component of OpenShift and delivered/managed by CVO.
Here is an example how one can use SRO + KVC to deliver a simple kernel module
via container in a OpenShift cluster:
[https://bit.ly/2EAlLEF](https://bit.ly/2EAlLEF)
### kmods-via-containers (KVC)
[kmods-via-containers](https://github.com/kmods-via-containers/) is a framework
for building and delivering kernel modules via containers. The implementation
for this framework was inspired by the work done by Joe Doss on
[atomic-wireguard](https://github.com/jdoss/atomic-wireguard). This framework
relies on 3 independently developed pieces.
1. [The kmods-via-containers code/config](https://github.com/kmods-via-containers/kmods-via-containers)
Delivers the stencil code and configuration files for building and delivering
kmods via containers. It also delivers a service `kmods-via-containers@.service`
that can be instantiated for each instance of the KVC framework.
2. The kernel module code that needs to be compiled
This repo represents the kernel module code that contains the source code for
building the kernel module. This repo can be delivered by vendors and generally
knows *nothing* about containers. Most importantly, if someone wanted to deliver
this kernel module via the KVC framework, the owners of the code don't need to
be consulted. The project provides an
[example kmod repo](https://github.com/kmods-via-containers/simple-kmod).
3. A KVC framework repo for the kernel module to be delivered
This repo defines a container build configuration as well as a library,
userspace tools, and config files that need to be created on the host system.
This repo does not have to be developed by the owner of the kernel module that
is wanted to be delivered.
It must define a few functions in the bash library:
- `build_kmods()`
- Performs the kernel module container build
- `load_kmods()`
- Loads the kernel module(s)
- `unload_kmods()`
- Unloads the kernel module(s)
- `wrapper()`
- A wrapper function for userspace utilities
Customers can hook in their procedures on how to build, load, unload etc their
kernel modules. We are providing only an interface not the actual "complicated"
implementation of those steps (facade pattern). Customers can then use any
tool(s) (akmods, dkms, ..) they need to build their modules.
[This repo](https://github.com/kmods-via-containers/kvc-simple-kmod) houses
an example using `simple-kmod`.
## Motivation
In OpenShift v3.x out-of-tree drivers could be easily installed on the nodes,
since the node was a full RHEL node with a subscription and needed tools were
installed easily with yum.
In OpenShift v4.x this changed with the introduction of RHCOS. There are
currently two different documented ways to enable out-of-tree drivers. One way
is using a SRO based operator and the other is using kmods-via-containers.
We want to come up with a unified solution that works for our customers across
RHCOS and RHEL. The solution should also help customers that are currently on
RHEL7 to consider moving to RHCOS which is fully managed and easier to support
in OpenShift.
### DriverContainer Management OpenShift
More and more customers/partners want to enable hardware and or software on
OpenShift that need kernel drivers, which are currently (and maybe forever)
out-of-tree and not upstream aka in-tree.
We want to provide a unified way to support out-of-tree kernel drivers
(multiple) on OpenShift day-2. It has to work for classical (RHEL7, 8, Fedora)
and container based operating systems (RHCOS, FCOS) in the same way.
Many customers/partners are using [dkms](https://github.com/dell/dkms)/akmods as
the solution to build, rebuild modules on kernel changes. Adopting dkms/akmods
is not a workable solution for OpenShift/OKD; we need to create and own
acceptable build and delivery mechanism for RHEL, Fedora, and OpenShift.
### Fill the gap of providing drivers that are not yet, or will never be, upstream
For the DriverContainer we need to cover several stages of driver packaging.
- **source repository or archive** The driver is available as source code, is
not packaged, and/or is required to be set up before the cluster is available;
this is where KVC can help
- **kmod-{vendor}-src.rpm** The next step is a source RPM package that can be
recompiled with rpmbuild. This is also the base for akmods, dkms
- **precompiled-{vendor}-{kernelversion}.rpm** Precompiled RPMS this is the wish
thinking to have in the future. DriverContainers could be easily built just by
using RPMs.
Some drivers will *never* be upstreamed and can be in any of the states
described above. The proposed solutions needs to handle drivers in any state and
build by "any" tool.
The compilation of kernel modules was always anticipated to be the fallback
solution when dealing with kernel modules. Some kernel modules will always be
out-of-tree and are not going to be included in the near future. Some kernel
modules are out-of-tree but we are working with the vendors on upstreaming them
to the mainline kernel.
### Goals
- A unified way to deploy out-of-tree drivers on OpenShift 4.x on all supported
Red Hat Operating Systems
- The solution should avoid rebuilds on every node and allow for distribution of
drivers on a cluster using the cluster registry
- A solution for day-2 kernel modules
- Support upgrades of OpenShift for multiple kernel module providers
- Hierarchical initialization of kernel modules (modeling dependencies)
- Handle dependencies between kernel modules (depmod) in tree and out of tree
- Should support disconnected and proxy environments
- Support heterogeneous cluster:
- OpenShift with RHEL7, 8 and RHCOS
### Non-Goals
- The solution is not a replacement for the traditional way of delivering kernel
modules, customer/partners should be aware that we prefer they deliver the
drivers upstream
- We are not providing a way how to build the drivers this is business logic of
a specific vendor, we are providing the interface to hook into specific stages
of a DriverContainer
- Extending customer support for third-party modules or implications of said
modules.
## Proposal
The SRO pattern showed how to enable hardware and the complete hardware
accelerator stack on OpenShift. The heavy-lifting was the management of the
DriverContainer. Approximately 5% of the logic behind SRO was used for deploying
the remaining parts aka stack.
Based on the current SROv1alpha1 we're going to build a new version SROv1beta1 that
has more functionality focusing on the out-of-tree driver aspect.
The new version of SRO will have an API update and hence called SROv1beta1 for
Tech Preview and will be SROv1 for GA.
### Combining both approaches
For managing the module in a container we are going to use KVC as the framework
of choice. Targeting RHCOS solves the problem also for RHEL7 and RHEL8. The
management of those KVC containers aka DriverContainers are managed by SRO,
### Day-2 DriverContainer Management OpenShift
For any day-2 kernel module management or delivery we propose using SRO as the
building block on OpenShift.
We will run a single copy of the SRO as part of OpenShift that is able to
handle multiple kernel module drivers using the following proposed CR below.
The following section will cover three kernel module instantiations (1) A single
kernel module (2) multiple kernel modules with build artifacts (3) full-stack
enablement.
There are three main parts involved in the enablement of a kernel module. We
have a specific (1) set of meta information needed for each kernel module, a (2)
set of manifests to deploy a DriverContainer plus enablement stack and lastly
(3) a framework running inside the container for managing the kernel module
(dkms like functions).
*(1) The metadata are encoded in the CR for a special resource*
*(2) The manifests with templating functions to inject runtime information
are the so called recipes*
*(3) This will be done by KVC and some enhancements that will be discussed later*
The following section will walk one through the enablement of the different
use-case scenarios. After deploying the operator the first step is to create an
instance of a special-resource. Following are some example CRs for how one would
instantiate SRO to manage a kernel module or hardware driver.
#### Example CR for a single kernel module #1
```yaml
apiVersion: sro.openshift.io/v1alpha1
kind: SpecialResource
metadata:
name: <vendor>-<kmod>
spec:
metadata:
version: <semver>
driverContainer:
- git:
ref: "release-4.3"
uri: "https://gitlab.com/<vendor>/<kmod>.git"
```
The second example shows the combined capabilities of SRO for dealing with
multiple driver containers and artifacts. On the other side SRO can also be
used in a minimalistic form where we only deploy a simple kmod. The example CR
above would create only one DriverContainer from the git repository provided.
For each kernel module one would provide one CR with the needed information.
#### Example CR for a hardware vendor (all settings) #2
```yaml
apiVersion: sro.openshift.io/v1alpha1
kind: SpecialResource
metadata:
name: <vendor>-<hardware>
spec:
metadata:
version: <semver>
namespace: <vendor>-<driver>
machineConfigPool: <vendor>-<mcp>
matchLabels:
<vendor>-<label>: "true"
configuration:
- name: "key_id"
value: ["AWS_ACCESS_KEY_ID"]
- name: "access_key"
value: ["AWS_SECRET_ACCESS_KEY"]
driverContainer:
source:
git:
ref: "master"
uri: "https://gitlab.com/<vendor>/driver.git"
buildArgs:
- name: "DRIVER_VERSION"
value: "440.64.00"
- name: "USE_SPECIFIC_DRIVER_FEATURE"
value: "True"
runArgs:
- name: "LINK_TYPE_P1" # 1st Port
value: "2" #Ethernet
- name: "LINK_TYPE_P2" # 2nd Port
value: "2" #Ethernet
artifacts:
hostPaths:
- sourcePath: "/run/<vendor>/usr/src/<artifact>"
destinationDir: "/usr/src/"
images:
- name: "<vendor>-{{.KernelVersion}}:latest"
kind: ImageStreamTag
namespace: "<vendor>-<hardware>"
pullSecret: "vendor-secret"
paths:
- sourcePath: "/usr/src/<vendor>/<artifact>
destinationDir: "/usr/src/"
claims:
- name: "<vendor>-pvc"
mountPath: "/usr/src/<vendor>-<internal>"
nodeSelector:
key: "deployment-cluster"
values: ["frontend", "backend"]
dependsOn:
- name: <CR_NAME_VENDOR_ID_SRO>
- name: <CR_NAME_VENDOR_ID_KJI>
```
Since SRO will manage several special resources in different namespaces, hence
the CRD will have cluster scope. The SRO can take care of creating and
deleting of the namespace for the specialresource, which makes cleanup of
a special resource easy, just by deleting the namespace. Otherwise one would have a
manual step in creating the new namespace before creating the CR for a specialresource.
If there is no spec.metadata.namespace supplied SRO will set
the namespace to the CR name per default to separate each resources.
With the above information SRO is capable of deducing all needed information
to build and manage a DriverContainer. All manifests in SRO are templates that
are templatized during reconciliation with runtime and meta information.
Recipes will have a version field to distinguish them between operator upgrades.
An operator upgrade will not create an updated version of the recipe. This can be done by
editing the CR and updating the version field.
##### MachineConfigPools
There is also an optional field to set a MachineConfigPool per special resource.
A paused MCP will not be upgraded but all other workers, masters and operators will be.
An upgrade could introduce an incompatibility with the special resource and the kernel.
The production workload can stay in the paused MCP and an updated special resource
nodeSelector can be used to deploy the special resource to the upgraded nodes.
SRO can handle different kernel versions in a cluster see [OpenShift Rolling Updates](#OpenShift-Rolling-Updates)
This can reduce application downtime where we would have always a working version running
in the cluster. If the new upgraded Node can handle the special resoure the MPC can be unpaused
an the rolling upgrade can be finished.
```yaml
metadata:
name: <vendor>-<hardware>
spec:
metadata:
namespace: <vendor>-<hardware>
configuration:
- key: "key_id"
value: "ACCESS_KEY_ID"
- key: "access_key"
value: "SECRET_ACCESS_KEY"
driverContainer:
source:
git:
ref: "master"
uri: "https://gitlab.com/<vendor>/driver.git"
```
The name e.g. is used to prefix all resources (Pod, DameonSet, RBAC,
ServiceAccount, Namespace, etc) created for this very specific
**{vendor}-{hardware}**. The DriverContainer section expects optionally the git
repository from a vendor. This repository has all tools and scripts to build the
kernel module. The base image for a DriverContainer is an UBI7,8 with the KVC
(kmod-via-containers) framework installed. Simpler builds can be accomplished
by including the Dockerfile into the Build YAML.
KVC provides hooks to build, load, unload the kernel modules and a wrapper for
userspace utilities. We might extend the number of hooks to have a similar
interface as dkms.
The configuration section can be used to provide an arbitrary set of key value
pairs that can be later templatized for any kind of information needed in the
enablement stack.
```yaml
buildArgs:
- name: "DRIVER_VERSION"
value: "440.64.00"
- name: "USE_SPECIFIC_DRIVER_FEATURE"
value: "True"
```
Another important field is the build arguments. We have often seen
incompatibility between workloads and driver versions. Selecting a specific
version is sometimes the only way to have a workload successfully running on
OpenShift or BareMetal. This field can also be used by an administrator to
upgrade or downgrade a kernel module due to CVEs, bug fixes or incompatibility.
Some drivers have also some flags to enable or disable specific features of a
driver.
```yaml
runArgs:
- name: "LINK_TYPE_P1" # 1st Port
value: "2" #Ethernet
- name: "LINK_TYPE_P2" # 2nd Port
value: "2" #Ethernet
```
Run arguments can be used to provide configuration settings for the driver.
Some hardware accelerators e.g. need to change specific attributes that are
only available after the DriverContainer is executed.
```yaml
artifacts:
hostPaths:
- sourcePath: "/run/<vendor>/usr/src/<artifact>"
destinationDir: "/usr/src/"
images:
- name: "<vendor>-{{.KernelVersion}}:latest"
kind: ImageStreamTag
namespace: "<vendor>-<hardware>"
pullSecret: "vendor-secret"
paths:
- sourcePath: "/usr/src/<vendor>/<artifact>
destinationDir: "/usr/src/"
claims:
- name: "<vendor>-pvc"
mountPath: "/usr/src/<vendor>-<internal>"
```
The next section is used to tell SRO where to find build artifacts from other
drivers. Some drivers need e.g. symbol information from kernel modules, header
files or the complete driver sources to be built successfully. We are providing
two ways for these artifacts to be consumed. (1) Some vendors expose the build
artifacts in a hostPath. The DriverContainer with KVC needs a hook for preparing
the sources, which means it would copy from sourcePath on the host to the
destinationDir in the DriverContainer. (2) The other way to get build artifacts
is to use a DriverContainer image that is already built to get the needed
artifacts (We are assuming here that the vendor is not exposing any artifacts
to the host). We can leverage those images in a multi-stage build for the
DriverContainer.
```yaml
nodeSelector:
key: "feature.../pci-<VENDOR_ID>.present"
values: ["val1", "val2"]
```
The next section is used to filter the nodes on which a kernel module or driver
should be deployed on. It makes no sense to deploy drivers on nodes where the
hardware is not available. Furthermore this can also be used to target even
subsets of special nodes either by creating labels manually or leveraging NFDs
hook functionality.
To retrieve the correct image we are using SRO templating to inject the
correct runtime information, here we are using **{{.KernelVersion}}** as a
unique identifier for DriverContainer images.
For the case when no external or internal repository is available or in a
disconnected environment, SRO can consume also sources from a PVC. This makes
it easy to provide SRO with packages or artifacts that are only available
offline.
```yaml
dependsOn:
- name: <CR_NAME_VENDOR_ID_SROv2>
imageReference: "true"
- name: <CR_NAME_VENDOR_ID_KJI>
```
There are kernel-modules that are relying on symbols that another kernel-module
exports which is also handled by SRO. We can model this dependency by the
dependsOn tag. Multiple SRO CR names can be provided that have to be done
(all states ready) first before the current CR can be kicked off. CRs with now
dependsOn tag can be executed/created/handled simultaneously.
Users should usually deploy only the top-level CR and SRO will take care of
instantiating the dependencies. There is no need to create all the CRs in the
dependency, SRO will take care of it.
If special resource *A* uses a container image from another special resource *B*
e.g using it as a base container for a build, SRO will setup the correct RBAC
rules to make this work.
```yaml
buildArgs:
- name: "KVER"
value: "{{.KernelVersion}}" (1)
- name: "KMODVER"
value: "SRO"
```
One can also use template variables in the CR that are correctly templatized by
SROv2 in the final manifest. SRO does a 2 pass templatizing, 1st pass is to
inject the variable into the manifest and the second pass it templatize this
given variable. Even if we do not know the runtime information beforehand of an
cluster we can use it in a CR.
#### DriverContainer Manifests (recipes)
The third part of enablement are the manifests for the DriverContainer. SRO
provides a set of predefined manifests that are completely templatized and SRO
updates each tag with runtime and meta information. They can be used for any
kernel module. Each Pod has a ConfigMap as an entrypoint, this way custom
commands or modification can be easily added to any container running with
SRO. See [https://red.ht/34ubzq3](https://red.ht/34ubzq3) for a complete list
of annotations and template parameters.
To ensure that a DriverContainer is successfully running SRO provides several
annotations to steer the behaviour of deployment. We can enforce an ordered
startup of different stages. If the drivers are not loaded it makes no sense to
startup e.g. a DevicePlugin it will simply fail and all other dependent
resources as well.
DriverContainers can be annotated and are telling SROv2 to wait for full
deployment of DaemonSets or Pods. SROv2 watches the status of the resources.
Some DriverContainers can be in a running state but are executing some scripts
before being fully operational. SROv2 provides a special annotation for the
manifest to look for a specific regex in the container logs to match before
declaring a DriverContainer as operational. This way we can guarantee that
drivers are loaded and subsequent resources are running successfully.
#### Supporting Disconnected Environments
SRO will first try to pull a DriverContainer. If the DriverContainer does not
exist, SROv2 will kick off a BuildConfig to build the DriverContainer on the
cluster. Administrators could build a DriverContainer upfront and push it to an
internal registry. If is able to pull it, it will ignore the BuildConfig and try
to deploy another DriverContainer if specified (ImageContentSourcePolicy).
### Operator Metrics & Alerts
Like DevicePlugins the new operator should provide metrics and alerts on the
status of the DriverContainers. Alerts could be used for update, installation or
runtime problems. Metrics could expose resource consumption, because some of the
DriverContainers are also shipping daemons and helper tools that are needed to
enable the hardware.
### User Stories [optional]
#### Story 1: Day-2 DriverContainer Kernel Module
As a vendor of kernel extensions I want a procedure to build modules on
OpenShift with all relevant dependencies. Before loading the module I may need
to do some housekeeping and start helper binaries and daemons. This procedure
should enable an easy way to interact with the module and startup and teardown
of the entity delivering the kernel extension. I should also be able to run
several instances of the kernel extension (A/B testing, stable and unstable
testing).
#### Story 2: Multiple DriverContainers
As an administrator I want to enable a hardware stack to enable a specific
functionality with several kernel modules. These kernel modules may have an
order and the procedure enabling them needs to expose a way to model this
dependency. It may even be the case that a specific module needs kernel modules
loaded that are already installed on the node. The intended clusters are either
behind a proxy or completely disconnected and hence the anticipated procedure
has to work in these environments too.
#### Story 3: Day-2 DriverContainer Accelerator
As a vendor of a hardware accelerator I want a procedure to enable the
accelerator on OpenShift no matter which underlying OS is running on the nodes.
The life-cycle of the drivers should be fully managed with the ability to
upgrade and downgrade drivers for the accelerator.
It should support all kernel versions (major, minor, z-stream) and handle driver
errors gracefully. Uninstalling the drivers should not leave any trace of the
previous installation (keep the node as clean as possible). The driver will not
be upstreamed to the mainline kernel, which means it will always be out-of-tree.
#### Story 4: Multiple Driver Containers with Artifacts
As an administrator I want to enable a specific vendor stack. I need to build
kernel modules that are dependent on each other during the build and at the time
of loading. Specific build artifacts need to be available during the build not
for loading. These artifacts can be available in another DriverContainer or
extracted during runtime.
### Implementation Details/Notes/Constraints
DriverContainers need at least the following packages:
- kernel-devel-$(uname -r)
- kernel-headers-$(uname -r)
- kernel-core-$(uname -r)
Kernel-core is needed for running `depmod` inside the DriverContainer to resolve
all symbols and load the dependent modules. The DriverContainer is not
installing the kmods on the host and hence the already installed modules from
the host are missing in the container.
These packages can be installed from different sources, SRO can currently handle
all three:
- from base repository
- from EUS repository
- from machine-os-content (missing kernel-core)
[BuildConfigs do not support volume mounts](https://issues.redhat.com/browse/DEVEXP-17),
cannot be used for artifacts on a hostPath. If we have all artifacts stored in
a container image, BuildConfig can/could be leveraged for building. Where no
build artifacts are needed a BuildConfig is a first choice, because of the
functionality it provides (triggers, source, output, etc).
One of the most important things, we have no issues with SELinux, because we are
interacting with the same SELinux contexts (container_file_t). Accessing host
devices, libraries, or binaries from a container breaks the containment
as we have to allow containers to access host labels.
For DriverContainers to build the kernel modules we need entitlements first. The
e2e story is described here: [https://bit.ly/2XjZq5D](https://bit.ly/2XjZq5D)
We need to provide an interface for vendors to hook in their business logic of
building the drivers
For prebuilt containers pullable from a vendor's repository we're going to use a
ImageContentSourcePolicy, currently only pulling by digest works, we cannot pull
by label now. We need to accommodate this in the naming scheme of a
DriverContainer.
#### The driver-toolkit container
The handling of repositories and extracting RPMs from the the machine-os-content
can be a complex task. To make this easier SRO builds a driver-toolkit base container
for easier out-of-tree driver building. This base container has the right kernel versions
that are needed for a specific OpenShift release.
This container should be preferrably being build by ART and pushed to the
registry.redhat.io/openshift4 registry. The build should be done on all z-stream releases
and nightlies to cover pre-release testing of customers and to catch any changes of
the kernel between releases.
Customers that want to build out-of-tree drivers would not need entitlements per se and would
have all needed RPMs at hand. This container should be externally accessible to be used
in customer CI/CD pipelines that do not need a full cluster installation.
This base container could also be used as the base for developers as a prototyping and testing
tool in the pre-release phase. Tested drivers with a pre-release would make sure that when a
OpenShift version goes GA the customer has already tested several version before the date.
There could be several z-stream releases with the very same kernel but there wouldn't be a
single z-stream with different kernels.
Currently the driver-toolkit by ART can only be tagged with the OpenShift "full" version (x.y.z).
Meaning currently it is not easy to relate a specific driver-toolkit:vX.Y.Z to a specific node
that could be in different versions in the cluster depending on the state of MCPs.
For building the driver-toolkit on the cluster as a fallback solution, if we do not have a recent
build, the other problematic is that we cannot easily relate the correct m-o-c of the nodes.
The proposal is to create an annotation on the release-paylod to the machine-os-content.
The machine-os-content has already the kernel version annotation.
```json
release-payload:4.7.2 -> moc:8.3 -> kernel-4.20
release-payload:4.7.0 -> moc:8.2 -> kernel-4.19
mcp0: node -> kernel-4.20
mcp1: node -> kernel-4.19
```
The *primary key* of those two datasets would be the kernel. This would also solve the issue of
finding the right m-o-c for a specific release. The extensions are used to build on cluster as a
fallback solution if the driver-toolkit container is not available, e.g. for an early nightly build.
Otherwise I would need to do the following, I am aware of the oc adm ... command but this literally
pulls the container, mounts it and reads the manifest to print out the osImage URL.
```yaml
$ CNT=`buildah from registry.ci.openshift.org/ocp/release:4.8.0-0.ci-2021-03-17-153948`
$ MNT=`buildah mount $CNT`
$ yq '.data.osImageURL' $MNT/release-manifests/0000_80_machine-config-operator_05_osimageurl.yaml:
"registry.ci.openshift.org/ocp/4.8-2021-0 ... "
```
A simple inspect of the image should work in this case (https://issues.redhat.com/browse/ART-2763),
see also `Can we update os-release to reflect the "full" version of OpenShift?` on cores-devel.
```yaml
$ skopeo inspect docker://registry.ci.openshift.org/ocp/release:4.8.0-0.ci-2021-03-17-153948 | grep os-image-url
"io.openshift.release.os-image-url": "registry.ci.openshift.org/ocp/4.8-2021-03-17-153948@sha256:cb00332da7d29f98990058cbe4376615905cf05857ff81c0cb408ca6365b4196"
```
From here we can use the annotations without pulling the image:
```yaml
$ skopeo inspect docker://registry.ci.openshift.org/ocp/4.8-2021-03-17-153948@sha256:cb00332da7d29f98990058cbe4376615905cf05857ff81c0cb408ca6365b4196 | grep kernel
"com.coreos.os-extensions": "kernel-rt;kernel-devel;qemu-kiwi;usbguard",
"com.coreos.rpm.kernel": "4.18.0-240.15.1.el8_3.x86_64",
"com.coreos.rpm.kernel-rt-core": "4.18.0-240.15.1.rt7.69.el8_3.x86_64",
```
### Risks and Mitigations
What are the risks of this proposal and how do we mitigate. Think broadly. For
example, consider both security and how this will impact the larger OKD
ecosystem.
How will security be reviewed and by whom? How will UX be reviewed and by whom?
Consider including folks that also work outside your immediate sub-project.
## Design Details
### Test Plan
**Note:** *Section not required until targeted at a release.*
Consider the following in developing a test plan for this enhancement:
- Will there be e2e and integration tests, in addition to unit tests?
- How will it be tested in isolation vs with other components?
No need to outline all of the test cases, just the general strategy. Anything
that would count as tricky in the implementation and anything particularly
challenging to test should be called out.
All code is expected to have adequate tests (eventually with coverage
expectations).
### Upgrade / Downgrade Strategy
#### Red Hat Kernel ABI
Red Hat kernels guarantee an stable kernel application binary interface. If modules
are only using whitelisted symbols then they can leverage weak-updates in the
case of an upgrade. An kmod that is build on 8.0 can be easily loaded on all
subsequent y-stream releases. The weak-update is nothing more than a symlink in
`lib/modules/..../weak-updates` for the driver.
We will extend KVC to check if an out-of-tree driver is able to do weak-updates and
leverage the weak-modules script (part of RHEL) to create the correct symlinks.
If the driver is not kABI compatible SRO will create an alert on the console for
awareness.
In some rare occassions the kABI can change (CVE, bugs, etc) and hence as a preflight
check SRO is going to comparte the curent kABI with the kABI coming with the update.
#### Updates in OpenShift
Updates in OpenShift can happen in two ways:
1. Only the payload (operators and needed parts on top of the OS) is updated
2. The payload and the OS are simultaneously updated.
The first case is "easy" the new version of the operator will reconcile the
expected state and verify that all parts of the special resource stack are
working and then "do" nothing.
For the second case, the new operator will reconcile the expected state and see
that there is a mismatch regarding the kernel version of the DriverContainer and
the updated Node.
It will try to pull the new image with the correct kernel version.
If the correct DriverContainer cannot be pulled, will update the BuildConfig
with the right kernel version and OpenShift will reinitiate the build since we
have the ConfigChange trigger as described above.
Besides the ConfigChange trigger, we also added the ImageChange trigger, which
is important when the base image is updated due to CVE or other bug fixes.
For this to happen automatically we are leveraging ImageStreams of OpenShift, an
ImageStream is a collection of tags that gets automatically updated with the
latest tags. It is like a container repository that represents a virtual view
of related images.
To be always up to date another possibility would be to register a github/gitlab
webhook so every time the DriverContainer code changes a new container could be
built.
One has just to make sure that the webhook is triggered on a specific release
branch, it is not advisable to monitor a fast moving branch (e.g. master) that
would trigger frequent builds.
#### OpenShift Rolling Updates
The OpenShift update can be split into two major parts. The first one being the
upgrade of all CVO managed operators (with OLM) and the second part the update
of the operating system upgrade.
An operator that is deploying a ClusterOperator object can signal CVO if it is
ready to be upgraded or not (Upgradeable=False). MCO and other operator will
check several cluster constraints and signal upgradebility.
SRO will use this very first phase to execute a preflight check to see if the
new kernel that is coming with the operating system is compatible with the
current deployed out-of-tree drivers. SRO will set Upgradeability=False and only
set it to true if the special resources that are managed can be either pulled
(meaning a customer/partner CI/CD pipeline has already created them) or try to
build the out-of-tree driver with the new kernel. This way one can guard the
special resource from being updated and prevent an upgrade to a non working
kernel.
If the preflight checks are successfull SRO needs to take care of the operating
system upgrade. CVO will do the upgrade of MCO as the last step and create all
neccessary manifests (e.g. creating an updated osimage ConfigMap with the new
URL pointing to the osImage) that is used by MCO to roll out the new OS.
Per defaul there are two MachineConfigPools in OpenShift (master and worker),
when MCO starts updating the OS it will do it one machine at a time per each
MachineConfigPool. To prevent MCO from updating a specific MCP we can set it to
"paused:true".
One way to handle this rolling update is to wait for all machines to be updated
in a MCP and then rollout the new drivers but this would mean a service or
application downtime and depending on the amount of nodes it can be very long.
SROs goal here is to minimize the downtime to a bare minimum.
To handle this situation SRO will create for each triplet of cluster, OS and
kernel version a DaemonSet DriverContainer. Supposing we have a MCP with 5
machines and X is the current version and Y the version to be upgraded to we
would have the following picture (N is the node):
```bash
Node: N:0 N:1 N:2 N:3 N:4
OS Version: V:X V:X V:X V:X V:X
DaemonSet: D:X D:X D:X D:X D:X
```
MCO will do a rolling update and pick the first node and apply version Y to it,
the action item for SRO is to create now a new DaemonSet with the new version Y
and provide the new drivers to the node. Since we have run the preflight check
in the first phase SRO knows that it will work and create the new DaemonSet.
```bash
Node: N:0 N:1 N:2 N:3 N:4
OS Version: V:Y V:X V:X V:X V:X
DaemonSet: D:Y D:X D:X D:X D:X
```
The DameonSet will have a nodeSelector targeting the different kernel versions,
meaning a DaemonSet build for version X will only run on nodes with compatible
kernel X and a DaemonSet with version Y will only run on nodes with compatible
kernel Y. This has the effec that the new DameonSet (Y) will automatically scale
up to the new nodes updated to the new version Y by MCO and the DameonSet (X)
will automatically scale down.
```bash
Node: N:0 N:1 N:2 N:3 N:4
OS Version: V:Y V:Y V:Y V:Y V:X
DaemonSet: D:Y D:Y D:Y D:Y D:X
```
MCO will do the rolling update of all the nodes and the DaemonSets will scale
up or scale down automatically, making sure that always a version of the
out-of-tree driver is running on all nodes at any time.
This way SRO can support multiple version skews of OpenShift and handle several
MCPs with different versions running. Depending on the cluster constraints SRO
will also delete obsolete not "supported" DaemonSet DriverContainer versions.
#### Special Resource Driver Downgrade
Having a look at the example CRs above we can see that one can provide a driver
version for a specific hardware aka DriverContainer. will take care of updating
the BuildConfig and DriverContainer manifests. Tainting the node as
**specialresource.openshift.io/downgrade=true:NoExecute** will evade all running
Pods and the DriverContainer can be restarted. When the DriverContainer is again
up and running, the node can be un-tainted to allow workloads to be scheduled on
the node again.
#### Update proactive DriverContainers
A preferable workflow for updates could also be to be proactive on updates. When
OpenShift is updated we would need a mechanism (notification, hook for updates)
to provide the kernel version of the next update before attempting the
upgrade and rebooting the nodes (needinfo installer/CVO team) .
This way DriverContainers are prebuilt and potential problems can be examined
before the update completes (e.g. no drivers for newer kernels, build errors,
etc).
Another major point is to know the underlying RHEL version with major and minor
number. Many drivers have dependencies on RHEL8.0 or RHEL8.1 etc. Currently
there is no easy way to find out if RHCOS is based on RHEL8.0 or RHEL8.1
(OpenShift 4.3 e.g changes from 8.0 to 8.1 depending on the z-Stream)
#### Exception Handling
If there is no prebuilt DriverContainer and no source git repository is provided
to build the DriverContainer the current behaviour is to wait until one of these
prerequisites are fulfilled. Either a DriverContainer is pushed to a registry
known to or a new updated CR is created. The current status is exposed in the
status field of the special resource.
To prevent such a state, before an update happens the user/administrator should
know the kernel version upfront. We need an most obvious way to expose the
kernel version.
Even with the kernel version exposed it is hard to know if an update will break
the cluster. There are several constraints of the drivers and how they tie to a
kernel version.
You have one single source of drivers that can be compiled on all major RHEL
versions. Here it does not matter which kernel version we are running. Here we
can assume that drivers work for all 3.xx.yy and 4.xx.yyy kernels.
One could have drivers that are only dependent on the major RHEL versions. We
would need to consider "only" upgrades from one major to the other. Here drivers
are sensitive going from one major kernel version to the other.
Another case is where drivers are also sensitive to minor version changes which
means they are driver changes for any kernel version.
#### DCI - Distributed CI Environment (RHEL)
### Version Skew Strategy
In some use-case scenarios relies on NFD labels, which are used as node
selectors for deploying the DriverContainers. NFD labels are not changed during
updates. The specific label is an input parameter for the CR of a hardware.
NFD labels are integral parts of the node, if a label is not discovered then the
hardware is not available and hence not intended to be used as a deployment
target for DriverContainers.