NOTE THIS MOVED TO https://hackmd.io/C-CnPdohTbOEigkvRKgoGA

# NOTE THIS MOVED TO https://hackmd.io/C-CnPdohTbOEigkvRKgoGA https://hackmd.io/C-CnPdohTbOEigkvRKgoGA # NOTE THIS MOVED TO https://hackmd.io/C-CnPdohTbOEigkvRKgoGA https://hackmd.io/C-CnPdohTbOEigkvRKgoGA # NOTE THIS MOVED TO https://hackmd.io/C-CnPdohTbOEigkvRKgoGA https://hackmd.io/C-CnPdohTbOEigkvRKgoGA --- title: OpenShift CoreOS Layering authors: - "@cgwalters" - "@darkmuggle" reviewers: - "@mrunalp" approvers: - "@sinnykumari" - "@mrunalp" creation-date: 2021-10-19 last-updated: 2021-10-19 tracking-link: - https://issues.redhat.com/browse/GRPA-4059 --- # OpenShift Layered CoreOS (PROVISIONAL) ## Release Signoff Checklist - [ ] Enhancement is `implementable` - [ ] Design details are appropriately documented from clear requirements - [ ] Test plan is defined - [ ] Operational readiness criteria is defined - [ ] Graduation criteria for dev preview, tech preview, GA - [ ] User-facing documentation is created in [openshift-docs](https://github.com/openshift/openshift-docs/) ## Summary **NOTE: Nothing in this proposal should be viewed as final. It is highly likely that details will change. It is quite possible that larger architectural changes will be made as well.** Change CoreOS to be a "base image" that can be used as in layered container builds and then booted. This will allow custom 3rd party agents delivered via RPMs installed in a container build. The MCO will roll out and monitor these custom builds the same way it does for the "pristine" CoreOS image today. This is the OpenShift extension to [ostree native containers](https://fedoraproject.org/wiki/Changes/OstreeNativeContainer) or [CoreOS layering](https://github.com/coreos/enhancements/pull/7).  ## Motivation 1. We want to support more customization, such as 3rd party security agents (often in RPM format). 2. It should be possible for users to fully control the configuration of the OS in the same way they add content. 3. It should be much easier to roll out a hotfix (kernel or userspace) via pushing a derived container and having the MCO use it. 4. We want transactionality, which is not possible with the existing hybrid MCD/rpm-ostree managed model: - A significant amount of mutation happens per-node - [Filesystem modifications are not transactional](https://github.com/openshift/machine-config-operator/issues/1190) - Related to the above, `rpm-ostree rollback` does not do the right thing, and it's also hard to rollback at the cluster level - There is no way to validate without rebooting a node (and booting a new test node in the new config is not obvious or easy) ### Goals - [ ] Administrators can add custom code alongside configuration via a familiar build system - [ ] The output is a container that can be pushed to a registry and processed via security scanners - [ ] Transactional configuration changes - [ ] Avoid breaking existing workflow via MachineConfig (including extensions) - [ ] Avoid overwriting existing custom modifications (such as files managed by other operators) during upgrades ### Non-Goals - While the base CoreOS layer/ostree-container functionality will be accessible outside of OpenShift, this enhancement does not cover or propose any in-cluster functionality for generating or using images outside of the OpenShift node use case. - This proposal does not cover generating updated "bootimages"; see https://github.com/openshift/enhancements/pull/201 - Doesn't change existing workflow for RHEL worker nodes ## Proposal **NOTE: Nothing in this proposal should be viewed as final. It is highly likely that details will change. It is quite possible that larger architectural changes will be made as well.** 1. The `machine-os-content` shipped as part of the release payload will change format to the new "native ostree-container" format (and become runnable as a container directly for testing). For more information, see [ostree-rs-ext](https://github.com/ostreedev/ostree-rs-ext/) and [CoreOS layering](https://github.com/coreos/enhancements/pull/7). Internally, this will be a `openshift-machine-config-operator/coreos` object of type `imagestream`, owned by the MCO. 2. Each machineconfig pool will have an associated `BuildConfig` object in the spec. The default install will have `mco-controlplane` and `mco-worker` objects in the `openshift-machine-config-operator` namespace. This is where most `MachineConfig` changes will be handled. 3. Each machineconfig pool will also support a `custom-coreos` `BuildConfig` object and imagestream. This build must use the `mco-coreos` imagestream as a base. The result of this will be rolled out by the MCO to nodes. 5. Each machineconfig pool will also support a `custom-external-coreos` imagestream for pulling externally built images (PROVISIONAL) 6. MCD continues to perform drains and reboots, but writes much less configuration per node 7. The Machine Configuration Server (MCS) will only serve a "bootstrap" Ignition configuration (pull secret, network configuration) sufficient for the node to pull the target container image. For clusters without any custom MachineConfig at all, the MCO will deploy the result of the `mco-coreos` build. ### User Stories #### What works now continues to work An OpenShift administrator at example.corp is happily using OpenShift 4 (with RHEL CoreOS) in several AWS clusters today, and has only a small custom MachineConfig object to tweak host level auditing. They do not plan to use any complex derived builds, and just expect that upgrading their existing cluster continues to work and respect their small audit configuration change. #### Adding a 3rd party security scanner/IDS example.bank's security team requires a 3rd party security agent to be installed on bare metal machines in their datacenter. The 3rd party agent comes as an RPM today, and requires its own custom configuration. While the 3rd party vendor has support for execution as a privileged daemonset on their roadmap, it is not going to appear soon. After initial cluster provisioning is complete, the administrators at example.bank supply a `BuildConfig` object named `custom-coreos-$pool-build` with an [inline Dockerfile](https://docs.openshift.com/container-platform/4.8/cicd/builds/creating-build-inputs.html#builds-dockerfile-source_creating-build-inputs) that adds a repo file to `/etc/yum.repos.d/agentvendor.repo` and invokes `RUN yum -y install some-3rdparty-security-agent`). (Here `$pool` = `worker`) The MCO notices the build object creation and starts an initial build, which gets succesfully pushed to the `custom-coreos-$pool-imagestream` imagestream. This gets added to both the control plane (master) and worker pools, and is rolled out in the same way the MCO performs configuration and OS updates today. A few weeks later, after a cluster level upgrade has started, a new base RHEL CoreOS image is updated in the `coreos` imagestream by the MCO. This triggers a rebuild of both `buildconfig/mco-coreos-controlplane` and `buildconfig/mco-coreos-worker`, which succeed. This in turn triggers a rebuild of the `buildconfig/custom-coreos-$pool-build` builds. A month after that, the administrator wants to make a configuration change, and creates a `machineconfig` object targeting the `worker` pool. This triggers a new image build. But, the 3rd party yum repository is down, and the image build fails. The operations team gets an alert, and resolves the repository connectivity issue. They manually restart the build via `oc -n openshift-machine-config-operator start-build custom-coreos-worker` which succeeds. #### Kernel hotfix example.corp runs OCP on aarch64 on bare metal. An important regression is found that only affects the aarch64 architecture on some bare metal platforms. While a fix is queued for a RHEL 8.x z-stream, there is also risk in fast tracking the fix to *all* OCP platforms. Because this fix is important to example.corp, a hotfix is provided via a pre-release `kernel.rpm`. The OCP admins at example.corp get a copy of this hotfix RPM into their internal data store, and craft a `Dockerfile` that does `yum -y upgrade https://example.corp/mirror/kernel-5.x.y*.rpm` and create the `buildconfig/custom-coreos-worker` object in their cluster. The MCO builds a derived image and rolls it out. (Note: this flow would likely be explained as a customer portal document, etc.) Later, a fixed kernel with a newer version is released in the main OCP channels. As part of the `oc adm upgrade`, the `yum -y upgrade` invocation above detects that a newer kernel is already in the base image, and returns an error. The example.corp administrators get an alert, and simply `oc -n openshift-machine-config-operator delete buildconfig/custom-coreos-worker`. The MCO returns to deploying the `mco-coreos` image. #### Externally built image As we move towards having users manage many clusters (10, 100 or more), it will make sense to support building a node image built centrally. This will allow submitting the image to a security scanner or review by a security team before deploying to clusters. Acme Corp has 300 clusters distributed across their manufacturing centers. They want to centralize their build system in their main data center, and just distribute those images to edge clusters. They provide a `custom-coreos-imagestream` object at installation time, and their node CoreOS image is deployed during the installation of each cluster without a build operation. (Note some unanswered questions below) ### Implementation details #### Preserving `MachineConfig` We cannot just drop `MachineConfig` as an interface to node configuration. Hence, the MCO will be responsible for starting new builds on upgrades or when new machine config content is rendered. For most configuration, instead of having the MCD write files on each node, it will be added into the image build run on the cluster. To be more specific, most content from the Ignition `systemd/units` and `storage/files` sections (in general, files written into `/etc`) will instead be injected into an internally-generated `Dockerfile` (or equivalent) that performs an effect similar to the example from the [CoreOS layering enhancement](https://github.com/coreos/enhancements/blob/main/os/coreos-layering.md#butane-as-a-declarative-input-format-for-layering). ```dockerfile= FROM <coreos> # This is needed ADD mco-rendered-config.json /etc/mco-rendered-config.json ADD ignition.json /tmp/ignition.json RUN ignition-liveapply /tmp/ignition.json && rm -f /tmp/ignition.json ``` This build process will be tracked via a `mco-coreos-build` `BuildConfig` object which will be monitored by the operator. The output of this build process will be pushed to the `imagestream/mco-coreos`, which should be used by further build processes. #### Handling booting old nodes We can't switch the format of the oscontainer easily because older clusters may have older bootimages with older `rpm-ostree` that won't understand the new container format. Hence firstboot upgrades would just fail. Options: - Double reboot; but we'd still need to ship the old image format in addition to new And really the only sane way to ship both is to generate the old from the new; we could do that in-cluster or per node or pre-generated as part of the payload - Try to run rpm-ostree itself as a container - Force bootimage updates (can't be a 100% solution due to UPI) NOTE: Verify that we're doing node scaling post-upgrade in some e2e tests #### Preserving old MCD behaviour for RHEL nodes The RHEL 8 worker nodes in-cluster will require us to continue support existing file/unit write as well as provision (`once-from`) workflows. See also [openshift-ansible and MCO](https://github.com/openshift/machine-config-operator/issues/1592). #### Handling extensions We need to preserve support for [extensions](https://github.com/openshift/enhancements/blob/master/enhancements/rhcos/extensions.md). For example, `kernel-rt` support is key to many OpenShift use cases. Extensions move to a `machine-os-content-extensions` container that has RPMs. Concretely, switching to `kernel-rt` would look like e.g.: ``` FROM machine-os-extensions as extensions FROM <machine-os-content> WORKDIR /root COPY --from=extensions /srv/extensions/*.rpm . RUN rpm-ostree switch-kernel ./kernel-rt*.rpm ``` The RHCOS pipeline will produce the new `machine-os-content-extensions` and ensure the content there is tested with the main `machine-os-content`. #### Kernel Arguments Not currently in scope for CoreOS derivation. See also https://github.com/ostreedev/ostree/issues/479 For now, updating kernel arguments will continue to happen via the MCD on each node via executing `rpm-ostree kargs` as it does today. #### Ignition Ignition will continue to handle the `disks` and `filesystem` sections - for example, LUKS will continue to be applied as it has been today. Further, it is likely that we will need to ship a targeted subset of the configuration via Ignition too - for example, the pull secret will be necessary to pull the build containers. ##### Per machine state, the pointer config See [MCO issue 1720 "machine-specific machineconfigs"](https://github.com/openshift/machine-config-operator/issues/1720). We need to support per machine/per node state like static IP addresses and hostname. ##### 3 Ignition "levels" - Pointer configuration: this stays unchanged - Firstboot ignition: Contains the bits needed to perform the switch to the custom image - Everything else: This all ends up in the `mco-coreos` container image, e.g. `kubelet.service` systemd unit. #### Drain and reboot The MCD will continue to perform drain and reboots. #### Single Node OpenShift Clearly this mechanism needs to work on single node too. It would be a bit silly to build a container image and push it to a registry on that node, only to pull it back to the host. But it would (should) work. #### Reboots and live apply The MCO has invested in performing some types of updates without rebooting. We will need to retain that functionality. Today, `rpm-ostree` does have `apply-live`. One possibility is that if just e.g. the pull secret changes, the MCO still builds a new image with the change, but compares the node state (current, new) and executes a targeted command like `rpm-ostree apply-live --files /etc/kubernetes/pull-secret.json` that applies just that change live. Or, the MCD might handle live changes on its own, writing files instead to e.g. `/run/kubernetes/pull-secret.json` and telling the kubelet to switch to that. Today the MCO supports [live updating](https://github.com/openshift/machine-config-operator/pull/2398) the [node certificate](https://docs.openshift.com/container-platform/4.9/security/certificate_types_descriptions/node-certificates.html). #### Node firstboot/bootstrap Today the MCO splits node bootstrapping into two locations: Ignition (which provisions all Ignition subfields of a MachineConfig) and `machine-config-daemon-firstboot.service`, which runs before kubelet to provision the rest of the MC fields, and reboots the node to complete provisioning. We can't quite put *everything* configured via Ignition into our image build. At the least, we will need the pull secret (currently `/var/lib/kubelet/config.json`) in order to pull the image to the node at all. Further, we will also need things like the image stream for disconnected operation. In our new model, Ignition will likely still have to perform subsets of MachineConfig (e.g. disk partitioning) that we do not modify post bootstrapping. It will also need to write certain credentials for the node to access relevant objects, such as the pull secret. The main focus of the served Ignition config will be, compared to today, setting up the MCD-firstboot.service to fetch and pivot to the layered image. This initial ignition config we serve through the MCS will also contain all the files it wrote, which is then encapsulated in the MCD firstboot to be removed, since we do not want to have any "manually written files". We need to be mindful to preserve anything provided via the pointer config, because we need to support that for per-machine state. Alternatively, we could change the node firstboot join to have a pull secret that only allows pulling "base images" from inside the cluster. Analyzing and splitting this "firstboot configuration" may turn out to be a nontrivial amount of work, particularly in corner cases. A mitigation here is to incrementally move over to things we are *sure* can be done via the image build. ##### Compatibility with openshift-ansible/windows containers There are other things that pull Ignition: - [openshift-ansible for workers](https://github.com/openshift/openshift-ansible/blob/c411571ae2a0b3518b4179cce09768bfc3cf50d5/roles/openshift_node/tasks/apply_machine_config.yml#L23) - [openshift-ansible for bootstrap](https://github.com/openshift/openshift-ansible/blob/e3b38f9ffd8e954c0060ec6a62f141fbc6335354/roles/openshift_node/tasks/config.yml#L70) fetches MCS - [windows node for openshift](https://github.com/openshift/windows-machine-config-bootstrapper/blob/016f4c5f9bb814f47e142150da897b933cbff9f4/cmd/bootstrapper/initialize_kubelet.go#L33) #### Intersection with https://github.com/openshift/enhancements/pull/201 In the future, we may also generate updated "bootimages" from the custom operating system container. #### Intersection with https://github.com/openshift/os/issues/498 It would be very natural to split `machine-os-content` into `machine-coreos` and `machine-kubelet` for example, where the latter derives from the former. #### Using RHEL packages - entitlements and bootstrapping Today, installing OpenShift does not require RHEL entitlements - all that is necessary is a pull secret. This CoreOS layering functionality will immediately raise the question of supporting `yum -y install $something` as part of their node, where `$something` is not part of our extensions that are available without entitlement. For cluster-internal builds, it should work to do this "day 2" via [existing RHEL entitlement flows](https://docs.openshift.com/container-platform/4.9/cicd/builds/running-entitled-builds.html#builds-source-secrets-entitlements_running-entitled-builds). Another alternative will be providing an image built outside of the cluster. It may be possible in the future to perform initial custom builds on the bootstrap node for "day 1" customized CoreOS flows, but adds significant complexity around debugging failures. We suspect that most users who want this will be better served by out-of-cluster image builds. ### Risks and Mitigations We're introducing a whole new level of customization for nodes, and because this functionality will be new, we don't yet have significant experience with it. There are likely a number of potentially problematic "unknown unknowns". To say this another way: until now we've mostly stuck to the model that user code should run in a container, and keep the host relatively small. This could be perceived as a major backtracking on that model. This also intersects heavily with things like [out of tree drivers](https://github.com/openshift/enhancements/pull/357). We will need some time to gain experience with what works and best practices, and develop tooling and documentation. It is likely that the initial version will be classified as "Tech Preview" from the OCP product perspective. #### Supportability of two update mechanisms If for some reason we cannot easily upgrade existing FCOS/RHCOS systems provisioned prior to the existence of this functionality, and hence need to support *two* ways to update CoreOS nodes, it will become an enormous burden. Also relatedly, we would need to continue to support [openshift-ansible](https://github.com/openshift/openshift-ansible) for some time alongside the `once-from` functionality. See also [this issue](https://github.com/openshift/machine-config-operator/issues/1592). #### Versioning of e.g. kubelet We will need to ensure that we detect and handle the case where core components e.g. the `kubelet` binary is coming from the wrong place, or is the wrong version. #### Location of builds Today, ideally nodes are isolated from each other. A compromised node can in theory only affect pods which land on that node. In particular we want to avoid a compromised worker node being able to easily escalate compromise the control plane. #### Registry availability If implemented in the obvious way, we OS updates would fail if the cluster-internal registry is down. A strong mitigation for this is to use ostree's native ability to "stage" the update across all machines before starting any drain at all. However, we should probably still be careful to only stage the update on one node at a time (or `maxUnavailable`) in order to avoid "thundering herd" problems, particularly for the control plane with etcd. Another mitigation here may be to support peer-to-peer upgrades, or have the control plane host a "bootstrap registry" that just contains the pending OS update. #### Manifest list support We know we want heterogeneous clusters, right now that's not supported by the build and image stream APIs. #### openshift-install bootstrap node process A key question here is whether we need the OpenShift build API as part of the bootstrap node or not. One option is to do a `podman build` on the bootstrap node. Another possibility is that we initially use CoreOS layering only for worker nodes. ##### Single Node bootstrap in place Today [Single Node OpenShift](https://docs.openshift.com/container-platform/4.9/installing/installing_sno/install-sno-installing-sno.html) performs a "bootstrap in place" process that turns the bootstrap node into the combined controlplane/worker node without requiring a separate (virtual/physical) machine. It may be that we need to support converting the built custom container image into a CoreOS metal image that would be directly writable to disk to shave an extra reboot. ## Design Details ### Open Questions - Would we offer multiple base images, e.g. users could now choose to use RHEL 8.X "Z-streams" versus RHEL 8.$latest? - How will this work for a heterogenous cluster? #### Debugging custom layers (arbitrary images) In this proposal so far, we support an arbitrary `BuildConfig` which can do anything, but would most likely be a `Dockerfile`. Hence, we need to accept arbitrary images, but will have the equivalent of `podman history` that is exposed to the cluster administrator and us. #### Exposing custom RPMs via butane (Ignition) Right now we have extensions in MachineConfig; to support fully custom builds it might suffice to expose yum/rpm-md repos and an arbitrary set of packages to add. Note that Ignition is designed not to have distro-specific syntax. We'd need to either support RPM packages via Butane sugar, or think about a generic way to describe packages in the Ignition spec. This would be a custom container builder tool that drops the files from the Ignition config into a layer. This could also be used in the underlying CoreOS layering proposal. #### External images This will need some design to make it work nicely to build images for a different target OCP version. The build cluster will need access to base images for multiple versions. Further, the MCO today dynamically templates some content based on target platform, so the build process would need to support running the MCO's templating code to generate per-platform config at build time. Further, we have per-cluster data such as certificates. We may need to fall back to doing a minimal per-cluster build, just effectively supporting replacing the coreos image instead of replacing the `mco-base`. ### Test Plan Attempting to convert as much of the default MachineConfig flow to use this functionality will heavily exercise the code. ### Graduation Criteria (TBD) **Tech Preview** **GA** ### Upgrade / Downgrade Strategy See above - this is a large risk. Nontrivial work may need to land in the MCO to support transitioning nodes. ### Version Skew Strategy Similar to above. ## Implementation History There was a prior version of this proposal which was OpenShift specific and called for a custom build strategy. Since then, the "CoreOS layering" effort has been initiated, and this proposal is now dedicated to the OpenShift-specific aspects of using this functionality, rather than also containing machinery to build custom images. ## Drawbacks If we are succesful; not many. If it turns out that e.g. upgrading existing RHCOS systems in place is difficult, that will be a problem. ## Alternatives Continue as is - supporting both RHEL CoreOS and traditional RHEL (where it's more obvious how to make arbitrary changes at the cost of upgrade reliability), for example.