# NOTE THIS MOVED TO https://hackmd.io/C-CnPdohTbOEigkvRKgoGA https://hackmd.io/C-CnPdohTbOEigkvRKgoGA # NOTE THIS MOVED TO https://hackmd.io/C-CnPdohTbOEigkvRKgoGA https://hackmd.io/C-CnPdohTbOEigkvRKgoGA # NOTE THIS MOVED TO https://hackmd.io/C-CnPdohTbOEigkvRKgoGA https://hackmd.io/C-CnPdohTbOEigkvRKgoGA --- title: OpenShift CoreOS Layering authors: - "@cgwalters" - "@darkmuggle" reviewers: - "@mrunalp" approvers: - "@sinnykumari" - "@mrunalp" creation-date: 2021-10-19 last-updated: 2021-10-19 tracking-link: - https://issues.redhat.com/browse/GRPA-4059 --- # OpenShift Layered CoreOS (PROVISIONAL) ## Release Signoff Checklist - [ ] Enhancement is `implementable` - [ ] Design details are appropriately documented from clear requirements - [ ] Test plan is defined - [ ] Operational readiness criteria is defined - [ ] Graduation criteria for dev preview, tech preview, GA - [ ] User-facing documentation is created in [openshift-docs](https://github.com/openshift/openshift-docs/) ## Summary **NOTE: Nothing in this proposal should be viewed as final. It is highly likely that details will change. It is quite possible that larger architectural changes will be made as well.** Change CoreOS to be a "base image" that can be used as in layered container builds and then booted. This will allow custom 3rd party agents delivered via RPMs installed in a container build. The MCO will roll out and monitor these custom builds the same way it does for the "pristine" CoreOS image today. This is the OpenShift extension to [ostree native containers](https://fedoraproject.org/wiki/Changes/OstreeNativeContainer) or [CoreOS layering](https://github.com/coreos/enhancements/pull/7). <!-- A key feature of OpenShift 4 is that the cluster manages the operating system too. Today there is a hybrid management/update mechanism through rpm-ostree and the Machine Config Daemon; see [OSUpgrades.md](https://github.com/openshift/machine-config-operator/blob/master/docs/OSUpgrades.md). Add or replace the `machine-os-content` image to use the new [OSTree native container format](https://fedoraproject.org/wiki/Changes/OstreeNativeContainer). --> ## Motivation 1. We want to support more customization, such as 3rd party security agents (often in RPM format). 2. It should be possible for users to fully control the configuration of the OS in the same way they add content. 3. It should be much easier to roll out a hotfix (kernel or userspace) via pushing a derived container and having the MCO use it. 4. We want transactionality, which is not possible with the existing hybrid MCD/rpm-ostree managed model: - A significant amount of mutation happens per-node - [Filesystem modifications are not transactional](https://github.com/openshift/machine-config-operator/issues/1190) - Related to the above, `rpm-ostree rollback` does not do the right thing, and it's also hard to rollback at the cluster level - There is no way to validate without rebooting a node (and booting a new test node in the new config is not obvious or easy) ### Goals - [ ] Administrators can add custom code alongside configuration via a familiar build system - [ ] The output is a container that can be pushed to a registry and processed via security scanners - [ ] Transactional configuration changes - [ ] Avoid breaking existing workflow via MachineConfig (including extensions) - [ ] Avoid overwriting existing custom modifications (such as files managed by other operators) during upgrades ### Non-Goals - While the base CoreOS layer/ostree-container functionality will be accessible outside of OpenShift, this enhancement does not cover or propose any in-cluster functionality for generating or using images outside of the OpenShift node use case. - This proposal does not cover generating updated "bootimages"; see https://github.com/openshift/enhancements/pull/201 - Doesn't change existing workflow for RHEL worker nodes ## Proposal **NOTE: Nothing in this proposal should be viewed as final. It is highly likely that details will change. It is quite possible that larger architectural changes will be made as well.** 1. The `machine-os-content` shipped as part of the release payload will change format to the new "native ostree-container" format (and become runnable as a container directly for testing). For more information, see [ostree-rs-ext](https://github.com/ostreedev/ostree-rs-ext/) and [CoreOS layering](https://github.com/coreos/enhancements/pull/7). Internally, this will be a `openshift-machine-config-operator/coreos` object of type `imagestream`, owned by the MCO. 2. Each machineconfig pool will have an associated `BuildConfig` object in the spec. The default install will have `mco-controlplane` and `mco-worker` objects in the `openshift-machine-config-operator` namespace. This is where most `MachineConfig` changes will be handled. 3. Each machineconfig pool will also support a `custom-coreos` `BuildConfig` object and imagestream. This build must use the `mco-coreos` imagestream as a base. The result of this will be rolled out by the MCO to nodes. 5. Each machineconfig pool will also support a `custom-external-coreos` imagestream for pulling externally built images (PROVISIONAL) 6. MCD continues to perform drains and reboots, but writes much less configuration per node 7. The Machine Configuration Server (MCS) will only serve a "bootstrap" Ignition configuration (pull secret, network configuration) sufficient for the node to pull the target container image. For clusters without any custom MachineConfig at all, the MCO will deploy the result of the `mco-coreos` build. ### User Stories #### What works now continues to work An OpenShift administrator at example.corp is happily using OpenShift 4 (with RHEL CoreOS) in several AWS clusters today, and has only a small custom MachineConfig object to tweak host level auditing. They do not plan to use any complex derived builds, and just expect that upgrading their existing cluster continues to work and respect their small audit configuration change. #### Adding a 3rd party security scanner/IDS example.bank's security team requires a 3rd party security agent to be installed on bare metal machines in their datacenter. The 3rd party agent comes as an RPM today, and requires its own custom configuration. While the 3rd party vendor has support for execution as a privileged daemonset on their roadmap, it is not going to appear soon. After initial cluster provisioning is complete, the administrators at example.bank supply a `BuildConfig` object named `custom-coreos-$pool-build` with an [inline Dockerfile](https://docs.openshift.com/container-platform/4.8/cicd/builds/creating-build-inputs.html#builds-dockerfile-source_creating-build-inputs) that adds a repo file to `/etc/yum.repos.d/agentvendor.repo` and invokes `RUN yum -y install some-3rdparty-security-agent`). (Here `$pool` = `worker`) The MCO notices the build object creation and starts an initial build, which gets succesfully pushed to the `custom-coreos-$pool-imagestream` imagestream. This gets added to both the control plane (master) and worker pools, and is rolled out in the same way the MCO performs configuration and OS updates today. A few weeks later, after a cluster level upgrade has started, a new base RHEL CoreOS image is updated in the `coreos` imagestream by the MCO. This triggers a rebuild of both `buildconfig/mco-coreos-controlplane` and `buildconfig/mco-coreos-worker`, which succeed. This in turn triggers a rebuild of the `buildconfig/custom-coreos-$pool-build` builds. A month after that, the administrator wants to make a configuration change, and creates a `machineconfig` object targeting the `worker` pool. This triggers a new image build. But, the 3rd party yum repository is down, and the image build fails. The operations team gets an alert, and resolves the repository connectivity issue. They manually restart the build via `oc -n openshift-machine-config-operator start-build custom-coreos-worker` which succeeds. #### Kernel hotfix example.corp runs OCP on aarch64 on bare metal. An important regression is found that only affects the aarch64 architecture on some bare metal platforms. While a fix is queued for a RHEL 8.x z-stream, there is also risk in fast tracking the fix to *all* OCP platforms. Because this fix is important to example.corp, a hotfix is provided via a pre-release `kernel.rpm`. The OCP admins at example.corp get a copy of this hotfix RPM into their internal data store, and craft a `Dockerfile` that does `yum -y upgrade https://example.corp/mirror/kernel-5.x.y*.rpm` and create the `buildconfig/custom-coreos-worker` object in their cluster. The MCO builds a derived image and rolls it out. (Note: this flow would likely be explained as a customer portal document, etc.) Later, a fixed kernel with a newer version is released in the main OCP channels. As part of the `oc adm upgrade`, the `yum -y upgrade` invocation above detects that a newer kernel is already in the base image, and returns an error. The example.corp administrators get an alert, and simply `oc -n openshift-machine-config-operator delete buildconfig/custom-coreos-worker`. The MCO returns to deploying the `mco-coreos` image. #### Externally built image As we move towards having users manage many clusters (10, 100 or more), it will make sense to support building a node image built centrally. This will allow submitting the image to a security scanner or review by a security team before deploying to clusters. Acme Corp has 300 clusters distributed across their manufacturing centers. They want to centralize their build system in their main data center, and just distribute those images to edge clusters. They provide a `custom-coreos-imagestream` object at installation time, and their node CoreOS image is deployed during the installation of each cluster without a build operation. (Note some unanswered questions below) ### Implementation details #### Preserving `MachineConfig` We cannot just drop `MachineConfig` as an interface to node configuration. Hence, the MCO will be responsible for starting new builds on upgrades or when new machine config content is rendered. For most configuration, instead of having the MCD write files on each node, it will be added into the image build run on the cluster. To be more specific, most content from the Ignition `systemd/units` and `storage/files` sections (in general, files written into `/etc`) will instead be injected into an internally-generated `Dockerfile` (or equivalent) that performs an effect similar to the example from the [CoreOS layering enhancement](https://github.com/coreos/enhancements/blob/main/os/coreos-layering.md#butane-as-a-declarative-input-format-for-layering). ```dockerfile= FROM <coreos> # This is needed ADD mco-rendered-config.json /etc/mco-rendered-config.json ADD ignition.json /tmp/ignition.json RUN ignition-liveapply /tmp/ignition.json && rm -f /tmp/ignition.json ``` This build process will be tracked via a `mco-coreos-build` `BuildConfig` object which will be monitored by the operator. The output of this build process will be pushed to the `imagestream/mco-coreos`, which should be used by further build processes. #### Handling booting old nodes We can't switch the format of the oscontainer easily because older clusters may have older bootimages with older `rpm-ostree` that won't understand the new container format. Hence firstboot upgrades would just fail. Options: - Double reboot; but we'd still need to ship the old image format in addition to new And really the only sane way to ship both is to generate the old from the new; we could do that in-cluster or per node or pre-generated as part of the payload - Try to run rpm-ostree itself as a container - Force bootimage updates (can't be a 100% solution due to UPI) NOTE: Verify that we're doing node scaling post-upgrade in some e2e tests #### Preserving old MCD behaviour for RHEL nodes The RHEL 8 worker nodes in-cluster will require us to continue support existing file/unit write as well as provision (`once-from`) workflows. See also [openshift-ansible and MCO](https://github.com/openshift/machine-config-operator/issues/1592). #### Handling extensions We need to preserve support for [extensions](https://github.com/openshift/enhancements/blob/master/enhancements/rhcos/extensions.md). For example, `kernel-rt` support is key to many OpenShift use cases. Extensions move to a `machine-os-content-extensions` container that has RPMs. Concretely, switching to `kernel-rt` would look like e.g.: ``` FROM machine-os-extensions as extensions FROM <machine-os-content> WORKDIR /root COPY --from=extensions /srv/extensions/*.rpm . RUN rpm-ostree switch-kernel ./kernel-rt*.rpm ``` The RHCOS pipeline will produce the new `machine-os-content-extensions` and ensure the content there is tested with the main `machine-os-content`. #### Kernel Arguments Not currently in scope for CoreOS derivation. See also https://github.com/ostreedev/ostree/issues/479 For now, updating kernel arguments will continue to happen via the MCD on each node via executing `rpm-ostree kargs` as it does today. #### Ignition Ignition will continue to handle the `disks` and `filesystem` sections - for example, LUKS will continue to be applied as it has been today. Further, it is likely that we will need to ship a targeted subset of the configuration via Ignition too - for example, the pull secret will be necessary to pull the build containers. ##### Per machine state, the pointer config See [MCO issue 1720 "machine-specific machineconfigs"](https://github.com/openshift/machine-config-operator/issues/1720). We need to support per machine/per node state like static IP addresses and hostname. ##### 3 Ignition "levels" - Pointer configuration: this stays unchanged - Firstboot ignition: Contains the bits needed to perform the switch to the custom image - Everything else: This all ends up in the `mco-coreos` container image, e.g. `kubelet.service` systemd unit. #### Drain and reboot The MCD will continue to perform drain and reboots. #### Single Node OpenShift Clearly this mechanism needs to work on single node too. It would be a bit silly to build a container image and push it to a registry on that node, only to pull it back to the host. But it would (should) work. #### Reboots and live apply The MCO has invested in performing some types of updates without rebooting. We will need to retain that functionality. Today, `rpm-ostree` does have `apply-live`. One possibility is that if just e.g. the pull secret changes, the MCO still builds a new image with the change, but compares the node state (current, new) and executes a targeted command like `rpm-ostree apply-live --files /etc/kubernetes/pull-secret.json` that applies just that change live. Or, the MCD might handle live changes on its own, writing files instead to e.g. `/run/kubernetes/pull-secret.json` and telling the kubelet to switch to that. Today the MCO supports [live updating](https://github.com/openshift/machine-config-operator/pull/2398) the [node certificate](https://docs.openshift.com/container-platform/4.9/security/certificate_types_descriptions/node-certificates.html). #### Node firstboot/bootstrap Today the MCO splits node bootstrapping into two locations: Ignition (which provisions all Ignition subfields of a MachineConfig) and `machine-config-daemon-firstboot.service`, which runs before kubelet to provision the rest of the MC fields, and reboots the node to complete provisioning. We can't quite put *everything* configured via Ignition into our image build. At the least, we will need the pull secret (currently `/var/lib/kubelet/config.json`) in order to pull the image to the node at all. Further, we will also need things like the image stream for disconnected operation. In our new model, Ignition will likely still have to perform subsets of MachineConfig (e.g. disk partitioning) that we do not modify post bootstrapping. It will also need to write certain credentials for the node to access relevant objects, such as the pull secret. The main focus of the served Ignition config will be, compared to today, setting up the MCD-firstboot.service to fetch and pivot to the layered image. This initial ignition config we serve through the MCS will also contain all the files it wrote, which is then encapsulated in the MCD firstboot to be removed, since we do not want to have any "manually written files". We need to be mindful to preserve anything provided via the pointer config, because we need to support that for per-machine state. Alternatively, we could change the node firstboot join to have a pull secret that only allows pulling "base images" from inside the cluster. Analyzing and splitting this "firstboot configuration" may turn out to be a nontrivial amount of work, particularly in corner cases. A mitigation here is to incrementally move over to things we are *sure* can be done via the image build. ##### Compatibility with openshift-ansible/windows containers There are other things that pull Ignition: - [openshift-ansible for workers](https://github.com/openshift/openshift-ansible/blob/c411571ae2a0b3518b4179cce09768bfc3cf50d5/roles/openshift_node/tasks/apply_machine_config.yml#L23) - [openshift-ansible for bootstrap](https://github.com/openshift/openshift-ansible/blob/e3b38f9ffd8e954c0060ec6a62f141fbc6335354/roles/openshift_node/tasks/config.yml#L70) fetches MCS - [windows node for openshift](https://github.com/openshift/windows-machine-config-bootstrapper/blob/016f4c5f9bb814f47e142150da897b933cbff9f4/cmd/bootstrapper/initialize_kubelet.go#L33) #### Intersection with https://github.com/openshift/enhancements/pull/201 In the future, we may also generate updated "bootimages" from the custom operating system container. #### Intersection with https://github.com/openshift/os/issues/498 It would be very natural to split `machine-os-content` into `machine-coreos` and `machine-kubelet` for example, where the latter derives from the former. #### Using RHEL packages - entitlements and bootstrapping Today, installing OpenShift does not require RHEL entitlements - all that is necessary is a pull secret. This CoreOS layering functionality will immediately raise the question of supporting `yum -y install $something` as part of their node, where `$something` is not part of our extensions that are available without entitlement. For cluster-internal builds, it should work to do this "day 2" via [existing RHEL entitlement flows](https://docs.openshift.com/container-platform/4.9/cicd/builds/running-entitled-builds.html#builds-source-secrets-entitlements_running-entitled-builds). Another alternative will be providing an image built outside of the cluster. It may be possible in the future to perform initial custom builds on the bootstrap node for "day 1" customized CoreOS flows, but adds significant complexity around debugging failures. We suspect that most users who want this will be better served by out-of-cluster image builds. ### Risks and Mitigations We're introducing a whole new level of customization for nodes, and because this functionality will be new, we don't yet have significant experience with it. There are likely a number of potentially problematic "unknown unknowns". To say this another way: until now we've mostly stuck to the model that user code should run in a container, and keep the host relatively small. This could be perceived as a major backtracking on that model. This also intersects heavily with things like [out of tree drivers](https://github.com/openshift/enhancements/pull/357). We will need some time to gain experience with what works and best practices, and develop tooling and documentation. It is likely that the initial version will be classified as "Tech Preview" from the OCP product perspective. #### Supportability of two update mechanisms If for some reason we cannot easily upgrade existing FCOS/RHCOS systems provisioned prior to the existence of this functionality, and hence need to support *two* ways to update CoreOS nodes, it will become an enormous burden. Also relatedly, we would need to continue to support [openshift-ansible](https://github.com/openshift/openshift-ansible) for some time alongside the `once-from` functionality. See also [this issue](https://github.com/openshift/machine-config-operator/issues/1592). #### Versioning of e.g. kubelet We will need to ensure that we detect and handle the case where core components e.g. the `kubelet` binary is coming from the wrong place, or is the wrong version. #### Location of builds Today, ideally nodes are isolated from each other. A compromised node can in theory only affect pods which land on that node. In particular we want to avoid a compromised worker node being able to easily escalate compromise the control plane. #### Registry availability If implemented in the obvious way, we OS updates would fail if the cluster-internal registry is down. A strong mitigation for this is to use ostree's native ability to "stage" the update across all machines before starting any drain at all. However, we should probably still be careful to only stage the update on one node at a time (or `maxUnavailable`) in order to avoid "thundering herd" problems, particularly for the control plane with etcd. Another mitigation here may be to support peer-to-peer upgrades, or have the control plane host a "bootstrap registry" that just contains the pending OS update. #### Manifest list support We know we want heterogeneous clusters, right now that's not supported by the build and image stream APIs. #### openshift-install bootstrap node process A key question here is whether we need the OpenShift build API as part of the bootstrap node or not. One option is to do a `podman build` on the bootstrap node. Another possibility is that we initially use CoreOS layering only for worker nodes. ##### Single Node bootstrap in place Today [Single Node OpenShift](https://docs.openshift.com/container-platform/4.9/installing/installing_sno/install-sno-installing-sno.html) performs a "bootstrap in place" process that turns the bootstrap node into the combined controlplane/worker node without requiring a separate (virtual/physical) machine. It may be that we need to support converting the built custom container image into a CoreOS metal image that would be directly writable to disk to shave an extra reboot. ## Design Details ### Open Questions - Would we offer multiple base images, e.g. users could now choose to use RHEL 8.X "Z-streams" versus RHEL 8.$latest? - How will this work for a heterogenous cluster? #### Debugging custom layers (arbitrary images) In this proposal so far, we support an arbitrary `BuildConfig` which can do anything, but would most likely be a `Dockerfile`. Hence, we need to accept arbitrary images, but will have the equivalent of `podman history` that is exposed to the cluster administrator and us. #### Exposing custom RPMs via butane (Ignition) Right now we have extensions in MachineConfig; to support fully custom builds it might suffice to expose yum/rpm-md repos and an arbitrary set of packages to add. Note that Ignition is designed not to have distro-specific syntax. We'd need to either support RPM packages via Butane sugar, or think about a generic way to describe packages in the Ignition spec. This would be a custom container builder tool that drops the files from the Ignition config into a layer. This could also be used in the underlying CoreOS layering proposal. #### External images This will need some design to make it work nicely to build images for a different target OCP version. The build cluster will need access to base images for multiple versions. Further, the MCO today dynamically templates some content based on target platform, so the build process would need to support running the MCO's templating code to generate per-platform config at build time. Further, we have per-cluster data such as certificates. We may need to fall back to doing a minimal per-cluster build, just effectively supporting replacing the coreos image instead of replacing the `mco-base`. ### Test Plan Attempting to convert as much of the default MachineConfig flow to use this functionality will heavily exercise the code. ### Graduation Criteria (TBD) **Tech Preview** **GA** ### Upgrade / Downgrade Strategy See above - this is a large risk. Nontrivial work may need to land in the MCO to support transitioning nodes. ### Version Skew Strategy Similar to above. ## Implementation History There was a prior version of this proposal which was OpenShift specific and called for a custom build strategy. Since then, the "CoreOS layering" effort has been initiated, and this proposal is now dedicated to the OpenShift-specific aspects of using this functionality, rather than also containing machinery to build custom images. ## Drawbacks If we are succesful; not many. If it turns out that e.g. upgrading existing RHCOS systems in place is difficult, that will be a problem. ## Alternatives Continue as is - supporting both RHEL CoreOS and traditional RHEL (where it's more obvious how to make arbitrary changes at the cost of upgrade reliability), for example. <!-- --- # walters Feedback/discussion - High level idea of generating an image and having the MCO apply makes a lot of sense. - I just don't want to lose the idea of having something like `rpm-ostree status` that *clearly* shows what the customer has layered versus our base image. A: Agreed that we need to have something that shows the layers. Ideally, we would preserve a listing base layers. - Big change to the MCO implementation - can we do this first *without* exposing lots of knobs to users too? A: For the first pass, I'd envision that the rendered MCO configuration would build `machine-cluster-content` and apply that. Depending on how we choose to implement it, we could use the MCD to simply set the `machine-*-content` upstream image and use Zincati (attractive for Hyper Scale) or have the MCO trigger an upgrade when it detects a change to the image stream. - Let's run through the example of "I know `runc` is fixed in 4.7.N and I want to cherry pick that". Another good example is kubelet builds in CI. A: since builds will be part of the cluster, it means that `runc` and others will be able to CI the change via Prow. In this case a tarball with the overrides would be provided via a binary input, i.e. `oc start-build machine-user-content/pool --from-file bits.tar`. The MCBS for the `machine-user-content` would understand pull the latest `machine-cluster-content` to apply the override. A cache layer would be pushed to the image stream and re-used for subsiquent builds. - *Requires solving* https://github.com/openshift/machine-config-operator/issues/1720 (Perhaps the end story here is since we still support providing Ignition to nodes, any bits in filesystem/systemd units can become machine-specific config provided $however) - Bootstrap flow needs design: Presumably the bootstrap node builds the base image. Clearly we still need to provision the pull secret on the bootstrap node, so that would still run through Ignition etc. A: Correct, the Bootstrap work flow would change. The back-of-a-napkin design is that Ignition would handle the image pull, but in the case, Ingnition would only handle limited functions like disk and filesystem setup. Since the image is already constructed, the node would boot directly into it ready state eliminating the multiple reboots. - Ideally, we support something like fsverity/dm-verity in the future where e.g. we can enforce that all privileged OS content is actually *signed* (like iOS/Android). Should think a bit about how something should be signed here. A: The security benefits are quite impressive in this world view. StackRox and other scanning software would be able to inspect the final contents and companies would have the ability to inspect the output images to check for base level compliance. ## miabbott notes - OS Observability - Can we monitor the journal for SELinux denials, OOMs, core dumps - Probably orthogonal to MachineContent/usr-os-content - Diagnosability - "you've added this binary to the user-os-content and now it is dumping core" - Linting - added SSH keys, but that is managed by something else - added an systemd unit, but it is misconfigured - trying to write a file to non-writable location - etc. - Intersection with RHEL for Edge/Imagebuilder - Imagebuilder blueprints seem very similar to MachineContent spec - Scalability - How will this be managed by an SRE team that is responsible for 1000s of clusters, i.e. IBM Cloud folks? A. Thanks for highlighting the question of liniting -- one of complaints about the MCO is that a machine-config is somewhat dangerous. If the basic structure of the MCC is correct, then the MCO will pass through the operation to the entire cluster. Only on failure (and degredation of the pool) does someone get feedback, but they don't understand what when wrong. The choice of using a new build variant is deliberate becuase: - it seperates the build and application steps - users can use standard tools to inspect the logs `oc logs muc/pool-of-glory` to see what went wrong - users would be able to pull and inspect the resultant image ## eparis - How do we update the ICSP without a reboot? A: AFIAK, we will still need to do a reboot, less we have the MCD apply it. In this world, the ICSP would be a secret ref that would be rendered by the MCD on host. walters: An entirely different approach would be using rpm-ostree's `apply-live` path to do rebootless updates. - Is there value in seperating the "machine-os-content" that comes from RHEL and what comes from OpenShift into their own layer? A: The value is reducing scope from general to specific. Each layer is a reduction of the scope. The machine-os-content that comes from RHEL is understood to be a base OS that can work on ANY cluster, while the cluster content is specific for a node joining one cluster, and the machine-user-content can be scoped to a single pool or set of machines. - That isn't an answer to the question. Is there value in having seperate layers from Red Hat, one with RHEL content and one with OpenShift binaries, like kubelet and crio bits. - A: Oh. The OpenShift binaries would continue to ship as part of the `machine-os-content` The `machine-cluster-content` would be composed from what the MCO renders today. The value then is a seperation of the binaries and the configuration. - I get that seperation. Wouldn't having a layer owed by RHCOS (RHEL bits) and alayer owned by node (kubelet/crio) make our teams work better together? - A: now I feel like an idiot :) Perhaps. TBH, one thing that is not clear here is that having layers of content will help enable a CI chain for those tools and components (think E2E tests for runc or kublet). So I do think there would be value and the line can be somewhat arbitrary based on our needs. Realistically, we could have one `machine-os-content` per-release and then add on the upgrades as seperate layers. - Hey, that's how I usually feel! - How does the kubelet configuration work? A: The Kublet configuration would be embedded in the `machine-cluster-content` layer., - Does the final set of layers need to be behind some set of auth? If I can get the final set of layers without any auth, how do I get a token to create a CSR to register as a node? (the token is available inside the ignition data today, and lots of people hate that, because ignition is not behind auth and can give access to create objects in the api) A: The final image would be pushed to an image stream in the cluster behind an RBAC. The MCD would be responsible for writing the CSR as part of booting. The token would be written as part of Ignition. HOWEVER, the the Machine Config Server may not be needed. - ok, no worse and no better than today, i think. gotta ponder just a bit :) - A: this is an area we definately can improve on, for sure. ## jligon - can they bring containerd to the user content if they find someone else to support it? A: Sure, in theory. We could put guard rails if we want to to prevent some actions. We might seperate the `machine-os-content` to be RHEL with overlays for RHEL for Edge and OpenShift. - is this all going to be built by COSA or are we going to call out the the Image Builder service? A: The Machine Content Build Strategy will have its own container for doing the build since its in cluster. I'm sure we could use the same container elsewhere. A2: COSA would not be used for in-cluster builds. The base `machine-os-content` as found in the release manifest will be used as the base layer and will be built as part of regular build process. - override content like the RT kernel will have a forked/split image? A: On upgrade/install, the cluster will build base images for each extension set as a `machine-os-content-*` image. The cluster settings will be layered on top of those base images and then the user content to form the final `machine-user-content` that is applied to the machines. Switching to the RT will be as simple as changing the `rpm-ostree` backend. - How much of the `machine-os-content-*` will differ from the RHEL Edge package set? A: I would hope very little, other than the additions. TBH, I don't have much background on that. @cgwalters? Each `machine-os-content-*` will be based on the current extensions already supported are built on the base `machine-os-content`. ## kirsten - how does the new machine-user-content differ from the current day2 machine config changes that a user can make? (also noting there is potential here to streamline the UI that got kind of messy via user submitted machine configs) A: each machine-config change would trigger a build, the MCO would monitor for the output. The build API would ensure that the machine-user-content is generated. The MCO/MCD would trigger the update by watching for the `machine-user-content` image steams to know when the update would be ready (or better, require the user to trigger the update?) Further, the MCO wouldn't have to loop. It could be configured to watch for image streams to trigger updates OR it could be configured for the user to initiate the update. - how inspectable would these images be? For ex: "Oh wow I messed up here what in the world did I do?" A: There would be two ways: the build API will show logs for all builds, including the `machine-cluster-content`. With another change that is being planned for the future, the rpm-ostree image will be replaced with an OCI container (probably a few cycles out) meaning that users will be able to do `oc exec -it -n machine-config-operator --image.....` - this would get rid of the kubelet config & crio controllers, but would this simplify the main MCC and its many syncs? Feels like a yes? A: I think it would. I would posit that we could stop exposing the MCC to end users and only allow MCC's other operators. - overall, I think the idea holds a lot of promise in simplifying the MCO and its operations, hopefully resulting in smoother ux, more dependable outcomes and easier troubleshooting. definitely will think on this more, but feels like a good direction and potentially alot cleaner than what we're doing today. ## Jerry A lot of points have already been brought up, so I wanted to quickly go through some flows and check my understanding: 1. bootstrap flow: - It sounds like from Colin's question we will have a bootstrap version of the MBCS that is awaiting the bootstrap MCO to finish rendering the initial master configs, after which it will build the required layers on top of the corresponding in-cluster MOC, and serve that to the master nodes. The master node ignition knows to pull the image and have the node perform an early pivot from bootimage directly to in-cluster MOC with all layers? - A: more or less, this would be the case. - In this flow, who would be responsible for applying e.g. FIPs (early boot configs)? Same flow as today? (in the initial boot ignition not the pivoted image) - A: Same flow as today, Ignition would stage an OSTree (with kernel and kernel args preset) that enable FIPS. The MBCS won't include the CSR or any key material, so FIPS would matter until after the boot. - Does ignition also have the full ability to stage the full layered OS that is incoming? What does that look like? - A: Ingition would pivot into the full layered image. To Ignition and other observers, it would be one image. The MBCS will compose all the layers into a single image. 2. new node joining the cluster - Similar to above, basically either - a secret contains the stub secret given to MCS, MCS serves a second stub with pointer ignition to pull MBCS build, ignition pulls the image and does its thing - no MCS, in which case the secret is constantly update to contain the correct pointer ignition to pull the latest MBCS build 3. os config updates - Main question (like Eric mentioned) is for rebootless updates. Will the MCD in general have insight into the total change incoming? MCD today has no way of knowing what changed from a controller perspective, and can only diff the current and desired configs. It sounds like the MCD won't have insight into most of the changes directly. (key concern being cert rotation, as that is a constant thorn, but others also apply) - A: As designed, the MCD shouldn't _care_. Certificates and secrets would remain the domain of the MCD to write on node. - thinking through edge cases (restoring system defaults, unit presets, etc.) seems to be fine (and easier) with the new flow 4. upgrades - this seems like it would be the most seamless - A: Correct. On upgrade, the cluster-content will generated by the MBCS and then the machine-user-content will be composed from that layer. The MCD would simply apply the change and the bulk upgrade operations like extensions or template upgrades will be composed in the MBCS. For upgrades, the MCD might only care about secrets/certificates and then applying the updated image with a reboot. 5. verification - the MCD today validates the state of the system. Will this change? (also see 3) - A: The MCD will only care about the few files it writes. We will need to do some design around validation, but most of the validation would be unneeded since the MBCS would compose the final image for most changes. The success or failure of the build stage would provide the information. I believe we could CI the final image by doing a mock boot to do basic smoke tests. 6. RHEL 8 nodes - the current rhel 8 plan sounds like we are going to have the MCD manage the file/unit writing. What about in this new flow? Does the MCD have 2 modes of operation (current code to support rhel8 + new code to support new MCBS) - A: Out of scope. The MCD could stop caring about the RHEL 8 vs RHCOS, other than knowing that on RHCOS it applies an image. - Q: I'm not sure I follow, who will be writing the "base" configs into RHEL 8 nodes then? A new RHEL8 operator? (also see point 8 on onceFrom) 7. windows nodes - if I understand correctly, windows nodes use their version of the MCO. Will their workflow change? - A: Completely out of scope. The image based work flow is RHCOS specific 8. what would happen to the M-C-D onceFrom mode of operation? 9. what about Single Node openshift bootstrap-in-place? ## Sinny This is a great proposal and have very nice ideas! Based on what I understood so far, have some follow-up questions: * machine-os-content-extensions: How does applying day2 extension look like in this approach A: The cluster will auto-generate the base `machine-os-content-<extension>` to an image stream. The MCD will apply the ostree from the image-stream with the appropriate tag such as `machine-os-content-rt-kernel:pool` or `machine-user-content-rt-kernel:pool`. * Does MCBS in the end generates a single OSTree repo out of different machine-*-content builds initiated? If not, does rpm-ostree already handles applying different unrelated image build (OStrees) as a single deployment? A: the layers will be collapsed into a single repo. As far as the MCD is concerned, it will be tasked with applying a single repo. * There are OpenShift operators like NTO (Node Tuning Operator) that applies some MachineConfigs. In this approach, are they considered as user content or cluster content? A: All operators would be considered cluster content * Is MCO responsible for requesting build for machine-user-content and applying the image on node A: No. The user would not write MCC's anymore. Rather, they would write a YAML configuration like a container build. The user would target the `machine-cluster-content` by label such was a `worker` or `master`. * Does MCBS builds for machine-*-content occur inside the cluster? A: Except for the `machine-os-content` that comes from Red Hat, all other builds will be done inside the cluster. - If yes - which on which node build will run. Building image on the fly would consume additional system resources. A: Images would be build on build nodes (or infra nodes). The cost of building the image is mostly going to be in terms of space and storing the resultant images. CPU and memory costs will be shifted individual nodes to a central location. - How will this look like on other product like SNO where they are trying to fit in all OCP consumed resources to 1 core - A: the cost should be minimal since there would be less layers and the layers will be smaller (overlaying text files instead of rpms). - If no, how will this build take place in disconnected or behind proxy enviormnet * As lot of cutomers are time sensitive, What is the average time looks like to build different machine-*-content A: Unknown. However, since the rendering of the MCO content, the building of layers and applying the layers are discrete steps, the work can be done before applying (pausing the pool). This should help with time sensativite customers. * How do we ensure that it will be smooth transition for existing customer's cluster including upgrade and node scale-up? A: node-scale up should faster since new nodes will boot directly into their target state (no reboot) A2: On upgrade, the MCD would remove the files written, and then apply the new ostree. * Maybe I am mising something but how will new approach solve single node configuration without creating 1:1 node to pool configurations A: This does provide a path for getting rid of 1:1 pool to nodes. Pools would be for general configuration and the resultant `machine-cluster-content` will be for the general pool. When a user/admin needs to provide different content they would define a new MBCS for that node. I think we would need a new mechanism for the MCD to select the appropraite image from a hierarchy (`machine-os-content:latest` < `machine-cluster-content:worker-latest` < `machine-user-content:worker-latest` < `machine-user-content:worker_<node>-latest` -- I am not suggesting using arbitrary image tags, rather, I'm using image tags to demonstrate the idea. * How does this approach improves node safety i.e applied user or cluster configs are not buggy? A: the user will be able to inspect the resultant configuration, see the build logs using the build API, and we could have a CI step. The MCBS approach won't solve for buggy content, but will provide a clear deliniation of where the failure happens. Since only operators will use MCC's, rendering failures will happen up stack, and users will be able to seperate their changes from the cluster's by looking at the `machine-user-content` build. ## Derrick * No plans to support yum/DNF repositories as input for user content? For users that want to layer packages that may have complex dependencies it may be difficult for them to manage them and their updates manually. When it comes to RH content, I realize this may add complexity due to entitlements, but it may be possible to utilize some of the work that the Build team is doing to cover this use-case. A: initially, there would be no plan for supporting the yum/dnf repos, although we could easily do that. There's a couple reasons I specifically ignored supporting repos to start: 1) RHCOS provides some content that is not available in RHEL. Enabling arbitrary repos could put us down the path of having some core compontents replaced such as crio, the kublet, etc; 2) Entitlement problems. # Notes from Aug 26 2021 Meeting - cgwalters - jlebon - dcarr - sdodson - yuqizhang - mrunalp - gmarkley - jkyros - rphillips - skumari - miabbott - mrussell - mkrejci - aravindhp [cgwalters] elevator pitch: imagine we deliver + boot the OS from a container image (phase 1). imagine we can configure the OS like we build a container image (FROM quay.io/openshift/os) and additively put on layers (phase 2). [yuqizhang] how do we excecute the plan? in-place upgrade? only works for coreos; will we make it work for RHEL? will MCO be the unified interface for all OS variants? how will it interface with SR-IOV operator, etc that does not interact with MCO? [cgwalters] initially just changing implementation for CoreOS systems; would have to keep existing MCD support for non-CoreOS systems. [skumari] what does timeline look like for implementation? what does migration look like from existing MCs -> MCBS? can live updates happen w/out node drain? [cgwalters] we want transactional updates between state, but want to avoid reboots for things. should be able to utilize the `rpm-ostree live` support. drain shoud not be needed for all live updates (depends on workloads). MCs could be translated to just files as a layer on top of the existing OS container image. kernel args are more difficult. [dcarr] tweak to elevator pitch: clarifies the responsibility model for customers in terms of what they can modify on CoreOS systems. teams should not have to think about if a change needs a reboot or not, but also have rollback/canary support. regardless of implementation, we need a better responsibility model for RHCOS. [jlebon] unlikely that kernel did not change between minor releases; would have to reboot. proposal covers non kernel/initramfs updates. [jlebon] in-between phase: interface to customer doesn't change, but under the covers we are doing a migration/change to MCBS style [sdodson] historically kernel updates every 3 z-streams [skumari] if the image is being built in cluster, what kind of performance hits will we take? how will it affect the upgrade performace? [cgwalters] adding layers to an existing OS image shouldn't be terrible for perf [mrunalp] is there a way that users will be able to build a custom image locally and test it out, outside of the cluster? do we need additional tooling to support this? [dcarr] telemetry to see that a user has changed the OS image would be useful/necessary. assumed that the build of the OS image would happen before the maintenance window? [cgwalters] if the CVO pulls in an update that has a new RHCOS, who controls when the new OS image build happens? need careful consideration. [yuqizhang] would like to see the MCD be the interface to manage changes to RHCOS and RHEL nodes. fundamental problem? [dcarr] MCs are opaque to the cluster. having them as an image layer allows for better instropection/better security model(?). k8s admins are familiar with Dockerfile style build; MCs are new and need to be learned. unclear messages to users about what can be edited. want to improve boundaries between teams so that we deal with "envelopes" (i.e. don't care about content). should be able to say "reboot with this envelope", "don't reboot with this envelope". would consider deprecating the MC spec section that allows raw Ignition configs; make it internal only. other operators should be able to pass an image layer to the MCBS. [mkrejci] what are the parameters? what are the timelines? what are the ambitions? [dcarr] walk, crawl, run: delivery of os image via container, apply "Red Hat layer" of default configs... [cgwalters] optimal to have k8s defaults as separate layer. could have MCD still writing out user content separately. [mkrejci] what is unacceptable? support? time? resources? performance? what signals should we watch for? [dcarr] reserve the time to consider that. operational safety of OCP is single most important thing to business; willing to devote lots of resources towards this. would like to see rollback of workers safer. improving mean time to recovery. if this proposal doesn't improve operational safety, then we shouldn't do it. but we need an alternative to achieve that goal. [mrunalp] user overlay could be built as an rpm to test out changes when building custom OS image. would like to tackle stale content on host type of issues (i.e. crio config). moved from crio.conf -> crio.d (https://bugzilla.redhat.com/show_bug.cgi?id=1995785) [cgwalters] unacceptable outcome: if we have to support two ways to update/modify OS (at least for RHCOS). need to handle node-specific configuration, which this proposal doesn't cover. [dcarr] reputational risk that content is coming from MCs that we don't know about(?); moving to image layer gives plausible deniability [mrussel] can we rename MCBS to Machine Container Build Strategy? [jkyros] I typed up some notes from my meeting with Colin on live-apply cases from my perspective, it was too long to just dump here: https://docs.google.com/document/d/16nPRS-9LKdrpaxgp6u7LX74AykHBWcmbqRK1W-K7rnw/edit# -->