Fallback on Failing Revisions of Static Pods

--- title: Fallback on Failing Revisions of Static Pods authors: - "@sttts" reviewers: - "@p0lyn0mial" - "@mfojtik" - "@soltysh" - "@marun" approvers: - "@mfojtik" - "@soltysh" creation-date: 2021-06-21 last-updated: 2021-06-21 status: provisional|implementable|implemented|deferred|rejected|withdrawn|replaced|informational # see-also: # - "/enhancements/this-other-neat-thing.md" # replaces: # - "/enhancements/that-less-than-great-idea.md" # superseded-by: # - "/enhancements/our-past-effort.md" --- # Fallback on Failing Revisions of Static Pods ## Release Signoff Checklist - [ ] Enhancement is `implementable` - [ ] Design details are appropriately documented from clear requirements - [ ] Test plan is defined - [ ] Operational readiness criteria is defined - [ ] Graduation criteria for dev preview, tech preview, GA - [ ] User-facing documentation is created in [openshift-docs](https://github.com/openshift/openshift-docs/) ## Summary Static pods operators use a revisioned pod manifest and revisioned manifests to roll out and start a new configuration of the operand in an atomic way. Configuration can be bad in different ways. In HA OpenShift, the static pod operator will wait for operands on one node to start up and to become healthy and ready. The operator will stop rolling out bad revisions further. In SNO, there is no other pod to ensure availability when a roll-out of a new revision happens. Hence, a bad revision can be fatal for the cluster, especially for the kube-apiserver and etcd This enhancement is describing a fallback mechanism that will make the operand to start a prevision revision when the new revision fails, and how the operator will notice this event and how it will react. The fallback mechanism is opt-in by the operator, and kube-apiserver and etcd operators (and potentially kcm and ks) have to consult the deployment topology in the infrastructure resource to decide. ## Motivation In SNO, we only one chance to start the new static pod for the kube-apiserver or etcd. When this fails, the cluster is bricked and there is no automatic recovery. Bad configurations can have many reasons: - **the pod manifest can be wrong**: e.g. by YAML syntax, not validating as a pod or some deeper semantical error: - **each one of the [23 different config observers of kube-apiserver]([https://](https://github.com/openshift/cluster-kube-apiserver-operator/blob/ce4170b4a040fc03603f62d686a6e9ea0cadde34/pkg/operator/configobservation/configobservercontroller/observe_config_controller.go#L119)) provided and owned by 9 different teams can lead to bad configuration**. E.g. a config observer adds a flag `--foo` which was removed in the current upstream release, but lacking test coverage did not show this in CI. - **source invalid data from other sources with incomplete validation**: e.g. the network mask given through some other config.openshift.io/v1 CR defined by the user is invalid. - **some ConfigMap or Secret can be wrong**: e.g. the audit policy is a YAML file inside a ConfigMap. If it is invalid, the kube-apiserver will refuse startup. ConfigMaps and Secrets are synced into the operand namespace from many different sources, and then copied as files onto the master node file system by the installer pod. - **some ConfigMap or Secret marked as optional but is actually required**: the installer will continue rolling out the pod after skipping the optional files. In HA OpenShift, this might not show up because the operator create quickly another revision when the ConfigMap or Secret show up. In SNO this race might be fatal. It is impossible to guarantee that none of these ever happens because the test permutations are just infeasible to cover completely in CI, especially as many of these are disruptive for cluster behaviour and hence very expensive to test. ### Goals - recover from a badly formatted or invalid pod manifest - recover from bad config observer output - recover from bad revisioned ConfigMaps and Secrets. ### Non-Goals - recover from invalid non-revisioned certificates - recover from missing or invalid files on the masters given by static paths outside of the `/etc/kubernetes` directory (e.g. some certs or kubeconfig the operand is consuming) - optimize downtime of the API beyond the accepted 60s plus some waiting time to decide whether operand is healtyh. This mechanism is for the disaster case. If downtime is `O(5min)`, this is completely ok. ## Proposal We propose to add a `<operand>-startup-monitor` static pod that watches the operand pods for readiness, created by installer as another manifest in `/etc/kubernetes` and provides by the static pod controller the same way the operand pod manifest is created today. If the operand pod does not start up in N minutes, e.g. through: - invalid pod manifest - free port wait loop timeout - crash-looping - not answering on the expected port (connection refused) - healthy but never readyz the task for startup-monitor is to fall back: when detecting problems with the new revision, the startup-monitor will copy the pod-manifest of the `/etc/kubernetes/static-pods/last-known-good` link (or the previous revision if the link does not exist, or don't do anything if there is no previous revision as in bootstrapping) into `/etc/kubernetes`. It will add an annotation to the old revision pod manifest `startup-monitor.static-pods.openshift.io/fallback-for-revision: <revision>`. This will be copied into the mirror pod by kubelet and the operator knows that this is due to a problem to start up the new revision. If the operand becomes ready, the startup-monitor will link the revision to `/etc/kubernetes/static-pods/last-known-good`, and then remove its own pod-manifest from `/etc/kubernetes`. If the startup-monitor notices a operand pod-manifest of a different revision (by checking the operator manifest in `/etc/kubernetes` and the revision annotation), it will do nothing and just wait, keeping watching the operand pod-manifest. This is important to avoid races on startup, and avoid problems on downgrade before this mechanism was introduced. ### Difficulties & Questions - **too early fallback**: we have checks like etcd smoke test on operand start-up that have external reason and might eventually (with enough restarts) resolve itself. When we fallback to the old pod, we might forfeith this chance. - **hitting old revision port**: make sure that it hits the correct pod, not the one of the previous revision: we cannot just query the expected port because on that port the prevision revision might still listen. We could add some header to the `/readyz` endpoint response (identity), or we could add cri-o awareness of some kind, or we could look which process actually is behind the port via netstat and fiends. - **operator behaviour**: how should the operator behave when `startup-monitor.static-pods.openshift.io/fallback-for-revision: <revision>` is found. Should we retry? Should we just wait? In HA OpenShift we just wait, until the next reason for bumping the revision. - **different readiness logic**: etcd readiness is different from kube-apiserver. ### User Stories 1. As a cluster admin I **don't want to brick my cluster because of an invalid input** (e.g. some field in a `config.openshift.io/v1` CR). 2. As a cluster admin I want to **get notice that a revision was not rolled out successfully**, e.g. by seeing the operator being degraded. 3. As a cluster admin I want that **every good configuration is eventually rolled out** and not delayed much longer than a temporary failure condition (e.g. etcd down) persists. ### Risks and Mitigations ## Design Details ### Open Questions [optional] ### Test Plan The usual e2e tests will verify the happy case. For the error case, we have to inject the different types of errors into an operand, similarly as it is done for the installer in https://github.com/openshift/cluster-kube-apiserver-operator/blob/022057ed14a0f2b1eb98f7b5ccb76d100de011d6/pkg/operator/starter.go#L363): - [ ] test case to recover from a badly formatted or invalid pod manifest - [ ] test case to recover from bad config observer output - [ ] test case to recover from bad revisioned ConfigMaps and Secrets. ### Graduation Criteria This will graduate directly to GA as 4.9 is the target release we have to hit. #### Dev Preview -> Tech Preview #### Tech Preview -> GA #### Removing a deprecated feature ### Upgrade / Downgrade Strategy On upgrade, if that even matters for SNO pre-GA, the `last-known-good` link won't exist on first rollout. We will fall back to the previous revision then. On downgrade, the startup-monitor might still exist in `/etc/kubernetes`. We will make it wait until it sees an operand pod-manifest of the same revision before doing anything (see proposal) ### Version Skew Strategy Not relevant. ## Implementation History Major milestones in the life cycle of a proposal should be tracked in `Implementation History`. ## Drawbacks ## Alternatives - we have talked about starting the new apiserver in parallel and to have some proxy or iptables mechanism to switch over only when the new revision is ready. This was rejected as we don't have the memory resources on nodes to run two apiservers in parallel. Instead, it was decided that 60s downtime of the API for the happy-case (= configuration is not bad) is acceptable. Hence, we changed direction to only cover the disaster recovery case that is described in this enhancement.