--- title: Failed-Operator-Upgrades authors: - "@ankithom" - "@agreene" reviewers: - TBD - "@dmesser" - "@bparees" - "@jlanford" - "@njhale" approvers: - TBD creation-date: yyyy-mm-dd last-updated: yyyy-mm-dd status: provisional tags: enhancement proposal see-also: -"https://docs.google.com/document/d/1EMGGdUJBMly8LEr5jNo2A8P6y67XVr4toYXue3EhSok/" --- # failed-operator-upgrades ## Release Signoff Checklist - [x] Enhancement is `implementable` - [x] Design details are appropriately documented from clear requirements - [ ] Test plan is defined - [ ] Graduation criteria for dev preview, tech preview, GA ## Open Questions <!-- - Upgrades to other operators when there is a failing csv/ip from a different package without replacement in a cluster - Upgrades skipping multiple failed versions --> ## Summary ## Motivation Cluster administrators rely on the [Operator Lifecycle Manager (OLM)](https://github.com/operator-framework/operator-lifecycle-manager) to install and manage the lifecycle of operators on a cluster. An operator installation/upgrade will fail when: - The InstallPlan enters the FAILED state. - The ClusterServiceVersion (CSV) oscilates between the PENDING and FAILED states. If either of the above criteria is met, the installation will never succeed. The cluster administrator must then: 1. Update the catalog to include a version of the operator that addresses the failure and includes a skip-range. 2. Manually delete all traces of the failed installation and resubscribe to the operator. The manual steps describe in step 2 become increasingly untenable as the number of clusters being administrated reaches into hundreds or thousands of instances. The goal of this enhancement is to propose a means to opt into automatic recovery from failed installations without the need for cluster admin intervention. ### Goals - Allow OLM to automatically recover from failed installations if a new operator is added to a CatalogSource that replaces the last successfully installed operator. ### Non-Goals - Support Operator rollbacks. - Allow cluster admin to perform unsafe/unsane upgrades. - Retrying failed installations for operators. -- This is supported in limited capacity today as won't immediately fail if it finds that some requirements haven't been met. ## Proposal OLM allows cluster admins to subscribe to operator updates through the use of subscriptions in a namespace. OLM determines what to install based on the set of subscriptions in a namespace. The list of resources that must be created are then placed within an InstallPlan. Once an InstallPlan is created, OLM will attempt to create or update each resource listed in the InstallPlan. Operator installs/upgrades generally follow the workflow given below: 1. A User creates a subscription for a specific operator. 2. OLM calculates the latest generation of operators that should be installed in a namespace and creates an InstallPlan listing each resource that must exist on cluster. 3. If the InstallPlan requires manual approval, a User must "approve" it. 5. OLM creates/updates the manifests defined in the InstallPlan. 6. If an older CSV exists on the cluster OLM will place it in the "Replacing" phase. 7. Once successfully applied, OLM marks the new InstallPlan as succeeded. 8. Once the new CSV reaches the "Succeeded" phase the old CSV is removed from the cluster. The main reasons that an upgrade may fail include: - **Invalid CSV:** The CSV may come in the form of a failing CSV, or in cases of unmet requirements, a forever pending CSV. - **Invalid InstallPlan:** Usually occurs because a resource fails to be created or updated. An InstallPlan may fail independently of its CSV. The InstallPlan may fail even with a successfully applied CSV, or even a missing CSV. So, a failed workflow might look like: 1. A User creates a subscription for a specific operator. 2. OLM calculates the latest generation of operators that should be installed in a namespace and creates an InstallPlan listing each resource that must exist on the cluster. 3. If the InstallPlan requires manual approval, a User must "approve" it. 4. OLM creates/updates the manifests defined in the InstallPlan 5. If an older CSV exists on the cluster OLM will place in the "Replacing" phase. 6a. OLM fails to create/update a resource defined in the InstallPlan and moves the InstallPlan to the "Failed" state. 6b. OLM installs a CSV and it enters the "Failed" state or forever enters the "Pending" state. In either failure scenario, a cluster admin must perform manual steps to recover, which becomes increasingly untenable as the number of managed clusters increases. > Note: OLM does not perform multiple attempts to resolve InstallPlans, meaning that a failed install blocks upgrades for all operators in the namespace as resolution is on a per-namespace basis. This enhancement proposes allowing cluster admins to opt-into fail forward upgrades, thereby enabling OLM to recover from failed installations when a new upgrade path is discovered. ### User Stories - As a cluster admin, I want to allow all operators to "fail forward" in a specific namespace. - As a cluster admin, if an InstallPlan is in the FAILED state, I want to create the next InstallPlan. - As a cluster admin, if a CSV is stuck in the PENDING state, I want to allow it to upgrade to the next version. - As a cluster admin, if a CSV is stuck in the FAILED state, I want to allow it to upgrade to the next version. - As a catalog curator / operator author, I want to chose which versions of an operator may "fail forward" to the next version so my customers don't miss critical upgrades. ### Implementation Details/Notes/Constraints [optional] #### Opting into Fail Forward Upgrades Since resolution is namespaced, the toggle for allowing forced upgrades also should be namespaced to avoid having to add the extra burden on the resolver surrounding understanding which dependent operators can or cannot use the forced upgrade toggle. With this in mind, this enhancement proposes adding the `failForwardUpgrades` toggle to an already watched namespaced resource, the OperatorGroup. ```yaml apiVersion: operators.coreos.com/v1 kind: OperatorGroup metadata: name: <operator-name-here> namespace: <namespace-goes-here> spec: failForwardUpgrades: true ``` With the toggle enabled, OLM will allow operators to "fail forward" if such upgrades are supported by the operator. If the toggle is disabled, OLM exhibits existing behavior. #### Supporting Fail Forward Upgrades This enhancement proposes the introduction of the `fail-forward-supported` [Bundle Property](https://olm.operatorframework.io/docs/reference/file-based-catalogs/#properties). When the resolver determins the set of operators to install, OLM can place constraints on a bundle that must be met in the next generation of installables. To allow installs past the failed install, a catalog curator/operator author would need to rebuild the operator's upgrade graph to include a CSV that: - Replaces the version of the operator that failed to install. - Includes the `fail-forward-supported` Bundle property. A bundle with the `fail-forward-supported` property can be seen below: ```yaml= image: quay.io/foo/bar:baz name: foo-operator.v1.0.1 package: kiali properties: - type: olm.gvk value: group: foo.io kind: bar version: v1alpha1 - type: olm.package value: packageName: stable version: 1.0.1 - type: olm.failForward value: supported: true # This bundle supports "failing forward". schema: olm.bundle ``` With this property present, OLM can then communicate to the resolver when a particular operator failed to install. The resolver can then check if an upgrade exists for the operator that supports "failing forward". Let's consider each of the scenarios OLM should Support: - **InstallPlan Failure:** If the InstallPlan has failed, the set of operators sent to the resolver should include: -- Each CSV that successfully installed -- Each CSV that was not installed (OLM will include the "fail forward" constraint). >Note: It might make sense just to recreate the InstallPlan here if the operatorGroup is configured to "fail forward", thoughts? - **CSV Failure:** If a CSV in the namespace is in the FAILED state, the set of operators sent to the resolver should include: -- Each CSV that successfully installed. -- Each CSV that failed to install (OLM will include the "fail -- Each CSV in the "Pending State" (OLM will include the "fail forward" constraint ### Risks and Mitigations - The solution defined in this proposal places additional complexity into maintenance of the opeators upgrade path. Operator Authors may benefit from a GUI that acts as a source of truth for supported upgrade paths of a package, which could specifically callout nodes which support fail-forward-upgrades. **This GUI does not block the delivery of this feature, but would improve user experience.** ## Design Details ### Test Plan The following tests are based on the high-level workflow described below in namespaces where the OperatorGroup.spec.failForward is set to `true`: 1. `Operator v1` is installed 2. `Operator v1` is upgraded to `operator v2`, which fails to install. 3. OLM supports "falling forward" to `operator v3`. - OLM can successfully upgrade to `operator v3` if the `operator v2 CSV` is stuck in a PENDING state. - OLM can successfully upgrade to `operator v3` if the `operator v2 CSV` is stuck in a FAILED state. - OLM can successfully upgrade to `operator v3` if the InstallPlan is stuck in a FAILED state. - Existing behavior is observed when a user has not opted into the failForward feature. ### Graduation Criteria ### Version Skew Strategy The proposed behaviour is enabled by a boolean field present in the spec of the OperatorGroup resource. Cluster admins can choose to opt into this mode as needed. As long as the operator author provided upgrade graphs are same, there should be no issues with turning the toggle off and on in a cluster regardless of its health. ## Alternatives ### Alternative 1: Determine installables based on upgrade path In this alternative approach, failed operator installations would be addressed by rebuilding the opeators upgrade path using the existing `skips` and `skipRange` features and distributing the new catalog. Consider the following high-level workflow: 1. `Operator v1` is installed 2. `Operator v1` is upgraded to `operator v2`, which fails to install. 3. OLM supports "falling forward" to `operator v3` by **providing the resolver with a set of operators installed in the namespace**. `Operator v3` is identified as a valid upgrade and is installed successfully. The workflow for catalog curators/operator authors is determined by the version of the operator sent to the resolver in Step 3. In the scenario above, the operator version could be: - The operator that initiated the upgrade, `operator v1`. - The operator that failed to install, `operator v2`. Lets consider what the user experience looks like depending on which version is sent to the resolver. #### Alternative 1a: Determine installabes based on the operator version that initiated the upgrade By providing the resolver with the `operator v1`, OLM would recalculate the list of installables using the set of operators that initiated the failed upgrade. OLM would only create a new InstallPlan once the catalog was updated to include a new upgrade path for the `operator v1`. ##### Pros - The catalog curator/operator author must be aware of the failed installation, allowing them to address the issue directly by providing a new upgrade path. - Involving catalog curators/operator authors helps prevent unsafe/insane upgrades. - Upgrade graph maintinance relies tools/concepts embraced today. #### Cons - The catalog curator/operator author must manually address each failed installation. - Cluster admins cannot role forward until a new catalog is published. - Deviates significantly from OLM's upgrade behavior today. By the time `operator v2` has failed, most resources have been updated to include `operator v2` as an owner. This impacts OLMs upgrade logic in multiple places and introduces additional complexity to the solution. #### Alternative 1b: Determine installabes based on the operator version that failed to install By providing the resolver with the `operator v2`, OLM would attempt to immediately upgrade to the next version of the operator in the case of a failed installation. The catalog curator/operator author would not need to make any changes to the upgrade path of the operator. ##### Pros - Catalog curators/operator authors do not need to recreate upgrade paths if they release a operator version that can't be installed. - Cluster admins can "fail forward" without consulting the catlog curator/operator author. - Doesn't deviate from OLM's existing upgrade logic today and therefor is relatively simple to implement. ##### Cons - The solution does not gurantee any involvement from the catalog curator / operator author. Many customers may hit the issue before the team is notified. - Without involving the catalog curator/operator author, there is no gurantee that the upgrade is safe or sane. The operator version that failed to install might have been required due to a migration or some other critical reason. ### Alternative 2: Address with RukPak APIs The rukpak APIs can provide this same toggle with fewer potential side effects, but does require more effort to implement. However, while the InstallPlan APIs will eventually be deprecated in favor of rukpak, this would take potentially another release cycle or more to get those changes in.