Automatic Upgrades For Failed Operator Installations

# Automatic Upgrades For Failed Operator Installations    - [Release Signoff Checklist](#release-signoff-checklist) - [Summary](#summary) - [Motivation](#motivation) - [Goals](#goals) - [Non-Goals](#non-goals) - [Proposal](#proposal) - [User Stories (Optional)](#user-stories-optional) - [Story 1](#story-1) - [Story 2](#story-2) - [Notes/Constraints/Caveats (Optional)](#notesconstraintscaveats-optional) - [Risks and Mitigations](#risks-and-mitigations) - [Design Details](#design-details) - [Test Plan](#test-plan) - [Graduation Criteria](#graduation-criteria) - [Upgrade / Downgrade Strategy](#upgrade--downgrade-strategy) - [Version Skew Strategy](#version-skew-strategy) - [Production Readiness Review Questionnaire](#production-readiness-review-questionnaire) - [Feature Enablement and Rollback](#feature-enablement-and-rollback) - [Rollout, Upgrade and Rollback Planning](#rollout-upgrade-and-rollback-planning) - [Monitoring Requirements](#monitoring-requirements) - [Dependencies](#dependencies) - [Scalability](#scalability) - [Troubleshooting](#troubleshooting) - [Implementation History](#implementation-history) - [Drawbacks](#drawbacks) - [Alternatives](#alternatives) - [Infrastructure Needed (Optional)](#infrastructure-needed-optional)  ## Release Signoff Checklist  Items marked with (R) are required *prior to targeting to a milestone / release*. - [ ] (R) Enhancement issue in release milestone, which links to KEP dir in [kubernetes/enhancements] (not the initial KEP PR) - [ ] (R) KEP approvers have approved the KEP status as `implementable` - [ ] (R) Design details are appropriately documented - [ ] (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors) - [ ] e2e Tests for all Beta API Operations (endpoints) - [ ] (R) Ensure GA e2e tests for meet requirements for [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md) - [ ] (R) Minimum Two Week Window for GA e2e tests to prove flake free - [ ] (R) Graduation criteria is in place - [ ] (R) [all GA Endpoints](https://github.com/kubernetes/community/pull/1806) must be hit by [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md) - [ ] (R) Production readiness review completed - [ ] (R) Production readiness review approved - [ ] "Implementation History" section is up-to-date for milestone - [ ] User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io] - [ ] Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes  [kubernetes.io]: https://kubernetes.io/ [kubernetes/enhancements]: https://git.k8s.io/enhancements [kubernetes/kubernetes]: https://git.k8s.io/kubernetes [kubernetes/website]: https://git.k8s.io/website ## Summary  This proposal describes modifications to the Operator Lifecycle Manager (OLM) that enable users to opt-into "fail forward" operator upgrades; that is, automatically upgrading an operator even when its deployment has failed. ## Motivation  Today, when an operator's installation fails due to a defect, users must manually reconcile k8s resources -- e.g delete the operator -- before it can be upgraded to a working version. Two of the most common defects are: 1. An issue with an operator's manifests -- e.g. a typo -- prevents resources from being applied to a cluster 2. An operator crashes due to a bug In both of these scenarios, OLM refuses to apply further upgrades even when there's potential to automatically resolve the issue. ### Goals  - Provide a mechanism to enable upgrades from failed operator installations without manual intervention - Once automatic upgrades of a failed operator are enabled, no further intervention by the user is necessary to ensure available upgrades are applied ### Non-Goals  - Automatically upgrade from failed installations without user opt-in - Provide typical upgrade safety; e.g. no data loss, CustomResourceDefinition (CRD) API schema compatibility, etc ## Proposal  Introduce a field on the `Subscription` API that lets users opt-into "fail forward". When enabled and an upgrade is available for a package, OLM will **always** resolve the upgrade and perform a potentially disruptive/destructive installation of the new version. **Alternative:** Put a flag on OLMConfig resource instead of on a Subscription. This flag can make "fail forward" behavior global -- i.e. for all operator installations -- or, can accept a list of Subscriptions or Namespaces to do this for. **Edge Cases:** - Target version has an incompatible CRD schema - More than one package (operator) installed to a namespace - Existing version contains APIServices/Webhooks **Open Questions:** - Do transient errors constitute "install failure"; e.g. when network issues that blocked an installation are resolved, will installation eventually succeed? TODO find home: - Questions: - What's the UX around failed rollouts? - What's required in order to enable more control over failed rollouts, e.g. more advanced health checks/probes/etc. - Can you configure the failed rollouts behavior, or is this a one-off for SD going forward? Does this design need to account for future asks around supporting extended deployment stategies (canary, b/g, etc.) - https://issues.redhat.com/browse/OLM-2311 - Do we need to support purging resources and/or provide trivial self healing mechanisms? - If so: What happens when a vital namespaced-scoped resource is deleted during a failed rollout (e.g. ServiceAccount) and the result is a degraded operator installation? - What kind of granularity do we need to expose? - What's the downstream requirements for introducing new APIs that are opt-in? **Implementation paths:** There are two independent paths to implement this feature: 1. Extend the existing `InstallPlan` controllers 2. Introduce APIs and behaviors from `rukpak` Details (rukpak): - Introduce a new InstallPlan-esq API that manages Bundle/BundleInstance resources - Replace InstallPlans with Bundles/BundleInstances - Provisioner looks at Subscriptions and OperatorGroups - creates cluster-scoped Bundles/BundleInstances with manifests scoped to the OperatorGroup - Provisioner is responsible for making pivoting decisions - Extending the existing InstallPlan-related code to support the new APIs wouldn't be trivial, and the existing catalog-operator codebase is already due for a refactoring ![](https://i.imgur.com/WeJSRf4.png) Details (extend existing): - Extend the current InstallPlan API when the opt-in mechanism has been enabled, and only the new APIs are created in the list of steps; implemented a provisioner responsible for this use-case - Support enabling a mechanism for exposing scope to the BundleInstance API, e.g. an annotation that contains the desired operator installation namespace - What are the existing OLM features that would need to be present in an implementation that's centered around the new APIs - Safely rolling out CRD changes ### User Stories (Optional)  #### Story 1 #### Story 2 ### Notes/Constraints/Caveats (Optional)  #### Happy Path Install The "Happy Path Install" refers to the scenario where an operator is installed without any issues occuring. Consider a namespace, `my-ns`, containing the following resources: 1. A `CatalogSource`, `my-cs` - points to a valid catalog image - contains a linear bundle upgrade graph for package `pkg-a`; i.e. `bundle-vN` replaces `bundle-v(N-1)` replaces `bundle-v(N-2)` replaces `...` - all bundles in `pkg-a` support the same install modes - all bundles in `pkg-a` provide the same CRD `foos.my.com` 2. An `OperatorGroup`, `my-og` - configured to any one of the install modes supported by bundles in `pkg-a` Assuming a clean cluster -- i.e. no other namespaces, CRDs, APIServices, what-have-you -- with OLM, the latest bundle in `pkg-a`, `bundle-vN`, can be installed by creating `Subscription` `my-sub`: - in `my-ns` - with a spec requesting `pkg-a` from `my-cs` When `my-sub` is created, OLM will perform the following sequence (roughly): 1. Gather all `Subscriptions` and `CSVs` in `my-ns` - in this case, only `Subscription` `my-sub` is found 2. Gather all available bundles in `my-cs` 3. Massage the output of 1. and 2. into a filtered set of candidate bundles to install 4. Feed the output of 3. to a SAT-Solver, which determines _the_ set of bundles that satisfies all given constraints (if such a set exists) - here, it resolves the latest bundle `bundle-vN` 5. Create an InstallPlan in `my-ns`, containing instructions to apply new `Subscriptions` for and manifests from yet-to-be-installed bundles resolved in 4. - in this case the `Subscription` for `bundle-vN` in `pkg-a` already exists (`my-sub`), so only instructions to install manifests for `bundle-vN` are added - this step also generates and adds RBAC resources to the `InstallPlan` based on the permissions fields of the CSVs in each bundle 6. Execute the InstallPlan created in 5. - all `Subscriptions` and namespace-scoped bundle manifests are automatically namespaced to `my-ns` - bundle resources are applied sequentially by a partial-order prioritizing CRDs, then CSVs, then all other kinds (TODO: verify this) - this means CRD `foos.my.com` is created _before_ CSV `csv-vN` While this is happening, a separate controller in OLM is concurrently performing an independent sequence: 1. Gather CSVs and `OperatorGroups` in `my-ns` - only `csv-vN` and `my-og` is found 2. Determine if any API ownership conflicts exist between CSVs in `my-ns` - an "API ownership conflict" exists when more than one unrelated CSV provides the same CRD, APIService, and/or Webhook - CSVs are related if there exists a linear upgrade path between them; i.e. have a direct or transitive "replaces" fields - for this `csv-vN` in this case, no conflict exists 3. Determine if any API ownership conflicts exist between CSVs in intersecting `OperatorGroups` - `OperatorGroups` intersect when they set of APIs they select overlap - here, there are no other `OperatorGroups` on the cluster, so there's no conflict 4. Generate and apply `Deployments` from CSVs in `my-ns` 5. ### Risks and Mitigations  #### Data Loss ## Design Details  ### Test Plan  ### Graduation Criteria  ### Upgrade / Downgrade Strategy  ### Version Skew Strategy  ## Production Readiness Review Questionnaire  ### Feature Enablement and Rollback  ###### How can this feature be enabled / disabled in a live cluster?  - [ ] Feature gate (also fill in values in `kep.yaml`) - Feature gate name: - Components depending on the feature gate: - [ ] Other - Describe the mechanism: - Will enabling / disabling the feature require downtime of the control plane? - Will enabling / disabling the feature require downtime or reprovisioning of a node? (Do not assume `Dynamic Kubelet Config` feature is enabled). ###### Does enabling the feature change any default behavior?  ###### Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?  ###### What happens if we reenable the feature if it was previously rolled back? ###### Are there any tests for feature enablement/disablement?  ### Rollout, Upgrade and Rollback Planning  ###### How can a rollout or rollback fail? Can it impact already running workloads?  ###### What specific metrics should inform a rollback?  ###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?  ###### Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?  ### Monitoring Requirements  ###### How can an operator determine if the feature is in use by workloads?  ###### How can someone using this feature know that it is working for their instance?  - [ ] Events - Event Reason: - [ ] API .status - Condition name: - Other field: - [ ] Other (treat as last resort) - Details: ###### What are the reasonable SLOs (Service Level Objectives) for the enhancement?  ###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?  - [ ] Metrics - Metric name: - [Optional] Aggregation method: - Components exposing the metric: - [ ] Other (treat as last resort) - Details: ###### Are there any missing metrics that would be useful to have to improve observability of this feature?  ### Dependencies  ###### Does this feature depend on any specific services running in the cluster?  ### Scalability  ###### Will enabling / using this feature result in any new API calls?  ###### Will enabling / using this feature result in introducing new API types?  ###### Will enabling / using this feature result in any new calls to the cloud provider?  ###### Will enabling / using this feature result in increasing size or count of the existing API objects?  ###### Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?  ###### Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components?  ### Troubleshooting  ###### How does this feature react if the API server and/or etcd is unavailable? ###### What are other known failure modes?  ###### What steps should be taken if SLOs are not being met to determine the problem? ## Implementation History  ## Drawbacks  ## Alternatives  ## Infrastructure Needed (Optional)