OLMv1 Upgrade User Experience

# OLMv1 Upgrade User Experience Operator management is non-trivial, for various reasons: * Operators are singletons on the cluster. * Operators may have dependencies. * Operator dependencies complicate creation, deletion and updates. * Operators have multiple versions available. * Operators manage custom resources defined by CRDs. * CRDs are a cluster-wide resource. Summary of Design Decisions for the operator-controller: * Dependencies are reported, but not automatically installed. * Operator CRs are not created or (specs) updated by the operator-controller. The operator-controller only manages the Operator CRs. * Users have explicit control over the version(s) to be installed. * Upgrades are handled by a separate process that updates Operator CRs. * The upgrade process may be on-cluster or off-cluster, automatic or manual. * Automatic upgrades are opt-in, and the user has control. ## Goals The goal of OLMv1 is to ease the installation of operators (as described above). This includes, but is not limited to: * Ease of installation of both OLMv1 and the operators themselves. This means minimal configuration is necessary in order to install an operator. * Managing exactly what versions of operators are on the cluster. This is done by a process separate from the operator-controller. * Provide safety nets for operator installation. This is accomplished by pre-checks and and ordered operations to ensure upgrades and installations can be performed. * Provide stability for installation. This is accomplished by Policy dictating how (e.g. when) an operator is installed or upgraded. The user experience for installation is broken down into two parts, each handled by a separate component. ```mermaid flowchart LR G[[generate]]-->A A[Operator CR]-->B C(Current\nVersion)-.->R op[[operator-controller]]-->R(Path)-.->B(Destination\nVersion) ``` ## Destination The destination, which is the desired version of an operator, is specified via the `.spec.packageVersion` field in the Operator CR. This is determined by the “generate” component. The “generate” component may be many things: 1. The `kubectl operator generate` command, run manually or via a script or via a cronjob. 1. The `kubectl operator create` command. 1. The “generate” component (i.e. service/process/sidecar/cronjob) running on-cluster. 1. The OpenShift console. 1. A user-created Operator CR. The first four options all share a common code base/library/functionality that can optionally take cluster state, the current Operator configuration and user input to determine the Operator CRs to be created. Because this component is separate from the **operator-controller** component (below), it can be run in different locations based on the users’ needs. ## Path The path to that destination is determined by the **operator-controller** component. It is responsible for installing and upgrading the operators based on the `.spec.packageVersion` field in the Operator CR. The **operator-controller**’s job is to get the specified operators from point (version) A to point (version) B. It may check for compatibility between operators, but it does not try to resolve it beyond determining the order of installation. # Actions Fundamentally, there are three actions for the operator-controller: 1. Installation of operators 1. Upgrading of operators 1. Deletion of operators These are all about the “routing” that the operator-controller performs. ## Installing Installing an operator is the simplest case; as the path can be minimized to just the destination. ## Upgrading Upgrading an operator is the “normal” case, and should not be considered a special case of installing. Installing should be considered a special case of upgrading, where the path is minimized. ## Deleting Deleting is a very special case, in that the path and destination are to nowhere. It is also imperative, not declarative. As the operator-controller knows what CRDs an operator provides, the operator-controller can refuse to delete or uninstall an in-use operator. The delete behavior needs to be well-defined, and there should be mechanisms to override this behavior (e.g. force-delete, force-uninstall). # User Types There is a spectrum of users, and at the two extreme are: 1. Those who want to know and control exactly what’s on their cluster. 1. Those who want to “set and forget”. And those in-between. GitOps would fall into category 1; the cluster must match the git canonical repository. Casual users would fall into category 2. But that category can be split into two camps: 1. Those who want to assign a version and not change it. 1. Those who want automatic updates. Both of these groups could be called “one-and-done”, in that they want to configure things only once. In this case, types 1 and 2.1 are similar; they don’t want any automated upgrades. This is likely to be a majority of the cases, especially with large clusters. More casual users with smaller clusters might prefer automatic upgrades. So, this would be a smaller camp. This document will focus on the manual (GitOps-friendly) case, but an opt-in automatic upgrade process will be available to those who want it. # General Case The simplest case for Operator installation is a user defining an Operator CR and applying it to the cluster. This may only work for the simplest operators with no dependencies (e.g. Prometheus). # GitOps GitOps, in its simplest form, is managing a cluster configuration, or part of it (e.g. namespace, or specific resources), via a git repository. It is not any different than the General Case or using the kubectl operator CLI to define an Operator CR. GitOps does not want to go through an iterative process to determine dependencies or version requirements. It makes their jobs easier to get an (hopefully) valid answer upfront, based on their current configuration, submit that as a pull/merge request and then propagate through the various test and staging clusters before production. # Handling Dependencies Dependencies will be reported as part of the `.status.dependencies` field of the Operator CR, as determined by the operator-controller. This field will only represent the immediate dependencies. External tools may be used to build and display a tree of second-level and below dependencies; it will not be part of the operator status. The configuration of the operator may select whether to install regardless of their dependencies, or wait until dependencies are satisfied. Automated/automatic installation of dependencies is not supported. The user will need to explicitly configure dependent Operator CRs. This goes in line with GitOps, in that what’s configured in the cluster matches the git repository. In other words, Operator custom resources don’t “create themselves”. Operators that are dependencies also know what is dependent on them. This makes it easier to recognize what operators are leafs, and could possibly be removed. Of course, many operators will fall into the leaf category. Administrators will have to be cognizant of any operator use (by cluster users) before removal. This status will be reported as the `.status.usedBy` field of the Operator CR. # Available Upgrades Periodically (frequency TBD) operator-controller will query via resolution for version updates to an operator. The `.status.versions.available` will be updated with semantically new versions of the operator. A Kubernetes Event will be generated when this list changes. Thus, available updates can be easily listed by just getting `.status.versions.available`. Automated installations of upgrades are not supported by default. The user will need to specify a version to upgrade to by updating the desired `.spec.packageVersion` in the Operator CR. However, this method can be used to perform automatic upgrades. # Separation of Responsibilities Because Operators are a cluster-wide resource, only the cluster administrator should be able to install them. Thus, there is no need for a separation of responsibilities between namespace tenants and the cluster administrator. This also eases GitOps procedures; once the configuration is applied, that’s it. No need for approvals on-cluster. The GitOps procedures generally use a pull/merge request approval mechanism, so there’s no need for an on-cluster approval mechanism. # Upgrades To help with these situations, the generate component is used to determine what will be installed. From the users’ perspective, this is implemented by two mechanisms: 1. The Operator kubectl plug-in (`kubectl operator`), which will be extended to provide additional functionality. This functionality will help both GitOps and casual administrators. 1. An opt-in on-cluster "generate" component to perform automatic upgrades to installed Operators. The kubectl plug-in will include deppy and provide the offline mechanism used to determine dependencies and to generate them. Generating the configuration offline allows for greater control, review and flexibility. Being a command line tool, interested parties can The on-cluster "generate" process allows for automatic upgrades. The initial implementation could just be a wrapper around the kubectl plug-in. This will be op-in as part of the installation of OLMv1. A tool such as `kustomize` or the Operator SDK will be needed to install the on-cluster "generate" component. ## Current Commands Ignoring OLMv0 commands, the current commands are: * `kubectl operator olmv1 install <name>` * `kubectl operator olmv1 uninstall <name>` These two commands will be renamed to `create` and `delete` respectively (without the `olmv1` keyword), as that is what they do, they create and delete Operator CRs from the cluster. These are the simplest operations. A `--version` option will be added to install a version other than the latest. **Note**: The current commands are underneath the `olmv1` keyword. If the commands are renamed, they can be pulled out from under the `olmv1` keyword. This could make migration easier. **Note**: From here on out, the `kubectl` command is aliased to `k`. ## Generate Command ```mermaid flowchart LR C[(Cluster\nConfig)] --> G[[generate]] OO(Current Config:\nOperator CRs\nCatSrc) --> G NV[Desired Config:\nmanifest\nCLI options] --> G G --> NO(Output Operator CRs) ``` A new command, `generate`, will be defined to assist in creating configuration that can be applied to the cluster for a new operator. Either directly, or via GitOps. ```shell # Save the output to 'prometheus.yaml' then apply it k operator generate prometheus > prometheus.yaml k apply -f prometheus.yaml # Save the output into separate files based on kind and name # e.g. for creating separate files for GitOps k operator generate prometheus | yq -s '.kind + "-" + .metadata.name' # Apply the configuration directly to the cluster k operator generate prometheus | k apply -f - ``` The `generate` command will use deppy to calculate and determine dependencies. The output will be Operator CRs in YAML document format (i.e. with three dashes `---`) that need to be created in order to install the specified operator (prometheus 1.2.3 in these examples). Note that this does not take into account the current state of the cluster. When supplied with the `-f` option, the input (either stdin, a file, or directory) is used to determine the current state and what needs to happen to get the operator to the specified version. This is used to get a static configuration which the `generate` command updates. ```shell # Update the configuration based on a set of files k kustomize build <dir> | k operator generate prometheus=1.2.4 -f - # Update the configuration based on the current state of the cluster k get operators,catsrc -o=yaml | k operator generate prometheus=1.2.4 -f - # Update the configuration based on a directory/file k operator generate prometheus=1.2.4 -f <dir/file> ``` The output consists of all the Operator CRs, regardless of their update status. Because there are a number of ways software can be updated, there are multiple options to specify the version. The desired version may be explicitly specified by adding an equals sign followed by the version value. The version value may be one of: * A semantic version number (e.g. `1.2.3`) * `latest` (also the default) * `latest-z-stream` * `latest-y-stream` The `latest-stream` options allow the user to update to the most recent version of the given stream, as determined by deppy. If there is no input (i.e. no `-f` option), these values are equivalent to `latest`. To cleanly remove an operator (e.g. to assist with GitOps), the `--delete` option will delete the appropriate operator and dependencies from the output. ```shell # Delete the redis operator k get operators,catsrc -o=yaml | k operator generate --delete redis -f - # Delete the redis operator, update prometheus k get operators,catsrc -o=yaml | k operator generate --delete redis prometheus=1.2.3 -f - # Look at the current GitOps code base, and delete redis k operator generate --delete redis -f . | yq -s '.kind + "-" + .metadata.name' ``` The above examples would delete the **redis** operator and any subsequent, unused, dependencies. This could be combined with other operator-version options. Multiple operators may be specified on the command line to resolve multiple operators simultaneously (i.e. in parallel). Certain operators/versions combinations may not be possible, if that’s the case, an error (on stderr) will be reported. ```shell # Create a configuration for two operators, latest version k operator generate prometheus redis # Create a configuration for two operators, with specific versions k operator generate prometheus=1.2.3 redis=4.5.6 # Update the configuration for two operators k get operators -o=yaml | k operator generate prometheus=latest redis=latest-z-stream -f - # Pipeline to update the configuration for a single operator # This won't delete an operator, however k get operators,catsrc -o=yaml | k operator generate prometheus=latest | kubectl apply -f - ``` ### Inputs This section summarizes the types of inputs and how to get them into the `generate` command. **Current Configuration** (and state) is input via the -f option, which can be done via stdin, as a single file or as a directory. This input mechanism is similar to the `apply` command. The input can be JSON or YAML formatted. The current configuration may be retrieved from the cluster via `kubectl get operators,catsrc -o=yaml`. This configuration is what determines how to get to the next version of an operator. The current state includes the deppy version used to determine available versions, this is evaluated by the `generate` command to determine if there are any possible reconciliation incompatibility issues. **Desired Configuration** is represented by the command line inputs, indicating the desired change. This includes a list of operators to install and/or update. Given the potential for hundreds of operators, this can be specified as a manifest-like input file. This configuration is reconciled with the current configuration to determine the output. ```yaml operators: install: - name: prometheus version: 2.0.4 ``` The desired configuration may be presented in two ways: * **Command line:** This method is additive. The list of operators on the command line are added to the current configuration (if present). These operators (along with their dependencies) is what is put into the output. * **Manifest file:** The list of operators in the file (along with their dependencies) is what is put into the output. Anything else is removed. This is likely the option to be used by GitOps. Such that the manifest file is checked into git and is used to generate the YAML files. This manifest could also be an kubernetes API. The command line options are as follows: * `-f` | `--filename`: the existing operator and catalog source files. * `-d` | `--destination`: the manifest file for the desired configuration. * `--delete`: remove an operator from the current set of input operators. There is no short version of this option. * `-a` | `--add`: add the specified operator to the current set of input operators. The add option is not required; any operator listed on the command line is considered to be an add operation. **Cluster Configuration** may be necessary to determine compatible operator versions. This might include: cluster version, OpenShift version, number of nodes, etc. This could be retrieved directly from the cluster in question, or provided via a file. TBD. ### Outputs This section summarizes the types of outputs and how to get them from the generate command. The default output consists of JSON (or YAML) Operator and CatalogSource CRs. To generate different types of output, the following options are available: * `--output`, one of `yaml`, `json` (possibly a `table` format?) The default input will match the input. If the input is somehow mixed, then the output will be JSON. * `--diff`, generates differential output (see Enhancements) Errors and warnings will be output on stderr. Any error will emit error code 2, if there are warnings (but no errors), generate will emit error code 1. Success will emit error code 0. The generated output will be sent to stdout, unless an error occurs. ### Enhancements 1. The version option could have `>` or `>=` to specify a minimum version number to upgrade to. 1. An option to label or annotate those Operator CRs that are user-specified. This will make it easier to separate leaf operators into “explicitly configured” and “unused dependencies”. 1. `--output` options (`json`, `yaml`, etc.) 1. Diff output to show what would change. This might be in YAML or JSON format, or might be tabular. ``` # Example --diff -o=json output {"name":"prometheus","version":"1.1.5","previous":"1.1.1","available":["1.1.2","1.1.3","1.1.4"]}, {"name":"redis","version":"2.0.1","previous":1.5.2","available":["1.5.3"]} # Example --diff -o=table ouput Name Version Previous Available -------------------------------------------------- prometheus 1.1.5 1.1.1 1.1.2, 1.1.3, 1.1.4 redis 2.0.1 1.5.2 1.5.3 ``` # Version Compatibility ## Version Traversal & Intermediate Versions As is often the case, an operator cannot simply go from version 1.2.3 to 2.3.4. The operators’ [update graph](https://olm.operatorframework.io/docs/concepts/olm-architecture/operator-catalog/creating-an-update-graph/) defines the path that an operator version can take. The `generate` command is used to determine the destination. The **operator-controller** is responsible for determining the path to get there. The `generate` command will only ever specify a valid version from the catalogs. If a version is removed from the catalogs, or a user specifies an unknown version, that might make it impossible for the **operator-controller** to reach the desired .spec.packageVersion. In such a case, the **operator-controller** will give up, indicating that the destination is unreachable. Otherwise, if the destination version can be reached, the **operator-controller** will install versions, according to the update graph, until the desired version is reached. The time between installs will be a reasonable and configurable default (e.g. 60 minutes?) to allow the cluster to settle to the new operator version. The `generate` command will follow the update graph to ensure that the destination can be reached. It may generate an error message if this is the case, with suggested actions. The operator-controller will attempt to install operator versions to get to the configured version. ## Operator Version Compatibility It may happen that two operator-versions are incompatible with each other, or that two operators need to be installed/updated simultaneously to satisfy operator-version compatibility. In this case, the **operator-controller** will start installs/updates of both controllers to get them to a compatible state. **Example:** > Two operators must be updated in lockstep; they must both be at the same version. In this case, assuming each operators’ update graph allows the update to the next version, both would be updated nearly simultaneously. However, if the packageVersions of those operators do not match (i.e. operator A = 1.2.3, operator B = 1.2.4), then **operator-controller** will be able to get both operators to version 1.2.3, but then be unable to get both operators to version 1.2.4, as only one of them has version 1.2.4. An error would be reported in the status of operator B. Manual intervention (i.e. changing .spec.packageVersion) may be necessary, and can be accomplished by changing .spec.packageVersion of the incompatible versions. However, **operator-controller** should be able to find an appropriate path, assuming there are no version incompatibility loops. ## Deppy Compatibility Deppy is used as a library by the **operator-controller** and the generate command. When using the current state of the cluster as input (via k get operators,catsrc -o=yaml), there is a possibility that deppy versions may be incompatible. In this case a warning will be generated. # Parallel vs. Serial Because the configuration input into `kubectl operator generate` is static, an end-state should be achievable, either with a successful configuration, or an error (stderr output only). In effect, this is parallel reconciliation, but the problem is easy to solve because of the static input. If a user wants to prioritize one operator over another, this can be done in a serial fashion. In other words, the user will update one operator version, then the other. ```shell # parallel reconciliation k get operators -o=yaml | k operator generate prometheus=latest redis=latest | k apply -f - # serial reconciliation - always try to update prometheus before redis k get operators -o=yaml | k operator generate prometheus=latest | k apply -f - k get operators -o=yaml | k operator generate redis=latest | k apply -f - ``` In the first example, the generate command will attempt to update both operators simultaneously. If it can’t, then an error occurs. In the second example, the generate command attempts to update the prometheus operator first, and if that succeeds, applies the result. Then the generate command attempts to update the redis operator. Depending on how the script is implemented (e.g. is `set -e` used?), if the first command fails, then the second command is not run. The second serial command is almost the same as the parallel command, in that the inputs to it are that both operators are to be at the latest, but prometheus is already at the latest. # Automatic Updates For those users that want automatic updates, the `kubectl operator generate` command (or functional equivalent) can be run in a way that provides this functionality. Fundamentally, running the kubectl plugin periodically will provide the automatic update capability. This can be accomplished in a number of ways, but the most common methods would be: 1. Offline 1. Online 1. On-cluster These are described below. ## Offline A GitOps flow could have a periodic process that reads the canonical state of the cluster, runs the kubectl plugin to determine available updates, and then automatically creates a PR with those updates. Configuration of the script that runs the kubectl plugin would actually be the canonical operator configuration. Basically, the script running the kubectl plugin would be used to generate the initial configuration. This configuration is then read back into the script to update the repository. A sample script/example will be made available. Implementation of the sample script would be an exercise left to the user. ## Online A periodic process (e.g. cronjob) runs external to the cluster. The process reads the Operator configuration of the cluster, runs the kubectl plugin to determine available updates, and automatically applies those updates to the cluster. A sample script/example will be made available. Implementation of the sample script would be an exercise left to the user. ## On-cluster A CronJob or operator-controller sidecar runs periodically (or is triggered by operator-controller status updates/Events; implementation TBD) to read the Operator configuration of the cluster, run the generate process (e.g. the kubectl plugin or similar) to determine available updates, and then automatically applies those updates to the cluster. This method would be implemented by the OLM team as an opt-in piece of functionality. (GitOps would certainly appreciate knowing that automatic updates can be explicitly removed.) Configuration could be provided via ConfigMap, or similar resource. Parallel and serial reconciliation should be available options. In all cases, configuration would need to be provided to the kubectl plugin. In the case of offline or online methods, the author of the script would include that in the script or via some other means. ## Failures There are a number of points during an upgrade (or installation) that could fail. This is an incomplete list of failure points, with safety and recovery notes. 1. Resolution (e.g. `generate`) fails in some way. Something is missing, or dependencies cannot be determined, etc. In this case, no configuration update is performed. An on-cluster solution could generate a log and update a Condition on the Operator CR. Off-cluster solutions would have to provide their own logging or error messages. 1. Reconciliation by the installer (OLM) fails in some way. Something is missing, or dependencies cannot be determined, etc. In this case, no installation occurs. A Condition would be added to the Operator CR status. 1. Constraints: there could be a number of constraints/policies defined that preclude installation (e.g. time-of-day), and these may be transient. A Condition would be added to the Operator CR status. This is not necessarily a fatal error, and may simply delay the final installation. 1. Pre-conditions/checks fail. A set of checks performed by OLM before actual installation to maximize installation success (e.g. did the bundle download and successfully extract, are all operator pre-conditions met). A condition would be added to the Operator CR status. 1. Installation fails, in this case, the operator would have to be rolled back, if that's possible. Not much different than today. A manual intervention might be necessary (e.g. update the packageVersion in the Operator CR). # Operator-controller The operator-controller’s purpose has a smaller scope than that of OLMv0. The operator controller is responsible for the Operator CR, and initiating installs, along with determining a safe order for installing those operators. It does have responsibility to ensure that configured operator versions are compatible with each other, but it does not need to resolve it. It needs to report the problem with enough information. While the output from `kubectl operator generate` can be used to generate Operator CRs, it doesn’t have to be the source of the Operator CRs. Users are still able to create the Operator CRs manually, and the operator-controller must be able to deal with any conflicts that might generate. # Management Console The kubectl operator generate functionality should be able to be incorporated into the Management Console. Either the kubectl plug-in could be invoked indirectly, or the golang code could be incorporated directly into the console. There are a number of possible UIs that could be developed, but the console could: 1. List the current operators-versions and indicate if an upgrade was available. 1. Allow the user to select an upgrade and/or version, and see what operator-versions would be installed. 1. Allow the user to add a new operator, and see what operator-versions would be installed. 1. Allow the user to delete an operator, and see what operator-versions would be removed. 1. Enable/disable automatic updating and what that configuration should be. # Key Takeaways * No separation of responsibilities (authorization). * Installation and Version checking are separated, simplifying each component. * Cluster administrators are responsible for explicitly installing Operators. * Upgrades and updates are done via explicit configuration. * Operator-controller’s job is to attempt to reconcile the installation of Operators. If the configuration cannot be reconciled, then the Operator `.status` will indicate the issues. * Operator configuration can be “pre-reconciled” via the kubectl plugin before deployment. * Automatic upgrades can be done via the manual “pre-reconciliation” process implemented automatically. # Implementation Details ## Operator CR The Operator CR is created and owned by the Cluster admin. Because it’s the cluster admin, there’s no need for a separate approval. What do we need to add to the Operator: ### Spec * Desired package: the name of the operator * Desired version: optional. When not specified, the behavior is to reconcile once, more details below. Otherwie, valid values are any semantic version parse range. * Dependencies: how to handle dependency requirements, defaults to “require” (only install if dependencies are to be installed), could also be “ignore” (install regardless of dependencies). This is basically the “force install” option. #### Version The `version` field is a "first-order" constraint that has extra meaning beyond specifying reconcilliaton behavior. By "first-order", we mean that its value may override other constraints; specifically when it is blank (unspecified). The value may be either be unspecified (blank/empty) or a [parse range](https://pkg.go.dev/github.com/blang/semver#ParseRange). If blank, it informs the operator-controller to _successfully_ reconcile the Operator CR once, until its value is changed. Thus: * When not specified at creation, the reconciliation occurs without respect to the version; "any" version may be applicable. Once successfully reconciled and installed, no other updates or installations occur unless the `version` value changes. * When not specified (changed to empty) on an CR update: * If the CR had been successfully reconciled and an operator installed, no more reconclliations occur; the operator is now "fixed" at the installed version. * If the CR had _not_ been successfully reconciled, then the behavior is the same as "not specified at creation", above. Examples of parse ranges: * The equivalent of "any" or "latest" is `>=0.0.0`. * Pin to version "1.2.3": `1.2.3`. * Follow the z-stream (1.1.x): `>=1.1.0 <1.2.0`. * Follow the y-stream (1.x): `>=1.0.0 <2.0.0`. The semantic version is specified by [this regex](https://semver.org/#is-there-a-suggested-regular-expression-regex-to-check-a-semver-string), which will need to be expanded to support parse ranges. ### Status * Conditions: indicates the installation status of this operator, including if the operator-controller is in the process of installing. * Installed version: currently installed version, if different from spec.packageVersion, then there’s some upgrading to do. The `.status.conditions` fields will indicate why the resource is not at the `.spec.packageVersion`. * Available version(s): a list of available versions that could be installed. What is potentially installed next depends on the available upgrade paths to `.spec.packageVersion`. These are all semantically higher versions than the `.status.versions.installed` value (regardless of the `.spec.packageVersion`). * List of other operator dependencies: this is a single level list, that can be * List of other operators that depend directly on this operator. * Versions of deppy and operator-controller managing this resource: These values are of interest to the `generate` command to determine if there are any incompatibilities between deppy versions. ```yaml apiVersion: operators.operatorframework.io/v1alpha1 kind: Operator metadata: name: prometheus spec: packageName: prometheus version: 2.41.0 dependencies: require status: conditions: - type: Ready status: "True" observedGeneration: 1 reason: Installed message: "prometheus installed" - type: Installing status: "False" observedGeneration: 2 reason: TimeRestricted message: "waiting for the weekend" dependencies: - name: alpha minimumVersion: 1.2.3 usedBy: - name: omega version: 6.7.8 versions: installed: 2.40.6 available: - 2.42.0 - 2.41.0 - 2.40.7 deppy: 1.2.3 controller: 1.2.3 ``` ## Separated Policies Separated policies simplify the Operator CRs, and means that they can be more easily updated without accidentally deleting Policy information. If Policy were defined in Operator, then that Policy would have to be replicated when an Operator is updated, otherwise a patch would need to be done. This might work for GitOps (where presumably the policy is checked in with the Operator CR), but makes non-GitOps updates more challenging. ## Policy CR Policy CRs limit or extend what OLMv1 can do with an Operator. Separating Policy from the Operator allows rich Policy to be defined. ### Approaches There are many dimensions when defining Policies. The first dimension is deciding how Policy is semantically defined: 1. **Open**. With no bound Policy, the Operator has complete freedom of installation. This provides the most ease of use, and Policies ***restrict*** what can be done (subtractive). 1. **Closed**. With no bound Policy, the Operator cannot be installed. This provides the most security (control) and Policies ***enable*** what can be done (additive, like RBAC). Another dimension is deciding how Policy is syntactically defined: 1. **Well-defined fields** (can only define what is in the CR) limit what can be defined, but ease implementation and syntax checking. 1. **Generic fields** allow for great flexibility and minimizes CRD churn. It also means that the generic fields need to be interpreted, and a status for the Policy needs to indicate whether it was parsed correctly (or simply disallow it during creation/update). Regardless of the approach, the Policies may translate to Constraints within the operator-controller, rukpak and deppy. Example with well-defined fields: ```yaml apiVersion: operators.operatorframework.io/v1alpha1 kind: Policy metadata: name: install-on-weekends spec: schedule: - start: 01 00 * * 6 end: 59 23 * * 0 ``` For generic fields, I’m going to suggest OPA (Open Policy Agent). ```yaml apiVersion: operators.operatorframework.io/v1alpha1 kind: Policy metadata: name: install-on-weekends spec: opaPolicy: "default allow := false; allow if { weekend := {"Sunday", "Saturday"}; weekend[time.weekday(time.now_ns())] }" ``` ## PolicyBinding CR A CR that is used to link a Policy CR to an Operator CR. Very similar to RBAC’s RoleBinding or ClusterRoleBinding. Permits many-to-many associations of Policy and Operator CRs. **QUESTION**: Do we allow wildcard Operator names in the links? **QUESTION**: Keep the subject/policyRef linkage simple or similar to RBAC? It’s possible to have “broken” bindings (where one of the subjects and/or policyRefs is not present). This nullifies the Binding, such that it has no effect. ```yaml apiVersion: operators.operatorframework.io/v1alpha1 kind: PolicyBinding metadata: name: default-install-time spec: subjects: - apiGroup: operators.operatorframework.io kind: operator name: "*" policyRef: - apiGroup: operators.operatorframework.io kind: policy name: install-on-weekends ``` # Catalog Source Changes There might need to be an annotation of the operator-version image's SHA hash to determine if it had changed between configuration reconciliation and installation reconciliation. This could prevent issues if catalog references change. # Incidents This section is here to determine if this UX would prevent (or decrease the severity of) an incident. ## Staging Catalog Copied Erroneously to Production Automatic upgrades are not enabled by default. Upgrades are checked via a separate process (offline or on-cluster), and if the checks in that process fail, then the Operator CR is not updated, subsequently upgrades could not happen. Installation of operators should not start to occur until all bundles are available and va # Future/Enhancements 1. Supporting [ApplySets](https://github.com/kubernetes/enhancements/tree/master/keps/sig-cli/3659-kubectl-apply-prune) in kubernetes. 1. `spec.packageVersion` vs. `spec.packageVersionRange`, and/or migrating `spec.packageVersion` to support ranges? # References [Open Policy Agent | Documentation](https://www.openpolicyagent.org/docs/latest/) [OLM V1 MVP](https://www.openpolicyagent.org/docs/latest/) [yq](https://mikefarah.gitbook.io/yq/) [`spec.version` behavior](https://github.com/operator-framework/operator-controller/discussions/165) # Comments ## Actions > Deleting **Joe Lanford: 12:42 PM Mar 21** This is an interesting problem. If Operator v1 depends on API A and Operator v2 depends on an equivalent but different API B, you can't simply delete API A's CRD and create API B's CRD. The Operator needs a v1.5 that depends on both A and B and does the migration to B. **Todd Short: 2:12 PM Mar 21** Yes it is. We can't block k8s from deleting an Operator CR (ok we can, but finalizers are messy). OLMv1 should be able to recognize that some destinations can't be routed to, and will make the best attempts before giving up. "Giving up" is perfectly acceptable, if it keeps the system in a stable state and the user is notified of this condition. **Joe Lanford: 2:21 PM Mar 21** +1 that finalizers are messy I would say that permanently giving up is absolutely unacceptable. Giving up in the face of conceivably temporary errors is unacceptable. But otherwise, giving up until my input changes is absolutely acceptable. **Todd Short: 2:45 PM Mar 21** The hope is that "giving up" occurs (in generate) BEFORE the configuration is applied. So, we never end up in a state like this. That being said, users can completely ignore the tools we provide, and manually create Operator CRs that can't be reconciled, and we need to handle that. **Joe Lanford: 4:29 PM Mar 21** Yeah I've got another absolutely for you: the on-cluster reconciler absolutely can't make assumptions about the client tooling used to create/update/delete Operator CRs. **Todd Short: 3:54 PM Mar 22** Exactly. Although the operator-controller can't express hope, the authors of the operator-controller can at least hope that the user wants to accomplish something. :) **Joe Lanford: 12:47 PM Mar 21** It can refuse, but should it? Perhaps we should delete the in-use operator because the cluster admin has told us they no longer need it to be in-use. (or it is no longer a transitive dependency of the operators they care about) I realize this is a very provocative suggestion, but it is -- I think -- the most aligned with this being a declarative API. It boils down to the question: "should cluster admins be forced to tell us everything they explicitly care about?" if so, then anything installed that hasn't been declared can also be automatically removed when it is no longer depended upon. **Joe Lanford: 12:47 PM Mar 21** This could be a footgun for cluster admins, but it is a very straightforward and easy to describe behavior. **Todd Short: 2:21 PM Mar 21** So, everything is explicitly declared. An operator won't be installed unless there's an Operator CR for it. Because all operators are defined as an Operator CR, and OLMv1 doesn't create/modify the spec of Operator CRs, OLMv1 won't automatically install/create dependencies. That is what the separate "kubectl operator generate" command (and possibly the on-cluster updater) is for. This behavior (to delete or not to delete, that is the question) could be configurable in the OLM deployment. I do have a book titled "Enough Rope To Shoot Yourself in the Foot". I do tend to lean toward simpler designs that can be combined to create something bigger. **Joe Lanford: 2:25 PM Mar 21** Yeah, I think it has to be either: "you tell us everything including dependencies, explicitly" or "we auto-manage dependencies always" There are pros/cons both ways. For instance, if we auto-manage dependencies, we can upgrade you through spots in the graph that pull a new dependency in and then dump that dependency in a later version. **Todd Short: 2:50 PM Mar 21** The approach I'm suggesting is sorta hybrid. One tool to help you determine the destination (generate), and another that gets you there (operator-controller). In this case however, "generate" may need to specify the "rest stops" along the way, which may include additional operators. So, "generate" the tool for "we auto-manage dependencies always", and operator-controller is the tool for "you tell us everything including dependencies, explicitly." And they work together. ## Handling Dependencies **Joe Lanford: 12:53 PM Mar 21** Should there be a way to distinguish between an intent to use an Operator's API directly vs an intent that an Operator is only present to satisfy a dependency? For example, consider cert-manager. It is pretty likely to be both a "root" desired operator AND one depended on by other operators. How is that intent made apparent? **Todd Short: 2:27 PM Mar 21** Yes, we ought to be able to distinguish between the two (or the combination). The implicit method, where "status.usedBy" and "status.dependencies" is one way. If something isn't "usedBy" it's implied that it's there for an explicit reason. The tricky one is when something is both explicit and a dependency. In this case, I believe I had mentioned adding an annotation. **Joe Lanford: 4:36 PM Mar 21** Yeah, that could work. Then the plugin could automatically spit out that annotation in CRs that were explicitly added via the generation command. I wonder if there's a way to somehow do this without requiring the client to co-operate. (i.e. if I directly run `kubectl create -f` on a file with an Operator CR in it). Maybe an admission webhook/cel expression could add the annotation on create events if usedBy is empty? But when is usedBy populated? :thinking_face: **Ben Parees: 5:05PM Mar 30** how about an artifical "usedBy" value of "user" or "explicit" or something? that would cover the case of "it was installed directly, but now other things are dependent on it too" what it doesn't cover is the case of "it was installed as a dependency, but now people are using it directly". That's the hard one. ## Available Upgrades **Joe Lanford: 12:57 PM Mar 21** How will this work when there is interplay between other Operators? E.g. - two operators need to update in lock step - if one operator upgrades, then another operator's available versions changes. If I have ten operators installed, and all of them are listing available upgrades, does that mean that any combination of those upgrades satisfies the resolver? **Todd Short: 2:30 PM Mar 21** No. This just lists available upgrades. Because you haven't attempted to upgrade by updating the Operator CR, nothing happens. When an upgrade is attempted, the upgrade process will determine what's possible based on current installs and where you want to go. This could fail! **Joe Lanford: 4:39 PM Mar 21** Okay so this field is based purely on what the catalog says are the possible upgrade edges? I think the terminology talking about "resolution" threw me off. Maybe clarify this in the doc? **Joe Lanford: 5:32 PM Mar 21** I think this is the correct thing to do btw. It will give admins the ability to see possibilities beyond their current cluster state/constraints. But we'll have to be super clear with the API/UX of this field so as to not confuse people (i.e. you said it was available, but now i tried it an i'm getting an error about resolution! this is a bug! fix it!) ## Separation of Responsibilities **Joe Lanford: 12:58 PM Mar 21** This is definitely a nice outcome of this design. **Todd Short: 2:30 PM Mar 21** Thanks **Joe Lanford: 4:40 PM Mar 21** The whole on-cluster approval workflow always felt so off to me. Its essentially this imperative thing inserted into the middle of a declarative API. ## Upgrades > Generate Command (1) **Joe Lanford: 1:02 PM Mar 21** What is the source of the catalog information it pulls from? **Joe Lanford: 1:10 PM Mar 21** Assuming catalog sources defined on the cluster, how will `kubectl operator generate` connect to them? Or is there a different off-cluster catalog design not mentioned here? **Joe Lanford: 1:11 PM Mar 21** Ah - maybe its the on-cluster apiserver accessible package API? The stuff Anik has been working on? **Todd Short: 2:31 PM Mar 21** I haven't spoken to Anik about this. But reading further, the CatSrcs are used as input to the generate command. **Joe Lanford: 4:44 PM Mar 21** Yeah, but if those are anything like OLMv0 catsrcs, then you can't query their content unless you've got a route to the pod in the cluster. So you'd need an ingress or a port-forward or something. **Todd Short 4:08 PM Mar 22** It should be an accessible package API. ## Upgrades > Generate Command (2) **Joe Lanford: 5:00 PM Mar 21** I'm not sure I agree with this method of inputting cluster state, for a few reasons. 1. It means the generate command has some magic logic for interpreting the YAML it receives as input and essentially expanding those to resolver inputs in a switch statement based on pre-defined hard coded GVKs it knows about 2. It's not very obvious what's supported. What if I want node architecture to be accounted for? Can I just add nodes to the list. 3. Its easy for different admins to accidentally provide different input Maybe we should have a pluggable aggregated API (or similar) that various input providers can contribute to. Then the generate command just talks to that API, and all users get the same view? That option feels a little funny too. Overall question is: what is the right way to: 1. Get potentially arbitrary cluster state into the generate command? 2. Ensure that when different cluster admins run the generate command, they can't accidentally provide different input, given the exact same cluster. Are those reasonable things to solve for? ## Upgrades > Generate Command (3) **Joe Lanford: 1:13 PM Mar 21** If 0.1.2 is installed, and I ask for 1.2.3, but there's no direct upgrade from 0.1.2 to 1.2.3, what happens? Does generate give me an intermediate version even though I asked for 1.2.3? Or does it fail? Or something else? **Joe Lanford: 1:17 PM Mar 21** Kept reading... generate gives me the destination. That begs further questions, but I'll keep reading. :) **Todd Short: 9:45 AM Today** 👍 ## Upgrades > Generate Command > Inputs (1) **Joe Lanford: 8:33 AM Mar 22** Is there a three way merge problem here? 1. Existing state of Operator manifests stored in git 2. Existing state of Operator CRs in the cluster 3. The new desired Operator manifests. What happens when 1 and 2 are different (ignoring status and auto-generated metadata)? **Todd Short: 9:20 AM Mar 22** There shouldn't be. Per gitops, 1 and 2 should be the same. If they are not, then #1 takes precedence. The proper state of the cluster is stored in git. Deviations from that shouldn't happen, but it's possible. It's also up to the SRE to determine where to take the input from. In a gitops scenario, the existing state of the Operators CRs in the cluster are ignored; git is used as the input. In a non-gitops scenario, the existing state of the Operator CRs in the cluster are used. **Joe Lanford: 9:28 AM Mar 22** Maybe I'm missing something, but it seems like "generate" will always need to at least see the status of the on-cluster Operator CRs (or maybe rukpak BDs?) to understand what's already installed? So in a gitops context, it seems like the Operator inputs would be: 1. Git: Operator metadata/spec 2. Cluster: Operator status 3. Desired metadata/spec So perhaps no 3-way merge concern in a happy path scenario. But I don't think we can just blindly ignore the cluster Operator metadata/spec. Maybe we should fail out or ask if cluster spec changes should be copied back to the manifests? **Joe Lanford: 9:38 AM Mar 22** My concern is along the lines of this scenario: 1. I'm an SRE and I setup a gitops flow such that changes in my git repo are reflected into the cluster 2. Some outage happens, and in a rush, I resolve the outage by manually applying changes to the cluster Operator CRs. 3. I forget to go back to my git repo and make the same changes 4. A new change comes into my git repo, it gets applied, and all of a sudden there's another outage because my live changes were never captured in git. **Joe Lanford: 9:40 AM Mar 22** If, as part of step 4, I used the generate command to build the patch, I'm suggesting that it should do something (e.g. present an interactive warning prompt) to make sure the user knows something unexpected is going on. * Existing Operator CRs * Existing CatalogSource CRs ## Upgrades > Generate Command > Inputs (2) **Joe Lanford: 1:16 PM Mar 21** What's the API for this file? It seems like it would need to be almost identical to the Operator API itself? For example, if there's a new field we add to the Operator API that allows a cluster admin to plumb something through to rukpak, this manifest would need to capture that, right? **Todd Short: 2:55 PM Mar 21** It's meant to be simpler, and not a k8s API, but relatively simple YAML to be read by the generate process. Could be stored in a ConfigMap. I could be able to take this file and use it for the "generate" command line, and dump it directly into a ConfigMap. If we wanted to define it as an API (maybe that would make things simpler) it might be defined as "UpgradePolicy". **Joe Lanford: 4:49 PM Mar 21** So maybe a different way to ask my question. Is the intention here to give users a manifest input that can idempotently (assuming the same input otherwise), produce everything you'd want to apply to the cluster? Or is this just the minimum set of stuff required to generate some minimal Operator CRs and then the user is expected to do further manipulation to fill in other fields that are necessarily cluster-admin defined? **Todd Short: 9:21 AM Mar 22** The first; and assumes CatSrcs don't change. **Joe Lanford: 9:34 AM Mar 22** Ok, so if it's the first, then it need to essentially very closely follow the Operator API to make sure users have the full fidelity necessary to generate the desired Operator CRs? **Todd Short: 9:50 AM Today** The primary job here is version specification. Other parameters of the Operator CR should not change if the version changes. That does imply that one could update the output with labels, annotations and other fields. My intent was not to duplicate the contents of the Operator CR (i.e. _very_ closely follow), but merely indicate the version to go to. Admittedly, this won't provide all the possible options, but this ought to be the 90% case, and those 10% that need extra parameters can add them to the output before application. These parameters would be included in subsequent output if they are pulled into the generate command again. ## Upgrades > Generate Command > Inputs (3) **Todd Short: 2:05 PM Feb 14** It would be great if there was a single command to gather the information, but there doesn't appear to be. **Alex Greene: 12:19 PM Feb 28** This can be automated where it makes sense on cluster flavors that we support. ## Version Compatibility > Version. Traversal & Intermediate Versions **Alex Greene: 12:25 PM Feb 28** This is a very interesting concept. Seems like users would like being able to say "upgrade until version x", which isn't supported by OLM v0 **Joe Lanford: 1:20 PM Mar 21** I think we'll need to do this as quickly as possible. And to do it as quickly as possible, I think we'll need some signal from operators to tell us they're ready. We have this concept in OLMv0 (OperatorConditions), but I think we should re-evaluate that design as well. **Joe Lanford: 2:29 PM Mar 21** We could also make this an opt-in sort of thing. For example, first you tell us in advance that you provide an upgrade readiness signal, then we look for it. If you don't provide one, we have a $wellExplainedAndSimpleHeuristic for determining readiness, e.g. all of the core workload objects in your manifest are passing their health checks. **Joe Lanford: 1:24 PM Mar 21** I think this answers a question I had, but want to clarify. If I want a version that is 10 hops away from the currently installed version, the generate command will only return a result if a) it finds a path, and b) each hop on that path is satisfiable. What if states 1 and 10 are both individually satisfiable given the current cluster, but states 4 and 5 need an extra operator installed? **Todd Short: 2:39 PM Mar 21** The goal will be to get to your destination; the generate command will create additional Operator CRs to get you to where you need to go. The tricky part is if states 4 and 5 need operator B, BUT states 6-10 don't require operator B. Do we define operator B, even though it's not needed at the endpoint 10? I would say, only if state 4 and 5 could not be skipped, would operator B be defined. **Joe Lanford: 4:21 PM Mar 21** Yeah, and maybe we have a separate command that finds Operator CRs where `status.usedBy` is empty, and then we interactively (or not depending on a flag), ask if those CRs should be deleted? Basically a "yum autoremove" equivalent? Or perhaps it's a slightly different mode of the generate command, in which you just ask it to cleanup unused stuff? The problem I guess is the timing of when to run the it. You'd have to re-run generate only once you get to state 6 or greater. ## Automatic Updates > On Cluster **Kevin Rizza: 3:50 PM Mar 14** is there a reason we would have to delegate this to a binary? why wouldn't we want to write a service / operator that explicitly knows how to do this and keeps track of state rather than a run once process? what happens when an automatic update fails? **Todd Short: 4:05 PM Mar 14** (Answering 2nd question first.) There are four ways the automatic updates could fail: 1. Resolution (e.g. generate) fails in some way. Something is missing, or dependencies cannot be determined, etc. In this case, no configuration update is performed, and a Condition would be added to the Operator CR. 2. Pre-conditions/checks fail. A set of checks performed before actual installation to maximize installation success (e.g. did the bundle download and successfully extract, are all operator pre-conditions met). A condition would be added to the Operator CR status. 3. Constraints: there could be a number of constraints/policies defined that preclude installation (e.g. time-of-day), and these may be transient. A Condition would be added to the Operator CR status. 4. Installation fails, in this case, the operator would have to be rolled back, if that's possible. Not much different than today. **Todd Short: 4:17 PM Mar 14** 1st question: code reuse. One could argue the benefits of having a long-lived service vs. a periodic process. This could be implemented as a library used by both the kubernetes plugin and the on-cluster mechanism. But TBH, the actual implementation doesn't really matter in this (UX) context, as long as the end result (updating of configuration) is the same. That being said, a long-lived process that is basically maintaining the same state as the k8s configuration risks getting out-of-sync (I recall there as a recent problem, and the fix was to kill the pod because it had a bad cache?). In the grand scheme of things, it's more reliable to read the canonical configuration, and the benefit gained by a cache (which could be added later!) is minuscule compared to the time spent waiting for results of remote manifest queries. I do envision the configuration updater to be idle 99% of the time, as it should only be triggered when new version are available, and that is the task - at least described here - of OLM to discover these new versions. **Todd Short: 4:23 PM Mar 14** Using the kubectl plug-in would allow the creation of a quick PoC as a (Cron)Job. ## Management Console **Todd Short: 4:25 PM Mar 14** (In response to Kevin Rizza) I had already mentioned here that the functionality could be included in the management console in one of two different ways; so no reason why this wouldn't also apply to the on-cluster option. ## Implementation Details > Operator CR > Spec **Joe Lanford: 9:32 AM Mar 22** I think there's a separate requirement to be able to directly specify a bundle to install, rather than do a catalog lookup. I don't think we need to handle this requirement immediately, but I'm curious if you think this design could be easily extended to handle that requirement in the future? **Joe Lanford: 5:03 PM Mar 21** If the field emptiness is interpreted exactly once at create time, then we need some defaulting logic to fill it in before it hits etcd, I think. **Joe Lanford: 5:12 PM Mar 21** Trying to understand how this works. 1. "latest" is not a valid value. So valid values are empty string or a semver version 2. If empty string, that means "latest at the time of install"? What if I patch an existing CR with an empty string? Do I get the new latest at patch time? I'm a little uneasy about this latest default mechanic unless latest is always re-evaluated at reconcile time. **Joe Lanford: 5:15 PM Mar 21** I think the options are: 1. Always allow empty string, always re-evaluate at reconcile time. 2. Never allow empty string - intercept CR at admission time, lookup latest on create and use that as the value, fail updates that unset version. (2) feels off time me on failing updates that unset the version, but it also feels off for the API to require a field to be unset to re-pull the latest version. **Joe Lanford: 5:18 PM Mar 21** I think I go back to my preference for: - spec.versionRange (optional) - if set, constrain possible solutions to contain a bundle in this version - if unset, solution can contain any version of this operator. **Todd Short: 9:39 AM Mar 22** So, given that this is the "destination" as I've described it, it really needs to be defined. I'm less keen on an optional versionRange; gitops won't use it. The generate mechanism will calculate the the latest. It becomes a problem if unset by a user manually creating the CR, which can be rejected. **Joe Lanford: 9:50 AM Mar 22** > I'm less keen on an optional versionRange; gitops won't use it Gitops could use it. Just because I'm gitops-ing my Operator CRs doesn't mean I want to prescribe the exact versions that must be running. If I turly want to control exactly what's running at all times, I'd probably gitops rukpak bundle deployments directly. IMO with Operator CRs, it's totally reasonable to gitops a policy of "keep me on major version 3" or "keep me on minor version 3.1" I see the versionRange as a superset of the current proposal. The current proposal is essentially a range right? To do the routing, we need to tell the resolver that its allowed to install stuff less than or equal to spec.packageVersion. So we convert "spec.packageVersion: v1.2.3" to the range "<=v1.2.3" when we hand things off to the resolver during routing. **Joe Lanford: 5:05 PM Mar 21** What about force downgrade? Or force direct upgrade? Or ignoring conflicts with other Operators? Maybe we need a whole section about enabling cluster admins to override the will of the operator controller. **Todd Short: 9:43 AM Mar 22** I don't disagree. Most of these problems occur when manually creating the CR, which out to be discouraged (but we still need to handle). Maybe we also need to put some intelligence into the create/delete kubectl actions (but that's a separate point). **Joe Lanford: 9:58 AM Mar 22** > Maybe we also need to put some intelligence into the create/delete I'm going to hard disagree with this. If we do that, then we inch toward the OLMv0 experience where install/update requires a special client or expert knowledge of the API interactions. I think we need treat "vi operators.yaml && kubectl apply -f operators.yaml" as a completely anticipated, first-class, and expected flow that people will use. Doesn't mean we encourage it, but the minute we say its discouraged, it'll be too easy for us to handwave over special sauce we put into client tooling. What I'm getting at is that I think we need the Operator API to have some fields like: - spec.unsafeAllowDirectUpgrades - spec.unsafeAllowDowngrades - spec.unsafeIgnoreConflicts These would be understood by the resolver to add possible next versions, so resolution would still pass, but know with cluster-admin provided overrides that violate the intent of operator authors (which is why I include the unsafe prefix in these, probably need that for the noDeps option too) ## Implementation Details > PolicyBinding Details **Alex Greene: 1:59 PM Feb 28** My inclination is to adhere to the design of {cluster}RoleBindings as closely as possible, but the only way I could imagine implementing "groups" of operators is by introducing an additional API. Could we do this with label selectors? **Todd Short: 2:41 PM Feb 28** I was using RBAC has a basis for this, but there are subtleties to work out (such as the above) **Todd Short: 10:25 AM Mar 6** Bindings could instead be done via label selectors, either as part of PolicyBinding or Policy CR. But right now this is as close to RBAC as possible. However, RBAC has to deal with multiple kinds of subjects and policies, where this does not.