owned this note
owned this note
Published
Linked with GitHub
# OLM v1 UX
## Tenets
* Declarative: resource creation and deletion has predictable side-effects.
* GitOps: The cluster operator loadout can be managed completely through GitOps.
* Control: automation should only help and never hinder.
* Any automation can be manually executed through the CLI
* Self-service: In multitenant clusters, operations (e.g. installation/deletion) _can_ be self-service.
* Transparency: Users can have full visibility about the impact of an operation
* Extensible: Admins/users can extend the system to implement automation or flows/tasks suitable for their environment/organization without OLM team involvement
## Summary
This document proposes a flexible approach for the OLM v1 user experience that enables from the full `GitOps` to cluster-user self-service workflows. The approach is composed of two APIs: `Operator` and `OperatorAction`, and a CLI that helps the user build their desired cluster state.
The `Operator API` is as simple, predictable, and with as little automation as possible. There is a one-to-one mapping between the `Operator` CR and a package that is installed on the cluster. Dependencies are checked, but not installed automatically. If the required dependencies are not present on the cluster, installation will continue (although warnings events will be generated). Deletion has a wide blast radius. It will delete CRDs and with it any workloads represented by those CRDs. Deletion has the potential to break other `Operator` installations, if the deleted operator is a dependency of another. This approach gives maximum control to the admin to set the cluster state as they want without any automation getting in the way.
Through a CLI (`kubectl` plug-in), the user can get the help they need to inspect packages, pick dependencies and generate (and, optionally, apply) the `Operator` resources to have a successful installation of a particular package in a particular cluster. The CLI can be used in a GitOps flow to generate and lint `Operator` sets that define the packages the user wants on cluster. Any operator can be dry-run prior to execution, giving the user a clear understanding of what the change set is: which resources will be deleted, which will be created, if any orphaned dependencies are left behind, or if there are any running workloads in the deletion chain. This helps admins/sres to create predictable, sanity checked, cluster states without necessarily affecting the cluster directly. Rather, leaving a pipeline to apply and test the state before pushing to production. At the same time, it enables developers and power users to quickly and directly impact the cluster state, i.e. install/test their packages, etc.
Lastly, a `Operator Actions` API will provide an extensible mechanism for cluster-user facing workflows. They function similarly to jobs and represent a request for a cluster state change (e.g. operator installation, update, or deletion). Different workflows can be envisioned here with different levels of automation. Best effort self-service flows can be created that safe guard the cluster and only require admin engagement in specific cases or, alternatively, through a request/approval process.
## Operator API
The `Operator API` is responsible for installing and uninstalling a single operator. Its behavior should be predictable and follow the `GitOps` philosophy: the state of the cluster is represented by the state in the repository. The `Operator` API makes very few assumptions and has little automation. It gives admins full control over the cluster state with predictable side-effects. An admin can, for instance, easily sacrifice cluster stability to ensure security (e.g. by deleting a risky/compromised dependency at the cost of the parent operator's stability).
### Interaction
* Installation: after applying a new `Operator` CR, the `operator-controller` will attempt to install the corresponding package. If there are dependencies and they are not currently installed on the cluster, installation will fail.
* Deletion: when an `Operator` CR is deleted, the corresponding package manifests, including CRDs and all corresponding workloads will be deleted. Deletion will not cascade to dependencies. If the deleted `Operator` is a dependency of another `Operator`, the other `Operator` installation will break (i.e. it will still be installed, but likely no longer work)
* Upgrade: Similarly to installation, an upgrade will only succeed if all requirements are met a priori. Updates do not cascade to dependencies. Orphans will not be cleaned up (e.g. if v1 has a dependency, and v2 drops that dependency, upgrade progresses but the dependency still hangs around until its `Operator` CR is deleted). RukPak will be in charge of pivoting to the new version and all CR migrations.
## Operator CLI
The `Operator CLI` will be shipped as a `kubectl` plug-in. It can be used to automate `Operator CR` generation. It can either apply the generated manifests directly to the cluster, or to simply generate the manifest files for a `GitOps` flow. It can include local `Operator` resource and the cluster in its state calculation.
### Interaction
* Installation: the cli can engage with the resolver to calculate dependencies and generate all of the required `Operator` CRs for the installation. Additional flags can be used to nudge the resolver towards a particular solution (e.g. choosing the dependencies that should be used, etc.)
* Deletion: flags can be passed to the client to instruct it to cascade deletion to dependencies. Additional safety can be added to deletion in case the user is attempting to delete an operator that is a dependency of another.
* Upgrades: the client can calculate all changes required for upgrading, including finding orphan dependencies.
* Lint: Checks that a directory of `Operator` CRs will work when applied to the cluster. It can issue warnings, e.g. for orphan packages, unmet dependencies, etc.
### Challenges
* How do we ensure that the CLI is operating over the right state (e.g. the right sources are being considered, the correct cluster state, etc.)?
## Operator Action API
The `Operator Action` API follows a similar pattern to the `Ingress API`, where an `actionClass` is specified, and any operator that can execute on that class reconciles the action. The `Operator Action` API then becomes an extensible interface where different operations and automations can be modeled. For instance, we can imagine the following classes:
* `operator-deletion`: request to delete an operator (or dry-run a deletion). It can carry with it additional attributes to define what ought to be deleted, how, and under which conditions: e.g. delete operator X, cascade to dependencies, don't delete any operator that has any running workload
* `operator-installation`: request to install an operator (or dry-run an installation). It can carry with it additional attributes to define what ought to be installed and how. For instance, a request to install operator X at version Y, installing all dependencies, such that one of the dependencies should be filled with operator Z, and the others should prefer packages from the same repository as Operator X
* `resolution-debug`: request that the resolver execute a particular resolution query
* `autoremove`: request that all orphaned operators be deleted (or dry-run)
Similarly to a Kubernetes `Job`, an `Operator Action` can be configured with a number of retries. We could also consider adding `CronOperatorAction` API for actions that should occur on a regular cadence. Additional `Operator Actions` can be created (by us or community) and distributed/installed by OLM. We could also consider event-based triggering for certaing actions. This API provides admins with the building blocks they need to create their own user experience for their cluster and customize the levels of automation to their needs.
A nice side-effect of interacting with OLM through a request resource is that we (and the admin) get full visibility into which actions were taken and in which order and the corresponding status of those actions. It also fits well with console operations, where forms and buttons can be used to dispatch certain actions.
A `kubectl` plug-in could also be provided to facilitate interactions with the API. Though, we'd need to figure out how expand it with plug-ins for different (new) action classes that might be developed.
We could also imagine an `Operator Action Admission Controller`, which allows admins to configure who can run what actions and with which attribute values. I.e. enable as much self-service as possible up to the point where admin intervention will be required in order to protect the cluster.
Note that the `Operator Actions` API is completely optional. They wouldn't be necessary for clusters purely driven by GitOps.
## Examples
*Disclaimer: all examples shown here are for illustrative purposes only any similarities with real persons, APIs, or flows is purely coincidental.*
### Control Freak
As a control freak, I don't want any automation. I want to be able to define the state that I want (whether it be consistent or not) because I know best!
#### Installation
```yaml=
cat <<EOF | kubectl apply -f -
apiVersion: olm.io/v1
kind: Operator
metadata:
name: foo-operator
labels:
app: foo-operator
spec:
version: 1.2.3
# other attributes the user might set:
# channel: stable
# repository: foo-repo
EOF
```
```yaml=
$ kubectl get foo-operator -o yaml
apiVersion: olm.io/v1
kind: Operator
metadata:
name: foo-operator
labels:
app: foo-operator
spec:
version: 1.2.3
status:
conditions:
- lastTransitionTime: "2022-12-16T15:08:03Z"
message: "foo-operator depends on bar-operator >= 1.2.3 but bar-operator is not installed" # _OR_ but bar-operator v1.2.1 is installed"
reason: UnmetDependency
status: "False"
type: ConstraintsSatisfied
- lastTransitionTime: "2022-12-16T15:08:03Z"
message: "bundle foov1.2.3 successfully unpacked"
reason: UnpackSuccessful
status: "True"
type: HasValidBundle
- lastTransitionTime: "2022-12-16T15:08:03Z"
message: Instantiated bundle foov1.2.3 successfully
reason: InstallationSucceeded
status: "True"
type: Installed
```
To bring the system to a consistent state, the user would need to understand the dependencies of the package they are trying to install. The OLM CLI would afford them the necessary functions to inspect the bundle and query the catalogs for bundles that can fulfill the dependency. The user would then create a new `Operator` CR for `bar-operator`, which once applied and in the correct version range, would bring the system to a consistent state. When inspecting the `foo-operator` again, the `ConstraintsSatisfied` condition would be `True`.
### Deletion
```bash=
$ kubectl delete operator bar-operator
WARNING: deleting bar-operator will lead to an inconsistent state: foo-operator depends on bar-operator. Would you like to continue? [N/y]: y
```
Deleting the `bar-operator` resource would result in all `bar-operator` resources and workloads being nuked from the system. It would also bring the `foo-operator` `CostraintsSatisfied` condition back to `False`.
### Update
Because I'm a control freak, all of my package versions are pinned. Under these circumstances, an update would be a noop.
```bash=
$ kubectl operator update
All operator versions are pinned. Nothing to do. If you want to bring all your operators to the latest version try with --latest
$ kubectl operator update --latest
foo-operator@v1.2.2 -> foo-operator@v2.0.1
bar-operator@v1.2.5 -> bar-operator@v1.2.5 (orphaned)
new operator -> baz-operator@1.5.6
fiz-operator@v2.9.2 -> fiz-operator@v3.0.1
...
Would you like to continue? [N/y]: n
$ kubectl operator update --for-range foo-operator@^1.2.x
foo-operator@v1.2.2 -> foo-operator@v1.2.9
bar-operator@v1.2.5 -> bar-operator@v1.2.7
fiz-operator@v2.9.2
buz-operator@v0.2.4
...
Would you like to continue? [N/y]: n
$kubectl operator update --for-range foo-operator@^1.2.x;fiz-operator@<3.0.0
foo-operator@v1.2.2 -> foo-operator@v1.2.9
bar-operator@v1.2.5 -> bar-operator@v1.2.7
fiz-operator@v2.9.2 -> fiz-operator@v2.9.6
buz-operator@v0.2.4
...
Would you like to continue? [N/y]: y
```
### Power User
The power user is a Ronald Reagan style keyboard cowboy that trusts but verifies.
#### Installation
```bash=
$ kubectl operator resolve foo-operator
* foo-operator v1.2.3 depends on bar-operator >= v1.2.3
* bar-operator v1.2.5 depends on baz-operator >= 1.2.3
Generate Operator resources for:
- foo-operator@v1.2.3
- bar-operator@v1.2.5
- baz-operator@v1.2.4
[Y/n]: Y
$ ls
foo-operator.yaml
bar-operator.yaml
baz-operator.yaml
```
In an alternative reality:
```bash=
$ ls
fiz-operator.yaml
$ kubectl operator resolve foo-operator
* foo-operator v1.2.3 depends on bar-operator >= v1.2.3
* bar-operator v1.2.5 depends on baz-operator >= 1.2.3
* baz-operator conflicts with fiz-operator and fiz-operator is currently installed
Error: installing foo-operator could leave your system in an inconsistent state
$ ls
fiz-operator.yaml
```
The CLI could still offer overrides to force the generation of the foo-operator + dependencies `Operator` CRs. The user could delete fiz-operator.yaml or choose to roll the dice.
If the user wants foo-operator to always update to the latest z-release:
```bash=
$ kubectl operator resolve foo-operator --versionRange ^1.2.x
* foo-operator v1.2.3 depends on bar-operator >= v1.2.3
* bar-operator v1.2.5 depends on baz-operator >= 1.2.3
```
#### Deletion
```bash=
$ kubectl delete operator bar-operator --safe
Error: could not delete bar-operator: foo-operator depends on bar-operator
```
```bash=
$ kubectl delete foo-operator --safe
WARNING: foo-operator still has running workloads. Would you like to continue? [N/y]: y
WARNING: bar-operator will become an orphaned dependency of foo-operator. Would you also like to delete bar-operator? [n/Y]: y
WARNING: bar-operator has running workloads. Would you still like to delete bar-operator? [n/Y]: y
foo-operator deleted
bar-operator deleted
```
```bash=
$ kubectl delete foo-operator --cascade --force
foo-operator deleted
bar-operator deleted
```
#### Update
Update follows a similar mechanism as with the control freak example. If versions are pinned, its a noop. If our power user has installed operators using version ranges, updates would be a possibility.
```bash=
$ kubectl operator update
foo-operator ^1.2.x (@v1.2.5) -> foo-operator ^1.2.x (@v1.2.9)
...
```
### Cluster User
A cluster user is a non-admin user. Maybe a namespace admin or just a regular user within a namespace. The admin doesn't want to be bothered with every single operator installation request. They know they might have to manually intervene in some situations, but for the 80% case they want users to be able to self-service. As long as an installation request won't put the system in an inconsistent state, they are happy to let users install anything from the catalog (standard or curated) to their hearts content.
```yaml=
cat <<EOF | kubectl apply -f -
apiVersion: olm.io/v1
kind: RetryAction
metadata:
name: install-foo-operator
spec:
backoffLimit: 3
template:
actionClass: operator-install
package: foo-operator
version: ^1.2.x
# other attributes the user might set:
# channel: stable
# repository: foo-repo
installDependencies: true
dryRun: false
EOF
```
```yaml=
$ kubectl get retryactions
NAME SUCCEEDED FAILED BACKOFFLIMIT
install-foo-operator 1 1 3
$ kubectl get actions
install-foo-operator-aj3ef FAILED
install-foo-operator-bn4jg SUCCEEDED
$ kubectl get install-foo-operator-aj3ef -o yaml
apiVersion: olm.io/v1
kind: Action
metadata:
name: install-foo-operator-aj3ef
spec:
actionClass: operator-install
package: foo-operator
version: ^1.2.x
installDependencies: true
dryRun: false
status:
conditions:
- lastTransitionTime: "2022-12-16T15:08:03Z"
message: Solar flare has wreaked havoc with the electronics
reason: SolarFlare
status: "False"
type: ActionSucceeded
$ kubectl get install-foo-operator-bn4jg -o yaml
apiVersion: olm.io/v1
kind: Action
metadata:
name: install-foo-operator-aj3ef
spec:
actionClass: operator-install
package: foo-operator
version: ^1.2.x
installDependencies: true
ensureConstraintSatisfiability: true
dryRun: false
status:
conditions:
- lastTransitionTime: "2022-12-16T15:08:03Z"
message: foo-operator@1.2.5 was successfully installed
reason: ActionSucceeded
status: "True"
type: ActionSucceeded
- lastTransitionTime: "2022-12-16T15:08:03Z"
message: bar-operator@1.2.5 was successfully installed
reason: DependencyCreated
status: "True"
type: ConstraintSatisfied
- lastTransitionTime: "2022-12-16T15:08:03Z"
message: baz-operator@1.2.2 was successfully installed
reason: DependencyCreated
status: "True"
type: ConstraintSatisfied
$ kubectl get operators
NAME SATISFIED VERSION
...
foo-operator True ^1.2.x (@v1.2.5)
bar-operator True >=1.2.5,< 1.3 (@v1.2.5)
baz-operator True ^1.2.x (@1.2.2)
```
As an admin, I might not want the self-service flow to allow users to install operators that have dependencies. Maybe I don't trust the resolvers default behaviour and want to have all the control over which dependencies get selected for installation. To solve this, we could have a PSA style webhook to always enforce `ensureConstraintSatisfiability: true` and `installDependencies: false` for install actions from users of under a particular role.
The deletion and update flows work with similar action classes with their own checks and attributes. E.g. an admin could determine that anyone can delete an operator as long as there are no workloads running under it. If there are, then that's a situation the admin ought to be involved. Or the user has to delete those workloads (assuming they can) then re-submit the deletion request.
### Automation
`Actions` represent small configurable automations. They'll try to fulfill a request as best they can and might fail due to configurable pre-conditions. The scope of an action can be limited for particular users through an admission webhook. `Actions` can be created and executed as a one-off, or additional APIs could be envisioned, e.g. `CronOperatorAction`, or `EventOperatorAction`, which trigger actions on a cadence or when certain events occur. For instance, there could be events for changes in the Operator CRs, or changes in the content sources.
`Actions` can be mixed and matched in any cluster to enable admins to provide the user experience and level of automation that they want for their cluster. `Actions` can be distributed through OLM registries.
#### Auto-update
An update action could be scheduled every time there is a change in content sources - or through a cron job, every Friday at at 16:50, so we can ruin the admins lives. The admin can configure the pre-conditions that should be met before an update is performed. For instance, only apply updates if resolution shows consistency and if no operators with running workloads will be removed. Otherwise, fail. Monitoring can be setup to generate a ticket everytime this action fails. The action should include all relevant information to help the admin rectify the situation.
## Challenges
One of the main challenges of this approach is state management as it relates to input into the resolver. Should the CLI take into consideration the state of the cluster? Or not? I think much of this comes down to a pets vs cattle discussion. If you treat your cluster as cattle, you might want to have completely external state management. We could imagine state files (or requirements.txt) that could be solely used as input into the CLI resolver. We could also imagine that you'd want to take all of the state from the target cluster and use the on-cluster resolver to derive the future intended states. Or even a hybrid approach, where you want to take the cluster properties (kube version, topology, nodes and architectures, etc.) but the package list and version ranges are supplied externally. I think all of these modes of operation should be supported. However, special care needs to be taken to ensure that the user knows what they are doing.
## Debugging
Debugging resolver behavior is an important part troubleshooting. Additionally, having CLI access to on-cluster resolver is required to support certain usage strategies. This means, we'd need to expose the resolver somehow. A couple of ways of doing so might be:
* have a `resolution` action class. Where the on-cluster resolver can be queried in an asynchronous fasion. The CLI can submit resolution actions to the cluster and operate over the results. This increases CLI latency, but also guarantees fidelity. We may also run into CR size issues. If we're feeling hacky, we could truncate the response and make it available in a well formatted log line that can be then grepped for.
* expose the resolver via a http endpoint. This might restrict the set of users that can speak to the resolver (i.e. they can't set up a port-forward and the admin doesn't want to expose the service via ingress).
* on-cluster resolver exposes enough configuration state that the CLI resolver can configure itself to be exactly the same