Descoped Operator: An operator in a cluster that is expected to be the sole owner of an API in a cluster.
Scoped Operator: An operator that has been installed under a previous version of OLM and includes scoping assumptions (including metadata like installMode
).
When OLM was first written, CRDs defined only the existence of a GVK
in a cluster. Operators developed for OLM could only install in a namespace, watching that namespace - this delivered on the self-service, operational-encoding story of operators. The same operator could be installed in every namespace of a cluster.
Privilege escalation became a concern - since operators are run with a service account in a namespace, anyone with the ability to create workloads in that namespace could escalate to the permissions of the operator. This made service provider/consumer relationships a difficult sell for operators in OLM.
At the same time, CRDs continued to add features. With version schemas and admission and conversion webhooks, CRDs no longer simply registered a global name for a type, and operators in separate namespaces had lots of options to interfere with one another if they shared the same CRD. OLM also expanded to support APIService
s in addition to operators based on CRDs, and so required a notion of cluster-wide operators.
To address these concerns, a notion of scoping
operators was introduced via the OperatorGroup
object. An OperatorGroup
would specify a set of namespaces within a cluster in which all operators installed would share the same scope. OLM would ensure that only one operator within a namespace owned a particular CRD to avoid collision problems, and more installation options were provided to allow separating operators from their managed workloads.
But OperatorGroups
do not alter the fundamental problem: that apis in a kubernetes cluster are cluster-scoped. They are visible via discovery to any user that wishes to see them. Even operators that agree on a particular GVK may have differences of opinion in how those objects should be admitted to a cluster, or how conversion between api versions should happen.
With Operator Framework, we want to build an ecosystem of high-quality operators that can be re-used across different projects, whether they're in the same cluster or not. But re-using operators compounds the scoping problems within a cluster - it increases the likelihood that more than one "opinion" about an API exists in the cluster.
For these reasons we are looking to entirely remove the notion of scoping from OLM.
It means that (in the near future) for any operator installed via OLM, we expect that:
status
of that API (if the API has a status
section, as most do)It does not mean that:
If you are an operator author and the above statements are concerning, please review the Operator Patterns section for suggestions on how to acheive your goals in a descoped world.
For the most part, the final state for de-scoping will be a transition away from namespace-scoped APIs like Subscription
, InstallPlan
, and ClusterServiceVersion
to a different set of cluster-scoped APIs like Operator
and Install
.
These newer apis avoid much of the complexity introduced by scoping, and are already in progress. The Operator
api is availabe as a read-only API in 4.6.
The document fully describing this end state is the Simplify OLM APIs enhancement Note: the enhancement is currently pending an update to call out the scoping issues. The only change as a result of the decision to de-scope is to always assume that operators are descoped in the new APIs.
At some point, it is likely that OLM will introduce some namespaced APIs again for the installation of non-operator content. But this will be accompanied by its own enhancement.
Operators may still request the cluster- and namespace-scoped permissions they need to run within their installation namespace at installtime (i.e. today, via clusterPermissions
and permissions
). But unlike with scoped operators, the namespace-scoped permissions do not get copied to a pre-defined set of namespaces and no bindings are created by default.
De-scoped operators need to be concerned about RBAC in two general areas:
A descoped operator will not be allowed to install any bindings - this is the job of an administrator.
Operators will provide a set of ClusterRoles
for the work that they need to do.
For example, an operator with serviceaccount foo-sa
might provide this ClusterRole
:
An administrator can bind this with a ClusterRoleBinding
if they want to operator to be able to perform its tasks in all namespaces. Or, they may drop a RoleBinding
for this role binding the operator's ServiceAccount
in each desired namespace.
An operator author may provide more or less granular ClusterRoles
. Most operators likely just need one per serviceaccount, but others may wish to provide granular feature-based ClusterRoles so that an administrator can enable/disable portions of the operator's functionality.
A convention will be established: a single ClusterRole for a ServiceAccount will be assumed to be required for the operator to operate. Multiple ClusterRoles will be considered optional, such that creating/deleting them will enable/disable certain aspects of the operator.
One exception: De-scoped operators are always expected to have read,update /status
permission on the APIs they own in all namespaces. This is to ensure that the operator has a communication channel with users (to communicate, for example, that they do not have the proper permission to do work in a particular namespace). OLM will raise alerts when there are CRs in a cluster with no controller capabable of updating their status.
Note: This is a degradation of the install experience, see below for the interim solution.
For scoped operators, OLM automatically generates ClusterRoles and automatically aggregates them to the default admin
, edit
, and view
ClusterRoles, based on the availability of an operator in a particular namespace.
For de-scoped operators, OLM will leave this to the administrator.
Note: This is a degradation of the install experience, see below for the interim solution.
The above changes for de-scoped operators, without any additional tooling, degrades the operator installation experience. There is no declarative way to indicate that an operator should be permitted to work in a set of namespaces (it becomes a two step process: install, bind).
For now, we will repurpose OperatorGroups
for RBAC management. An OperatorGroup
will not be required for the installation of a de-scoped operator (as they are today for scoped operators).
If de-scoped operator is installed in an OperatorGroup
, the namespace list on an operatorgroup is used to determine which namespaces will get automatic bindings for the operator's serviceaccounts, and will generate and aggregate API access RBAC roles to view
, edit
, and admin
.
This should be considered an interim solution. A more configurable and more supported API may look something like RBACManager.
Since CSVless bundles are not yet available, we will indicate that an operator is descoped by marking installModes
as optional.
Any CSV with an empty installModes
block will be treated as a de-scoped operator.
Any CSV with an installModes
block that supports AllNamespace-mode only will be treated as a de-scoped operator that provides APIs to all users by default (i.e. OLM will generate an AllNamespace OperatorGroup for it).
The Operator
API is a cluster-scoped API to reflect the cluster-scoped nature of api extensions.
Users of the provided APIs can tell that they exist via discovery, but admins may be hesitant to grant read
on the Operator API itself to learn more about the operator and its services.
Discovery of available operators for UI will flow through discovery:
TODO: this will not work well for apis shared between operators, and makes it easier to hit etcd key limits.
Dependency resolution will continue to take place at the namespace scope for operators intalled via Subscriptions
. Any operators installed this way will have an Operator
object created automatically for visibility, but the Operator
may begin to emit warnings about the scoped nature of the installation.
Resolution will also take place among de-scoped operators at the cluster scope. Note that resolution or updates of de-scoped operators can be blocked by issues with scoped operators.
TODO
Affects: OwnNamespace, SingleNamespace, MultiNamespace
Transition:
Transition:
Affects: OwnNamespace, SingleNamespace, MultiNamespace
Transition:
Affects: OwnNamespace, SingleNamespace, MultiNamespace
Many of the use-cases for scoped operators are better suited as features within the operator itself.
One of the primary reasons for scoping operators is to reason about their Blast Radius - i.e. in the case of a bug or malicious control, limit the worst-case scenario for the cluster.
Limiting blast radius for de-scoped operators is generally simpler to reason about, because it relies heavily on auditable RBAC policy.
In this example, the de-scoped operator is installed with a ClusterRole only, and an administrator must explicitly bind each namespace that the operator should be allowed to use via a RoleBinding:
Operators scoped to a single namespace are often used to provide internal APIs that should not be available anywhere else in the cluster. These may be config apis that configure cluster operation or APIs that are otherwise sensitive.
This differs from example above of limiting an operator's blast radius - the operator author doesn't want it to be possible for an administrator to expose the APIs to other users and namespaces, or wants a guarantee that the operator is not given permission outside of its installation namespace.
In this example, an operator is granted restricted permissions for a single namespace, and provides a Cluster-Scoped API. This ensures that anyone with access to write the API has been vetted, and that the operator itself can only perform operations within its own namespace.
Scoping is often desired in order to deal with tenant isolation and prevent noisy-neighbor effects.
Without mutliple clusters or a first-class tenancy effort within kubernetes, this will never be truly possible (i.e. scoped operators may provide isolation for their APIs, but not for underlying kubernetes APIs, etcd access, or physical cluster resources). For operators that still wish to isolate tenants, however, this is possible by having a single parent operator spin up multiple control loops or even operator pods.
These architectures may also be a strategy to scale operators horizontally.
Spin up controllers per tenant within a running process.
This is similar to above, but spins up controllers in their own pod. This may be valuable to leverage the cluster scheduler, or to have specific tenants mangaged with an (auditable) reduction is permission scope.
This is a similar pattern, but places the pods near to the workloads they manage. This has different visibility and permission implications that may be desirable depending on the tasks being performed.
Scoping has also been considered as a solution for canary rollouts of new operator versions. A new version may be released to manage one namespace, while the previous release manages the rest of the cluster's namespaces.
During an upgrade, there will be two operators running at one time. This provides an opportunity for both operators to transfer ownership, handoff locks, or otherwise coordinate the rollout of operator resources - combined with a strategy for limiting the blast radius and use of operator conditions, effective and domain-specific canary or other rollout strategies can be implemented.
The precise mechanism for determining when and how the new version should take control can be determined by the operator (or a separate or external tool), which can allow automated rollout based on domain-specific metrics. For example, an operator might use ownerreferences or labels on CRs to indicate which version is currently managing that instance, with ownership flipped manually. Additional automation can be added to automatically flip management based on metrics / percent rollout, i.e. "for every 2hours that metrics still look healthy, increase rollout to management by the new operator by 10%".
This is a strategy for limiting the impact of rolling out the operator itself, but similarly domain-specific rollout strategies may be defined for any operands (rules about which versions can be upgraded to which others, etc).
The only part of this that OLM needs to be informed of is that the rollout is taking place - it is important to set the Upgradeable=false OperatorCondition
so that OLM does not attempt to interfere with the handoff.