On cluster operator package delivery mechanism for OLM v1

--- title: On cluster operator package delivery mechanism for OLM v1 authors: - "@anik120" reviewers: - "@joelanford" - "@zesus" - "@everettraven" approvers: - creation-date: 2022-10-01 last-updated: 2023-01-11 tags: operator catalogs --- # On cluster operator package delivery mechanism for OLM v1 ## Summary This document aims to discuss (and ultimately help decide) the on cluster operator package(i.e catalogs of operators) delivery mechanism for OLM v1. It proposes a new architecture for implementing the existing OLM v0 concept of `CatalogSource` using an aggregated apiserver, and an alternate architecture that resembels the current implementation of `CatalogSource` but with a few important modifications. Both architectures leverage the new [File-based catalogs](https://olm.operatorframework.io/docs/reference/file-based-catalogs/#olmbundleobject-alpha) packaging mechanism, and have the main goal of being more compatible with other OLM v1 components (eg rukpak) than the v0 implementation. The proposed architectures also aims to significantly improve human user UX over their predecessor. ## Motivation At present, OLM delivers the content of operator catalogs via the `CatalogSource` CRD, and the controller that reconciles it. However, the `CatalogSource` CRs are namespace-scoped, with additional information hardcoded in the controller to make some catalogs behave as if they were cluster-scoped (eg Openshift hardcodes the logic "all Catalogs in the openshfit-marketplace are "global"), to support features like namespace-scoped dependency resolution of packages to facilitate multi-tenancy. This is at odds with OLM v1, where [operator bundles and their dependencies installed will be cluster-scoped](https://github.com/operator-framework/rukpak/blob/main/manifests/apis/crds/core.rukpak.io_bundles.yaml#L16). That introduces requirements such for an Openshift installation, "openshift-marketplace namespace must be created as step 0, and all catalogs must be created in that namespace to facilliate cluster-scoped package discovery and dependency resolution", which is redundant in the context of OLM V1. Therefore, porting over the existing concept of namespace-scoped catalogs only introduces additional (unnecessary) costly complexities in OLM v1. ![olm v0 catalogsource arch](https://i.imgur.com/QXbyCFr.png) Additonally, the controller that reconciles the current CatalogSource CRD, also reconciles other OLM v0 CRDs. Even with just the part that reconciles CatalogSource CRD stripped out, it has historical tech debt baked into it. An example of that is the [packageserver](https://github.com/operator-framework/operator-lifecycle-manager/tree/master/pkg/package-server), a custom API server that was needed to parse the [sqllite database that stored catalog metadata](https://github.com/operator-framework/operator-registry/blob/master/pkg/containertools/dockerfilegenerator.go#L52-L55) and re-create package level metadata for making the [`PackageManifests`](https://github.com/operator-framework/operator-lifecycle-manager/blob/master/pkg/package-server/apis/operators/packagemanifest.go)([eg CR](https://gist.github.com/anik120/50099e5b35067ab2fd1936ff6c4f0348)) available on cluster, which in turn needed a [packageserver-manager](https://github.com/openshift/operator-framework-olm/tree/master/pkg/package-server-manager) downstream, increasing the fork between what's available upstream vs downstream, thereby increasing the effort needed to maintain the architecture. This suggest a need for either a complete port of the controller introducing unnecessary maintenance cost to OLM v1, or an audit and stripping of the controller that is likely to be too costly to be justified anyway. ![registry pods client interactions](https://i.imgur.com/MRPhMFZ.png) The registry pods also encountered multiple sources of high CPU/memory consumptions[(eg)](https://bugzilla.redhat.com/show_bug.cgi?id=2015814), with several patches only getting it to a state where currently the registry pods created by the CRs "only intermittently" experiences CPU/memory consumption spikes (instead of constant exaggerated consumption). This in addition to duplicated extra CPU/memory consumption on cluster required by components like package-server, resolver (that builds [caches](https://github.com/operator-framework/operator-lifecycle-manager/blob/master/pkg/controller/registry/resolver/cache/cache.go) of the catalogs' meatadata in memory to reduce the time taken to connect with and retrive data from the pods' service endpoints) etc. Creating bare pods without using the workload objects such as `Deployments`/`StatefulSets`/ `Jobs` objects [is not the norm](https://kubernetes.io/docs/concepts/workloads/pods/#working-with-pods), and has burdend the `CatalogSource` controller with high maintaince cost in an effort to do some of the job that the workload resources are already primed at achieving. This was felt clearly in some of the high priority/high fix cost bugs filed by customers/epics that internal stakeholders needed: (not an exhaustive list) * [Inability to customize regsitry pod configurations to allow customers to schedule/lifecycle pods in their cluster (eg adding tolerations for tainted nodes)](https://bugzilla.redhat.com/show_bug.cgi?id=1927478). This ultimately led to the need for introducing the spec.grpcPodConfig field, which has set the stage for a long cycle of API surface expansion, one field at a time. * [Pods crashlooping without saving termination logs making it difficult to debug urgent customer cases](https://bugzilla.redhat.com/show_bug.cgi?id=1952238). Customer had no way to customize the Pod config to set termination log path, and had to wait for long-winded release cycle for the fix to reach their cluster before they could finally start debugging crashlooping pods. Meanwhile, the Alerts kept firing for the cluster while they waited for the fix to land. * A [~30 story points](https://issues.redhat.com/browse/OLM-2600) epic to address Openshift changing the default Pod Security Admission level to `restricted`. In order for any Kubernetes distribution to deliver these catalogs of extensions on cluster with OLM V1 with a goal of reducing the cost of running the control plane, the catalogs themselves must not be the source of high cost for the customers. Therefore, there needs to be a way for the content of these catalogs to be delivered on cluster in a manner that's compatible with OLM v1, while being lean in executing the delivery over OLM v1 API's matainance lifeycle. ## Requirements ### User stories * As an operator author(eg etcd operator), I want the ability to include my operator in catalogs being built by multiple authors of operator catalogs(eg cncf-community-operators, redhat-community-operators etc) of various Kubernetes distributions(eg Openshift, IBM Cloud etc), so that my operator can be made available for installation for customers of these distributions. * As an author of a Catalog, I want a mechanism for the list of operator packages I have curated and vetted to be made available on clusters, so that the customers of the Kubernetes distribution I have authored the catalog for can use my catalog to select operators they want installed/lifecycled with OLM V1 as part of their installation. I also want to be able to continuously make newer versions of the operators I've packaged available to customer clusters, when authors of those operators release newer version of their operators/patches etc. * As cluster admin of a cluster that is lifecycling operators with OLM V1, I want OLM's package delivery mechanism to incur minimal control-plane cost. ### Goals * Provide a cluster-scoped API that helps expose operator packages' metadata from index images (OCI images in image repositories that contain [fbc metadata folders]((https://olm.operatorframework.io/docs/reference/file-based-catalogs/))) on cluster, i.e re-build [existing CatalogSource CRD](https://github.com/operator-framework/api/blob/master/crds/operators.coreos.com_catalogsources.yaml#L19) as cluster-scoped. * Provide a streamlined API surface that leverages the new [file-based catalog(fbc) format](https://olm.operatorframework.io/docs/reference/file-based-catalogs/) of organizing catalog metadata, to expose the content on cluster for consumption by on-cluster clients(eg OLM dependency resolver, Openshift Console etc). i.e streamline the [current API surface](https://github.com/operator-framework/operator-registry/blob/master/pkg/server/server.go). This API was designed to expose catalog content stored in sqllite databases, but have evolved to carry the burden of historical contexts that are some of the major contributors of high resource consumption in the architecture (eg the need to send the entire CSV for each bundle in the response for ListBundles), that have hardly any justification to be carried over to OLM v1 with the new fbc format. ### Non goals * Discuss mechanism for packaging/storage/shipping of fbc in OCI registries. This doc assumes that there is a standard way of pulling and unpacking an image to get an fbc directory in the file system of the component that's ultimately going to serve the content on cluster. The packaging and shipping mechanism for fbc is discussed in a separate document : </ placeholder for link when the doc is available > ## Proposal ### Problem statement Given an [index image](https://olm.operatorframework.io/docs/glossary/#catalog-image), make the content of the image (i.e operator packages shipped in that image) available on cluster via APIs that: 1. Provide an unprivileged client (eg a non-admin user, Openshift console etc) information about: * the operator packages available in the index * the channels available in each package, the default channel for each package, and the bundles that belong to each channel * the display metadata used by the openshift console eg `description` of the package, catalogsource publisher, maintainers information etc 2. Provide a privileged client (eg OLM dependency resolver) information about: * the bundles present in each package * the provided/required properties of each bundle * the upgrade graph information of each channel in the package Project name: `catalogd` ### Architecture `catalogd` has the unique opportunity of providing all of the information stated above in a more k8s native way than OLM v0 Catalogsource, with the help of the already defined fbc format. All of the information mentioned above is already stored in Go types in the [declcfg library](https://github.com/operator-framework/api/pull/266), used by the fbc format. These Go types can be exposed as CRs of two CRDs, `catalogd.operatorframework.io/Package` and `catalogd.operatorframework.io/BundleMetadata`, allowing for a "what you put in the catalog is what you can reliably expect to see on cluster, without any conversion/translation layer in between" UX. The two new CRDs will be in addition to a revamped `catalogd` `CatalogSource` CRD, that will not include any of the OLM v0 implementation detail specific fields included in the v0 CRD (more details in the next section). To ensure that the kube-system apiserver and etcd instances are not being misused as `catalogd` datasotre, the `catalogd` will use an aggregated apiserver like it's OLM v0 counterpart, but with dedicated custom stroage to store `Package` and `BundleMetadata` objects. ![catalogd arch](https://i.imgur.com/5DN8Ve2.png) The two CRDs introduced will be cluster scoped. #### CatalogSource CRD The revamped, cluster-scoped `catalogd` `CatalogSource` CRD will start out with the following fields in it's spec: * `image` which will contain the index image * `priority` to indicate priority of the package to be considered during installation/dependecy resolution Since the `catalogd` `CatalogSource` is not of the existing v0 CatlaogSource types (`grpc`/`address`/`configmap`), it will not contain any of the type specific fields. An example `CatalogSource` CR looks like the following: ```yaml= apiVersion: catalogd.operatorframework.io/v1beta1 kind: CatalogSource metadata: name: community-operators spec: image: quay.io/operatorhubio/catalog:latest priority: 0 ``` #### Package resource The `Package` resource will be used to expose all the package level metadata for each package in a `CatalogSource`. This includes (but may not be limited to, during implementation) * package name (console specific requirements:) * catalogSource package is available in, and it's display name * the maintainers of the package * description of the package provided by the author * icon for the package * keywords, links, maturity, provider (other general client (including deppy) requirements) * default channel of the package * channels available in the package * bundles present in the channel, and the related upgrade graph information An example Package resource will look like the following: ```yaml= apiVersion: catalogd.operatorframework.io/v1beta1 Kind: Package metadata: name: amq-online source: redhat-platform-operators spec: sourceDisplayName: Red Hat Platform Operators icon: base64data: iVBORw0KGgoAAAANSU...... mediatype: image/png defaultChannel: stable channels: - name: stable entries: - name: amq-online.1.6.2 skipRange: '>=1.5.4 <1.6.0' - name: amq-online.1.7.0 replaces: amq-online.1.6.2 - name: amq-online.1.7.1 - name: amq-online.1.7.1-0.1628610187.p replaces: amq-online.1.7.0 skips: - amq-online.1.7.1 - name: amq-online.1.7.2 replaces: amq-online.1.7.1-0.1628610187.p ``` This allows for a UX with kubectl+other existing tools to answer a lot of the questions for users. For example, the command to see the channels available in a package would be ```bash= $ kubectl get package amq-online -o yaml | grep entries -B 1 -name: stable ``` ### BundleMetadata resource The `BundleMetadata` resource will be used to expose the bundle metadata for each bundles in the catalog. It will include the [fields](https://github.com/operator-framework/operator-registry/blob/master/alpha/declcfg/declcfg.go#L62-L65) present in a fbc `olm.bundle` object. An example `BundleMetadata` resource will look like the following: ```yaml= apiVersion: catalogd.operatorframework.io/v1beta1 Kind: BundleMetadata metadata: name: amq-online.1.6.2 package: amq-online source: redhat-platform-operators spec: image: quay.io/amq7/amq-online-controller-manager@sha256:debf79ef45e0fdd229bdfae29ff9a40926854278b39b4aa0d364f5ae9c02c6dc properties: - type: olm.deprecated value: {} - type: olm.gvk value: group: admin.enmasse.io kind: AddressPlan version: v1alpha1 - type: olm.gvk value: group: admin.enmasse.io kind: AddressPlan version: v1beta1 - type: olm.gvk value: group: admin.enmasse.io kind: AddressPlan version: v1beta2 - type: olm.gvk value: group: admin.enmasse.io kind: AddressSpacePlan version: v1alpha1 - type: olm.gvk value: group: admin.enmasse.io kind: AddressSpacePlan version: v1beta1 - type: olm.gvk value: group: admin.enmasse.io kind: AddressSpacePlan version: v1beta2 - type: olm.gvk value: group: admin.enmasse.io kind: AuthenticationService version: v1beta1 - type: olm.gvk value: group: admin.enmasse.io kind: BrokeredInfraConfig version: v1alpha1 - type: olm.gvk value: group: admin.enmasse.io kind: BrokeredInfraConfig version: v1beta1 - type: olm.gvk value: group: admin.enmasse.io kind: ConsoleService version: v1beta1 - type: olm.gvk value: group: admin.enmasse.io kind: StandardInfraConfig version: v1alpha1 - type: olm.gvk value: group: admin.enmasse.io kind: StandardInfraConfig version: v1beta1 - type: olm.gvk value: group: enmasse.io kind: Address version: v1beta1 - type: olm.gvk value: group: enmasse.io kind: AddressSpace version: v1beta1 - type: olm.gvk value: group: enmasse.io kind: AddressSpaceSchema version: v1beta1 - type: olm.gvk value: group: user.enmasse.io kind: MessagingUser version: v1beta1 - type: olm.maxOpenShiftVersion value: 4.8 - type: olm.package value: packageName: amq-online version: 1.6.2 ``` ### Alternative Architecture Using an aggregated APIServer(as opposed to pods exposing grpc endpoints), comes with added complexities. In that context, any bug that affects the core functionalites of the cluster (for eg via a possible indirect induction of performance degradation of the core kube api server) could have the potential of being esclated to urgent priority/blocker bugs for OCP/OLM v1. An alternative to using an aggregated apiserver would be to use the current `registry pod` architechture, but using an http server exposed in those pods, instead of grpc server endpoints (since a simple http endpoint is sufficient to expose the [olm defined schemas](https://olm.operatorframework.io/docs/reference/file-based-catalogs/#olm-defined-schemas)) for FBC. The current v0 architecture has multiple layers of APIs (eg sqllite db/fbc API, which is wrapped around by [grpc endpoint APIs](https://github.com/operator-framework/operator-registry/blob/master/pkg/server/server.go#L21-L98)), which is again processed by the v0 packageserver to faccilitate the `PackageManifest` API, which was a result of the sqllite db being the storage mechanism in the initial implementation of `registry pods`. With fbc being a declarative API itself, there's an opportunity to work with the fbc API as the sole API for exposing catalog metadata on cluster. A likely candidate is a query language like [graphQL](https://graphql.org), that has libraries to help [serve the content](https://graphql.org/learn/serving-over-http/) of fbc over an http endpoint. The http endpoint can then be queried for highly customizable reponses that can help reduce resource consumption for clients by reducing network bandwith needed (since clients can ask for only the data they care about instead of eg `listbundles` that return the entire content of the bundles every time), and memeory and CPU cycles to process the responses. A prototype to showcase what serving fbc over a http endpoint and how clients can customize the queires to that enpoint to get the relevant content can be found here: https://github.com/everettraven/catsrc-gql However, this is only switching out the grpc server that v0 utilizes with an http server, and although we can reduce the number of APIs needed end to end, and the resource consumption by clients, the arch wrapping the binary would still be very similar: * The v0 `CatalogSource` CRD would have to be re-released as OLM v0 `CatalogSource` CRD. It'd have to have an almost identical API surface as the v0 CatalogSource(i.e it'd have to carry over fields such as `updateStrategy`, `grpcPodConfig` but renamed to `httpPodConfig` etc), with a only few modifications such as: - The new CRD will be cluster scoped For eg, [the current Red hat community operator catalog CR](https://github.com/operator-framework/operator-marketplace/blob/master/defaults/03_community_operators.yaml) will only have two changes, the `apiVersion` field, and (deletion of) the `metadata.namespace` field ```yaml= apiVersion: "catalogd.operatorframework.io/v1beta1" kind: "CatalogSource" metadata: name: "community-operators" annotations: target.workload.openshift.io/management: '{"effect": "PreferredDuringScheduling"}' spec: sourceType: grpc image: registry.redhat.io/redhat/community-operator-index:v4.12 displayName: "Community Operators" publisher: "Red Hat" priority: -400 updateStrategy: registryPoll: interval: 10m grpcPodConfig: nodeSelector: node-role.kubernetes.io/master: "" kubernetes.io/os: "linux" priorityClassName: "system-cluster-critical" tolerations: - key: "node-role.kubernetes.io/master" operator: Exists effect: "NoSchedule" - key: "node.kubernetes.io/unreachable" operator: "Exists" effect: "NoExecute" tolerationSeconds: 120 - key: "node.kubernetes.io/not-ready" operator: "Exists" effect: "NoExecute" tolerationSeconds: 120 ``` Note that this CR has grown over time via multiple (significant effort) updates ([1](https://github.com/operator-framework/operator-marketplace/commit/20337880ba6adb3a48a659a095e8726bfecd602b#diff-68d22dd5c803e04736b0ead56614426969ffd4ef66b27c3e8527a1c13220b0beR11-R13),[2](https://github.com/operator-framework/operator-marketplace/commit/cda16eb9d0b85e49ca3688fe8891c8a771faf06f)), caused by the problem of the `CatalogSource` controller creating and managing it's own pods, that has been discussed above. * All pods would have be to installed in a "global" namespace, most likely the `catalogd` namespace * Clients such as the Openshift console, deppy etc would have to connect to all of the pods in the "global" namespace by discovering all CatalogSource pod services(**a major breaking change for the console**). To allow unprivileged tenants to discover the operators available for installation on cluster, a kubectl plugin like https://github.com/operator-framework/kubectl-operator(kubectl catalog?) would have to be provided for the new CatalogSource CRs that'll allow for the same UX mentioned in the original proposal (eg a package's channels can be listed using `kubectl catalog get channels -A | grep package:<package-name>`). * The existing API surface exposed by `opm serve` would be simplified to contain only four new endpoints, that merely exposes the fbc schemas wholly(without any post processing etc): - listBundles: that exposes the `olm.bundle` objects - listPackages: that exposes the `olm.package` objects - listChannels: that exposes the `olm.channel` objects - listCustomObjects: that exposes all user defined objects (non olm.* objects) Not including a custom apiserver like the olm packageserver will help with finally making the [inclusion of `olm.bundle` objects unnecessary](https://olm.operatorframework.io/docs/reference/file-based-catalogs/#olmbundleobject-alpha), which will help with bringing down the memory comsumption by the registry pods. However, this architechture is being proposed as an alternate architechture over the main proposed custom apiserver architecture for several reasons: * There are already a good number of production grade custom apiserver being run out in the wild, including the olm v0 `packageserver` that the maintainers of OLM already have experience with. For eg, Openshift currently deploys the following aggregated APIServers: ``` $ kubectl get apiservice | grep -v Local NAME SERVICE AVAILABLE AGE v1.apps.openshift.io openshift-apiserver/api True 46m v1.authorization.openshift.io openshift-apiserver/api True 46m v1.build.openshift.io openshift-apiserver/api True 46m v1.image.openshift.io openshift-apiserver/api True 46m v1.oauth.openshift.io openshift-oauth-apiserver/api True 46m v1.packages.operators.coreos.com openshift-operator-lifecycle-manager/packageserver-service True 48m v1.project.openshift.io openshift-apiserver/api True 46m v1.quota.openshift.io openshift-apiserver/api True 46m v1.route.openshift.io openshift-apiserver/api True 46m v1.security.openshift.io openshift-apiserver/api True 46m v1.template.openshift.io openshift-apiserver/api True 46m v1.user.openshift.io openshift-oauth-apiserver/api True 46m v1beta1.metrics.k8s.io openshift-monitoring/prometheus-adapter True 37m ``` * As mentioned above, creating and maintaining pods directly (without using workload objects like `Deployment`/`StatefulSets`) has been the root cause of significant maintainance updates, and is likely to continue to be the cause for even more maintainance burden. * Existing external clients of OLM that depend on resources exposed by the current `packageserver`(eg openshift console/kubectl operator extension plugin, etc) will need to undergo minimal migration effort, thereby increasing the chance of a faster OLM v1 adoption. #### First Design meeting notes https://docs.google.com/document/d/1jhZl5yRammoCqMqGh9NslwFEzzBmSX8mAMmzlbNU9ZA ## Comparisons of solutions ### Summary of problems we're trying to solve in OLM V1 * Creating and running (registry) pods ourselves has lead to a lot of problems. We ideally need a way to serve the content to clients without running specialized pods. (henceforth refered to as the **pods** problem) * We need a way to serve the content to clients like the Openshift Console and deppy on cluster(henceforth refered to as the **content availability on cluster** problem) * We need to have a kube-native way for off cluster clients (eg kubectl) to query for basic information about the content. (henceforth referd to as the **kube-native content exploration off cluster** problem) * Optimization of resource consumption while serving the said content: We need to deliver all the content while using minimal memory/CUP/network. (henceforth refered to as the **cluster resource consumption** problem) ### Comparison of the two solutions proposed above | Solution | Solves *pods* problem| Solves *content availability on cluster* problem | Solves *kube-native content exploration off cluster* problem | Solves *cluster resource consumption* problem | Promises ease of maintainance| | -------- | -------- | -------- | ---------| ------------|------------| | Aggregated API Service | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: | :x: | | pods with graphQL server http endpoints | :x: | :heavy_check_mark: | :x: | :heavy_check_mark: | :heavy_check_mark: | ## A (happy?) union of the two solutions to hit all the check boxes Walking back from the problems we're trying to solve for, with the help of the ideas discussed in the two solutions, one of the conclusive architechture that can be reached is discussed in this section. If we focus on the checkboxes that are not hitting the marks, few design choices become apparent: ### Marks "Promises ease of maintainance" An aggregated apiservice is not easy to maintain, compared to the ubiquitous CRD+controller extension mechanism for kubernetes. That becomes especially highlighted when a) we need to provide a way to store the fbc content on cluster without overloading the main etcd store of the cluster b) any change in the fbc API would need plumbing in the CRDs that the aggregate API service is responsible for. **Ideal design choice**: Have a controller that owns CRDs that can be divided into two catagories. ###### Catagory I: (Cluster scoped) API that users can use to indicate the location of the fbc with. Think "here's the container/OCI artifact image that you can pull from a remote image registry (and unpack, in case of a container image) to retrive the fbc content from". i.e the `CatalogSource` CRD for which a CR would look like: ```yaml= apiVersion: catalogd.operatorframework.io/v1beta1 kind: CatalogSource metadata: name: community-operators spec: image: docker.io/operator-framework/community-operators:v4.14 pollingInterval: 45m priority: 0 ``` ###### Catagory II: APIs that will be generated from the retrived/unpacked fbc, to provide basic information about the packages provided by the catalog, that clients have already been conditioned to expect in a kube-native way. Think `PackageManifests` (ideally renamed to `OperatorPackages` to draw the analogy of "software packages" available for installation). This gives the clients like Openshift console the same information they use today, in the exact same way that they consume them today (i.e eg they'd only need to switch out client.list(&PackageManifests{}) with client.list(&OperatorPackages{})). Users with just the `kubectl` cli can query for basic information that they can query today, eg: ```bash $ kubectl get operatorpackages akka-cluster-operator Community Operators 2m40s pystol Community Operators 2m40s cert-manager Community Operators 2m40s percona-postgresql-operator Community Operators 2m40s camel-karavan-operator Community Operators 2m40s camel-k Community Operators 2m40s ibmcloud-operator Community Operators 2m40s self-node-remediation Community Operators 2m40s ack-opensearchservice-controller Community Operators 2m40s . . $ kubectl get operatorpackage camel-k -o yaml apiVersion: packages.operators.coreos.com/v1 kind: PackageManifest metadata: creationTimestamp: "2023-01-11T17:24:20Z" labels: catalog: operatorhubio-catalog catalog-namespace: olm operatorframework.io/arch.amd64: supported operatorframework.io/os.linux: supported provider: The Apache Software Foundation provider-url: "" name: camel-k namespace: default spec: {} status: catalogSource: operatorhubio-catalog catalogSourceDisplayName: Community Operators catalogSourceNamespace: olm catalogSourcePublisher: OperatorHub.io channels: - currentCSV: camel-k-operator.v1.6.1 currentCSVDesc: annotations: . . . ``` **(Which means this also checks the mark "kube-native content exploration off cluster problem")** ### Mark "pods problem" We don't want to port over the problems of maintaining workload APIs to expose the fbc content on cluster. But also, in order for the on cluster clients to use the fbc content stored in eg remote registries, we don't want to have to tell the clients to preface every request with "pull (IfNotPresent) and unpack onto the file system, the content of this image" for the catalogs available to this cluster. This is especially important when we consider the fact that persitent disk memory is much cheaper than network operations. Think "I'm deppy and I've been operational for 3 days now but you've asked me to pull(IfNotPresent) this new catalog image you want to install an operator from, but I have to fail coz I can't reach the image registry now." That leads us to the fact that we need to store at least one copy of the fbc from each catalog somewhere "locally". **Ideal design choice**: The v0 architecture does this by unpacking the sqllite db/fbc onto the cluster in each registry pods. But since we've come to associate those pods with nightmares, we could just ask the CatalogSource controller pod to do the unpacking of the FBCs into it's mounted filesystem path, but for all the catalogs present on cluster. Those unpacked FBCs can then be exposed using the graphQL server in a dedicated port (besides the 8080 port that exposes the /healthz and /metrics endpoints). Clients that will use this endpoint to query for fbc content metadata: a) Deppy(or associated adapter that'll create `EntitySources` from the response): Instead of querying `listBundle` endpoints from multiple pods to get information about all the catalogs available on cluster, only one endpoint (exposed via a singular hardcoded service) from the controller pod would need to be queried. Using the graphQL server gives the luxury for deppy(/adapater) to customize the query for responses with relevant information. b) A kubectl pluging, for advance use cases in the **kube-native content expolration off cluster** workflow. While a `OperatorPackage` (or `PackageManifest`) API is sufficient for basic information discovery of packages on cluster, we've had requests from users to have the ability to process more refined results. An example of that is "I want to know all the channels available in package Foo, along with the head of the channel for each channel". This is then just a pluging sub-command that is a wrapper around the connect to service+query graphQL server, so that users don't have to do that manually. **(Again checks the mark "kube-native content exploration off cluster problem")** ### Architecture of the union ![union arch](https://i.imgur.com/SLlf77K.png) ### Comparison of all three design choices | Solution | Solves *pods* problem| Solves *content availability on cluster* problem | Solves *kube-native content exploration off cluster* problem | Solves *cluster resource consumption* problem | Promises ease of maintainance| | -------- | -------- | -------- | ---------| ------------|------------| | Aggregated API Service | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: | :x: | | pods with graphQL server http endpoints | :x: | :heavy_check_mark: | :x: | :heavy_check_mark: | :heavy_check_mark: | | controller that owns `CatalogSource` and `OperatorPackage` CRD + has an (graphQL)http server endpoint + a kubectl operator pluging) | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: |