[WIP] Platform Operators Strawman

Summary

Jira Epic - https://issues.redhat.com/browse/OLM-2513

As described in the issue above, the PlatformOperators (PO) concept is being built with the intention of doing two things:

Reducing the core installation size of OpenShift
Allowing customers have more customization over what is installed

This document serves as a place to direct discussion, questions, design and ultimate consesus around PlatformOperators. For more information about requirements, constraints, scenarios, etc - please reference the Jira epic.

Design

Content Sourcing

When creating a PlatformOperator, a content source is required in order to determine what we are unpacking. Currently, this content source is found in the form of a CatalogSource. This is an image created that, when run, serves Bundles over a GRPC connection. A Bundle is a collection of manifests coupled with metadata that a consumer can install. With respect to PlatformOperators (PO), the PO is a consumer of these Bundles. To that end, there are two Bundle formats to be aware of:

registry+v1 - The format used by legacy OLM and contains legacy OLM specific manifests and metadata such as the CSV.
plain+v0 - Format containing regular Kubernetes manifests. Somewhat immature as metadata has not been solidified.

Discussion on how to handle these two Bundle formats is on going as extensively discussed in this document. At the core experience, however, a user will create a PlatformOperator that points to a Bundle from some content source (likely a CatalogSource). The PlatformOperator controller will retrieve that Bundle and stamp out its content into a BundleInstance with an owner reference to itself. This ensures that deleting a PlatformOperator results in all associated resources getting deleted.

Figure 1

Discussion Topics

What's the use case for legacy OLM when the marketplace component is disabled?

Users provide their own catalog content.

What's the use case for legacy OLM to be installed through this PO mechanism?

Better support resource constrained environments, e.g. edge devices, cluster invariants like microshift, etc.

What's the use case for removing a PO in day 2 operations?

No support around removing an operator after cluster installation results in a poor UX for cluster administrators as they would need to spin down the cluster, and re-install using a new openshift-installer configuration.

Can we introduce "internal" APIs into the core payload?

https://github.com/openshift/openshift-docs/pull/41018#issuecomment-1027327520

If there is something we want to be an internal api, we make it v1alpha1 and then never plan to promote it. However, we still need to not break people who upgrade their clusters/handle migration as we evolve the api.

What happens if another operator in the catalog has a dependency on a platform operator?

We expect OLM to be aware of all the operators installed, whether or not they are installed/managed as PlatformOperators. Ultimately they are still olm operators that are installed/running on the cluster. To do this, OLM is going to need to grow a central registry of what operators (potential dependencies) that are installed on a cluster.

[WIP] Platform Operators Strawman

Summary

Design

Content Sourcing

Discussion Topics

What's the use case for legacy OLM when the marketplace component is disabled?

What's the use case for legacy OLM to be installed through this PO mechanism?

What's the use case for removing a PO in day 2 operations?

Can we introduce "internal" APIs into the core payload?

What happens if another operator in the catalog has a dependency on a platform operator?

Read more

OLM Personas

OCPBUGS-35210: missing serviceaccount-token secrets on operator upgrade

Deprecating channels

catalogd states