---
title: (platform operators) 4.12 redhat-operators catalog audit
author: timflannagan
---
# Overview
Home for an initial audit into how many packages in the `redhat-operators` 4.12 catalog source can be successfully installed through the phase 0 platform operators framework.
## Summary (tl;dr)
Around 35% (25/70) of the current 4.12 redhat-operators catalog can be successfully installed through the phase 0 platform operators mechanism.
> Note: the findings didn't verify whether those 25 successful platform operator installations contained any runtime failures, which is a notable limitation in the current model. See the [note in the phase 0 EP for more information on this behavior](https://github.com/openshift/enhancements/blob/6e1697418be807d0ae567a9f83ac654a1fd0ee9a/enhancements/olm/platform-operators.md#reconcile-pom-aggregate-clusteroperator-object).
The failed platform operator installations can be bucketed into the following categories:
- Unpack failures when attempting to source registry+v1 bundle content.
- Runtime failures when attempting to persist the unpacked bundle contents to the cluster.
For the former category, there's restrictions on which OLM packages are valid to be installed as platform operators. Notable restrictions require platform operators to support the "AllNamespace" install mode, and ensure that registry+v1 bundle contents cannot specify any APIServer or webhook definitions. This is important to note as 100% of the platform operators that failed during the unpacking process were due to either of those documented restrictions.
<!-- TODO: explain why these restrictions are in place given it affects OLM 1.x -->
For the latter category, more investigation is warranted. A detailed log of those individual failures are shown below, and it's unclear whether these failures are easily reproducible right now, or point to a larger problem with the core rukpak provisioner implementations.
Lastly, we gained minor visibility into the overall footprint for the phase 0 implementation. During steady state operations, the core rukpak provisioner consumed around 300-600Mi of memory usage. Further investigation into this memory consumption is warranted.
Introducing pprof profiling into rukpak is a [roadmap item](https://github.com/operator-framework/rukpak/issues/267) that still needs to be prioritized. Exposing this endpoint along with introducing custom metrics can help improve visibility into the internal controller operations over time. Extending this framework with the usage of third-party continuous profiling and/or observability tooling still needs to be done.
Overall, the phase 0 implementation was able to roll out a large number of platform operators all at once, while correctly proxying individual platform operators failing states back to the CVO.
## Investigation / In-depth Findings
Note: This section contains how to reproduce these findings using a simply bash script for doing the initial setup, along with a working document on the findings.
### Pre-requisites
- Install a 4.12 cluster that has the "TechPreviewNoUpgrades" feature set enabled
- Install the [grpcurl](https://github.com/fullstorydev/grpcurl) tooling
### Initial Setup
After a 4.12 techpreview enabled cluster has been spun up, creating the following script locally:
```bash
$ cat bootstrap-platform-operators.sh
#! /bin/bash
set -o pipefail
set -o nounset
set -o errexit
FILE=packages.json
function registry_get_packages() {
if [[ ! -f $FILE ]]; then
grpcurl -plaintext localhost:50051 api.Registry/ListPackages | jq '.name' > "$FILE"
fi
}
function create_platform_operators() {
local file=$1
while IFS="" read -r p || [ -n "$p" ]
do
cat <<EOF | kubectl apply -f -
---
apiVersion: platform.openshift.io/v1alpha1
kind: PlatformOperator
metadata:
name: $p
spec:
package:
name: $p
EOF
done <"$file"
}
registry_get_packages
create_platform_operators "$FILE"
```
Before running that script, establish a local connection to the "redhat-operators" catalog source Kubernetes Service:
```bash=
$ kubectl -n openshift-marketplace port-forward svc/redhat-operators 50051:50051
Forwarding from 127.0.0.1:50051 -> 50051
Forwarding from [::1]:50051 -> 50051
...
```
And in another terminal, run the script detailed above to start creating an individual PlatformOperator resource per package listed in the redhat-operators catalog:
```bash=
$ chmod +x ./bootstrap-platform-operators.sh
$ ./bootstrap-platform-operators.sh
...
```
And optionally, add the "INSTALL STATE" custom printer column to the PlatformOperators CRD:
```bash=
$ kubectl get crds platformoperators.platform.openshift.io -o yaml | faq -f yaml '.spec.versions[].additionalPrinterColumns'
- jsonPath: .status.conditions[?(.type=="Installed")].reason
name: Install State
type: string
```
### In-depth Findings
First, wait until all the PlatformOperator resources have been created on the cluster:
> Note: Click the "details" tab for the full command output.
<details>
```bash=
$ kubectl get platformoperators
NAME INSTALL STATE
3scale-operator InstallSuccessful
advanced-cluster-management UnpackFailed
amq-online InstallSuccessful
amq-streams InstallSuccessful
amq7-interconnect-operator UnpackFailed
apicast-operator InstallSuccessful
aws-load-balancer-operator UnpackFailed
bamoe-businessautomation-operator UnpackFailed
bamoe-kogito-operator InstallSuccessful
businessautomation-operator UnpackFailed
cincinnati-operator UnpackFailed
cluster-logging UnpackFailed
compliance-operator InstallSuccessful
container-security-operator InstallSuccessful
costmanagement-metrics-operator UnpackFailed
cryostat-operator UnpackFailed
datagrid UnpackFailed
devspaces UnpackFailed
devworkspace-operator UnpackFailed
eap InstallSuccessful
elasticsearch-operator InstallFailed
external-dns-operator UnpackFailed
file-integrity-operator InstallSuccessful
fuse-apicurito UnpackFailed
fuse-console UnpackFailed
fuse-online UnpackFailed
gatekeeper-operator-product InstallSuccessful
integration-operator InstallSuccessful
jaeger-product UnpackFailed
jws-operator InstallSuccessful
kiali-ossm InstallSuccessful
kubevirt-hyperconverged UnpackFailed
loki-operator UnpackFailed
mcg-operator UnpackFailed
mtc-operator UnpackFailed
mtv-operator UnpackFailed
multicluster-engine UnpackFailed
node-healthcheck-operator InstallSuccessful
node-maintenance-operator UnpackFailed
node-observability-operator UnpackFailed
ocs-operator UnpackFailed
odf-csi-addons-operator InstallSuccessful
odf-lvm-operator UnpackFailed
odf-multicluster-orchestrator UnpackFailed
odf-operator UnpackFailed
odr-cluster-operator InstallSuccessful
odr-hub-operator InstallFailed
openshift-cert-manager-operator InstallSuccessful
openshift-custom-metrics-autoscaler-operator InstallSuccessful
openshift-gitops-operator InstallSuccessful
openshift-pipelines-operator-rh InstallSuccessful
openshift-secondary-scheduler-operator UnpackFailed
opentelemetry-product UnpackFailed
quay-bridge-operator UnpackFailed
quay-operator InstallSuccessful
red-hat-camel-k InstallSuccessful
redhat-oadp-operator UnpackFailed
rh-service-binding-operator UnpackFailed
rhacs-operator UnpackFailed
rhpam-kogito-operator InstallFailed
rhsso-operator UnpackFailed
sandboxed-containers-operator UnpackFailed
self-node-remediation UnpackFailed
serverless-operator UnpackFailed
service-binding-operator UnpackFailed
service-registry-operator InstallSuccessful
servicemeshoperator InstallSuccessful
skupper-operator InstallSuccessful
web-terminal InstallFailed
```
</details>
Looking at the above output, we can see only a subset of the overall 65-70 packages in the redhat-operators catalog are able to be successfully installed:
```bash=
$ kubectl get platformoperators --no-headers | wc -l
69
...
$ kubectl get platformoperators | grep "InstallSuccessful" | wc -l
25
```
Most of the platform operator resources that failed to install are due to Bundle unpack failures at the rukpak layer:
> Note: The following output uses a patched Bundle CRD that has been extending with additional printer columns that aren't present in the upstream/downstream configurations.
```bash=
$ kubectl get bundles | grep -v "UnpackSuccessful"
NAME UNPACK STATE UNPACK MESSAGE TYPE PHASE AGE
advanced-cluster-management-624cwb UnpackFailed convert registry+v1 bundle to plain+v0 bundle: AllNamespace install mode must be enabled image Failing 28m
amq7-interconnect-operator-d4w7wf UnpackFailed convert registry+v1 bundle to plain+v0 bundle: AllNamespace install mode must be enabled image Failing 30m
aws-load-balancer-operator-v7cxff UnpackFailed convert registry+v1 bundle to plain+v0 bundle: AllNamespace install mode must be enabled image Failing 29m
bamoe-businessautomation-operator-z46ccf UnpackFailed convert registry+v1 bundle to plain+v0 bundle: AllNamespace install mode must be enabled image Failing 29m
businessautomation-operator-fx9f9d UnpackFailed convert registry+v1 bundle to plain+v0 bundle: AllNamespace install mode must be enabled image Failing 25m
cincinnati-operator-72zz4w UnpackFailed convert registry+v1 bundle to plain+v0 bundle: AllNamespace install mode must be enabled image Failing 24m
cluster-logging-f49c4f UnpackFailed convert registry+v1 bundle to plain+v0 bundle: AllNamespace install mode must be enabled image Failing 25m
costmanagement-metrics-operator-9988xx UnpackFailed convert registry+v1 bundle to plain+v0 bundle: AllNamespace install mode must be enabled image Failing 24m
cryostat-operator-vvwb4z UnpackFailed convert registry+v1 bundle to plain+v0 bundle: AllNamespace install mode must be enabled image Failing 23m
datagrid-z28w45 UnpackFailed convert registry+v1 bundle to plain+v0 bundle: webhookDefiniions are not supported image Failing 29m
devspaces-wz9wz2 UnpackFailed convert registry+v1 bundle to plain+v0 bundle: webhookDefiniions are not supported image Failing 30m
devworkspace-operator-8b56b7 UnpackFailed convert registry+v1 bundle to plain+v0 bundle: webhookDefiniions are not supported image Failing 23m
external-dns-operator-b2588z UnpackFailed convert registry+v1 bundle to plain+v0 bundle: AllNamespace install mode must be enabled image Failing 25m
fuse-apicurito-d5857b UnpackFailed convert registry+v1 bundle to plain+v0 bundle: AllNamespace install mode must be enabled image Failing 27m
fuse-console-bz9d47 UnpackFailed convert registry+v1 bundle to plain+v0 bundle: AllNamespace install mode must be enabled image Failing 30m
fuse-online-4xx9zz UnpackFailed convert registry+v1 bundle to plain+v0 bundle: AllNamespace install mode must be enabled image Failing 25m
jaeger-product-76z889 UnpackFailed convert registry+v1 bundle to plain+v0 bundle: webhookDefiniions are not supported image Failing 28m
kubevirt-hyperconverged-xxw94f UnpackFailed convert registry+v1 bundle to plain+v0 bundle: AllNamespace install mode must be enabled image Failing 24m
loki-operator-cdc7b5 UnpackFailed convert registry+v1 bundle to plain+v0 bundle: webhookDefiniions are not supported image Failing 30m
mcg-operator-vzwd5b UnpackFailed convert registry+v1 bundle to plain+v0 bundle: AllNamespace install mode must be enabled image Failing 25m
mtc-operator-x27bwd UnpackFailed convert registry+v1 bundle to plain+v0 bundle: AllNamespace install mode must be enabled image Failing 24m
mtv-operator-vz4479 UnpackFailed convert registry+v1 bundle to plain+v0 bundle: AllNamespace install mode must be enabled image Failing 25m
multicluster-engine-8459d6 UnpackFailed convert registry+v1 bundle to plain+v0 bundle: AllNamespace install mode must be enabled image Failing 30m
node-maintenance-operator-zbb44w UnpackFailed convert registry+v1 bundle to plain+v0 bundle: webhookDefiniions are not supported image Failing 27m
node-observability-operator-w87z44 UnpackFailed convert registry+v1 bundle to plain+v0 bundle: AllNamespace install mode must be enabled image Failing 25m
ocs-operator-c49bvw UnpackFailed convert registry+v1 bundle to plain+v0 bundle: AllNamespace install mode must be enabled image Failing 27m
odf-lvm-operator-v7b722 UnpackFailed convert registry+v1 bundle to plain+v0 bundle: AllNamespace install mode must be enabled image Failing 24m
odf-multicluster-orchestrator-2f845c UnpackFailed convert registry+v1 bundle to plain+v0 bundle: webhookDefiniions are not supported image Failing 27m
odf-operator-w5x745 UnpackFailed convert registry+v1 bundle to plain+v0 bundle: AllNamespace install mode must be enabled image Failing 24m
openshift-secondary-scheduler-operator-682b79 UnpackFailed convert registry+v1 bundle to plain+v0 bundle: AllNamespace install mode must be enabled image Failing 25m
opentelemetry-product-98c4c2 UnpackFailed convert registry+v1 bundle to plain+v0 bundle: webhookDefiniions are not supported image Failing 29m
quay-bridge-operator-7v58c4 UnpackFailed convert registry+v1 bundle to plain+v0 bundle: webhookDefiniions are not supported image Failing 27m
redhat-oadp-operator-dzx47f UnpackFailed convert registry+v1 bundle to plain+v0 bundle: AllNamespace install mode must be enabled image Failing 24m
rh-service-binding-operator-d4ww48 UnpackFailed convert registry+v1 bundle to plain+v0 bundle: webhookDefiniions are not supported image Failing 24m
rhacs-operator-4c99v4 UnpackFailed convert registry+v1 bundle to plain+v0 bundle: webhookDefiniions are not supported image Failing 24m
rhsso-operator-cd7xwb UnpackFailed convert registry+v1 bundle to plain+v0 bundle: AllNamespace install mode must be enabled image Failing 26m
sandboxed-containers-operator-f9424c UnpackFailed convert registry+v1 bundle to plain+v0 bundle: AllNamespace install mode must be enabled image Failing 23m
self-node-remediation-z9wxcz UnpackFailed convert registry+v1 bundle to plain+v0 bundle: webhookDefiniions are not supported image Failing 30m
serverless-operator-b8d5x6 UnpackFailed convert registry+v1 bundle to plain+v0 bundle: webhookDefiniions are not supported image Failing 28m
service-binding-operator-d4ww48 UnpackFailed convert registry+v1 bundle to plain+v0 bundle: webhookDefiniions are not supported image Failing 51m
```
We can see that the majority of the Bundle unpack failures are due to explicit restrictions codified in the phase 0 implementation. These restrictions exclude any platform operator from declaring any webhook/APIServer definitions, and include the requirement where valid packages must supporting the AllNamespace install mode. The rationale for these restrictions are outlined in the phase 0 EP.
And then diving into the platform operators that were able to be successfully sourced/unpacked/etc. by the POM component but failed when persisting the unpacked Bundle contents at runtime:
<details>
```bash=
$ kubectl get bd $(k get bundledeployments | grep "InstallFailed" | awk '{ print $1 }') -o yaml | faq -f yaml '.items[].status'
conditions:
- lastTransitionTime: "2022-10-24T18:23:34Z"
message: Successfully unpacked the elasticsearch-operator-fw28fx Bundle
reason: UnpackSuccessful
status: "True"
type: HasValidBundle
- lastTransitionTime: "2022-10-24T18:25:38Z"
message: 'rendered manifests contain a resource that already exists. Unable to continue
with install: Role "leader-election-role" in namespace "openshift-platform-operators"
exists and cannot be imported into the current release: invalid ownership metadata;
annotation validation error: key "meta.helm.sh/release-name" must equal "elasticsearch-operator":
current value is "file-integrity-operator"'
reason: InstallFailed
status: "False"
type: Installed
observedGeneration: 1
---
conditions:
- lastTransitionTime: "2022-10-24T18:25:39Z"
message: Successfully unpacked the odr-hub-operator-w6bwb7 Bundle
reason: UnpackSuccessful
status: "True"
type: HasValidBundle
- lastTransitionTime: "2022-10-24T18:28:03Z"
message: 'rendered manifests contain a resource that already exists. Unable to continue
with install: Namespace "openshift-dr-system" in namespace "" exists and cannot
be imported into the current release: invalid ownership metadata; annotation validation
error: key "meta.helm.sh/release-name" must equal "odr-hub-operator": current
value is "odr-cluster-operator"'
reason: InstallFailed
status: "False"
type: Installed
observedGeneration: 1
---
conditions:
- lastTransitionTime: "2022-10-24T18:24:04Z"
message: Successfully unpacked the rhpam-kogito-operator-4x4f45 Bundle
reason: UnpackSuccessful
status: "True"
type: HasValidBundle
- lastTransitionTime: "2022-10-24T18:26:03Z"
message: 'rendered manifests contain a resource that already exists. Unable to continue
with install: CustomResourceDefinition "kogitoinfras.rhpam.kiegroup.org" in namespace
"" exists and cannot be imported into the current release: invalid ownership metadata;
annotation validation error: key "meta.helm.sh/release-name" must equal "rhpam-kogito-operator":
current value is "bamoe-kogito-operator"'
reason: InstallFailed
status: "False"
type: Installed
observedGeneration: 1
---
conditions:
- lastTransitionTime: "2022-10-24T18:21:48Z"
message: Successfully unpacked the web-terminal-454zw8 Bundle
reason: UnpackSuccessful
status: "True"
type: HasValidBundle
- lastTransitionTime: "2022-10-24T18:22:16Z"
message: 'rendered manifests contain a resource that already exists. Unable to continue
with install: Namespace "openshift-operators" in namespace "" exists and cannot
be imported into the current release: invalid ownership metadata; label validation
error: missing key "app.kubernetes.io/managed-by": must be set to "Helm"; annotation
validation error: missing key "meta.helm.sh/release-name": must be set to "web-terminal";
annotation validation error: missing key "meta.helm.sh/release-namespace": must
be set to "openshift-platform-operators"'
reason: InstallFailed
status: "False"
type: Installed
observedGeneration: 1
```
</details>
Those installation failures may warrant some further investigation. At a glance, it looks like most of them are due to using helm as our underlying engine for rukpak, and helm uses adoption primitives to distinguish between new and existing resources in the current chart's release.
Next, we'll dive into the state of the "platform-operators-aggregated" ClusterOperator used to proxy individual platform operator state back to the CVO:
```bash=
$ kubectl get co platform-operators-aggregated
NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE
platform-operators-aggregated 4.12.0-0.nightly-2022-10-24-103753 False True False 14m [encountered the failing quay-bridge-operator platform operator with reason "UnpackFailed", encountered the failing rhpam-kogito-operator platform operator with reason "InstallFailed", encountered the failing mtv-operator platform operator with reason "UnpackFailed", encountered the failing devworkspace-operator platform operator with reason "UnpackFailed", encountered the failing serverless-operator platform operator with reason "UnpackFailed", encountered the failing self-node-remediation platform operator with reason "UnpackFailed", encountered the failing bamoe-businessautomation-operator platform operator with reason "UnpackFailed", encountered the failing advanced-cluster-management platform operator with reason "UnpackFailed", encountered the failing elasticsearch-operator platform operator with reason "InstallFailed", encountered the failing odf-lvm-operator platform operator with reason "UnpackFailed", encountered the failing cryostat-operator platform operator with reason "UnpackFailed", encountered the failing odf-multicluster-orchestrator platform operator with reason "UnpackFailed", encountered the failing fuse-online platform operator with reason "UnpackFailed", encountered the failing mcg-operator platform operator with reason "UnpackFailed", encountered the failing odf-operator platform operator with reason "UnpackFailed", encountered the failing amq7-interconnect-operator platform operator with reason "UnpackFailed", encountered the failing web-terminal platform operator with reason "InstallFailed", encountered the failing ocs-operator platform operator with reason "UnpackFailed", encountered the failing rhsso-operator platform operator with reason "UnpackFailed", encountered the failing rh-service-binding-operator platform operator with reason "UnpackFailed", encountered the failing fuse-apicurito platform operator with reason "UnpackFailed", encountered the failing mtc-operator platform operator with reason "UnpackFailed", encountered the failing sandboxed-containers-operator platform operator with reason "UnpackFailed", encountered the failing node-observability-operator platform operator with reason "UnpackFailed", encountered the failing redhat-oadp-operator platform operator with reason "UnpackFailed", encountered the failing odr-hub-operator platform operator with reason "InstallFailed", encountered the failing aws-load-balancer-operator platform operator with reason "UnpackFailed", encountered the failing opentelemetry-product platform operator with reason "UnpackFailed", encountered the failing jaeger-product platform operator with reason "UnpackFailed", encountered the failing node-maintenance-operator platform operator with reason "UnpackFailed", encountered the failing businessautomation-operator platform operator with reason "UnpackFailed", encountered the failing cincinnati-operator platform operator with reason "UnpackFailed", encountered the failing multicluster-engine platform operator with reason "UnpackFailed", encountered the failing loki-operator platform operator with reason "UnpackFailed", encountered the failing fuse-console platform operator with reason "UnpackFailed", encountered the failing openshift-secondary-scheduler-operator platform operator with reason "UnpackFailed", encountered the failing cluster-logging platform operator with reason "UnpackFailed", encountered the failing rhacs-operator platform operator with reason "UnpackFailed", encountered the failing datagrid platform operator with reason "UnpackFailed", encountered the failing service-binding-operator platform operator with reason "UnpackFailed", encountered the failing costmanagement-metrics-operator platform operator with reason "UnpackFailed", encountered the failing kubevirt-hyperconverged platform operator with reason "UnpackFailed", encountered the failing devspaces platform operator with reason "UnpackFailed", encountered the failing external-dns-operator platform operator with reason "UnpackFailed"]
```
We can see that the ClusterOperator is reporting an unavailable status, which will prohibit the cluster from being upgraded by the CVO. The populated status condition message is fairly verbose given the number of individual platform operator failures.
And lastly, a snapshot into the steady state footprint for the phase 0 stack:
```bash=
$ kubectl k top pod
NAME CPU(cores) MEMORY(bytes)
platform-operators-controller-manager-7d4bdc975b-zc4b5 2m 59Mi
platform-operators-rukpak-core-844f9f848f-shl2g 24m 304Mi
platform-operators-rukpak-webhooks-5b95674b99-8mmhn 0m 19Mi
platform-operators-rukpak-webhooks-5b95674b99-9g2sq 0m 19Mi
```
At a glance, the bundle cache only attributed to a fraction of that memory usage:
```bash=
$ du -h /var/cache/
728K /var/cache/bundles
0 /var/cache/uploads
728K /var/cache/
```
Introducing pprof (e.g. /metrics endpoint) should help improve visibility into the internal controller state. Using this in combination with third party continuous profile tooling (e.g. parca) can help us get a better feel of what's happening during QPS/steady state behavior, and how to make the implementation more robust. Initial findings may point towards the usage of a dynamic watching implementation throughout rukpak provisioners, or due to the usage of helm's installation engine under-the-hood, as the cause of the non-bundle-cache memory usage. Evaluating CVO-based polling instead of using a dynamic watch implementation is a known roadmap item.
Another thing worth calling out is that the current rukpak Bundle controllers unpack bundle contents asynchronously using a Kubernetes Pod. After successfully unpacking the desired bundle contents, exposing those contents through the Pod's log sub-resource, and persisting those contents in an in-memory cache that clients can access, the result is the Pod sticks around in the "Completed" state:
```bash=
$ kubectl -n openshift-platform-operators get pods --no-headers | grep Completed | wc -l
69
```
This design may have scaling implications, where we have unbound growth over time during pivoting events as new bundles are rolled out. This behavior should be revisited over time as rukpak works towards a GA release.