CRD compatibility check issue workaround

--- **do not promote this to a publicly-available KCS. This workaround requires an SE and active collaboration to prosecute** --- # Workaround procedure *Thesis*: OLM has a problem with the operator upgrade procedure, in that it attempts to perform a list of all CRs on a cluster before it evaluates each against the proposed CRD. There is a fix being propagated, but customers have also requested some pretty significant support exceptions to remediate. Instead of increasing the general KubeApiServer request-timeout to 5 minutes (which promises knock-on effects related to scale), we propose a set of manual steps to replace the automatic CRD compatibility checking that OLM normally performs. This should be viewed as a stopgap procedure until [OCPBUGS-35358](https://issues.redhat.com/browse/OCPBUGS-35358) is backported to OCP versions in customers' range. # Format and Conventions The sequence of steps to perform is represented as a numbered list. When possible, an example snippet is provided which illustrates _one possible approach_ to performing the step. Actual steps can be informed by the illustration, but require a collaborative session with the customer to determine the appropriate details. Note: All examples presuppose a fictional scenario where the `argocd-operator` has been installed, and that a sufficient number of `Application` CR instances have been created in the `openshift-operators` namespace such that a non-chunking list for that resource list would timeout at the KubeApiServer. This **deliberately does not align** with any known customer case, to prevent blind re-use of the examples without consideration to data retention policies and existing workloads in the cluster. # Resolution Steps 1. (Optional) backup all CRDs/CRs related to operator 1. CRDs ```sh! for crd in $(oc get crd | grep argo | awk '{print $1}'); do oc get crd $crd -o yaml > $crd-crd.yaml done ``` 2. CRs ```sh! for app in $(oc get application -n openshift-operators| awk '{print $2}'; do os get application -n openshift-operators $app -o yaml > $app-application.yaml done ``` 2. perform a manual audit of the old/new CRDs' delta to ensure compatibility - can also use a tool like [dyff](https://github.com/homeport/dyff) or a really alpha utility called [crd-upgrade-checker](https://github.com/perdasilva/crd-upgrade-checker) - **Note:** manual workaround skips the CRD validation, so they need to be comfortable that there are no incompatibilities 3. delete old CSV, InstallPlans, Subscriptions related to the old operator version [link](https://docs.openshift.com/dedicated/operators/understanding/olm/olm-understanding-olm.html#olm-installplan_olm-understanding-olm) (NB: this will not impact workloads) ```sh! #!/bin/bash NAMESPACE="${1}" INSTALL_PLAN="${2}" CURRENT_UNIX_TIME=$(date +%s) OUTPUT_DIR="backup-${CURRENT_UNIX_TIME}" mkdir -p $OUTPUT_DIR for ip in $(oc get installplan $INSTALL_PLAN -n $NAMESPACE -o jsonpath='{range .status.plan[?(@.resource.kind=="CustomResourceDefinition")]}{.resource.name}{"\n"}{end}'); do echo "Processing CRD: $crd" # Check if the CRD exists if ! oc get crd "$crd" &> /dev/null; then echo "CRD $crd does not exist." continue fi # Get the scope of the CRD scope=$(oc get crd "$crd" -o jsonpath='{.spec.scope}') if [ "$scope" == "Namespaced" ]; then oc get "$crd" --all-namespaces --no-headers | while read -r row; do namespace="$(echo $row | awk '{ print $1 }')" name="$(echo $row | awk '{ print $2 }')" oc get "${crd}" "${name}" -n "${namespace}" -o yaml > "${OUTPUT_DIR}/${crd}-${namespace}-${name}.yaml" done else oc get "$crd" --no-headers | while read -r row; do name=$(echo $row | awk '{ print $1 }') oc get "${crd}" "${name}" -o yaml > "${OUTPUT_DIR}/${crd}-${namespace}-${name}.yaml" done fi done ``` 5. re-create subscriptions in manual approval mode, e.g.: ```yaml= apiVersion: operators.coreos.com/v1alpha1 kind: Subscription metadata: name: argocd-operator namespace: openshift-operators spec: channel: alpha installPlanApproval: Manual name: argocd-operator source: redhat-operators sourceNamespace: openshift-marketplace ``` 5. manually apply new CRDs 6. manually update the installed-alongside annotation on the new CRDs to point to the new CSV ```sh= for crd in $(oc get crd | grep argo | awk '{ print $1 }'); do oc annotate --overwrite crd/$crd operatorframework.io/installed-alongside-7b166f3de2426153=openshift-operators/argocd-operator.v0.12.0 done ``` 8. remove all CRDs from InstallPlans to prosecute the upgrade to the new operator version that will be created as a result of the new CRD application 1. delete the CRD from the target object ```sh! oc get installplan -o json -n <namespace> <install_plan_name> | jq '.status.plan |= map(select(.resource.group != "apiextensions.k8s.io" or .resource.kind != "CustomResourceDefinition"))' < ip.yaml > new-ip.yaml ``` 2. patch back the updated installplan ```sh! curl -k -X PUT \ -H "Authorization: Bearer $(oc whoami --show-token)" \ -H "Content-Type: application/json" \ --data-binary @new-ip.yaml \ $(oc whoami --show-server)/apis/operators.coreos.com/v1alpha1/namespaces/<namespace>/installplans/<install_plan_name>/status ``` 9. approve the InstallPlans ```sh oc -n openshift-operators patch installplan <install_plan_name> -p '{"spec":{"approved":true}}' --type merge ``` # Appendices ## Reproducer Scenario 1. save out the script 2. cluster-bot something 4.15 or earlier (since it doesn't have the fix in it) 3. in config, install the argocd community operator from the catalog (v0.9.1) 4. `oc create namespace argo-test` 5. run the script until around 300K applications we should start getting apiserver timeout errors (NB: this script will beat your laptop up if you're not careful! Recommend a parallel rate of 25-50 for a buildup rate of 90K+/hr.): ```sh cr-spammer.sh --start 1 --finish 300000 --parallel 25 ``` 6. we get an error from the apiserver when we attempt a chunk-less list of the CRs: `oc get applications -A -o json --chunk-size=0 > log1` ```sh! Error from server (Timeout): the server was unable to return a response in the time allotted, but may still be processing the request (get applications.argoproj.io)` ``` ```bash= #!/usr/bin/env bash # cr-spammer.sh # before running this script # 1 - start a cluster-bot 4.15 or earlier (since fix is landing in 4.16 now) # 2 - install the argocd operator from community catalog # 3 - create the namespace `test-namespace` # Namespace where the Application resources will be created NAMESPACE="openshift-operators" DEST_NAMESPACE="argo-test" PROJECT_NAME="argo-proj" DEST_NAME="test" # Number of Applications to create TOTAL_APPLICATIONS=100000 # Number of parallel jobs PARALLEL_JOBS=50 # sequence start FLOOR=$1 # Function to generate a unique application YAML and apply it create_application() { local app_name=$1 cat <<EOF | oc apply -f - apiVersion: argoproj.io/v1alpha1 kind: Application metadata: name: ${app_name} namespace: ${NAMESPACE} spec: destination: name: ${DEST_NAME} namespace: ${DEST_NAMESPACE} project: ${PROJECT_NAME} source: repoURL: https://github.com/jianzhangbjz/learn-operator/tree/master/manifests EOF } usage() { echo "$0 [OPTIONS] where OPTIONS is one/more of" echo " --start -s the application number to start at" echo " --parallel -p the number of parallel jobs to run at one time" echo " --finish -f the application number to end at" exit 255 } while [ ! -z "$1" ]; do case "$1" in --start|-s) shift FLOOR=$1 ;; --parallel|-p) shift PARALLEL_JOBS=$1 ;; --finish|-f) shift TOTAL_APPLICATIONS=$1 ;; *) usage ;; esac shift done # Main loop to create the specified number of applications in parallel count=0 for i in $(seq -w $FLOOR $TOTAL_APPLICATIONS); do app_name="example3-${i}" echo "Creating Application: ${app_name}" create_application "${app_name}" & count=$((count + 1)) # If we've reached the parallel jobs limit, wait for all jobs to complete if [[ $count -ge $PARALLEL_JOBS ]]; then wait count=0 fi done # Wait for any remaining background jobs to complete wait DELTA=$((TOTAL_APPLICATIONS - FLOOR)) echo "Successfully created $DELTA Applications [$FLOOR .. $TOTAL_APPLICATIONS]." ``` ## Snippets - extract operator-specific content from a RH catalog image and stash to a file: ```sh! ~/devel/operator-registry/bin/opm render -o yaml registry.redhat.io/redhat/community-operator-index:v4.16 | yq 'select((.schema == "olm.package" and .name == "argocd-operator") or (.package == "argocd-operator"))' | tee /tmp/arcd-operator-index.yaml ``` - pull CRDs from a catalog `opm render`-ed out as yaml and decode them: ```sh! yq 'select(.schema == "olm.bundle").properties[]| select(.type == "olm.bundle.object").value.data |= @base64d ' /tmp/arcd-operator-index.yaml > /tmp/argocd-crds.yaml ```