--- **do not promote this to a publicly-available KCS. This workaround requires an SE and active collaboration to prosecute** ---
# Workaround procedure
*Thesis*: OLM has a problem with the operator upgrade procedure, in that it attempts to perform a list of all CRs on a cluster before it evaluates each against the proposed CRD. There is a fix being propagated, but customers have also requested some pretty significant support exceptions to remediate.
Instead of increasing the general KubeApiServer request-timeout to 5 minutes (which promises knock-on effects related to scale), we propose a set of manual steps to replace the automatic CRD compatibility checking that OLM normally performs. This should be viewed as a stopgap procedure until [OCPBUGS-35358](https://issues.redhat.com/browse/OCPBUGS-35358) is backported to OCP versions in customers' range.
# Format and Conventions
The sequence of steps to perform is represented as a numbered list. When possible, an example snippet is provided which illustrates _one possible approach_ to performing the step. Actual steps can be informed by the illustration, but require a collaborative session with the customer to determine the appropriate details.
Note: All examples presuppose a fictional scenario where the `argocd-operator` has been installed, and that a sufficient number of `Application` CR instances have been created in the `openshift-operators` namespace such that a non-chunking list for that resource list would timeout at the KubeApiServer. This **deliberately does not align** with any known customer case, to prevent blind re-use of the examples without consideration to data retention policies and existing workloads in the cluster.
# Resolution Steps
1. (Optional) backup all CRDs/CRs related to operator
1. CRDs
```sh!
for crd in $(oc get crd | grep argo | awk '{print $1}'); do
oc get crd $crd -o yaml > $crd-crd.yaml
done
```
2. CRs
```sh!
for app in $(oc get application -n openshift-operators| awk '{print $2}'; do
os get application -n openshift-operators $app -o yaml > $app-application.yaml
done
```
2. perform a manual audit of the old/new CRDs' delta to ensure compatibility
- can also use a tool like [dyff](https://github.com/homeport/dyff) or a really alpha utility called [crd-upgrade-checker](https://github.com/perdasilva/crd-upgrade-checker)
- **Note:** manual workaround skips the CRD validation, so they need to be comfortable that there are no incompatibilities
3. delete old CSV, InstallPlans, Subscriptions related to the old operator version [link](https://docs.openshift.com/dedicated/operators/understanding/olm/olm-understanding-olm.html#olm-installplan_olm-understanding-olm) (NB: this will not impact workloads)
```sh!
#!/bin/bash
NAMESPACE="${1}"
INSTALL_PLAN="${2}"
CURRENT_UNIX_TIME=$(date +%s)
OUTPUT_DIR="backup-${CURRENT_UNIX_TIME}"
mkdir -p $OUTPUT_DIR
for ip in $(oc get installplan $INSTALL_PLAN -n $NAMESPACE -o jsonpath='{range .status.plan[?(@.resource.kind=="CustomResourceDefinition")]}{.resource.name}{"\n"}{end}'); do
echo "Processing CRD: $crd"
# Check if the CRD exists
if ! oc get crd "$crd" &> /dev/null; then
echo "CRD $crd does not exist."
continue
fi
# Get the scope of the CRD
scope=$(oc get crd "$crd" -o jsonpath='{.spec.scope}')
if [ "$scope" == "Namespaced" ]; then
oc get "$crd" --all-namespaces --no-headers | while read -r row; do
namespace="$(echo $row | awk '{ print $1 }')"
name="$(echo $row | awk '{ print $2 }')"
oc get "${crd}" "${name}" -n "${namespace}" -o yaml > "${OUTPUT_DIR}/${crd}-${namespace}-${name}.yaml"
done
else
oc get "$crd" --no-headers | while read -r row; do
name=$(echo $row | awk '{ print $1 }')
oc get "${crd}" "${name}" -o yaml > "${OUTPUT_DIR}/${crd}-${namespace}-${name}.yaml"
done
fi
done
```
5. re-create subscriptions in manual approval mode, e.g.:
```yaml=
apiVersion: operators.coreos.com/v1alpha1
kind: Subscription
metadata:
name: argocd-operator
namespace: openshift-operators
spec:
channel: alpha
installPlanApproval: Manual
name: argocd-operator
source: redhat-operators
sourceNamespace: openshift-marketplace
```
5. manually apply new CRDs
6. manually update the installed-alongside annotation on the new CRDs to point to the new CSV
```sh=
for crd in $(oc get crd | grep argo | awk '{ print $1 }'); do
oc annotate --overwrite crd/$crd operatorframework.io/installed-alongside-7b166f3de2426153=openshift-operators/argocd-operator.v0.12.0
done
```
8. remove all CRDs from InstallPlans to prosecute the upgrade to the new operator version that will be created as a result of the new CRD application
1. delete the CRD from the target object
```sh!
oc get installplan -o json -n <namespace> <install_plan_name> | jq '.status.plan |= map(select(.resource.group != "apiextensions.k8s.io" or .resource.kind != "CustomResourceDefinition"))' < ip.yaml > new-ip.yaml
```
2. patch back the updated installplan
```sh!
curl -k -X PUT \
-H "Authorization: Bearer $(oc whoami --show-token)" \
-H "Content-Type: application/json" \
--data-binary @new-ip.yaml \
$(oc whoami --show-server)/apis/operators.coreos.com/v1alpha1/namespaces/<namespace>/installplans/<install_plan_name>/status
```
9. approve the InstallPlans
```sh
oc -n openshift-operators patch installplan <install_plan_name> -p '{"spec":{"approved":true}}' --type merge
```
# Appendices
## Reproducer Scenario
1. save out the script
2. cluster-bot something 4.15 or earlier (since it doesn't have the fix in it)
3. in config, install the argocd community operator from the catalog (v0.9.1)
4. `oc create namespace argo-test`
5. run the script until around 300K applications we should start getting apiserver timeout errors (NB: this script will beat your laptop up if you're not careful! Recommend a parallel rate of 25-50 for a buildup rate of 90K+/hr.):
```sh
cr-spammer.sh --start 1 --finish 300000 --parallel 25
```
6. we get an error from the apiserver when we attempt a chunk-less list of the CRs:
`oc get applications -A -o json --chunk-size=0 > log1`
```sh!
Error from server (Timeout): the server was unable to return a response in the time allotted, but may still be processing the request (get applications.argoproj.io)`
```
```bash=
#!/usr/bin/env bash
# cr-spammer.sh
# before running this script
# 1 - start a cluster-bot 4.15 or earlier (since fix is landing in 4.16 now)
# 2 - install the argocd operator from community catalog
# 3 - create the namespace `test-namespace`
# Namespace where the Application resources will be created
NAMESPACE="openshift-operators"
DEST_NAMESPACE="argo-test"
PROJECT_NAME="argo-proj"
DEST_NAME="test"
# Number of Applications to create
TOTAL_APPLICATIONS=100000
# Number of parallel jobs
PARALLEL_JOBS=50
# sequence start
FLOOR=$1
# Function to generate a unique application YAML and apply it
create_application() {
local app_name=$1
cat <<EOF | oc apply -f -
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: ${app_name}
namespace: ${NAMESPACE}
spec:
destination:
name: ${DEST_NAME}
namespace: ${DEST_NAMESPACE}
project: ${PROJECT_NAME}
source:
repoURL: https://github.com/jianzhangbjz/learn-operator/tree/master/manifests
EOF
}
usage() {
echo "$0 [OPTIONS] where OPTIONS is one/more of"
echo " --start -s the application number to start at"
echo " --parallel -p the number of parallel jobs to run at one time"
echo " --finish -f the application number to end at"
exit 255
}
while [ ! -z "$1" ]; do
case "$1" in
--start|-s)
shift
FLOOR=$1
;;
--parallel|-p)
shift
PARALLEL_JOBS=$1
;;
--finish|-f)
shift
TOTAL_APPLICATIONS=$1
;;
*)
usage
;;
esac
shift
done
# Main loop to create the specified number of applications in parallel
count=0
for i in $(seq -w $FLOOR $TOTAL_APPLICATIONS); do
app_name="example3-${i}"
echo "Creating Application: ${app_name}"
create_application "${app_name}" &
count=$((count + 1))
# If we've reached the parallel jobs limit, wait for all jobs to complete
if [[ $count -ge $PARALLEL_JOBS ]]; then
wait
count=0
fi
done
# Wait for any remaining background jobs to complete
wait
DELTA=$((TOTAL_APPLICATIONS - FLOOR))
echo "Successfully created $DELTA Applications [$FLOOR .. $TOTAL_APPLICATIONS]."
```
## Snippets
- extract operator-specific content from a RH catalog image and stash to a file:
```sh!
~/devel/operator-registry/bin/opm render -o yaml registry.redhat.io/redhat/community-operator-index:v4.16 | yq 'select((.schema == "olm.package" and .name == "argocd-operator") or (.package == "argocd-operator"))' | tee /tmp/arcd-operator-index.yaml
```
- pull CRDs from a catalog `opm render`-ed out as yaml and decode them:
```sh!
yq 'select(.schema == "olm.bundle").properties[]| select(.type == "olm.bundle.object").value.data |= @base64d ' /tmp/arcd-operator-index.yaml > /tmp/argocd-crds.yaml
```