diving into olm e2e catalog failures

--- title: diving into olm e2e catalog failures authors: - "@grokspawn" reviewers: - TBD approvers: - TBD tags: test-debugging --- ###### tags: test-debugging # diving into olm e2e catalog failures original thread: https://coreos.slack.com/archives/G3T7N42NP/p1666358415190779 jordank Today at 8:20 AM languishing non-deppy PRs: api OWNER changes: https://github.com/operator-framework/api/pull/268 api absorbing declcfg to make it easier for folks to build pieces that “speak FBC”: https://github.com/operator-framework/api/pull/266 In addition, I’d like to write a de-flake e2e-gcp-olm story on the OLM component to resolve its many, many, many failures and to /override e2e-gcp-olm for this PR 42 replies tim :spiral_calendar_pad: 1 hour ago wait... this is master branch. we should be having green CI runs on that branch. seems like we either regressed in the past release or so, or catherine has terrible luck. tim :spiral_calendar_pad: 1 hour ago have we dive into those e2e failures? from a glance, it looks like the same installplan tests have been consistency failing. tim :spiral_calendar_pad: 1 hour ago Using the /logs/artifacts/install-plan-e2e-px5gj output directory Storing the test artifact output in the /logs/artifacts/install-plan-e2e-px5gj directory Collecting get catalogsources -o yaml output... ./collect-ci-artifacts.sh: line 29: oc: command not found failed to collect namespace artifacts: exit status 127 I'll throw up a PR now to fix this... jordank 1 hour ago I did, to my limits: https://coreos.slack.com/archives/G3T7N42NP/p1666274571614449 > jordank > I need a lease-an-eyeball service, with my propensity to ask for them, but anyway … > need some help with a persistent downstream e2e failure for this PR , failing here . My MO would be to > extract the steps and replay them manually, but that’s where I run into my ignorance. > e2e-gcp-olm has failed every time for this PR, but if I check previous PRs' results, this fails far > more often than it succeeds, but it eventually does after a handful of tries. For contrast, this test > has been executed (and fails) 17 times in the current PR. >Thread in team-olm | Yesterday at 9:02 AM | View message jordank 1 hour ago Using the /logs/artifacts/install-plan-e2e-px5gj output directory Storing the test artifact output in the /logs/artifacts/install-plan-e2e-px5gj directory Collecting get catalogsources -o yaml output... ./collect-ci-artifacts.sh: line 29: oc: command not found failed to collect namespace artifacts: exit status 127 Where did you find this? I don’t see it anywhere in the prow data (https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_operator-fr[…]rator-framework-olm-master-e2e-gcp-olm/1583156312722640896). tim :spiral_calendar_pad: 1 hour ago https://github.com/operator-framework/operator-lifecycle-manager/pull/2876 PR for fixing this tim :spiral_calendar_pad: 1 hour ago I saw it in the e2e logs tim :spiral_calendar_pad: 1 hour ago tl;dr the oc binary exists downstream and present in $PATH in the CI environment, but when spawning the new process to run the CI artifacts script, we override the $PATH value when we need to be retaining it. tim :spiral_calendar_pad: 1 hour ago have we tried running the downstream OLM e2e testing suite locally to figure out whether this is simply a flake? tim :spiral_calendar_pad: 1 hour ago because this looks unrelated looking at the changes in the PR jordank 1 hour ago I ran it locally yesterday. This failure did not repeat, but there were others. I was focused on this test only. tim :spiral_calendar_pad: 1 hour ago oof i thought we fixed the test pollution issue with the custom client jordank 1 hour ago It’s totally unrelated to the PR changes. tim :spiral_calendar_pad: 1 hour ago (assuming this is related to tests polluting unrelated tests due to poor cleanup) tim :spiral_calendar_pad: 1 hour ago it looks like we're failing in the BeforeEach clause which is weird: https://github.com/openshift/operator-framework-olm/blob/master/staging/operator-lifecycle-manager/test/e2e/installplan_e2e_test.go#L161-L164 jordank 1 hour ago Yep. Expects to see the installplan complete… and it was timing out trying to collect the namespace artifacts (I read as “trying to find out if we got there”). tim :spiral_calendar_pad: 1 hour ago when an individual test case fails, we try to collect the relevant OLM testing artifacts to improve debuggability in this situation. so in this context, both the individual job failed and we silently failed to collect the relevant debug artifacts. tim :spiral_calendar_pad: 1 hour ago I threw up a PR that should fix the latter that we'll need to port downstream. (edited) tim :spiral_calendar_pad: 1 hour ago https://kubernetes.io/docs/reference/using-api/deprecation-guide/#poddisruptionbudget-v125 is no longer served in 1.25 and kube 1.25 just landed downstream KubernetesKubernetes Deprecated API Migration Guide As the Kubernetes API evolves, APIs are periodically reorganized or upgraded. When APIs evolve, the old API is deprecated and eventually removed. This page contains information you need to know when migrating from deprecated API versions to newer and more stable API versions. Removed APIs by release v1.27 The v1.27 release will stop serving the following deprecated API versions: CSIStorageCapacity The storage.k8s.io/v1beta1 API version of CSIStorageCapacity will no longer be served in v1. tim :spiral_calendar_pad: 1 hour ago ```log time="2022-10-19T18:27:56Z" level=info msg="skip processing installplan without status - subscription sync responsible for initial status" id=s8ZOy ip=test-plan-api namespace=deprecated-e2e-bb8hb phase= time="2022-10-19T18:27:56Z" level=info msg=syncing id=h780a ip=test-plan-api namespace=deprecated-e2e-bb8hb phase=Installing time="2022-10-19T18:27:56Z" level=info msg="could not query for GVK in api discovery" err="the server could not find the requested resource" group=verticalpodautoscalers.autoscaling.k8s.io kind=VerticalPodAutoscaler version=v1 E1019 18:27:56.748459 1 queueinformer_operator.go:298] sync {"update" "deprecated-e2e-bb8hb/test-plan-api"} failed: api-server resource not found installing VerticalPodAutoscaler my.thing: GroupVersionKind verticalpodautoscalers.autoscaling.k8s.io/v1, Kind=VerticalPodAutoscaler not found on the cluster. This API may have been deprecated and removed, see https://kubernetes.io/docs/reference/using-api/deprecation-guide/ for more information. time="2022-10-19T18:27:56Z" level=info msg=syncing id=mt2GN ip=test-plan-api namespace=deprecated-e2e-bb8hb phase=Installing ``` tim :spiral_calendar_pad: 1 hour ago seems like a brittle test here jordank 1 hour ago Where did you find that log? tim :spiral_calendar_pad: 1 hour ago https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_oper[…]-operator-7b998d9968-89rg6_catalog-operator.log tim :spiral_calendar_pad: 1 hour ago if you're asking how I found that log: - clicked on the failed e2e test case dropdown and pressed the "open stderr" window - looked for the namespace this resource was deployed in and found the "tearing down the install-plan-e2e-9w7zt namespace" message - went back to the original window, and clicked the "artifacts" tab in the top right - navigated to the artifacts -> e2e-gcp-olm -> gather-extra -> artifacts -> pods directory hierarchy and searched for the catalog-operators logs - ctrl+f for "install-plan-e2e-9w7zt" namespace to find any relevant logs - saw the "sync {"update" "install-plan-e2e-9w7zt/test-plan-l6qzw"} failed: api-server resource not found installing PodDisruptionBudget test-pdb-47rhz: GroupVersionKind /policy/v1beta1, Kind=PodDisruptionBudget not found on the cluster. This API may have been deprecated and removed, see https://kubernetes.io/docs/reference/using-api/deprecation-guide/ for more information." log