---
title: diving into olm e2e catalog failures
authors:
- "@grokspawn"
reviewers:
- TBD
approvers:
- TBD
tags: test-debugging
---
###### tags: test-debugging
# diving into olm e2e catalog failures
original thread: https://coreos.slack.com/archives/G3T7N42NP/p1666358415190779
jordank
Today at 8:20 AM
languishing non-deppy PRs:
api OWNER changes: https://github.com/operator-framework/api/pull/268
api absorbing declcfg to make it easier for folks to build pieces that “speak FBC”: https://github.com/operator-framework/api/pull/266
In addition, I’d like to write a de-flake e2e-gcp-olm story on the OLM component to resolve its many, many, many failures and to /override e2e-gcp-olm for this PR
42 replies
tim
:spiral_calendar_pad: 1 hour ago
wait... this is master branch. we should be having green CI runs on that branch. seems like we either regressed in the past release or so, or catherine has terrible luck.
tim
:spiral_calendar_pad: 1 hour ago
have we dive into those e2e failures? from a glance, it looks like the same installplan tests have been consistency failing.
tim
:spiral_calendar_pad: 1 hour ago
Using the /logs/artifacts/install-plan-e2e-px5gj output directory
Storing the test artifact output in the /logs/artifacts/install-plan-e2e-px5gj directory
Collecting get catalogsources -o yaml output...
./collect-ci-artifacts.sh: line 29: oc: command not found
failed to collect namespace artifacts: exit status 127
I'll throw up a PR now to fix this...
jordank
1 hour ago
I did, to my limits: https://coreos.slack.com/archives/G3T7N42NP/p1666274571614449
> jordank
> I need a lease-an-eyeball service, with my propensity to ask for them, but anyway …
> need some help with a persistent downstream e2e failure for this PR , failing here . My MO would be to > extract the steps and replay them manually, but that’s where I run into my ignorance.
> e2e-gcp-olm has failed every time for this PR, but if I check previous PRs' results, this fails far
> more often than it succeeds, but it eventually does after a handful of tries. For contrast, this test
> has been executed (and fails) 17 times in the current PR.
>Thread in team-olm | Yesterday at 9:02 AM | View message
jordank
1 hour ago
Using the /logs/artifacts/install-plan-e2e-px5gj output directory
Storing the test artifact output in the /logs/artifacts/install-plan-e2e-px5gj directory
Collecting get catalogsources -o yaml output...
./collect-ci-artifacts.sh: line 29: oc: command not found
failed to collect namespace artifacts: exit status 127
Where did you find this?
I don’t see it anywhere in the prow data (https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_operator-fr[…]rator-framework-olm-master-e2e-gcp-olm/1583156312722640896).
tim
:spiral_calendar_pad: 1 hour ago
https://github.com/operator-framework/operator-lifecycle-manager/pull/2876 PR for fixing this
tim
:spiral_calendar_pad: 1 hour ago
I saw it in the e2e logs
tim
:spiral_calendar_pad: 1 hour ago
tl;dr the oc binary exists downstream and present in $PATH in the CI environment, but when spawning the new process to run the CI artifacts script, we override the $PATH value when we need to be retaining it.
tim
:spiral_calendar_pad: 1 hour ago
have we tried running the downstream OLM e2e testing suite locally to figure out whether this is simply a flake?
tim
:spiral_calendar_pad: 1 hour ago
because this looks unrelated looking at the changes in the PR
jordank
1 hour ago
I ran it locally yesterday. This failure did not repeat, but there were others. I was focused on this test only.
tim
:spiral_calendar_pad: 1 hour ago
oof i thought we fixed the test pollution issue with the custom client
jordank
1 hour ago
It’s totally unrelated to the PR changes.
tim
:spiral_calendar_pad: 1 hour ago
(assuming this is related to tests polluting unrelated tests due to poor cleanup)
tim
:spiral_calendar_pad: 1 hour ago
it looks like we're failing in the BeforeEach clause which is weird: https://github.com/openshift/operator-framework-olm/blob/master/staging/operator-lifecycle-manager/test/e2e/installplan_e2e_test.go#L161-L164
jordank
1 hour ago
Yep. Expects to see the installplan complete… and it was timing out trying to collect the namespace artifacts (I read as “trying to find out if we got there”).
tim
:spiral_calendar_pad: 1 hour ago
when an individual test case fails, we try to collect the relevant OLM testing artifacts to improve debuggability in this situation. so in this context, both the individual job failed and we silently failed to collect the relevant debug artifacts.
tim
:spiral_calendar_pad: 1 hour ago
I threw up a PR that should fix the latter that we'll need to port downstream. (edited)
tim
:spiral_calendar_pad: 1 hour ago
https://kubernetes.io/docs/reference/using-api/deprecation-guide/#poddisruptionbudget-v125 is no longer served in 1.25 and kube 1.25 just landed downstream
KubernetesKubernetes
Deprecated API Migration Guide
As the Kubernetes API evolves, APIs are periodically reorganized or upgraded. When APIs evolve, the old API is deprecated and eventually removed. This page contains information you need to know when migrating from deprecated API versions to newer and more stable API versions. Removed APIs by release v1.27 The v1.27 release will stop serving the following deprecated API versions: CSIStorageCapacity The storage.k8s.io/v1beta1 API version of CSIStorageCapacity will no longer be served in v1.
tim
:spiral_calendar_pad: 1 hour ago
```log
time="2022-10-19T18:27:56Z" level=info msg="skip processing installplan without status - subscription sync responsible for initial status" id=s8ZOy ip=test-plan-api namespace=deprecated-e2e-bb8hb phase=
time="2022-10-19T18:27:56Z" level=info msg=syncing id=h780a ip=test-plan-api namespace=deprecated-e2e-bb8hb phase=Installing
time="2022-10-19T18:27:56Z" level=info msg="could not query for GVK in api discovery" err="the server could not find the requested resource" group=verticalpodautoscalers.autoscaling.k8s.io kind=VerticalPodAutoscaler version=v1
E1019 18:27:56.748459 1 queueinformer_operator.go:298] sync {"update" "deprecated-e2e-bb8hb/test-plan-api"} failed: api-server resource not found installing VerticalPodAutoscaler my.thing: GroupVersionKind verticalpodautoscalers.autoscaling.k8s.io/v1, Kind=VerticalPodAutoscaler not found on the cluster. This API may have been deprecated and removed, see https://kubernetes.io/docs/reference/using-api/deprecation-guide/ for more information.
time="2022-10-19T18:27:56Z" level=info msg=syncing id=mt2GN ip=test-plan-api namespace=deprecated-e2e-bb8hb phase=Installing
```
tim
:spiral_calendar_pad: 1 hour ago
seems like a brittle test here
jordank
1 hour ago
Where did you find that log?
tim
:spiral_calendar_pad: 1 hour ago
https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_oper[…]-operator-7b998d9968-89rg6_catalog-operator.log
tim
:spiral_calendar_pad: 1 hour ago
if you're asking how I found that log:
- clicked on the failed e2e test case dropdown and pressed the "open stderr" window
- looked for the namespace this resource was deployed in and found the "tearing down the install-plan-e2e-9w7zt namespace" message
- went back to the original window, and clicked the "artifacts" tab in the top right
- navigated to the artifacts -> e2e-gcp-olm -> gather-extra -> artifacts -> pods directory hierarchy and searched for the catalog-operators logs
- ctrl+f for "install-plan-e2e-9w7zt" namespace to find any relevant logs
- saw the "sync {"update" "install-plan-e2e-9w7zt/test-plan-l6qzw"} failed: api-server resource not found installing PodDisruptionBudget test-pdb-47rhz: GroupVersionKind /policy/v1beta1, Kind=PodDisruptionBudget not found on the cluster. This API may have been deprecated and removed, see https://kubernetes.io/docs/reference/using-api/deprecation-guide/ for more information." log