E2E Flakes - HackMD

## Why are E2E flakes currently hard to debug? #### Tests are cleaned up as soon as they fail Maybe we can implement `if isFailed && isDevelopmentMode { os.Exit(1) }` a `JustAfterEach()` for every test spec. #### Takes a long time to run This is normal for a big project, but I'll add what I think we can do to speed things up where possible: - [namespaces take a minimum of 5 seconds to delete](https://github.com/kubernetes/kubernetes/pull/37431). This is probably our biggest wait time for tests, and the source of uncleaned up artifacts where deletion is not confirmed/waited-for. This is a kubernetes issue. I wonder if we can fix it, because it just seems like no one has looked at it since they implemented that workaround. - polling can be replaced with watches? https://pkg.go.dev/sigs.k8s.io/controller-runtime@v0.15.0/pkg/client#WithWatch - increase timeouts across the board and introduce benchmarks: https://onsi.github.io/ginkgo/#benchmarking-code ### Why does E2E flake? List of flake sources: | Cause | Potential Mitigation |Example| | -------- | -------- |-| |next tast starts before cleanup of old task finished|use common function with retries and wait for deletion of objects, register object to be cleaned up when creating with special client| Run 4 | |timeout out waiting for condition| preload images required: vcluster, ngninx, busybox, etc, increase timeouts? avoid cleanup on fail | Run 1,2,3 | | every api call can fail, inculding `Create()` if there are intermittent connectivity issues. | init client with [hashicorp/go-retryable](https://pkg.go.dev/github.com/hashicorp/go-retryablehttp?utm_source=godoc) or implement some other form of retrying client?|I think I've seen this a few times| | AfterEach waits for project to terminate, but doesn't wait for member namespace to terminate| should project termination wait for memeber namespace termination? | Run 4 | ### History of runs and RCA: ##### Ran full suite #1 got 1 flake, the assertion timed out waiting for an nginx pod to come up. Waiting for pod in space to run timeoud out after one minute Mitigation, preloading image, increasing timeout ``` /home/tpjsm/go/src/github.com/loft-sh/loft-enterprise/e2e/test_core/exec/exec.go:83 [FAILED] in [It] - /home/tpjsm/go/src/github.com/loft-sh/loft-enterprise/e2e/test_core/exec/exec.go:90 @ 06/23/23 07:14:12.48 • [FAILED] [60.080 seconds] Kubectl exec [It] exec in space /home/tpjsm/go/src/github.com/loft-sh/loft-enterprise/e2e/test_core/exec/exec.go:83 [FAILED] Unexpected error: <wait.errInterrupted>: timed out waiting for the condition { cause: <*errors.errorString | 0xc0001e4160>{ s: "timed out waiting for the condition", }, } occurred In [It] at: /home/tpjsm/go/src/github.com/loft-sh/loft-enterprise/e2e/test_core/exec/exec.go:90 @ 06/23/23 07:14:12.48 [...] Summarizing 1 Failure: [FAIL] Kubectl exec [It] exec in space /home/tpjsm/go/src/github.com/loft-sh/loft-enterprise/e2e/test_core/exec/exec.go:90 Ran 105 of 105 Specs in 870.269 seconds FAIL! -- 104 Passed | 1 Failed | 0 Pending | 0 Skipped --- FAIL: TestRunE2ETests (875.77s) FAIL Ginkgo ran 4 suites in 14m44.178474076s There were failures detected in the following suites: test_core ./test_core ``` #### Ran full suite #2 Previous Flake Theory: pod image pull took a long time Mitigation: Dumped list of all images in the kind cluster, preloaded them in this fresh cluster Timed out waiting for virtualclusterinstance to become ready. AfterSuite failed, so additional failures in the same ginkgo node `Project Virtual Cluster Import` ``` /home/tpjsm/go/src/github.com/loft-sh/loft-enterprise/e2e/test_project/projectimportvirtualcluster/projectimportvirtualcluster.go:136 [FAILED] in [It] - /home/tpjsm/go/src/github.com/loft-sh/loft-enterprise/e2e/test_project/projectimportvirtualcluster/projectimportvirtualcluster.go:181 @ 06/23/23 08:36:27.961 • [FAILED] [66.144 seconds] Project Virtual Cluster Import [It] Test basic project virtual cluster importing /home/tpjsm/go/src/github.com/loft-sh/loft-enterprise/e2e/test_project/projectimportvirtualcluster/projectimportvirtualcluster.go:136 [FAILED] Timed out after 60.002s. Expected <bool>: false to be true In [It] at: /home/tpjsm/go/src/github.com/loft-sh/loft-enterprise/e2e/test_project/projectimportvirtualcluster/projectimportvirtualcluster.go:181 @ 06/23/23 08:36:27.961 ------------------------------ ``` additional failures were because of failed cleanup: ``` [FAILED] Unexpected error: <*errors.StatusError | 0xc0013f60a0>: object is being deleted: namespaces "e2e-legacy-vcluster-ns" already exists { ErrStatus: { TypeMeta: {Kind: "", APIVersion: ""}, ListMeta: { SelfLink: "", ResourceVersion: "", Continue: "", RemainingItemCount: nil, }, Status: "Failure", Message: "object is being deleted: namespaces \"e2e-legacy-vcluster-ns\" already exists", Reason: "AlreadyExists", ``` #### Run 3 timed out waiting for virtualclusterinstance to be ready after import (seen 3-4 times before) ``` For more information, please take a look at the vcluster docs at https://www.vcluster.com/docs [FAILED] Timed out after 60.000s. Expected <bool>: false to be true In [It] at: /home/tpjsm/go/src/github.com/loft-sh/loft-enterprise/e2e/test_project/projectimportvirtualcluster/projectimportvirtualcluster.go:439 @ 06/23/23 09:48:02.205 < Exit [It] Should import vcluster created using vcluster cli command - /home/tpjsm/go/src/github.com/loft-sh/loft-enterprise/e2e/test_project/projectimportvirtualcluster/projectimportvirtualcluster.go:384 @ 06/23/23 09:48:02.206 (1m11.787s) > Enter [AfterEach] Project Virtual Cluster Import - /home/tpjsm/go/src/github.com/loft-sh/loft-enterprise/e2e/test_project/projectimportvirtualcluster/projectimportvirtualcluster.go:88 @ 06/23/23 09:48:02.206 < Exit [AfterEach] Project Virtual Cluster Import - /home/tpjsm/go/src/github.com/loft-sh/loft-enterprise/e2e/test_project/projectimportvirtualcluster/projectimportvirtualcluster.go:88 @ 06/23/23 09:48:24.531 (22.325s) • [FAILED] [94.131 seconds] Project Virtual Cluster Import /home/tpjsm/go/src/github.com/loft-sh/loft-enterprise/e2e/test_project/projectimportvirtualcluster/projectimportvirtualcluster.go:28 [It] Should import vcluster created using vcluster cli command ``` #### Run 4 previous cleanup not complete, AfterEach waits for project deletion. project deleted but owned namespace still terminating ``` [FAILED] Unexpected error: <*errors.StatusError | 0xc0001a1a40>: virtualclusterinstances.storage.loft.sh "virtual-cluster-instance-1" is forbidden: unable to create new content in namespace loft-p-virtual-cluster-active-quotas because it is being terminated { ErrStatus: { TypeMeta: {Kind: "", APIVersion: ""}, ListMeta: { SelfLink: "", ResourceVersion: "", Continue: "", RemainingItemCount: nil, }, Status: "Failure", Message: "virtualclusterinstances.storage.loft.sh \"virtual-cluster-instance-1\" is forbidden: unable to create new content in namespace loft-p-virtual-cluster-active-quotas because it is being terminated", Reason: "Forbidden", Details: { Name: "virtual-cluster-instance-1", Group: "storage.loft.sh", Kind: "virtualclusterinstances", UID: "", Causes: [ { Type: "NamespaceTerminating", Message: "namespace loft-p-virtual-cluster-active-quotas is being terminated", Field: "metadata.namespace", }, ], RetryAfterSeconds: 0, }, Code: 403, }, } occurred In [It] at: /home/tpjsm/go/src/github.com/loft-sh/loft-enterprise/e2e/test_core/quota/virtual_cluster_instances.go:641 @ 06/23/23 10:50:22.567 ``` [...] #### Run 34 (lots of the same, partial runs, etc) timed out waiting for space ``` > Enter [BeforeEach] ProjectSecret - /home/tpjsm/go/src/github.com/loft-sh/loft-enterprise/e2e/test_project/projectsecret/space_secret_sync.go:37 @ 06/23/23 16:51:11.449 < Exit [BeforeEach] ProjectSecret - /home/tpjsm/go/src/github.com/loft-sh/loft-enterprise/e2e/test_project/projectsecret/space_secret_sync.go:37 @ 06/23/23 16:51:13.478 (2.03s) > Enter [BeforeEach] Space sync with project - /home/tpjsm/go/src/github.com/loft-sh/loft-enterprise/e2e/test_project/projectsecret/space_secret_sync.go:254 @ 06/23/23 16:51:13.478 [FAILED] Timed out after 30.000s. Unexpected error: <*errors.StatusError | 0xc000f37cc0>: namespaces "project-secret" not found { ErrStatus: { TypeMeta: {Kind: "", APIVersion: ""}, ListMeta: { SelfLink: "", ResourceVersion: "", Continue: "", RemainingItemCount: nil, }, Status: "Failure", Message: "namespaces \"project-secret\" not found", Reason: "NotFound", Details: { Name: "project-secret", Group: "", Kind: "namespaces", UID: "", Causes: nil, RetryAfterSeconds: 0, }, Code: 404, }, } occurred In [BeforeEach] at: /home/tpjsm/go/src/github.com/loft-sh/loft-enterprise/e2e/test_project/projectsecret/space_secret_sync.go:292 @ 06/23/23 16:51:43.508 ``` #### Run 36 still hitting timeout after doubling, possibly something else could not find template "example-template" example template was not being created in the test ``` • [FAILED] [248.146 seconds] ProjectSecret /home/tpjsm/go/src/github.com/loft-sh/loft-enterprise/e2e/test_project/projectsecret/space_secret_sync.go:24 Space sync with project [BeforeEach] /home/tpjsm/go/src/github.com/loft-sh/loft-enterprise/e2e/test_project/projectsecret/space_secret_sync.go:254 does not synchronize space Secret from unauthorized space /home/tpjsm/go/src/github.com/loft-sh/loft-enterprise/e2e/test_project/projectsecret/space_secret_sync.go:320 ------------------------------ ``` ### Run 81 Yeah, just adding how many runs that felt like in between. Found a panic! Thought I hit one, before, but didn't find it again till I cleared the flakes above: Soo many goroutines that hackmd refused to let me paste the whole thing ``` I0627 05:37:53.336348 772 logr.go:278] vault-project-secrets-sync-controller: secret vault-sync-xzdgl: reconcile succeeded fatal error: concurrent map read and map write goroutine 975 [running]: reflect.mapaccess_faststr(0x4944480?, 0x18?, {0xc004a4e7b0?, 0x4944480?}) /usr/local/go/src/runtime/map.go:1343 +0x1e reflect.Value.MapIndex({0x4ad7d00?, 0xc0038de690?, 0x0?}, {0x4944480, 0xc0052caf40, 0x98}) /usr/local/go/src/reflect/value.go:1738 +0xc5 reflect.deepValueEqual({0x4ad7d00?, 0xc0057be930?, 0x0?}, {0x4ad7d00?, 0xc0038de690?, 0x0?}, 0x0?) /usr/local/go/src/reflect/deepequal.go:147 +0x1446 reflect.DeepEqual({0x4ad7d00?, 0xc0057be930?}, {0x4ad7d00?, 0xc0038de690?}) /usr/local/go/src/reflect/deepequal.go:237 +0x2c5 sigs.k8s.io/controller-runtime/pkg/predicate.AnnotationChangedPredicate.Update({{0x58ea8f8, 0xc0050d6d70, 0x0, 0x0}}, {{0x5948cb0, 0xc0054ae000}, {0x5948cb0, 0xc003b7f7c0}}) /loft/vendor/sigs.k8s.io/controller-runtime/pkg/predicate/predicate.go:201 +0x153 sigs.k8s.io/controller-runtime/pkg/predicate.or.Update(...) ``` Other apparent flake from going through the e2e log: https://github.com/loft-sh/loft-enterprise/actions/runs/5477220436/jobs/9976165206?pr=1673 apparent source: ``` func WaitForClusterAccess(ctx context.Context, loftClient loftclient.Client, cluster string) error { clusterClient, err := loftClient.Cluster(cluster) if err != nil { return err } return wait.PollUntilContextTimeout(ctx, time.Second, time.Minute*2, true, func(ctx context.Context) (bool, error) { _, err := clusterClient.Agent().ClusterV1().Spaces().List(ctx, metav1.ListOptions{}) if err != nil { if kerrors.IsUnauthorized(err) || kerrors.IsForbidden(err) { return false, nil } return false, err } return true, nil }) }```