## Why are E2E flakes currently hard to debug?
#### Tests are cleaned up as soon as they fail
Maybe we can implement `if isFailed && isDevelopmentMode { os.Exit(1) }` a `JustAfterEach()` for every test spec.
#### Takes a long time to run
This is normal for a big project, but I'll add what I think we can do to speed things up where possible:
- [namespaces take a minimum of 5 seconds to delete](https://github.com/kubernetes/kubernetes/pull/37431). This is probably our biggest wait time for tests, and the source of uncleaned up artifacts where deletion is not confirmed/waited-for. This is a kubernetes issue. I wonder if we can fix it, because it just seems like no one has looked at it since they implemented that workaround.
- polling can be replaced with watches? https://pkg.go.dev/sigs.k8s.io/controller-runtime@v0.15.0/pkg/client#WithWatch
- increase timeouts across the board and introduce benchmarks: https://onsi.github.io/ginkgo/#benchmarking-code
### Why does E2E flake?
List of flake sources:
| Cause | Potential Mitigation |Example|
| -------- | -------- |-|
|next tast starts before cleanup of old task finished|use common function with retries and wait for deletion of objects, register object to be cleaned up when creating with special client| Run 4 |
|timeout out waiting for condition| preload images required: vcluster, ngninx, busybox, etc, increase timeouts? avoid cleanup on fail | Run 1,2,3 |
| every api call can fail, inculding `Create()` if there are intermittent connectivity issues. | init client with [hashicorp/go-retryable](https://pkg.go.dev/github.com/hashicorp/go-retryablehttp?utm_source=godoc) or implement some other form of retrying client?|I think I've seen this a few times|
| AfterEach waits for project to terminate, but doesn't wait for member namespace to terminate| should project termination wait for memeber namespace termination? | Run 4 |
### History of runs and RCA:
##### Ran full suite #1
got 1 flake, the assertion timed out waiting for an nginx pod to come up.
Waiting for pod in space to run timeoud out after one minute
Mitigation, preloading image, increasing timeout
```
/home/tpjsm/go/src/github.com/loft-sh/loft-enterprise/e2e/test_core/exec/exec.go:83
[FAILED] in [It] - /home/tpjsm/go/src/github.com/loft-sh/loft-enterprise/e2e/test_core/exec/exec.go:90 @ 06/23/23 07:14:12.48
• [FAILED] [60.080 seconds]
Kubectl exec [It] exec in space
/home/tpjsm/go/src/github.com/loft-sh/loft-enterprise/e2e/test_core/exec/exec.go:83
[FAILED] Unexpected error:
<wait.errInterrupted>:
timed out waiting for the condition
{
cause: <*errors.errorString | 0xc0001e4160>{
s: "timed out waiting for the condition",
},
}
occurred
In [It] at: /home/tpjsm/go/src/github.com/loft-sh/loft-enterprise/e2e/test_core/exec/exec.go:90 @ 06/23/23 07:14:12.48
[...]
Summarizing 1 Failure:
[FAIL] Kubectl exec [It] exec in space
/home/tpjsm/go/src/github.com/loft-sh/loft-enterprise/e2e/test_core/exec/exec.go:90
Ran 105 of 105 Specs in 870.269 seconds
FAIL! -- 104 Passed | 1 Failed | 0 Pending | 0 Skipped
--- FAIL: TestRunE2ETests (875.77s)
FAIL
Ginkgo ran 4 suites in 14m44.178474076s
There were failures detected in the following suites:
test_core ./test_core
```
#### Ran full suite #2
Previous Flake Theory: pod image pull took a long time
Mitigation: Dumped list of all images in the kind cluster, preloaded them in this fresh cluster
Timed out waiting for virtualclusterinstance to become ready. AfterSuite failed, so additional failures in the same ginkgo node `Project Virtual Cluster Import`
```
/home/tpjsm/go/src/github.com/loft-sh/loft-enterprise/e2e/test_project/projectimportvirtualcluster/projectimportvirtualcluster.go:136
[FAILED] in [It] - /home/tpjsm/go/src/github.com/loft-sh/loft-enterprise/e2e/test_project/projectimportvirtualcluster/projectimportvirtualcluster.go:181 @ 06/23/23 08:36:27.961
• [FAILED] [66.144 seconds]
Project Virtual Cluster Import [It] Test basic project virtual cluster importing
/home/tpjsm/go/src/github.com/loft-sh/loft-enterprise/e2e/test_project/projectimportvirtualcluster/projectimportvirtualcluster.go:136
[FAILED] Timed out after 60.002s.
Expected
<bool>: false
to be true
In [It] at: /home/tpjsm/go/src/github.com/loft-sh/loft-enterprise/e2e/test_project/projectimportvirtualcluster/projectimportvirtualcluster.go:181 @ 06/23/23 08:36:27.961
------------------------------
```
additional failures were because of failed cleanup:
```
[FAILED] Unexpected error:
<*errors.StatusError | 0xc0013f60a0>:
object is being deleted: namespaces "e2e-legacy-vcluster-ns" already exists
{
ErrStatus: {
TypeMeta: {Kind: "", APIVersion: ""},
ListMeta: {
SelfLink: "",
ResourceVersion: "",
Continue: "",
RemainingItemCount: nil,
},
Status: "Failure",
Message: "object is being deleted: namespaces \"e2e-legacy-vcluster-ns\" already exists",
Reason: "AlreadyExists",
```
#### Run 3
timed out waiting for virtualclusterinstance to be ready after import (seen 3-4 times before)
```
For more information, please take a look at the vcluster docs at https://www.vcluster.com/docs
[FAILED] Timed out after 60.000s.
Expected
<bool>: false
to be true
In [It] at: /home/tpjsm/go/src/github.com/loft-sh/loft-enterprise/e2e/test_project/projectimportvirtualcluster/projectimportvirtualcluster.go:439 @ 06/23/23 09:48:02.205
< Exit [It] Should import vcluster created using vcluster cli command - /home/tpjsm/go/src/github.com/loft-sh/loft-enterprise/e2e/test_project/projectimportvirtualcluster/projectimportvirtualcluster.go:384 @ 06/23/23 09:48:02.206 (1m11.787s)
> Enter [AfterEach] Project Virtual Cluster Import - /home/tpjsm/go/src/github.com/loft-sh/loft-enterprise/e2e/test_project/projectimportvirtualcluster/projectimportvirtualcluster.go:88 @ 06/23/23 09:48:02.206
< Exit [AfterEach] Project Virtual Cluster Import - /home/tpjsm/go/src/github.com/loft-sh/loft-enterprise/e2e/test_project/projectimportvirtualcluster/projectimportvirtualcluster.go:88 @ 06/23/23 09:48:24.531 (22.325s)
• [FAILED] [94.131 seconds]
Project Virtual Cluster Import
/home/tpjsm/go/src/github.com/loft-sh/loft-enterprise/e2e/test_project/projectimportvirtualcluster/projectimportvirtualcluster.go:28
[It] Should import vcluster created using vcluster cli command
```
#### Run 4
previous cleanup not complete, AfterEach waits for project deletion. project deleted but owned namespace still terminating
```
[FAILED] Unexpected error:
<*errors.StatusError | 0xc0001a1a40>:
virtualclusterinstances.storage.loft.sh "virtual-cluster-instance-1" is forbidden: unable to create new content in namespace loft-p-virtual-cluster-active-quotas because it is being terminated
{
ErrStatus: {
TypeMeta: {Kind: "", APIVersion: ""},
ListMeta: {
SelfLink: "",
ResourceVersion: "",
Continue: "",
RemainingItemCount: nil,
},
Status: "Failure",
Message: "virtualclusterinstances.storage.loft.sh \"virtual-cluster-instance-1\" is forbidden: unable to create new content in namespace loft-p-virtual-cluster-active-quotas because it is being terminated",
Reason: "Forbidden",
Details: {
Name: "virtual-cluster-instance-1",
Group: "storage.loft.sh",
Kind: "virtualclusterinstances",
UID: "",
Causes: [
{
Type: "NamespaceTerminating",
Message: "namespace loft-p-virtual-cluster-active-quotas is being terminated",
Field: "metadata.namespace",
},
],
RetryAfterSeconds: 0,
},
Code: 403,
},
}
occurred
In [It] at: /home/tpjsm/go/src/github.com/loft-sh/loft-enterprise/e2e/test_core/quota/virtual_cluster_instances.go:641 @ 06/23/23 10:50:22.567
```
[...]
#### Run 34 (lots of the same, partial runs, etc)
timed out waiting for space
```
> Enter [BeforeEach] ProjectSecret - /home/tpjsm/go/src/github.com/loft-sh/loft-enterprise/e2e/test_project/projectsecret/space_secret_sync.go:37 @ 06/23/23 16:51:11.449
< Exit [BeforeEach] ProjectSecret - /home/tpjsm/go/src/github.com/loft-sh/loft-enterprise/e2e/test_project/projectsecret/space_secret_sync.go:37 @ 06/23/23 16:51:13.478 (2.03s)
> Enter [BeforeEach] Space sync with project - /home/tpjsm/go/src/github.com/loft-sh/loft-enterprise/e2e/test_project/projectsecret/space_secret_sync.go:254 @ 06/23/23 16:51:13.478
[FAILED] Timed out after 30.000s.
Unexpected error:
<*errors.StatusError | 0xc000f37cc0>:
namespaces "project-secret" not found
{
ErrStatus: {
TypeMeta: {Kind: "", APIVersion: ""},
ListMeta: {
SelfLink: "",
ResourceVersion: "",
Continue: "",
RemainingItemCount: nil,
},
Status: "Failure",
Message: "namespaces \"project-secret\" not found",
Reason: "NotFound",
Details: {
Name: "project-secret",
Group: "",
Kind: "namespaces",
UID: "",
Causes: nil,
RetryAfterSeconds: 0,
},
Code: 404,
},
}
occurred
In [BeforeEach] at: /home/tpjsm/go/src/github.com/loft-sh/loft-enterprise/e2e/test_project/projectsecret/space_secret_sync.go:292 @ 06/23/23 16:51:43.508
```
#### Run 36
still hitting timeout after doubling, possibly something else
could not find template "example-template"
example template was not being created in the test
```
• [FAILED] [248.146 seconds]
ProjectSecret
/home/tpjsm/go/src/github.com/loft-sh/loft-enterprise/e2e/test_project/projectsecret/space_secret_sync.go:24
Space sync with project [BeforeEach]
/home/tpjsm/go/src/github.com/loft-sh/loft-enterprise/e2e/test_project/projectsecret/space_secret_sync.go:254
does not synchronize space Secret from unauthorized space
/home/tpjsm/go/src/github.com/loft-sh/loft-enterprise/e2e/test_project/projectsecret/space_secret_sync.go:320
------------------------------
```
### Run 81
Yeah, just adding how many runs that felt like in between.
Found a panic! Thought I hit one, before, but didn't find it again till I cleared the flakes above:
Soo many goroutines that hackmd refused to let me paste the whole thing
```
I0627 05:37:53.336348 772 logr.go:278] vault-project-secrets-sync-controller: secret vault-sync-xzdgl: reconcile
succeeded
fatal error: concurrent map read and map write
goroutine 975 [running]:
reflect.mapaccess_faststr(0x4944480?, 0x18?, {0xc004a4e7b0?, 0x4944480?})
/usr/local/go/src/runtime/map.go:1343 +0x1e
reflect.Value.MapIndex({0x4ad7d00?, 0xc0038de690?, 0x0?}, {0x4944480, 0xc0052caf40, 0x98})
/usr/local/go/src/reflect/value.go:1738 +0xc5
reflect.deepValueEqual({0x4ad7d00?, 0xc0057be930?, 0x0?}, {0x4ad7d00?, 0xc0038de690?, 0x0?}, 0x0?)
/usr/local/go/src/reflect/deepequal.go:147 +0x1446
reflect.DeepEqual({0x4ad7d00?, 0xc0057be930?}, {0x4ad7d00?, 0xc0038de690?})
/usr/local/go/src/reflect/deepequal.go:237 +0x2c5
sigs.k8s.io/controller-runtime/pkg/predicate.AnnotationChangedPredicate.Update({{0x58ea8f8, 0xc0050d6d70, 0x0, 0x0}},
{{0x5948cb0, 0xc0054ae000}, {0x5948cb0, 0xc003b7f7c0}})
/loft/vendor/sigs.k8s.io/controller-runtime/pkg/predicate/predicate.go:201 +0x153
sigs.k8s.io/controller-runtime/pkg/predicate.or.Update(...)
```
Other apparent flake from going through the e2e log:
https://github.com/loft-sh/loft-enterprise/actions/runs/5477220436/jobs/9976165206?pr=1673
apparent source:
```
func WaitForClusterAccess(ctx context.Context, loftClient loftclient.Client, cluster string) error {
clusterClient, err := loftClient.Cluster(cluster)
if err != nil {
return err
}
return wait.PollUntilContextTimeout(ctx, time.Second, time.Minute*2, true, func(ctx context.Context) (bool, error) {
_, err := clusterClient.Agent().ClusterV1().Spaces().List(ctx, metav1.ListOptions{})
if err != nil {
if kerrors.IsUnauthorized(err) || kerrors.IsForbidden(err) {
return false, nil
}
return false, err
}
return true, nil
})
}```