Cluster API: v1.4 Test failures in e2e tests

# Cluster API: v1.4 Test failures in e2e tests ## High priority \- ## Low priority ### Rate limit for github api has been reached * Let's ignore for now, known root cause and its not our code * Fails because of rate limiting when downloading cert-manager.yaml. * Found in: (multiple times) * e2e-main: https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/periodic-cluster-api-e2e-main/1625685968952496128 * When testing clusterctl upgrades (v0.4=>current) * Also in other test cases ### hook BeforeClusterUpgrade call not recorded in configMap * Let's ignore for now, didn't happen often * Found in: * e2e-main: https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/periodic-cluster-api-e2e-main/1626715521942556672 * When upgrading a workload cluster using ClusterClass with RuntimeSDK ### Old version instances remain * Let's ignore for now, didn't happen often * Found in: * e2e-main: https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/periodic-cluster-api-e2e-main/1626109800913833984 * When upgrading a workload cluster using ClusterClass with RuntimeSDK ### No Control Plane machines came into existence. * Let's ignore for now, didn't happen often * Found in: (multiple times) * e2e-mink8s-main: * [When testing clusterctl upgrades (v0.4=>current)](https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/periodic-cluster-api-e2e-mink8s-main/1626662673397583872 ) * [When testing MachinePools Should successfully create a cluster with machine pool machines](https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/periodic-cluster-api-e2e-mink8s-main/1627856554810150912) * e2e-main: * [When testing clusterctl upgrades using ClusterClass (v1.2=>current)](https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/periodic-cluster-api-e2e-main/1626019200340332544) ### Self-hosted timeout * Saw it a few times already * Found in: (multiple times) * [When testing Cluster API working on self-hosted clusters using ClusterClass with a HA control plane [ClusterClass]](https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/periodic-cluster-api-e2e-main/1628108464884551680) ### Found unexpected running containers * Let's ignore for now, didn't happen often * Found in: * e2e-mink8s-main: https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/periodic-cluster-api-e2e-mink8s-main/1626723071853334528 * Quick start ### Timed out waiting for Cluster machine-pool-zoga1n/machine-pool-83gevp to provision * Let's ignore for now, didn't happen often * Found in: * e2e-main: https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/periodic-cluster-api-e2e-main/1625716170122334208 * When testing MachinePools ## Follow-ups * Retro We need a better overview over all flakes, e.g. via an umbrella issue. * There should be no implicitly accepted and not listed flakes ## Done ### Machines should remain the same after the upgrade (KCP triggers rollout after upgrade to main) * Fails very often * Found in: (multiple times) * e2e-mink8s-main: https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/periodic-cluster-api-e2e-mink8s-main/1626692872642236416 * When testing clusterctl upgrades (v0.3=>current) * e2e-main: https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/periodic-cluster-api-e2e-main/1625655518167044096 * When testing clusterctl upgrades (v0.3=>current) * PR merged: https://github.com/kubernetes-sigs/cluster-api/pull/8125 ### ClusterClass is not up to date (selfhost / clusterctl move) * Basically always happens * Found in: (multiple times) * e2e-main: https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/periodic-cluster-api-e2e-main/1626624923994689536 * When testing Cluster API working on self-hosted clusters using ClusterClass with a HA control plane [ClusterClass] * Analysis: clusterctl move * We create the ClusterClass and not the referenced templates in the target cluster (in the first moveSequence.group) => Variable Discovery works, reconcileExternalReferences fails (huge log, nothing in conditions about that) => Because of that VariablesReconciled is true, but observedGeneration is not set => In the second moveSequence.group we are trying to deploy the Cluster, which fails because variable defaulting fails, because observedGeneration has not been set * We are trying to pause / unpause the ClusterClass in mover.go. But it doesn't work, so it doesn't lead to even more issues right now (the bug is that we are comparing the same annotations map in mover.go:653). This also affects unpause. If the ClusterClass would actually be paused, the variables would not be reconciled at all. * Plan: * Killian: * [x] PR: [Make Cluster webhook less strict for out of date ClusterClasses](https://github.com/kubernetes-sigs/cluster-api/pull/8136) * Defaulting and Validation in webhooks best-effort (in controller guaranteed) * If not possible in webhooks => return warning (only godoc for now) * Fix the bug in pause/unpause ### Machines should remain the same after the upgrade(Cluster topology triggers rollout after upgrade to main) * Almost always happens * Found in: * e2e-main: https://prow.k8s.io/view/gs/kubernetes-jenkins/pr-logs/pull/kubernetes-sigs_cluster-api/8120/pull-cluster-api-e2e-full-main/1626475653224206336 * When testing clusterctl upgrades using ClusterClass (v1.3=>current) * (e2e-mink8s-main not affected, because it doesn't use ClusterClass) * Analsysis: * Cluster topology controller triggers a MD rollout after upgrade because of a difference in the defaulted ImagePullPolicy field in KubeadmConfigTemplate. * Options: * 1. Disabling defaulting in webhooks for dry-run: * Doesn't work, what if: * Defaulting was run on KubeadmConfigTemplate referenced in the ClusterClass * Default was *not* run on KubeadmConfigTemplate used in an existing MachineDeployment => SSA dry-run won't do any defaulting anymore: so you end up with a diff because only one of them has the field => The only way to get to an object which is the same is to run defaulting on both (like in KCP) * 2. Running dryrun also on originalUnstructured in SSA dry-run * Initial state of objects: * MD KubeadmConfigTemplate doesn't have ImagePullPolicy * ClusterClass KubeadmConfigTemplate has ImagePullPolicy * Both originalUnstructured and dryRunUnstructured will end up with the ImagePullPolicy * But there is a difference in managed fields: * originalUnstructured: capi-topology will not get ownership of ImagePullPolicy because the field is set by the defaulting webhook * dryRunUnstructured: capi-topology will get ownership as it's actively setting the ImagePullPolicy field => There is only a diff in managed fields, this is fine as we just patch them inline without rotation. => I think this is a viable solution. We might have to do some caching though as a follow-up to reduce SSA calls. E.g. if there is no diff, store originalUnstructured resourceVersion + hash of modifiedUnstructured as "no diff". * Plan: * Stefan: * [x] stopgap: skip rollout after validation: * PR: [test/e2e: disable rollout check for ClusterClass-based cluster in clusterctl upgrade test](https://github.com/kubernetes-sigs/cluster-api/pull/8138) * [x] Option 2: dryrun for both * PR: [ClusterClass: run dry-run on original and modified object](https://github.com/kubernetes-sigs/cluster-api/pull/8139) * [x] Follow-up: SSA cache (10 min ttl) * [x] Follow-up: double-check on upstream efforts for SSA lib => doesn't help as mutating webhooks are also relevant

Read more

Cluster API: label propagation: debugging Telekom issue

Cluster API: Validating label and annotation propagation

ClusterAPI - ClusterClass struct names (hack session)

ClusterAPI - Running e2e tests & controllers locally with Intellij