General notes - HackMD

# General notes This document is meant to give a general idea about work areas and activities of the CI team for release cycle v1.4.0 ## Responsibilities 1. **Signal:** * Responsibility for the quality of the release * Continuously monitor CI signal, so a release can be cut at any time * Add CI signal for new release branches 1. **Bug Triage:** * Make sure blocking issues and bugs are triaged and dealt with in a timely fashion 1. **Automation:** * Maintain and improve release automation, tooling & related developer docs ## Tasks ----- ### [Signal] **Setup jobs and dashboards for a new release branch (week 15)** The goal of this task is to have test coverage for the new release branch and results in testgrid. - [ ] Create new jobs based on the jobs running against our main branch: * Copy config/jobs/kubernetes-sigs/cluster-api/cluster-api-periodics-main.yaml to config/jobs/kubernetes-sigs/cluster-api/cluster-api-periodics-release-1-4.yaml. * Copy test-infra/config/jobs/kubernetes-sigs/cluster-api/cluster-api-periodics-main-upgrades.yaml to test-infra/config/jobs/kubernetes-sigs/cluster-api/cluster-api-periodics-release-1-4-upgrades.yaml. * Copy test-infra/config/jobs/kubernetes-sigs/cluster-api/cluster-api-presubmits-main.yaml to test-infra/config/jobs/kubernetes-sigs/cluster-api/cluster-api-presubmits-release-1-4.yaml. * Modify the following: * Rename the jobs, e.g.: periodic-cluster-api-test-main => periodic-cluster-api-test-release-1-4. * Change annotations.testgrid-dashboards to sig-cluster-lifecycle-cluster-api-1.4. * Change annotations.testgrid-tab-name, e.g. capi-test-main => capi-test-release-1-4. * For periodics additionally: * Change extra_refs[].base_ref to release-1.4 (for repo: cluster-api). * Change interval (let's use the same as for 1.3). * For presubmits additionally: Adjust branches: ^main$ => ^release-1.4$. - [ ] Create a new dashboard for the new branch in: test-infra/config/testgrids/kubernetes/sig-cluster-lifecycle/config.yaml (dashboard_groups and dashboards). - [ ] Verify the jobs and dashboards a day later by taking a look at: https://testgrid.k8s.io/sig-cluster-lifecycle-cluster-api-1.4 Prior art : * Add jobs for CAPI release 1.3: https://github.com/kubernetes/test-infra/pull/28010 * Add jobs for CAPI release 1.2: https://github.com/kubernetes/test-infra/pull/26621 **Also, modify the clusterctl upgrade jobs to test the new release** Prior art: * Issue: [Adjust clusterctl upgrade jobs for v1.3](https://github.com/kubernetes-sigs/cluster-api/issues/6835) * PR updates to v1.2.0-beta.2: [Update CAPI version for periodic upgrade jobs](https://github.com/kubernetes/test-infra/pull/26750) * Follow up PR to update to v1.2.0 once tag is ready. [use clusterctl v1.2.0 for upgrade tests](https://github.com/kubernetes/test-infra/pull/26779) #### [Continuously] Monitor CI signal 1. Add yourself to the Cluster API alert mailing list <br>Note: An alternative to the alert mailing list is manually monitoring the [testgrid dashboards](https://testgrid.k8s.io/sig-cluster-lifecycle-cluster-api#Summary) (also dashboards of previous releases). - [x] PR to add CI signal members to the mailing list: https://github.com/kubernetes/k8s.io/pull/4535 1. Triage CI failures reported by mail alerts or found by monitoring the testgrid dashboards: * Create an issue in the Cluster API repository to surface the CI failure. * Identify if the issue is a known issue, new issue or a regression. * Mark the issue as release-blocking if applicable. #### [Continuously] Reduce the amount of flaky tests The Cluster API tests are pretty stable, but there are still some flaky tests from time to time. To reduce the amount of flakes please periodically: 1. Take a look at recent CI failures via k8s-triage: * [sig-cluster-lifecycle-cluster-api](https://testgrid.k8s.io/sig-cluster-lifecycle-cluster-api) * [sig-cluster-lifecycle-cluster-api-0.3](https://testgrid.k8s.io/sig-cluster-lifecycle-cluster-api-0.3) * [sig-cluster-lifecycle-cluster-api-0.4](https://testgrid.k8s.io/sig-cluster-lifecycle-cluster-api-0.4) * [sig-cluster-lifecycle-cluster-api-1.0](https://testgrid.k8s.io/sig-cluster-lifecycle-cluster-api-1.0) * [sig-cluster-lifecycle-cluster-api-1.1](https://testgrid.k8s.io/sig-cluster-lifecycle-cluster-api-1.1) * [sig-cluster-lifecycle-cluster-api-1.2](https://testgrid.k8s.io/sig-cluster-lifecycle-cluster-api-1.2) * [sig-cluster-lifecycle-cluster-api-1.3](https://testgrid.k8s.io/sig-cluster-lifecycle-cluster-api-1.3) 2. Open issues for occurring flakes and ideally fix them or find someone who can. Note: Given resource limitations in the Prow cluster it might not be possible to fix all flakes. Let's just try to pragmatically keep the amount of flakes pretty low. **NOTE:** Please check a separate doc for keeping track of failing/flaking tests [here](https://hackmd.io/TiD0jUqUQ3OYKDx4JC0GLw). ----- ### [Bug triage] Continuously The goal of bug triage is to triage incoming issues and if necessary flag them with release-blocking and add them to the milestone of the current release. **Ask CAPI maintainers clarify the CI signal team responsibilities a bit clearly, because not all issues can be triaged by the team and there seems to be a shared responsibility between two parties.** **Answer:** It is not our responsibility to triage a newly created issues, but rather it is maintainers' job. Our main job here is: to track open issues labelled / milestoned targeted for specific release and left out without an attention or not moving forward. So, we have to collect them all and try to make sure they are worked on on timely manner or if not raise it to the repo maintainers' attention. **NOTE:** Please check a separate doc for keeping track of bug triages [here](https://hackmd.io/1rX71tsFRoys_c8T8I2H2A?view). ### [Automation] **NOTE:** Please check improvements docs [here](https://hackmd.io/ivRNeBF5Sg2DPeyokdMZZA). ----- ## What is expected by the team member responsible for specific week? 1. Mostly follow [Continuously-Monitor-CI-signal](https://hackmd.io/Nul46KJARES5tIbOoYY1vw?both#Continuously-Monitor-CI-signal) and [Continuously-Reduce-the-amount-of-flaky-tests](https://hackmd.io/Nul46KJARES5tIbOoYY1vw?both#Continuously-Reduce-the-amount-of-flaky-tests) notes described above. Please remember to keep track of the CI signal every day in the separate document available [here](https://hackmd.io/TiD0jUqUQ3OYKDx4JC0GLw). Also, it is good idea to check the CI status for the previous week, and take over / track the open issues reported. 2. Although bug triaging is a shared responsibility by the maintainers and the CI team, it is good we keep a track of it as much as we can. For that, please check the repo for open/without attention/not tracked [issues](https://github.com/kubernetes-sigs/cluster-api/issues?q=is%3Aopen+is%3Aissue+milestone%3Av1.4) and [PRs](https://github.com/kubernetes-sigs/cluster-api/pulls?q=is%3Aopen+is%3Apr+milestone%3Av1.4) milestoned for v1.4 release and if you find one, and that is not in the tracking bug triage document available [here](https://hackmd.io/1rX71tsFRoys_c8T8I2H2A), please add the issue/PR in the document. 3. Weekly community update on cluster-api slack channel. **Please note:** The person watching the CI for specific week (i.e week 4) should give a short update the following week (week 5) Monday preferably about the CI signal status for previous week on CAPI slack channel. **WEEKLY CI UPDATE example:** Just to give a better idea, k8 release team CI signal members use the following, while it is up to the responsible person in which format to give the week update: ![](https://i.imgur.com/gWEV9kJ.png) **FAILING JOB ANNOUNCEMENT example:** ![](https://i.imgur.com/5Y5tVXG.png) ----- ### Useful links: * SIG-CI-Signal pre-recorded videos: https://www.youtube.com/playlist?list=PL69nYSiGNLP2Lzsjir9W7S8u0UsQeeW71 * K8s release team bi-weekly meeting recordings: https://www.youtube.com/watch?v=8SAN2eOkI3o&list=PL69nYSiGNLP3QKkOsDsO6A0Y1rhgP84iZ&index=2