2023Q2: Keep infrastructure efficient – Project definitions

# 2023Q2: Keep infrastructure efficient – Project definitions # ## Pre-existing metrics - M3.1: Number of failing opam jobs per scheduled pipeline - This can be read from the GitLab web interface - M4: Recorded AWS cost - This can be deduced from the bills they send us ;) Alternatively the AWS cost exporter ## Project 1: Pipeline Monitoring and Simulation ## Title : Pipeline-centric Infrastructure Monitoring Description : Given the OKR, we need tools to monitor and measure the pipeline performance. We also need to tools to help us evaluate the changes we propose. Objective : Establish metrics to measure progress on the Axes 3 & 4. Estimated effort (in nb weeks) : -- Associated KRs : CI runs consistently under 20 minutes and cost is not higher that the current one Dependencies : None Deliverables : The set of metrics necessary to measure progress on the Axes 3 & 4. ## Task breakdown ## | Task | Impact | Complexity | Who could do it | Estimated effort (d) | |------------------------------------------------------------------|--------|------------|--------------------|----------------------| | Figure out a way to measure progress modulo evolution | | | All | 7d | | Pipeline monitoring dashboard as proposed on #nl-devops by Arvid | | | Charles, Corentin | 3d | | Profile all jobs to understand resources requirements | | | Charles? Corentin? | 3d | | (Pipeline cost estimation tool) | | | Arvid, Charles? | 1d | | (Measure flakiness over time) | | | Arvid / Pietro | 2d | ## Deliverables The deliverables for this project will be two tools producing metrics necessary to measure progress on Project 3 & 4. - Tool: Pipeline simulator (M1) - For a set of pipeline types, disregarding evolutions in the code-base unrelated to our changes. - Metric M1.1: Projected Pipeline wall-time and sequential duration by type - Metric M1.2: Projected Total Sequential Time (obtained by multiplying M1.1 with pipeline type frequencies) - Metric M1.3: Projected Sequential Time per Pipeline (obtained by dividing M1.2 with sum of pipeline type frequencies) - Tool: Pipeline monitoring dashboard (M2) - For a given project & given time period & pipeline type: - Metric M2.1: Recorded Pipeline average/worst-case wall-time and sequential duration - Metric M2.2: Recorded Number of launched pipelines for a given project - Metric M2.3: Recorded Sequential Time per Pipeline ### Figure out a way to measure progress modulo evolution ### <details> <summary>Breakdown of task 1</summary> 1. Identify a set of pipeline types which we would like to measure - assignee: arvid - estimation: 1d - description: types & (frequence sur period donnée) 3. Make a mechanism for running the set of pipelines for a given reference commit + a set of configuration changes - Il faut un infrastructure apart? - assignee: charles - estimation: 1d - description: - outil: input: baseline commit + modificiations commits - assignee: pietro? - estimation: 5d - checkout baseline commit - cherry pick modifications commits - modifier la CI config (.gitlab-ci.yml) automatiquement pour ajotuer un tag qui correspond à l'infra dedié - .gitlab-ci.yml: default: tag: MACHIN - comment gerer arm64? - lancer les pipelines types - instrumentaiton de .gitlab-ci.yml necessaire pour simuler des pipelines types - dans un projet apart (nomadic-labs/tezos-ci-measures) - 5. (Make it possible to do this every week, monitor the result and synthethize) </details> ### Pipeline monitoring dashboard ### Ref: [pipeline monitoring dashboard](https://tezos-dev.slack.com/archives/GTKLTTZU2/p1679472872216589) Title : Pipeline profiling and other investigating projects Description : It'd be a good idea to understand how resources are spent outside `{tezos,nomadic-labs}/tezos`. Also, it would be good to have the resource requirements of each specific job to understand what kind of machine is best adapted for it. Estimated effort (in nb weeks): : Associated KRs: : CI runs consistently under 20 minutes and cost is not higher that the current one Dependances: : Weak dependency on Project 1 (for pipeline types) ### Task breakdown - [ ] Measure the cost of side-projects (!= */tezos) @(Charles, Corentin?) [1d] - [ ] Make sure interruptible is used in side-projects @(Charles, Corentin, Arvid) [1d] ### Deliverables: - A report with the a resource characterization of each job in the set of pipeline types defined in project 1. - A report over load and cost associated per project in `{nomadic-labs,tezos}/*` - Pipeline configurations in side-projects with interruptible pipelines ## Project 3: Reducing overall workload in `{nomadic-labs,tezos}/tezos` ## Title : Reducing overall CI workload tezos/tezos Description : The idea is to reduce the overall workload. This will save money, which can be used use faster hardware to speed up the marge-bot pipeline wall time. The major contributor to workload atm is opam tests, so most ideas are centered around those tests. Estimated effort (in nb weeks) : 1 week Associated KRs : CI runs consistently under 20 minutes and cost is not higher that the current one Dependances : Project 1 ### Metrics & Objectives - Goal metrics (measured by): - Projected Sequential Time per Pipeline (M1.3) - Recorded Sequential Time per Pipeline (M2.3) - Guardrail metric (measured by): - Number of opam failures in scheduled pipelines (M3.1) - Recorded / projected AWS cost (implied by the goal metric if sequential time is a sufficient proxy for cost and by M4) Objectives: - Goal metric: - Reduce projected sequential time per pipeline to 230 minutes. - Reduce recorded sequential time per pipeline to 250 minutes. - Guardrail metric: - The number of opam failures in scheduled pipelines should be similar to a time period before the our changes (-+ 5%). - Recorded / projected AWS cost does not increase (this is implied by the goal metric if sequential time is a sufficient proxy for cost) ### Task breakdown | Task | Wall-time impact best/worst-case | Sequential impact | Cost impact potential | Complexity | Who could do it | Estimated effort (d) | |--------------------------------------------------|----------------------------------|-------------------|-----------------------|----------------|-----------------|----------------------| | Only run opam jobs for leaves/top-level packages | | -45% | Savings | Medium | Pietro, Arvid | 3d | | (Only run opam jobs for rev-deps) | 1-27 minutes / 0 | | Savings | Medium | Pietro, Arvid | 10d | | (Reduce number of opam packages) | 0 / 0 | | Savings | Hard / tedious | Pietro, Arvid | ? | | (Use runtest image instead of prebuild) | ? / 0 | | Savings | Easy | Arvid, Pietro | 2-3d | | (Cache _opam directory) | ? / ? | | Savings | Easy | Arvid, Pietro | 1d | ### Deliverables - A pipeline configuration that runs less opam jobs ### How the objective was computed <details> <summary>Nerdy details</summary> - In the period 2023-03-19T06:00:43.623Z to 2023-04-19T06:35:09.999Z - We had 473252 jobs in 4552 pipelines. A total sequential time of 32033 hours. Giving a Sequential time per pipeline of 422 minutes. - 19157 hours in 63091 jobs were spent on opam jobs (~60%) - We say that the necessary opam jobs are those that: - Marge-bot launched - That ran on master (scheduled pipelines) - Or were one of the leaf jobs (definition fuzzy for the moment): - opam:octez-accuser-PtLimaPt, opam:octez-accuser-PtMumbai, opam:octez-accuser-PtNairob, opam:octez-baker-PtLimaPt, opam:octez-baker-PtMumbai, opam:octez-baker-PtNairob, opam:octez-client, opam:octez-codec, opam:octez-node, opam:octez-protocol-compiler, opam:octez-proxy-server, opam:octez-signer, opam:octez-smart-rollup-client-PtLimaPt, opam:octez-smart-rollup-client-PtMumbai, opam:octez-smart-rollup-client-PtNairob, opam:octez-smart-rollup-node-PtLimaPt, opam:octez-smart-rollup-node-PtMumbai, opam:octez-smart-rollup-node-PtNairob - By only running the necessary opam jobs, we go down to 17385 jobs in 5121 hours from 63091 jobs in 19157 hours, a 73% reduction of opam job time. - In total, we go down from 473252 jobs to 348149 jobs (a 26% reduction) and from 32033 hours to 17463 hours (a 45% reduction). Giving a Sequential time per pipeline of 230 minutes. - To count in evolutions from code base evolutions, we add a margin of 20%, so we aim for a reduction to 250 minutes from 422. </details> ## Project 4: Reducing marge-bot pipeline wall-time on `tezos/tezos` ## Title : Reducing marge-bot pipeline wall-time on `tezos/tezos` Description : This is the end-goal: make the marge-bot pipelines shorter so that devs wait shorter. Estimated effort (in nb weeks): : -- Associated KRs : CI runs consistently under 20 minutes and cost is not higher that the current one Dependances : Axis 1 ### Metrics & objective - Goal metrics (measured by): - Projected Average/Worst-case Wall-time of Marge-bot pipelines (M1.1) - Recorded Average/Worst-case Wall-time of Marge-bot pipelines (M2.1) - Guardrail metrics (measured by): - Quality assurance level (??) - This is very hard to quantify in a measurable manner - Projected / recorded AWS cost (M1.2 / M4) Objectives: - Goal metrics: - Projected average/worst-case wall-time of marge-bot is under 20 minutes/30 minutes - Guardrail metrics - Quality assurance level: should be maintained - Recorded / projected AWS cost: should not increase ### Task breakdown | Task | Potential gains | Complexity | Who could do it | Estimated effort (d) | |----------------------------------------------------------------------|-------------------------|--------------------|-------------------|----------------------| | `build_x86_64`: batch dune calls | : minor | complication major | Arvid, Pietro | 1d | | Increase the number of Tezt jobs | 2-3 mins | minor | Arvid, Pietro | 1d | | `script:snapshot_alpha_and_link`: combine into `build_x86_64` | ?? | | Arvid | 2-3d | | `unified_coverage`: move to `master` pipeline on marge-bot pipelines | 2-3 mins | | Arvid | 2d | | Flaky tests: make and implement policy | | | Arvid, Pietro | 2d | | Flaky tests: ward off | | | Pietro, Arvid | 2d | | Finish the flakiness detection pipeline | | | Arvid | 1d | | Flaky tests: fix | | | Pietro, Arvid | ?? | | Unit tezts: only test modified + revdeps | | | | 10d | | Per job-specific machines as per profile | | | Charles, Corentin | ?? | | build_x86_64: investigate dedicated build machine(s) | potential gains: major, | complication major | Charles, Corentin | 1d | | build_x86_64: implement dedicated build machine(s) | potential gains: major, | complication major | Charles, Corentin | ?? | | Figure out what to do with old protocols | | | Pietro, arvid | ?? | ### Deliverables - A pipeline configuration with shorter average/worst-case marge-bot pipeline wall-times

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.