owned this note
owned this note
Published
Linked with GitHub
# 2023Q2: Keep infrastructure efficient – Project definitions #
## Pre-existing metrics
- M3.1: Number of failing opam jobs per scheduled pipeline
- This can be read from the GitLab web interface
- M4: Recorded AWS cost
- This can be deduced from the bills they send us ;) Alternatively
the AWS cost exporter
## Project 1: Pipeline Monitoring and Simulation ##
Title
: Pipeline-centric Infrastructure Monitoring
Description
: Given the OKR, we need tools to monitor and measure the pipeline performance. We also need to tools to help us evaluate the changes we propose.
Objective
: Establish metrics to measure progress on the Axes 3 & 4.
Estimated effort (in nb weeks)
: --
Associated KRs
: CI runs consistently under 20 minutes and cost is not higher that the current one
Dependencies
: None
Deliverables
: The set of metrics necessary to measure progress on the Axes 3 & 4.
## Task breakdown ##
| Task | Impact | Complexity | Who could do it | Estimated effort (d) |
|------------------------------------------------------------------|--------|------------|--------------------|----------------------|
| Figure out a way to measure progress modulo evolution | | | All | 7d |
| Pipeline monitoring dashboard as proposed on #nl-devops by Arvid | | | Charles, Corentin | 3d |
| Profile all jobs to understand resources requirements | | | Charles? Corentin? | 3d |
| (Pipeline cost estimation tool) | | | Arvid, Charles? | 1d |
| (Measure flakiness over time) | | | Arvid / Pietro | 2d |
## Deliverables
The deliverables for this project will be two tools producing metrics
necessary to measure progress on Project 3 & 4.
- Tool: Pipeline simulator (M1)
- For a set of pipeline types, disregarding evolutions in the code-base unrelated to our changes.
- Metric M1.1: Projected Pipeline wall-time and sequential duration by type
- Metric M1.2: Projected Total Sequential Time (obtained by multiplying M1.1 with pipeline type frequencies)
- Metric M1.3: Projected Sequential Time per Pipeline (obtained by dividing M1.2 with sum of pipeline type frequencies)
- Tool: Pipeline monitoring dashboard (M2)
- For a given project & given time period & pipeline type:
- Metric M2.1: Recorded Pipeline average/worst-case wall-time and sequential duration
- Metric M2.2: Recorded Number of launched pipelines for a given project
- Metric M2.3: Recorded Sequential Time per Pipeline
### Figure out a way to measure progress modulo evolution ###
<details>
<summary>Breakdown of task 1</summary>
1. Identify a set of pipeline types which we would like to measure
- assignee: arvid
- estimation: 1d
- description: types & (frequence sur period donnée)
3. Make a mechanism for running the set of pipelines for a given reference commit + a set of configuration changes
- Il faut un infrastructure apart?
- assignee: charles
- estimation: 1d
- description:
- outil: input: baseline commit + modificiations commits
- assignee: pietro?
- estimation: 5d
- checkout baseline commit
- cherry pick modifications commits
- modifier la CI config (.gitlab-ci.yml) automatiquement pour ajotuer un tag qui correspond à l'infra dedié
- .gitlab-ci.yml: default: tag: MACHIN
- comment gerer arm64?
- lancer les pipelines types
- instrumentaiton de .gitlab-ci.yml necessaire pour simuler des pipelines types
- dans un projet apart (nomadic-labs/tezos-ci-measures)
-
5. (Make it possible to do this every week, monitor the result and synthethize)
</details>
### Pipeline monitoring dashboard ###
Ref: [pipeline monitoring dashboard](https://tezos-dev.slack.com/archives/GTKLTTZU2/p1679472872216589)
Title
: Pipeline profiling and other investigating projects
Description
: It'd be a good idea to understand how resources are spent outside
`{tezos,nomadic-labs}/tezos`. Also, it would be good to have the
resource requirements of each specific job to understand what kind
of machine is best adapted for it.
Estimated effort (in nb weeks):
:
Associated KRs:
: CI runs consistently under 20 minutes and cost is not higher that the current one
Dependances:
: Weak dependency on Project 1 (for pipeline types)
### Task breakdown
- [ ] Measure the cost of side-projects (!= */tezos) @(Charles, Corentin?) [1d]
- [ ] Make sure interruptible is used in side-projects @(Charles, Corentin, Arvid) [1d]
### Deliverables:
- A report with the a resource characterization of each job in the
set of pipeline types defined in project 1.
- A report over load and cost associated per project in `{nomadic-labs,tezos}/*`
- Pipeline configurations in side-projects with interruptible pipelines
## Project 3: Reducing overall workload in `{nomadic-labs,tezos}/tezos` ##
Title
: Reducing overall CI workload tezos/tezos
Description
: The idea is to reduce the overall workload. This will save money,
which can be used use faster hardware to speed up the marge-bot
pipeline wall time. The major contributor to workload atm is opam
tests, so most ideas are centered around those tests.
Estimated effort (in nb weeks)
: 1 week
Associated KRs
: CI runs consistently under 20 minutes and cost is not higher that the current one
Dependances
: Project 1
### Metrics & Objectives
- Goal metrics (measured by):
- Projected Sequential Time per Pipeline (M1.3)
- Recorded Sequential Time per Pipeline (M2.3)
- Guardrail metric (measured by):
- Number of opam failures in scheduled pipelines (M3.1)
- Recorded / projected AWS cost (implied by the goal metric if sequential time is a sufficient proxy for cost and by M4)
Objectives:
- Goal metric:
- Reduce projected sequential time per pipeline to 230 minutes.
- Reduce recorded sequential time per pipeline to 250 minutes.
- Guardrail metric:
- The number of opam failures in scheduled pipelines should be
similar to a time period before the our changes (-+ 5%).
- Recorded / projected AWS cost does not increase (this is implied by the goal
metric if sequential time is a sufficient proxy for cost)
### Task breakdown
| Task | Wall-time impact best/worst-case | Sequential impact | Cost impact potential | Complexity | Who could do it | Estimated effort (d) |
|--------------------------------------------------|----------------------------------|-------------------|-----------------------|----------------|-----------------|----------------------|
| Only run opam jobs for leaves/top-level packages | | -45% | Savings | Medium | Pietro, Arvid | 3d |
| (Only run opam jobs for rev-deps) | 1-27 minutes / 0 | | Savings | Medium | Pietro, Arvid | 10d |
| (Reduce number of opam packages) | 0 / 0 | | Savings | Hard / tedious | Pietro, Arvid | ? |
| (Use runtest image instead of prebuild) | ? / 0 | | Savings | Easy | Arvid, Pietro | 2-3d |
| (Cache _opam directory) | ? / ? | | Savings | Easy | Arvid, Pietro | 1d |
### Deliverables
- A pipeline configuration that runs less opam jobs
### How the objective was computed
<details>
<summary>Nerdy details</summary>
- In the period 2023-03-19T06:00:43.623Z to 2023-04-19T06:35:09.999Z
- We had 473252 jobs in 4552 pipelines. A total sequential time
of 32033 hours. Giving a Sequential time per pipeline of 422 minutes.
- 19157 hours in 63091 jobs were spent on opam jobs (~60%)
- We say that the necessary opam jobs are those that:
- Marge-bot launched
- That ran on master (scheduled pipelines)
- Or were one of the leaf jobs (definition fuzzy for the moment):
- opam:octez-accuser-PtLimaPt, opam:octez-accuser-PtMumbai,
opam:octez-accuser-PtNairob, opam:octez-baker-PtLimaPt,
opam:octez-baker-PtMumbai, opam:octez-baker-PtNairob,
opam:octez-client, opam:octez-codec, opam:octez-node,
opam:octez-protocol-compiler, opam:octez-proxy-server,
opam:octez-signer, opam:octez-smart-rollup-client-PtLimaPt,
opam:octez-smart-rollup-client-PtMumbai,
opam:octez-smart-rollup-client-PtNairob,
opam:octez-smart-rollup-node-PtLimaPt,
opam:octez-smart-rollup-node-PtMumbai,
opam:octez-smart-rollup-node-PtNairob
- By only running the necessary opam jobs, we go down to 17385
jobs in 5121 hours from 63091 jobs in 19157 hours, a 73%
reduction of opam job time.
- In total, we go down from 473252 jobs to 348149 jobs (a 26%
reduction) and from 32033 hours to 17463 hours (a 45%
reduction). Giving a Sequential time per pipeline of 230 minutes.
- To count in evolutions from code base evolutions, we add a margin
of 20%, so we aim for a reduction to 250 minutes from 422.
</details>
## Project 4: Reducing marge-bot pipeline wall-time on `tezos/tezos` ##
Title
: Reducing marge-bot pipeline wall-time on `tezos/tezos`
Description
: This is the end-goal: make the marge-bot pipelines shorter so that devs wait shorter.
Estimated effort (in nb weeks):
: --
Associated KRs
: CI runs consistently under 20 minutes and cost is not higher that the current one
Dependances
: Axis 1
### Metrics & objective
- Goal metrics (measured by):
- Projected Average/Worst-case Wall-time of Marge-bot pipelines (M1.1)
- Recorded Average/Worst-case Wall-time of Marge-bot pipelines (M2.1)
- Guardrail metrics (measured by):
- Quality assurance level (??)
- This is very hard to quantify in a measurable manner
- Projected / recorded AWS cost (M1.2 / M4)
Objectives:
- Goal metrics:
- Projected average/worst-case wall-time of marge-bot is under
20 minutes/30 minutes
- Guardrail metrics
- Quality assurance level: should be maintained
- Recorded / projected AWS cost: should not increase
### Task breakdown
| Task | Potential gains | Complexity | Who could do it | Estimated effort (d) |
|----------------------------------------------------------------------|-------------------------|--------------------|-------------------|----------------------|
| `build_x86_64`: batch dune calls | : minor | complication major | Arvid, Pietro | 1d |
| Increase the number of Tezt jobs | 2-3 mins | minor | Arvid, Pietro | 1d |
| `script:snapshot_alpha_and_link`: combine into `build_x86_64` | ?? | | Arvid | 2-3d |
| `unified_coverage`: move to `master` pipeline on marge-bot pipelines | 2-3 mins | | Arvid | 2d |
| Flaky tests: make and implement policy | | | Arvid, Pietro | 2d |
| Flaky tests: ward off | | | Pietro, Arvid | 2d |
| Finish the flakiness detection pipeline | | | Arvid | 1d |
| Flaky tests: fix | | | Pietro, Arvid | ?? |
| Unit tezts: only test modified + revdeps | | | | 10d |
| Per job-specific machines as per profile | | | Charles, Corentin | ?? |
| build_x86_64: investigate dedicated build machine(s) | potential gains: major, | complication major | Charles, Corentin | 1d |
| build_x86_64: implement dedicated build machine(s) | potential gains: major, | complication major | Charles, Corentin | ?? |
| Figure out what to do with old protocols | | | Pietro, arvid | ?? |
### Deliverables
- A pipeline configuration with shorter average/worst-case marge-bot
pipeline wall-times