OCluster Carbon Footprint

--- tags: CI --- # OCluster Carbon Footprint ## Introduction Various ideas, thoughts and notes about making the cluster more carbon-aware. ### Solver Service #### Opam Action Graph Currently must builds follow some pattern of `setup -> install -> build/test`. The `install` step encompasses a lot actions like installing system dependencies, OCaml packages etc. When installing a packages dependencies, doing so in one step leaves a lot of room for missed caching opportunities. For example consider packages `A` and `B` with dependencies `dune.2.9.1, yaml.3.0.0` and `dune.2.9.1, yaml.2.9.9` respectively. Naïvely doing `opam install dune.2.9.1 yaml.3.0.0` and similarly for `B` will produce different hashes. If the build did `opam install dune.2.9.1` then `opam install yaml.3.0.0` then the build for `B` could restore the cache of the build `dune.2.9.1` saving on some computation. In practice this may not be quite as straightforwarrd as our caching is often broken by a something like a `copy` of `opam-repository` before getting to this stage or the `*.opam` files... #### Abort unavailable packages early By having a solver service we could be clever (I think) and immediately check if the package is available preventing any extra unnecessary build steps. For example currently in `opam-repo-ci` we `opam update --depexts` before checking if we can even install which seems wasteful (particularly given than build step will never be re-used afacit). - I believe dra27 is extracting the compiler making step at the very least between the 2/3 `make cold` opam steps :)) ### Other Builds #### Docker base images Are these being rebuilt (including `make cold` opams) every single time? Is this necessary for the images to be correct or are we missing a layer of caching here, perhaps we should be tracking the commit hash of the various opam branches that we build into these images so we know whether or not anything has changed? ### Carbon-aware Cluster All electricity is not created equally. There is a staggering amount of things to take into account (these thoughts relate to the UK National Grid): - What time of day is it? This impacts the relative carbon intensity of the electricity being generated. - Which country and what part of the country are the machine located in? See [the regional map of carbon intensity](https://carbonintensity.org.uk/#regional). - Should we be using the electricity at all? Are jobs a high enough priority to be worth scheduled -- do we need a weekly opam-health-check etc. #### Carbon-aware Scheduling Combining machine wattage output with data from the [Carbon Intensity API][carbon-intensity-api] (providing the machines are located in Great Britain) we can have a carbon-aware scheduler. Jobs are only scheduled at times when the carbon intensity is going to be below a threshold and can let users know if there will be a down period in advance -- for very high priority jobs we can manually schedule but comes at some carbon-offsetting cost. Jobs could also have an associated cost to them proportional to their carbon usage (a combination of run time, power usage and carbon intensity of the power). This information could also help with the desire to schedule jobs based on how long they are likely to take. #### Machine-aware Scheduling Some machines may (after some empirical research) have a better throughput/wattage ratio for particular jobs -- the scheduler could incorporate some feedback loop about the performance of certain jobs on machines and learn which machines to schedule what kinds of jobs. This would work together with the caching policy. [carbon-intensity-api]: https://api.carbonintensity.org.uk/