owned this note
owned this note
Published
Linked with GitHub
# Tiger team - GPU software in EESSI
Tiger team lead: Kenneth
# Sync meeting 20250507
attending:
- Richard, Thomas (UiB)
- Caspar (SURF)
- Lara (UGent)
## agenda
- actions points for weeks since last meeting:
- Caspar: CUDA sanity check in EasyBuild, access to service account @ SURF, help with arch_target_map in bot
- CUDA sanity check (https://github.com/easybuilders/easybuild-framework/pull/4692)
- access to service account @ SURF: "everybody" (works for Kenneth, Lara, Thomas; unclear for Richard) -- access means access to files, not access to the service account
- https://gitlab.com/eessi/support/-/wikis/GPU-SURF-cluster
- help with arch_target_map in bot (draft https://github.com/EESSI/eessi-bot-software-layer/pull/312)
- Lara: full matrix test builds for CUDA-samples & co
- did not get to it
- postpone to next meeting
- Richard did something
- Richard: full matrix test builds for CUDA-samples & co
- CUDA + bunch CUDA packages
- cc70: https://github.com/EESSI/software-layer/pull/1030
- some failure building CUDA Samples
- options
- don't build CUDA Samples or not this version --> not really a good option
- build a newer version of CUDA Samples and keep older version plus provide a module file that points users to newer version
- cc80: https://github.com/EESSI/software-layer/pull/1076
- cc90: https://github.com/EESSI/software-layer/pull/1077
- go ahead building CUDA, UCX, UCC and OSU without CUDA-Samples
- build newer version of CUDA-Samples for all ccXX
- build older version of CUDA-Samples for all ccXX, where cc70 modules have a warning/info that it didn't build and users can use a newer version
- [Kenneth] open an issue to update contribution policy for what is acceptible if a software (version) cannot be built for some specific hardware configurations (e.g., CUDA-Samples 12.1 for cc70)
- Thomas: arch_target_map in bot, checking in bot
- looked briefly at Caspar's draft PR
- PyTorch 2.1.2 with CUDA
- add ReFrame tests to list of tests that are run
- zen3+cc80 (UGent) built
- zen4+cc90 (Snellius) failed (timeout? in testing)
- grace+cc90 (Jureca) "failed" with ~ 20 tests instead of ~ 10 -> ask Flamefire if anything is concerning
- cc70
- create known issue and go ahead
- Kenneth: vacation (11-21 April), butterfly mode (help where needed)
- very little
- Alan: EasyBuild hook + implement fallback from cc89 to cc80 in init script
- check with Alan
- bot improvement
- UGent setup, scrontrol need to know cluster (wrapper could get job id via squeue, then use actual scontrol)
- LAMMPS on NVIDIA Grace/Hopper
- didn't build on jureca nodes
- Richard talks to Lara about it
- Try PyTorch 2.6 or even 2.7
- work with Flamefire
- next meeting: "doodle" for after last webinar not before june 5
- for Kenneth: not before 5th of June
# Sync meeting 20250409
attending:
- Thomas, Richard (UiB)
- Lara, Kenneth (UGent)
- Caspar (SURF)
- Alan (CECAM/UB)
## Agenda
### Prepared notes (Thomas)
- Formalize supported Cuda Compute Capability & CPU architecture combinations
- Decide on which CPU+CCx combinations we want to build for: please, everyone read up on https://gitlab.com/eessi/support/-/issues/142 and make up your mind
- so far we have CUDA 12.1.1 (with foss/2023a) and CUDA 12.4.0 (with foss/2023b)
- unclear why CUDA 11 is dicussed in the issue?
- Blackwell requires CUDA 12.8
- have to keep that in mind for EESSI/2025.04 + toolchains bundled with that
- unclear what "auto-upgrade" means, not clear why we need that
- proposal for supported CPU+ccXY combinations (regardless of toolchain?)
- EESSI/2023.06: any-CPU+cc[70,80,90]
- EESSI/2025.04: any-CPU+cc[70,80,90,100]
- does CUDA 12.8 still fully support cc70? just curious
- guarantees that cc70 builds work on any cc7x device, cc80 builds work on any cc8x device, ...
- could/should change archdetect so it only returns the major version (however, we then need to change that if we change our mind and build for minor cc-versions; maybe better to keep archdetect as is and decide which build to use where archdetect is used?)
- PTX sounds interesting, but maybe we look into that later?
- proposal for build workflow/strategy
- first build on nodes with GPUs
- if successful, build other combinations
- add note to module file to inform whether package has been build on GPU/ccXY or not
- Log / document which combinations are supported natively, and which are cross-compiled
- we should have a strategy in case something doesn't work
- (CUDA sanity check)
- Other topics (bot support, access to build nodes with GPUs)
### Meeting notes
- things to decide
- which CUDA versions per EESSI version
- CUDA 12.1.1 (2023a) + 12.4.0 (2023b) in EESSI 2023.06
- CUDA 12.8.0 (Blackwell support, 2024a) in EESSI 2025.0X
- which NVIDIA GPU targets (ccXY) per EESSI version (+ toolchain?)
- cc80 (A100) would be sufficient to also cover L4, L40, A40, etc.
- major versions (cc70, cc80, cc90) for all supported CPU targets
- specific minor versions (cc89) only with CPU targets that "make sense"
- cover combos available in AWS/Azure
- like cc75 (T4) + Cascade Lake => AWS G4 instances (https://aws.amazon.com/ec2/instance-types/g4/)
- which CPU+GPU target combos
- wait for CUDA sanity check feature in EasyBuild
- requires EasyBuild 5.0.X
- see https://github.com/easybuilders/easybuild-framework/pull/4692
- can also verify afterwards with `--sanity-check-only`
- needs extra option to error out in case CUDA sanity check fails
- GPU build nodes vs "cross-compiling"
- procedure
- for each PR
- first test builds on GPU build nodes
- @UGent: zen3+cc80 (service account)
- can also do cascadelake+cc70 and zen4+cc90
- @SURF: zen4+cc90 (service account)
- can also do icelake+cc90
- @JSC: grace+cc90 (service account, when available)
- @AWS: cascadelake+cc75 (G4 instances)
- only once cascadelake CPU target is fully supported
- then build for all major GPU targets (ccX0), cross-compiling
- then build for specific CPU+GPU combos, cross-compiling
- initially, do only subset:
- cascadelake+cc70 (UGent V100)
- zen2+cc80 (A100 @ Vega, Karolina, ...)
- zen3+cc80 (UGent A100)
- zen4+cc90 (SURF + UGent H100)
- nvidia/grace+cc90 (JSC)
- icelake+cc80 (SURF) -- when CPU target is available
- how to deal with explosion of bot build jobs + ensure everything is there
- + also improve ingestion
- add support for wildcards for CPU/GPU targets so all targets get built automatically
- `bot: check-deploy` => each bot verifies wheter PR is ready to deploy
- only set label if each bot replies positively
- eventually also let 1 bot verify overall status?
- start with CUDA-Samples, OSU, NCCL, GROMACS, LAMMPS, ESPResSo, ...
- (try) full matrix for CUDA-Samples + OSU + required dependencies
- to fix what we broke when removing CUDA & co from CPU-only locations
- `CUDA-Samples/12.1-GCC-12.3.0-CUDA-12.1.1`
- `OSU-Micro-Benchmarks/7.2-gompi-2023a-CUDA-12.1.1`
- `OSU-Micro-Benchmarks/7.5-gompi-2023b-CUDA-12.4.0`
- subset for apps like GROMACS
- dedicated issue to keep track of progress
- archdetect needs to be enhanced to make it aware of major GPU target version
- should actually be done by init script (cc89 -> cc89+cc80)
- EasyBuild hook should know when it's expected to have a GPU available
- based on whether `nvidia-smi` is available
- if `$EESSI_ACCELERATOR_TARGET` is specified, verify whether it matches with what `nvidia-smi` reports
- to enforce testing + inject info to generated module file on whether GPU testing was done or not
- make sure that no CPU-only modules get installed in `accel` subdirectory?
- enforce in `EESSI-extend` only for non-project/non-user installations
- actions points for coming weeks:
- Caspar: CUDA sanity check in EasyBuild, access to service account @ SURF, help with arch_target_map in bot
- Lara: full matrix test builds for CUDA-samples & co
- Thomas: arch_target_map in bot, checking in bot
- Ricard: full matrix test builds for CUDA-samples & co
- Kenneth: vacation (11-21 April), butterfly mode (help where needed)
- Alan: EasyBuild hook + implement fallback from cc89 to cc80 in init script
- next meeting
- Wed 7 May 2025 13:00 CEST: OK for all
---
# Sync meeting 20250114
attending: Richard, Lara, Kenneth, Alan, Bob, Caspar, Thomas
## Agenda
- status update
- build for specific CPU microarch + CUDA CC (zen2 + cc80, zen3 + cc80)
- some software is tricky to build (PyTorch, TensorFlow, ...)
- next steps
- in the process of setting up bots on clusters with GPUs (UGent, SURF, ...)
- keep focus on A100+Zen2/Zen3 for now (covers most of EuroHPC systems)
- A100+Zen3 as GPU build node
- A100+Zen2 on non-GPU build node (and then test on A100+Zen3 node)
- also start supporting Intel Ice Lake as extra CPU generation between Skylake & Sapphire Rapids?
- Intel Ice Lake for SURF A100 nodes + Leonardo GPU partition
- general discussion
- not feasible to really use GPU nodes for all combinations of CPU+GPU
- BUT do at least 1 (?) build on GPU node for each software package also for each GPU generation
- for service account for EESSI at UGent, supported GPUs:
- V100 (+ Intel Cascade Lake)
- A100 (+ Zen3, also + Zen2 on Tier-1 system)
- H100 (+ Zen4)
- would be helpful if bot knows what "baseline" is for testing GPU builds, like `bot: build arch=gpu-baseline`
- as a shorthand for build command for V100+A100+H100
- would be useful for CPUs too
- way for ensuring that all required builds have been done
- some CI or bot checks
- build for generic+cc80 on CPU only nodes (e.g., on AWS), may be tested on more modern CPUs that have A100 or newer
- need changes to bot / overall test setup
1. build on resource A, but only run minimal tests if any
2. upload tarball to some S3
3. run test on resource B
- software-layer
- stacks for new architectures: GH200 (OpenMPI/MPICH)
- important for Olivia (NRIS/Sigma2, Norway), JUPITER
- could run build bot at JSC
- CUDA 12.0 might be minimum version
- sync meeting with JSC (Sebastian) to understand status of builds for Grace?
- need to re-generate module files for CUDA module file
- see [easyblocks PR #3516](https://github.com/easybuilders/easybuild-easyblocks/pull/3516)
- PyTorch w/ CUDA
- handful of failing tests, maybe the usual suspects?
- AP TR: send list to Kenneth
- TensorFlow w/ CUDA
- host stuff leaking into build of Bazel/TensorFlow, leading to glibc compat issues
- maybe [easyblocks #3486](https://github.com/easybuilders/easybuild-easyblocks/pull/3486) is relevant here
- AMD GPUs
- check if support is needed for deliverables in June
- good connection to AMD
- currently only relevant for LUMI
- toolchain issue
- detection of libraries, drivers (how different is it to what we do for NVIDIA)
- sanity check for compute capabilities: check if device code in builds is what the build was intended for
- add to `bot/check-build.sh` script
- see [support issue #92](https://gitlab.com/eessi/support/-/issues/92) in GitLab
- EESSI test suite
- some issue with test with locally deployed bot https://gitlab.com/eessi/support/-/issues/114
- what GPU tests do we already have
- eessi-bot-software-layer
- support for `node` filter (to send job to GPU nodes)
- change way jobs are created (to avoid that many jobs are created by accident)
- think about baseline feature
- think about letting ReFrame itself schedule test jobs
- add support for silencing the bot :-)
- next meetings: schedule via Slack
- AMD:
- tiger-NVIDIA
# Sync meeting 20241009
attending: Kenneth, Alan, Caspar, Richard, Lara, Pedro
## Merged PRs
- *(none)*
## Open PRs
- Allow Nvidia driver script to set `$LD_PRELOAD` ([software-layer PR #754](https://github.com/EESSI/software-layer/pull/754))
- list of CUDA compat libraries is a lot shorter, are those actually enough?
- we should try and get this merged as is, additional features can be implemented in follow-up PRs
- Kenneth will review PR
- enhance archdetect to support detection of NVIDIA GPUs + using that in EESSI init script ([software-layer PR #767](https://github.com/EESSI/software-layer/pull/767))
- only covers init script, not `EESSI` module yet
- being deployed right now
- cuDNN 8.9.2.26 w/ CUDA 12.1.1 (part 1) ([software-layer PR #772](https://github.com/EESSI/software-layer/pull/772))
- minor open discussion point w.r.t. `--accept-cuda-eula` option, we should either make it more generic (`--accept-eula=CUDA,cuDNN`), or also add `--accept-cudnn-eula`
- open question on whether `--rebuild` is required in `install_cuda_and_libraries.sh`
- easystack file that specifies what to strip down to runtime stuff + install under `host_injections` should be shipped with EESSI
- need to update `install_cuda_and_libraries.sh` script to load a specific EasyBuild module version before loading EESSI-extend module
- path to easystack file(s) should be hardcoded in `install_cuda_and_libraries.sh` script, not passed as an argument
- no need to use `-dry-run --rebuild` to figure out what needs to be installed, let EB figure it out based on modules in hidden path `.modules/all`
- enhance cuDNN easyblock to verify that EULA is accepted before installing it ([easyblocks PR #3473](https://github.com/easybuilders/easybuild-easyblocks/pull/3473))
- waLBerla v6.1 w/ CUDA 12.1.1 ([software-layer PR #780](https://github.com/EESSI/software-layer/pull/780))
- on hold/blocked PRs:
- TensorFlow v2.15.1 w/ CUDA 12.1.1 ([software-layer PR #717](https://github.com/EESSI/software-layer/pull/717)): requires `cuDNN`
- PyTorch v2.1.2 w/ CUDA 12.1.1 ([software-layer PR #718](https://github.com/EESSI/software-layer/pull/718)): requires `cuDNN`
## Discussion
- additional `accel/nvidia/common` subdirectory for CUDA, CUDA-Samples, OSU-MicroBenchmarks, cuDNN, ... ?
- useful when specific type of GPU is not supported yet (currently only A100)
- have to ensure fat installations of CUDA-Samples & OSU-MicroBenchmarks that can be used on any type of NVIDIA GPU
- older CUDA versions will not work for newer NVIDIA GPUs, that's a counter argument against introducing `/common`
- CUDA-samples currently specifies `SMS`, so is built for a specific CUDA compute capabilities (+ corresponding PTX code, so can be JIT compiled on newer GPU, not older GPUs)
- fallback installations with support for a range of GPUs under `x86_64/generic`?
- best way would be by implementing `nvcc` (& co) compiler wrappers in EasyBuild so we can easily inject the correct options (and remove any others being added by whatever build system being used)
- sanity check for CUDA compute capabilities being used [support issue #92](https://gitlab.com/eessi/support/-/issues/92)
## TODO
- GPU build node `[Kenneth]`
- GROMACS w/ CUDA
- see [easyconfigs PR #21549](https://github.com/easybuilders/easybuild-easyconfigs/pull/21549)
- no libfabric with CUDA dependency (doesn't exist yet)
- in EESSI, workaround could be to inject an OpenMPI that has support for CUDA-aware libfabric
- GROMACS docs ~~only~~ mostly mention UCX, not libfabric: https://manual.gromacs.org/current/install-guide/index.html#gpu-aware-mpi-support
- waLBerla w/ CUDA
- see [easyconfigs PR #21601](https://github.com/easybuilders/easybuild-easyconfigs/pull/21601)
## Next sync meeting
- Wed 16 Oct'24 15:00 CEST
---
# Sync meeting 20241002
attending: Kenneth, Pedro, Lara, Caspar, Thomas
## Status
- GPU installations in place for ESPResSo, LAMMPS, CUDA-Samples, OSU-Microbenchmarks
## Merged PRs
- CUDA 12.1.1 (**rebuild**) ([PR #720](https://github.com/EESSI/software-layer/pull/720))
- UCX-CUDA v1.14.1 (**rebuild**) ([PR #719](https://github.com/EESSI/software-layer/pull/719))
- CUDA-Samples v12.1 (**rebuild**) ([PR #715](https://github.com/EESSI/software-layer/pull/715))
- NCCL 2.18.3 (**rebuild**) ([PR #741](https://github.com/EESSI/software-layer/pull/741))
- UCC-CUDA ([PR #750](https://github.com/EESSI/software-layer/pull/750))
- OSU-Microbenchmarks v7.2 w/ CUDA 12.1.1 (**rebuild**) ([PR #716](https://github.com/EESSI/software-layer/pull/716))
- LAMMPS 2Aug2023 CUDA 12.1.1 ([PR #711](https://github.com/EESSI/software-layer/pull/711))
- ESPResSo w/ CUDA ([PR #748](https://github.com/EESSI/software-layer/pull/748))
- Take accelerator builds into account in CI that checks for missing installations ([PR #753](https://github.com/EESSI/software-layer/pull/753))
## Open PRs
- enhance archdetect to support detection of NVIDIA GPUs + using that in EESSI init script ([PR #767](https://github.com/EESSI/software-layer/pull/767))
- [Thomas] cuDNN 8.9.2.26 ([PR #581](https://github.com/EESSI/software-layer/pull/581))
- some time spent on it, but no new PR yet
- this PR does a bunch of things:
- install cuDNN correctly in EESSI (only files we're allowed to re-distribute)
- see updated EasyBuild hooks
- use easystack file for CUDA + cUDNN
- generic script to install CUDA + cUDNN under `host_injections` and unbreak the symlinks
- required when installing stuff on top of cUDNN (like TensorFlow)
- update Lmod hooks to also throw error when cuDNN module is loaded without symlinks being unbroken first via `host_injections`
- => separate PR for this
- update EasyBuild hooks to downgrade cuDNN to build-only dependency
- Thomas will take a look this week, Caspar+Kenneth can continue the effort next week
- [Thomas/Caspar] PyTorch v2.1.2 w/ CUDA 12.1.1 ([PR #586](https://github.com/EESSI/software-layer/pull/586) or [PR #718](https://github.com/EESSI/software-layer/pull/718))
- requires ~~UCX-CUDA~~ + cuDNN
- [Thomas/Caspar] TensorFlow v2.15.1 w/ CUDA 12.1.1 ([PR #717](https://github.com/EESSI/software-layer/pull/717))
- requires ~~CUDA + NCCL~~ + cuDNN
## TODO
- [Keneth] proper GPU node
- will require bot + software-layer script changes too, to support something like `bot build arch=zen3:cc80`
- or implement Sam's idea to be able to provide additional options
- more (MultiXscale) software w/ GPU support:
- [Pedro] waLBerla
- also move to newer toolchain
- [Kenneth] PyStencils
- [Davide?] MetalWalls (=> Kenneth?)
- [Kenneth/Bob?] GROMACS (=> Kenneth?)
- see also CPU-only GROMACS [PR #709](https://github.com/EESSI/software-layer/pull/709)
- for GROMACS, we would definitely/ideally need a GPU node for it
- more `accel/*` targets?
- currently:
- `zen2` + `cc80` (A100) (Vega, Deucalion)
- `zen3` + `cc80` (A100) (Karolina)
- Snellius: Ice Lake + A100 (cc80), Zen4 + H100 (cc90)
- HPC-UGent Tier-2 joltik: Intel Cascade Lake (`intel/skylake_avx512`) + V100 (cc70)
- HPC-UGent Tier-2 donphan: Intel Cascade Lake (`intel/skylake_avx512`) + A2 (cc86)
- RUG: `intel/icelake` + A100 (cc80) (may be in too short supply to use), `intel/skylake_avx512` + V100 (cc70)
- => for now, stick to `zen2`+`cc80` and `zen3`+`cc80`, make sure we have a GPU build node first
- clean up GPU installations in CPU path of OSU-Microbenchmarks + CUDA-Samples
- produce a warning first before removing
- or also install these for `zen4` (just to keep things consistent)
- provide `{x86_64,aarch64}/generic` builds of these, and pick up on them automatically?
- could affect CPU-only optimized builds
- build OSU-Microbenchmarks + CUDA-Samples in `accel/nvidia/generic` (fat binaries)?
- there was a problem with building for multiple CUDA compute capabiities, but fixed by Caspar in [easyconfigs PR #21031](https://github.com/easybuilders/easybuild-easyconfigs/pull/21031)
- get GPU builds in software overview @ https://www.eessi.io/docs/available_software
- separate table on software page, specific to GPU?
## Next meeting
- (Thomas is on leave 7-11 Oct)
-----------------------------------------------------------------------------
# Sync meeting 20240925
## Merged PRs
- take into account accelerator target when configuring EasyBuild ([PR #710](https://github.com/EESSI/software-layer/pull/710))
- update `EESSI-remove-software.sh` script to support removal of GPU installations (in `accel/*`) ([PR #721](https://github.com/EESSI/software-layer/pull/721))
- Avoid `create_tarball.sh` exiting on non-matching grep ([PR #723](https://github.com/EESSI/software-layer/pull/723))
- Make the `bot/bot-check.sh` script correctly pick up builds for accelerators ([PR #726](https://github.com/EESSI/software-layer/pull/726))
## Open PRs
- CUDA 12.1.1 (**rebuild**) ([PR #720](https://github.com/EESSI/software-layer/pull/720))
- requires changes to EasyBuild hook for CUDA in [PR #735](https://github.com/EESSI/software-layer/pull/735)
- built for `x86_64/amd/zen2/accel/nvidia/cc80` + `x86_64/amd/zen3/accel/nvidia/cc80`
- deploy triggered, will be ingested soon
- [Kenneth] UCX-CUDA v1.14.1 (**rebuild**) ([PR #719](https://github.com/EESSI/software-layer/pull/719))
- requires CUDA
- [Thomas] cuDNN 8.9.2.26 ([PR #581](https://github.com/EESSI/software-layer/pull/581))
- requires CUDA
- [Caspar] CUDA-Samples v12.1 (**rebuild**) ([PR #715](https://github.com/EESSI/software-layer/pull/715))
- requires CUDA
- [Kenneth] NCCL 2.18.3 (**rebuild**) ([PR #741](https://github.com/EESSI/software-layer/pull/741))
- requires CUDA + UCX-CUDA (+ ~~EasyBuild v4.9.4, see [PR #740](https://github.com/EESSI/software-layer/pull/740)~~)
- [Lara] UCC-CUDA ([PR #750](https://github.com/EESSI/software-layer/pull/750))
- requires CUDA + UCX-CUDA + NCCL
- [Lara] OSU-Microbenchmarks v7.2 w/ CUDA 12.1.1 (**rebuild**) ([PR #716](https://github.com/EESSI/software-layer/pull/716))
- requires CUDA + UCX-CUDA + UCC-CUDA
- [Thomas/Caspar] PyTorch v2.1.2 w/ CUDA 12.1.1 ([PR #586](https://github.com/EESSI/software-layer/pull/586) or [PR #718](https://github.com/EESSI/software-layer/pull/718))
- requires UCX-CUDA + cuDNN
- [Thomas/Caspar] TensorFlow v2.15.1 w/ CUDA 12.1.1 ([PR #717](https://github.com/EESSI/software-layer/pull/717))
- requires CUDA + NCCL + cuDNN
- [Kenneth/Lara] LAMMPS 2Aug2023 CUDA 12.1.1 ([PR #711](https://github.com/EESSI/software-layer/pull/711))
- requires CUDA + UCX-CUDA + NCCL
- [Pedro] ESPResSo w/ CUDA ([PR #748](https://github.com/EESSI/software-layer/pull/748))
- requires only CUDA
- [easyconfig PR #21440](https://github.com/easybuilders/easybuild-easyconfigs/pull/21440)
## TODO
- `create_tarball.sh` might not include Lmod cache and config files correctly for GPU builds ([issue #722](https://github.com/EESSI/software-layer/issues/722))
- we need to determine whether we'll need separate Lmod hooks
- [Bob] make Lmod cache aware of modules in `/accel/*`
- separate cache for each `accel/*` subdirectories + extra `lmodrc.lua` that is added to `$LMOD_RC`
- can start looking into this as soon as `CUDA-samples` is deployed
- [Bob] make CI that checks for missing modules aware of GPU installations
- check for each CPU+GPU combo separately for now
- [Alan?] add NVIDIA GPU detection to `archdetect` + make sure init script & EESSI module uses it
- add `gpupath` function to `archdetect` script
- only exact match for `cc*` for now, determine via `nvidia-smi`, see also https://gitlab.com/eessi/support/-/issues/59#note_1909332878
## Notes
- (Richard) libfabric + CUDA
- (Caspar) we should add technical information on how EESSI is built
- incl. how/why we strip down CUDA installations
- starting point could be (outdated) ["Behind the Scenes" talk](https://github.com/EESSI/docs/blob/main/talks/20210119_EESSI_behind_the_scenes/EESSI-behind-the-scenes-20210119.pdf)
## Next meeting
- Tue 1 Oct 2024 09:00 CEST for GPU tiger team
- 10:00 CEST for dev.eessi.io
---
## Sync meeting 20240918
attending: Kenneth, Lara, Alan, Richard, Pedro, Caspar, Thomas
### Done
- pass down `accelerator` filter in `build` command for bot into `job.cfg` file ([merged bot PR #280](https://github.com/EESSI/eessi-bot-software-layer/pull/280))
```ini
# from working directory for build job:
$ cat cfg/job.cfg
...
[architecture]
software_subdir = x86_64/amd/zen2
os_type = linux
accelerator = nvidia/cc80
```
- mark `with_accelerator` setting in `[submitted_job_comments]` section as required ([merged bot PR #282](https://github.com/EESSI/eessi-bot-software-layer/pull/282))
### Open PRs
- take into account accelerator target when configuring EasyBuild ([software-layer PR #710](https://github.com/EESSI/software-layer/pull/710))
- tested via [PR to boegel's fork with `beagle-lib`](https://github.com/boegel/software-layer/pull/33)
- seems to be working, produces `2023.06/software/linux/x86_64/amd/zen2/accel/nvidia/cc80/modules/all/beagle-lib/4.0.1-GCC-12.3.0-CUDA-12.1.1.lua`
- (CPU-only) GROMACS 2024.3 ([software-layer PR #709](https://github.com/EESSI/software-layer/pull/709))
### Next steps
- [Thomas,Kenneth] new bot release (v0.5.0?), so we can build GPU software with production bots @ AWS+Azure
- [Kenneth] update `bot/check-build.sh` to be aware of `/accel/` paths
- just so reporting of files included in tarball to deploy looks nicer
- => should be done in a separate PR
- [???] verify `accelerator` value so only allowed values (`cc80`, `cc90`) can be used
- mainly to avoid silly mistakes that would be easy to overlook
- hard mapping of `cc80` to `8.0` instead of hacky approach with `sed` in `configure_easybuild` to define `$EASYBUILD_CUDA_COMPUTE_CAPABILITIES`?
- [???] verify CPU + accelerator target combos, only select combos should be used:
- separate small (Python) script to call from `bot/build.sh`?
- valid combos for EuroHPC systems:
- `x86_64/amd/zen2/accel/nvidia/cc80` (AMD Rome + A100 => Vega + Meluxina + Deucalion) => **MAIN TARGET**
- should be possible to find as AWS instance
- could use `arch:x86_64/amd/zen2+cc80`
- this would also work for Hortense @ UGent, Snellius @ SURF, ...
- `x86_64/amd/zen3/accel/nvidia/cc80` (AMD Milan + A100 => Karolina)
- `x86_64/intel/icelake/accel/nvidia/cc90` (Intel Icelake + H100 => Leonardo)
- or `/skylake_avx512/` since we don't have `icelake` CPU target?
- `x86_64/intel/sapphirerapids/accel/nvidia/cc80` (AMD Milan + A100 => MareNostrum 5)
- or `/skylake_avx512/` (for now) since no `sapphirerapids` CPU target (yet)?
- `x86_64/amd/zen3/accel/amd/gfx90a` (AMD Milan + MI250X => LUMI)
- `aarch64/nvidia/grace/accel/nvidia/cc90` (NVIDIA Grace + H100 => JUPITER)
- => should be `/neoverse_v2` instead of `/grace/`
- (no GPUs in Discoverer)
- [???] refuse to install CPU-only modules under `accel/*`
- missing CPU-only deps must be installed through separate `software-layer` PR
- [Alan?] verify CPU(+GPU) target at start of build job, to catch misconfigurations
- *** [Kenneth/Thomas/Alan] build nodes with GPU (relevant for test suites)
- needs enhancement in the bot so build jobs can be sent to GPU nodes
- `bot build ... node:zen2-cc80` which adds additional `sbatch` arguments
- or via `bot build ... arch:zen2+cc80`
- easy to add GPU partition to Magic Castle cluster
- also needs node image with GPU drivers
- Magic Castle auto-installs GPU drivers when it sees a GPU, we probably don't want that
- [Caspar] we need a way to pick a different ReFrame configuration file based on having a GPU or not
- [Alan?] let bot or GitHub Actions auto-report missing modules for PRs to `software-layer` PRs
- build & deploy GPU software
- [Caspar] CUDA-samples (moved from CPU-only install path)
- [Caspar] OSU-Microbenchmarks (moved from CPU-only install path)
- also for `{x86_64,aarch64}/generic/accel/nvidia/generic` as fallback?
- [Caspar] UCX-CUDA (moved from CPU-only install path?)
- [Caspar] CUDA itself (moved from CPU-only install path?)
- [Caspar] NCCL (moved from CPU-only install path?)
- [Pedro] ESPResSo
- OK for `foss/2023a` (not for `foss/2023b`)
- PR for easyconfigs repo + software-layer PR to test build for `accel:nvidia/cc80`
- to dedicated easystack file: `eessi-2023.06-eb-4.9.3-2023a-CUDA.yml`
- [Bob?] GROMACS
- [Lara] LAMMPS
- using existing `LAMMPS-2Aug2023_update2-foss-2023a-kokkos-CUDA-12.1.1.eb`
- TODO: software-layer PR + test `accel` build
- [Caspar] TensorFlow
- [Caspar/Thomas] PyTorch
- only build for now, don't reploy yet, ideally we use GPU nodes so GPU tests are also run
- to test the GPU software installations, we also need the `host_injections` stuff in place
- see https://www.eessi.io/docs/site_specific_config/gpu
- [???] Figure out if `.lmod/XYZ` files should be shipped in accel prefix, see [issue #722](https://github.com/EESSI/software-layer/issues/722)
### Notes
- is OpenMPI in EESSI CUDA-aware?
- yes, via UCX-CUDA
### Nexts sync meeting
- Wed 25 Sept 11:00 CEST (Kenneth, Pedro, Alan, Thomas, Richard?)
- clashes with [AI-Friendly EuroHPC Systems EuroHPC session](https://eurohpc-ju.europa.eu/news-events/events/ai-friendly-eurohpc-systems-virtual-2024-09-25_en) (09:30-12:45 CEST)
- can we reschedule to 13:00 CEST (GPU tiger team) + 14:00 CEST (dev.eessi.io tiger team)?
- OK
- (17:00 CEST is bi-weekly EasyBuild conf call, KH needs time to prepare it 15:00-17:00)
---
## Kickoff meeting 20240913
attending: Alan, Thomas, Richard, Pedro, Bob, Kenneth
### Goal
- more software installations in EESSI that support NVIDIA GPUs
- `accel/*` subdirectories (see [support issue #59](https://gitlab.com/eessi/support/-/issues/59#note_1924348179))
### Plan
#### software-layer scripts
- enhance scripts to support building for specified accelerator (see `job.cfg` script)
- similar to how `$EESSI_SOFTWARE_SUBDIR_OVERRIDE` is defined in `bot/build.sh`
- check whether `accelerator` value is "valid" + whether combo of CPU target + accelerator
- have to modify EasyBuild hooks to either:
- "select" correct software installation prefix for each installation;
- maybe opens the door to making mistakes easily, have stuff installed in the wrong place;
- refuse installing non-CUDA stuff in `/accel/` path;
- probably less error-prone, so lets do this;
- map `accel=nvidia/cc80` to:
- "`module use $EASYBUILD_INSTALLPATH/modules/all`" to make CPU module available
- `$EASYBUILD_INSTALLPATH=${EASYBUILD_INSTALLPATH}/accel/nvidia/cc80`
- `$EASYBUILD_CUDA_COMPUTE_CAPABILITIES=8.0`
#### archdetect
- auto-detect NVIDIA GPU + compute capability
- via `nvidia-smi --query-gpu=compute_cap`
- to let init script also do `module use .../accel/nvidia/cc80`
#### bot
- see [open PR #280](https://github.com/EESSI/eessi-bot-software-layer/pull/280) to add `accelerator=xxx` to `job.cfg` file
- new bot release after merging PR #280
#### software
- x CUDA-samples (reinstall)
- NCCL (reinstall + add license to installation)
- UCC-CUDA (reinstall)
- x UCX-CUDA (rinstall)
- x OSU Microbenchmarks (reinstall)
- ESPResSo
- no easyconfigs yet for ESPResSo with GPU support => task
- GROMACS
- LAMMPS
- PyTorch
- TensorFlow
#### CPU/GPU combo targets
- AMD Rome + A100 (Vega) => `amd/zen2/accel/nvidia/cc80`
- AMD Milan + A100 (Karolina) => `amd/zen3/accel/nvidia/cc80`
- Deucalion: Rome? + NVIDIA => `???`
- Mexulina: ???
- JUPITER: NVIDIA Grace + H100 => `aarch64/grace/accel/nvidia/cc??`
#### Hardware for build nodes
- manual test build on target hardware via `EESSI-extend`
- or by setting up bot in your own account and make test PRs to your fork of `software-layer` repo
- which AWS instance for AMD Rome + A100?
#### Ideas/future work
- script that reports contents of generated tarball may need some work (Bob?)
- also build for `accel/nvidia/generic` (or very low compute capability like `/cc60`)
### Tasks
(by next meeting)
- [Pedro] easyconfig for ESPResSo with CUDA support
- [Kenneth] re-review + merge of [bot PR #280](https://github.com/EESSI/eessi-bot-software-layer/pull/280)
- [Kenneth] update software-layer scripts
- [Alan] let archdetect detect GPU compute capability + produce list of all valid combos in sensible order
- [Alan] enhance init script/module file to implement auto-detect
### Next meetings
- Wed 18 Sept 11:00 CEST (Kenneth, Pedro, Alan, Richard?)
- Wed 25 Sept 11:00 CEST (Kenneth, Pedro, Alan, Thomas, Richard?)