Tiger team - GPU software in EESSI

# Tiger team - GPU software in EESSI Tiger team lead: Kenneth # Sync meeting 20250507 attending: - Richard, Thomas (UiB) - Caspar (SURF) - Lara (UGent) ## agenda - actions points for weeks since last meeting: - Caspar: CUDA sanity check in EasyBuild, access to service account @ SURF, help with arch_target_map in bot - CUDA sanity check (https://github.com/easybuilders/easybuild-framework/pull/4692) - access to service account @ SURF: "everybody" (works for Kenneth, Lara, Thomas; unclear for Richard) -- access means access to files, not access to the service account - https://gitlab.com/eessi/support/-/wikis/GPU-SURF-cluster - help with arch_target_map in bot (draft https://github.com/EESSI/eessi-bot-software-layer/pull/312) - Lara: full matrix test builds for CUDA-samples & co - did not get to it - postpone to next meeting - Richard did something - Richard: full matrix test builds for CUDA-samples & co - CUDA + bunch CUDA packages - cc70: https://github.com/EESSI/software-layer/pull/1030 - some failure building CUDA Samples - options - don't build CUDA Samples or not this version --> not really a good option - build a newer version of CUDA Samples and keep older version plus provide a module file that points users to newer version - cc80: https://github.com/EESSI/software-layer/pull/1076 - cc90: https://github.com/EESSI/software-layer/pull/1077 - go ahead building CUDA, UCX, UCC and OSU without CUDA-Samples - build newer version of CUDA-Samples for all ccXX - build older version of CUDA-Samples for all ccXX, where cc70 modules have a warning/info that it didn't build and users can use a newer version - [Kenneth] open an issue to update contribution policy for what is acceptible if a software (version) cannot be built for some specific hardware configurations (e.g., CUDA-Samples 12.1 for cc70) - Thomas: arch_target_map in bot, checking in bot - looked briefly at Caspar's draft PR - PyTorch 2.1.2 with CUDA - add ReFrame tests to list of tests that are run - zen3+cc80 (UGent) built - zen4+cc90 (Snellius) failed (timeout? in testing) - grace+cc90 (Jureca) "failed" with ~ 20 tests instead of ~ 10 -> ask Flamefire if anything is concerning - cc70 - create known issue and go ahead - Kenneth: vacation (11-21 April), butterfly mode (help where needed) - very little - Alan: EasyBuild hook + implement fallback from cc89 to cc80 in init script - check with Alan - bot improvement - UGent setup, scrontrol need to know cluster (wrapper could get job id via squeue, then use actual scontrol) - LAMMPS on NVIDIA Grace/Hopper - didn't build on jureca nodes - Richard talks to Lara about it - Try PyTorch 2.6 or even 2.7 - work with Flamefire - next meeting: "doodle" for after last webinar not before june 5 - for Kenneth: not before 5th of June # Sync meeting 20250409 attending: - Thomas, Richard (UiB) - Lara, Kenneth (UGent) - Caspar (SURF) - Alan (CECAM/UB) ## Agenda ### Prepared notes (Thomas) - Formalize supported Cuda Compute Capability & CPU architecture combinations - Decide on which CPU+CCx combinations we want to build for: please, everyone read up on https://gitlab.com/eessi/support/-/issues/142 and make up your mind - so far we have CUDA 12.1.1 (with foss/2023a) and CUDA 12.4.0 (with foss/2023b) - unclear why CUDA 11 is dicussed in the issue? - Blackwell requires CUDA 12.8 - have to keep that in mind for EESSI/2025.04 + toolchains bundled with that - unclear what "auto-upgrade" means, not clear why we need that - proposal for supported CPU+ccXY combinations (regardless of toolchain?) - EESSI/2023.06: any-CPU+cc[70,80,90] - EESSI/2025.04: any-CPU+cc[70,80,90,100] - does CUDA 12.8 still fully support cc70? just curious - guarantees that cc70 builds work on any cc7x device, cc80 builds work on any cc8x device, ... - could/should change archdetect so it only returns the major version (however, we then need to change that if we change our mind and build for minor cc-versions; maybe better to keep archdetect as is and decide which build to use where archdetect is used?) - PTX sounds interesting, but maybe we look into that later? - proposal for build workflow/strategy - first build on nodes with GPUs - if successful, build other combinations - add note to module file to inform whether package has been build on GPU/ccXY or not - Log / document which combinations are supported natively, and which are cross-compiled - we should have a strategy in case something doesn't work - (CUDA sanity check) - Other topics (bot support, access to build nodes with GPUs) ### Meeting notes - things to decide - which CUDA versions per EESSI version - CUDA 12.1.1 (2023a) + 12.4.0 (2023b) in EESSI 2023.06 - CUDA 12.8.0 (Blackwell support, 2024a) in EESSI 2025.0X - which NVIDIA GPU targets (ccXY) per EESSI version (+ toolchain?) - cc80 (A100) would be sufficient to also cover L4, L40, A40, etc. - major versions (cc70, cc80, cc90) for all supported CPU targets - specific minor versions (cc89) only with CPU targets that "make sense" - cover combos available in AWS/Azure - like cc75 (T4) + Cascade Lake => AWS G4 instances (https://aws.amazon.com/ec2/instance-types/g4/) - which CPU+GPU target combos - wait for CUDA sanity check feature in EasyBuild - requires EasyBuild 5.0.X - see https://github.com/easybuilders/easybuild-framework/pull/4692 - can also verify afterwards with `--sanity-check-only` - needs extra option to error out in case CUDA sanity check fails - GPU build nodes vs "cross-compiling" - procedure - for each PR - first test builds on GPU build nodes - @UGent: zen3+cc80 (service account) - can also do cascadelake+cc70 and zen4+cc90 - @SURF: zen4+cc90 (service account) - can also do icelake+cc90 - @JSC: grace+cc90 (service account, when available) - @AWS: cascadelake+cc75 (G4 instances) - only once cascadelake CPU target is fully supported - then build for all major GPU targets (ccX0), cross-compiling - then build for specific CPU+GPU combos, cross-compiling - initially, do only subset: - cascadelake+cc70 (UGent V100) - zen2+cc80 (A100 @ Vega, Karolina, ...) - zen3+cc80 (UGent A100) - zen4+cc90 (SURF + UGent H100) - nvidia/grace+cc90 (JSC) - icelake+cc80 (SURF) -- when CPU target is available - how to deal with explosion of bot build jobs + ensure everything is there - + also improve ingestion - add support for wildcards for CPU/GPU targets so all targets get built automatically - `bot: check-deploy` => each bot verifies wheter PR is ready to deploy - only set label if each bot replies positively - eventually also let 1 bot verify overall status? - start with CUDA-Samples, OSU, NCCL, GROMACS, LAMMPS, ESPResSo, ... - (try) full matrix for CUDA-Samples + OSU + required dependencies - to fix what we broke when removing CUDA & co from CPU-only locations - `CUDA-Samples/12.1-GCC-12.3.0-CUDA-12.1.1` - `OSU-Micro-Benchmarks/7.2-gompi-2023a-CUDA-12.1.1` - `OSU-Micro-Benchmarks/7.5-gompi-2023b-CUDA-12.4.0` - subset for apps like GROMACS - dedicated issue to keep track of progress - archdetect needs to be enhanced to make it aware of major GPU target version - should actually be done by init script (cc89 -> cc89+cc80) - EasyBuild hook should know when it's expected to have a GPU available - based on whether `nvidia-smi` is available - if `$EESSI_ACCELERATOR_TARGET` is specified, verify whether it matches with what `nvidia-smi` reports - to enforce testing + inject info to generated module file on whether GPU testing was done or not - make sure that no CPU-only modules get installed in `accel` subdirectory? - enforce in `EESSI-extend` only for non-project/non-user installations - actions points for coming weeks: - Caspar: CUDA sanity check in EasyBuild, access to service account @ SURF, help with arch_target_map in bot - Lara: full matrix test builds for CUDA-samples & co - Thomas: arch_target_map in bot, checking in bot - Ricard: full matrix test builds for CUDA-samples & co - Kenneth: vacation (11-21 April), butterfly mode (help where needed) - Alan: EasyBuild hook + implement fallback from cc89 to cc80 in init script - next meeting - Wed 7 May 2025 13:00 CEST: OK for all --- # Sync meeting 20250114 attending: Richard, Lara, Kenneth, Alan, Bob, Caspar, Thomas ## Agenda - status update - build for specific CPU microarch + CUDA CC (zen2 + cc80, zen3 + cc80) - some software is tricky to build (PyTorch, TensorFlow, ...) - next steps - in the process of setting up bots on clusters with GPUs (UGent, SURF, ...) - keep focus on A100+Zen2/Zen3 for now (covers most of EuroHPC systems) - A100+Zen3 as GPU build node - A100+Zen2 on non-GPU build node (and then test on A100+Zen3 node) - also start supporting Intel Ice Lake as extra CPU generation between Skylake & Sapphire Rapids? - Intel Ice Lake for SURF A100 nodes + Leonardo GPU partition - general discussion - not feasible to really use GPU nodes for all combinations of CPU+GPU - BUT do at least 1 (?) build on GPU node for each software package also for each GPU generation - for service account for EESSI at UGent, supported GPUs: - V100 (+ Intel Cascade Lake) - A100 (+ Zen3, also + Zen2 on Tier-1 system) - H100 (+ Zen4) - would be helpful if bot knows what "baseline" is for testing GPU builds, like `bot: build arch=gpu-baseline` - as a shorthand for build command for V100+A100+H100 - would be useful for CPUs too - way for ensuring that all required builds have been done - some CI or bot checks - build for generic+cc80 on CPU only nodes (e.g., on AWS), may be tested on more modern CPUs that have A100 or newer - need changes to bot / overall test setup 1. build on resource A, but only run minimal tests if any 2. upload tarball to some S3 3. run test on resource B - software-layer - stacks for new architectures: GH200 (OpenMPI/MPICH) - important for Olivia (NRIS/Sigma2, Norway), JUPITER - could run build bot at JSC - CUDA 12.0 might be minimum version - sync meeting with JSC (Sebastian) to understand status of builds for Grace? - need to re-generate module files for CUDA module file - see [easyblocks PR #3516](https://github.com/easybuilders/easybuild-easyblocks/pull/3516) - PyTorch w/ CUDA - handful of failing tests, maybe the usual suspects? - AP TR: send list to Kenneth - TensorFlow w/ CUDA - host stuff leaking into build of Bazel/TensorFlow, leading to glibc compat issues - maybe [easyblocks #3486](https://github.com/easybuilders/easybuild-easyblocks/pull/3486) is relevant here - AMD GPUs - check if support is needed for deliverables in June - good connection to AMD - currently only relevant for LUMI - toolchain issue - detection of libraries, drivers (how different is it to what we do for NVIDIA) - sanity check for compute capabilities: check if device code in builds is what the build was intended for - add to `bot/check-build.sh` script - see [support issue #92](https://gitlab.com/eessi/support/-/issues/92) in GitLab - EESSI test suite - some issue with test with locally deployed bot https://gitlab.com/eessi/support/-/issues/114 - what GPU tests do we already have - eessi-bot-software-layer - support for `node` filter (to send job to GPU nodes) - change way jobs are created (to avoid that many jobs are created by accident) - think about baseline feature - think about letting ReFrame itself schedule test jobs - add support for silencing the bot :-) - next meetings: schedule via Slack - AMD: - tiger-NVIDIA # Sync meeting 20241009 attending: Kenneth, Alan, Caspar, Richard, Lara, Pedro ## Merged PRs - *(none)* ## Open PRs - Allow Nvidia driver script to set `$LD_PRELOAD` ([software-layer PR #754](https://github.com/EESSI/software-layer/pull/754)) - list of CUDA compat libraries is a lot shorter, are those actually enough? - we should try and get this merged as is, additional features can be implemented in follow-up PRs - Kenneth will review PR - enhance archdetect to support detection of NVIDIA GPUs + using that in EESSI init script ([software-layer PR #767](https://github.com/EESSI/software-layer/pull/767)) - only covers init script, not `EESSI` module yet - being deployed right now - cuDNN 8.9.2.26 w/ CUDA 12.1.1 (part 1) ([software-layer PR #772](https://github.com/EESSI/software-layer/pull/772)) - minor open discussion point w.r.t. `--accept-cuda-eula` option, we should either make it more generic (`--accept-eula=CUDA,cuDNN`), or also add `--accept-cudnn-eula` - open question on whether `--rebuild` is required in `install_cuda_and_libraries.sh` - easystack file that specifies what to strip down to runtime stuff + install under `host_injections` should be shipped with EESSI - need to update `install_cuda_and_libraries.sh` script to load a specific EasyBuild module version before loading EESSI-extend module - path to easystack file(s) should be hardcoded in `install_cuda_and_libraries.sh` script, not passed as an argument - no need to use `-dry-run --rebuild` to figure out what needs to be installed, let EB figure it out based on modules in hidden path `.modules/all` - enhance cuDNN easyblock to verify that EULA is accepted before installing it ([easyblocks PR #3473](https://github.com/easybuilders/easybuild-easyblocks/pull/3473)) - waLBerla v6.1 w/ CUDA 12.1.1 ([software-layer PR #780](https://github.com/EESSI/software-layer/pull/780)) - on hold/blocked PRs: - TensorFlow v2.15.1 w/ CUDA 12.1.1 ([software-layer PR #717](https://github.com/EESSI/software-layer/pull/717)): requires `cuDNN` - PyTorch v2.1.2 w/ CUDA 12.1.1 ([software-layer PR #718](https://github.com/EESSI/software-layer/pull/718)): requires `cuDNN` ## Discussion - additional `accel/nvidia/common` subdirectory for CUDA, CUDA-Samples, OSU-MicroBenchmarks, cuDNN, ... ? - useful when specific type of GPU is not supported yet (currently only A100) - have to ensure fat installations of CUDA-Samples & OSU-MicroBenchmarks that can be used on any type of NVIDIA GPU - older CUDA versions will not work for newer NVIDIA GPUs, that's a counter argument against introducing `/common` - CUDA-samples currently specifies `SMS`, so is built for a specific CUDA compute capabilities (+ corresponding PTX code, so can be JIT compiled on newer GPU, not older GPUs) - fallback installations with support for a range of GPUs under `x86_64/generic`? - best way would be by implementing `nvcc` (& co) compiler wrappers in EasyBuild so we can easily inject the correct options (and remove any others being added by whatever build system being used) - sanity check for CUDA compute capabilities being used [support issue #92](https://gitlab.com/eessi/support/-/issues/92) ## TODO - GPU build node `[Kenneth]` - GROMACS w/ CUDA - see [easyconfigs PR #21549](https://github.com/easybuilders/easybuild-easyconfigs/pull/21549) - no libfabric with CUDA dependency (doesn't exist yet) - in EESSI, workaround could be to inject an OpenMPI that has support for CUDA-aware libfabric - GROMACS docs ~~only~~ mostly mention UCX, not libfabric: https://manual.gromacs.org/current/install-guide/index.html#gpu-aware-mpi-support - waLBerla w/ CUDA - see [easyconfigs PR #21601](https://github.com/easybuilders/easybuild-easyconfigs/pull/21601) ## Next sync meeting - Wed 16 Oct'24 15:00 CEST --- # Sync meeting 20241002 attending: Kenneth, Pedro, Lara, Caspar, Thomas ## Status - GPU installations in place for ESPResSo, LAMMPS, CUDA-Samples, OSU-Microbenchmarks ## Merged PRs - CUDA 12.1.1 (**rebuild**) ([PR #720](https://github.com/EESSI/software-layer/pull/720)) - UCX-CUDA v1.14.1 (**rebuild**) ([PR #719](https://github.com/EESSI/software-layer/pull/719)) - CUDA-Samples v12.1 (**rebuild**) ([PR #715](https://github.com/EESSI/software-layer/pull/715)) - NCCL 2.18.3 (**rebuild**) ([PR #741](https://github.com/EESSI/software-layer/pull/741)) - UCC-CUDA ([PR #750](https://github.com/EESSI/software-layer/pull/750)) - OSU-Microbenchmarks v7.2 w/ CUDA 12.1.1 (**rebuild**) ([PR #716](https://github.com/EESSI/software-layer/pull/716)) - LAMMPS 2Aug2023 CUDA 12.1.1 ([PR #711](https://github.com/EESSI/software-layer/pull/711)) - ESPResSo w/ CUDA ([PR #748](https://github.com/EESSI/software-layer/pull/748)) - Take accelerator builds into account in CI that checks for missing installations ([PR #753](https://github.com/EESSI/software-layer/pull/753)) ## Open PRs - enhance archdetect to support detection of NVIDIA GPUs + using that in EESSI init script ([PR #767](https://github.com/EESSI/software-layer/pull/767)) - [Thomas] cuDNN 8.9.2.26 ([PR #581](https://github.com/EESSI/software-layer/pull/581)) - some time spent on it, but no new PR yet - this PR does a bunch of things: - install cuDNN correctly in EESSI (only files we're allowed to re-distribute) - see updated EasyBuild hooks - use easystack file for CUDA + cUDNN - generic script to install CUDA + cUDNN under `host_injections` and unbreak the symlinks - required when installing stuff on top of cUDNN (like TensorFlow) - update Lmod hooks to also throw error when cuDNN module is loaded without symlinks being unbroken first via `host_injections` - => separate PR for this - update EasyBuild hooks to downgrade cuDNN to build-only dependency - Thomas will take a look this week, Caspar+Kenneth can continue the effort next week - [Thomas/Caspar] PyTorch v2.1.2 w/ CUDA 12.1.1 ([PR #586](https://github.com/EESSI/software-layer/pull/586) or [PR #718](https://github.com/EESSI/software-layer/pull/718)) - requires ~~UCX-CUDA~~ + cuDNN - [Thomas/Caspar] TensorFlow v2.15.1 w/ CUDA 12.1.1 ([PR #717](https://github.com/EESSI/software-layer/pull/717)) - requires ~~CUDA + NCCL~~ + cuDNN ## TODO - [Keneth] proper GPU node - will require bot + software-layer script changes too, to support something like `bot build arch=zen3:cc80` - or implement Sam's idea to be able to provide additional options - more (MultiXscale) software w/ GPU support: - [Pedro] waLBerla - also move to newer toolchain - [Kenneth] PyStencils - [Davide?] MetalWalls (=> Kenneth?) - [Kenneth/Bob?] GROMACS (=> Kenneth?) - see also CPU-only GROMACS [PR #709](https://github.com/EESSI/software-layer/pull/709) - for GROMACS, we would definitely/ideally need a GPU node for it - more `accel/*` targets? - currently: - `zen2` + `cc80` (A100) (Vega, Deucalion) - `zen3` + `cc80` (A100) (Karolina) - Snellius: Ice Lake + A100 (cc80), Zen4 + H100 (cc90) - HPC-UGent Tier-2 joltik: Intel Cascade Lake (`intel/skylake_avx512`) + V100 (cc70) - HPC-UGent Tier-2 donphan: Intel Cascade Lake (`intel/skylake_avx512`) + A2 (cc86) - RUG: `intel/icelake` + A100 (cc80) (may be in too short supply to use), `intel/skylake_avx512` + V100 (cc70) - => for now, stick to `zen2`+`cc80` and `zen3`+`cc80`, make sure we have a GPU build node first - clean up GPU installations in CPU path of OSU-Microbenchmarks + CUDA-Samples - produce a warning first before removing - or also install these for `zen4` (just to keep things consistent) - provide `{x86_64,aarch64}/generic` builds of these, and pick up on them automatically? - could affect CPU-only optimized builds - build OSU-Microbenchmarks + CUDA-Samples in `accel/nvidia/generic` (fat binaries)? - there was a problem with building for multiple CUDA compute capabiities, but fixed by Caspar in [easyconfigs PR #21031](https://github.com/easybuilders/easybuild-easyconfigs/pull/21031) - get GPU builds in software overview @ https://www.eessi.io/docs/available_software - separate table on software page, specific to GPU? ## Next meeting - (Thomas is on leave 7-11 Oct) ----------------------------------------------------------------------------- # Sync meeting 20240925 ## Merged PRs - take into account accelerator target when configuring EasyBuild ([PR #710](https://github.com/EESSI/software-layer/pull/710)) - update `EESSI-remove-software.sh` script to support removal of GPU installations (in `accel/*`) ([PR #721](https://github.com/EESSI/software-layer/pull/721)) - Avoid `create_tarball.sh` exiting on non-matching grep ([PR #723](https://github.com/EESSI/software-layer/pull/723)) - Make the `bot/bot-check.sh` script correctly pick up builds for accelerators ([PR #726](https://github.com/EESSI/software-layer/pull/726)) ## Open PRs - CUDA 12.1.1 (**rebuild**) ([PR #720](https://github.com/EESSI/software-layer/pull/720)) - requires changes to EasyBuild hook for CUDA in [PR #735](https://github.com/EESSI/software-layer/pull/735) - built for `x86_64/amd/zen2/accel/nvidia/cc80` + `x86_64/amd/zen3/accel/nvidia/cc80` - deploy triggered, will be ingested soon - [Kenneth] UCX-CUDA v1.14.1 (**rebuild**) ([PR #719](https://github.com/EESSI/software-layer/pull/719)) - requires CUDA - [Thomas] cuDNN 8.9.2.26 ([PR #581](https://github.com/EESSI/software-layer/pull/581)) - requires CUDA - [Caspar] CUDA-Samples v12.1 (**rebuild**) ([PR #715](https://github.com/EESSI/software-layer/pull/715)) - requires CUDA - [Kenneth] NCCL 2.18.3 (**rebuild**) ([PR #741](https://github.com/EESSI/software-layer/pull/741)) - requires CUDA + UCX-CUDA (+ ~~EasyBuild v4.9.4, see [PR #740](https://github.com/EESSI/software-layer/pull/740)~~) - [Lara] UCC-CUDA ([PR #750](https://github.com/EESSI/software-layer/pull/750)) - requires CUDA + UCX-CUDA + NCCL - [Lara] OSU-Microbenchmarks v7.2 w/ CUDA 12.1.1 (**rebuild**) ([PR #716](https://github.com/EESSI/software-layer/pull/716)) - requires CUDA + UCX-CUDA + UCC-CUDA - [Thomas/Caspar] PyTorch v2.1.2 w/ CUDA 12.1.1 ([PR #586](https://github.com/EESSI/software-layer/pull/586) or [PR #718](https://github.com/EESSI/software-layer/pull/718)) - requires UCX-CUDA + cuDNN - [Thomas/Caspar] TensorFlow v2.15.1 w/ CUDA 12.1.1 ([PR #717](https://github.com/EESSI/software-layer/pull/717)) - requires CUDA + NCCL + cuDNN - [Kenneth/Lara] LAMMPS 2Aug2023 CUDA 12.1.1 ([PR #711](https://github.com/EESSI/software-layer/pull/711)) - requires CUDA + UCX-CUDA + NCCL - [Pedro] ESPResSo w/ CUDA ([PR #748](https://github.com/EESSI/software-layer/pull/748)) - requires only CUDA - [easyconfig PR #21440](https://github.com/easybuilders/easybuild-easyconfigs/pull/21440) ## TODO - `create_tarball.sh` might not include Lmod cache and config files correctly for GPU builds ([issue #722](https://github.com/EESSI/software-layer/issues/722)) - we need to determine whether we'll need separate Lmod hooks - [Bob] make Lmod cache aware of modules in `/accel/*` - separate cache for each `accel/*` subdirectories + extra `lmodrc.lua` that is added to `$LMOD_RC` - can start looking into this as soon as `CUDA-samples` is deployed - [Bob] make CI that checks for missing modules aware of GPU installations - check for each CPU+GPU combo separately for now - [Alan?] add NVIDIA GPU detection to `archdetect` + make sure init script & EESSI module uses it - add `gpupath` function to `archdetect` script - only exact match for `cc*` for now, determine via `nvidia-smi`, see also https://gitlab.com/eessi/support/-/issues/59#note_1909332878 ## Notes - (Richard) libfabric + CUDA - (Caspar) we should add technical information on how EESSI is built - incl. how/why we strip down CUDA installations - starting point could be (outdated) ["Behind the Scenes" talk](https://github.com/EESSI/docs/blob/main/talks/20210119_EESSI_behind_the_scenes/EESSI-behind-the-scenes-20210119.pdf) ## Next meeting - Tue 1 Oct 2024 09:00 CEST for GPU tiger team - 10:00 CEST for dev.eessi.io --- ## Sync meeting 20240918 attending: Kenneth, Lara, Alan, Richard, Pedro, Caspar, Thomas ### Done - pass down `accelerator` filter in `build` command for bot into `job.cfg` file ([merged bot PR #280](https://github.com/EESSI/eessi-bot-software-layer/pull/280)) ```ini # from working directory for build job: $ cat cfg/job.cfg ... [architecture] software_subdir = x86_64/amd/zen2 os_type = linux accelerator = nvidia/cc80 ``` - mark `with_accelerator` setting in `[submitted_job_comments]` section as required ([merged bot PR #282](https://github.com/EESSI/eessi-bot-software-layer/pull/282)) ### Open PRs - take into account accelerator target when configuring EasyBuild ([software-layer PR #710](https://github.com/EESSI/software-layer/pull/710)) - tested via [PR to boegel's fork with `beagle-lib`](https://github.com/boegel/software-layer/pull/33) - seems to be working, produces `2023.06/software/linux/x86_64/amd/zen2/accel/nvidia/cc80/modules/all/beagle-lib/4.0.1-GCC-12.3.0-CUDA-12.1.1.lua` - (CPU-only) GROMACS 2024.3 ([software-layer PR #709](https://github.com/EESSI/software-layer/pull/709)) ### Next steps - [Thomas,Kenneth] new bot release (v0.5.0?), so we can build GPU software with production bots @ AWS+Azure - [Kenneth] update `bot/check-build.sh` to be aware of `/accel/` paths - just so reporting of files included in tarball to deploy looks nicer - => should be done in a separate PR - [???] verify `accelerator` value so only allowed values (`cc80`, `cc90`) can be used - mainly to avoid silly mistakes that would be easy to overlook - hard mapping of `cc80` to `8.0` instead of hacky approach with `sed` in `configure_easybuild` to define `$EASYBUILD_CUDA_COMPUTE_CAPABILITIES`? - [???] verify CPU + accelerator target combos, only select combos should be used: - separate small (Python) script to call from `bot/build.sh`? - valid combos for EuroHPC systems: - `x86_64/amd/zen2/accel/nvidia/cc80` (AMD Rome + A100 => Vega + Meluxina + Deucalion) => **MAIN TARGET** - should be possible to find as AWS instance - could use `arch:x86_64/amd/zen2+cc80` - this would also work for Hortense @ UGent, Snellius @ SURF, ... - `x86_64/amd/zen3/accel/nvidia/cc80` (AMD Milan + A100 => Karolina) - `x86_64/intel/icelake/accel/nvidia/cc90` (Intel Icelake + H100 => Leonardo) - or `/skylake_avx512/` since we don't have `icelake` CPU target? - `x86_64/intel/sapphirerapids/accel/nvidia/cc80` (AMD Milan + A100 => MareNostrum 5) - or `/skylake_avx512/` (for now) since no `sapphirerapids` CPU target (yet)? - `x86_64/amd/zen3/accel/amd/gfx90a` (AMD Milan + MI250X => LUMI) - `aarch64/nvidia/grace/accel/nvidia/cc90` (NVIDIA Grace + H100 => JUPITER) - => should be `/neoverse_v2` instead of `/grace/` - (no GPUs in Discoverer) - [???] refuse to install CPU-only modules under `accel/*` - missing CPU-only deps must be installed through separate `software-layer` PR - [Alan?] verify CPU(+GPU) target at start of build job, to catch misconfigurations - *** [Kenneth/Thomas/Alan] build nodes with GPU (relevant for test suites) - needs enhancement in the bot so build jobs can be sent to GPU nodes - `bot build ... node:zen2-cc80` which adds additional `sbatch` arguments - or via `bot build ... arch:zen2+cc80` - easy to add GPU partition to Magic Castle cluster - also needs node image with GPU drivers - Magic Castle auto-installs GPU drivers when it sees a GPU, we probably don't want that - [Caspar] we need a way to pick a different ReFrame configuration file based on having a GPU or not - [Alan?] let bot or GitHub Actions auto-report missing modules for PRs to `software-layer` PRs - build & deploy GPU software - [Caspar] CUDA-samples (moved from CPU-only install path) - [Caspar] OSU-Microbenchmarks (moved from CPU-only install path) - also for `{x86_64,aarch64}/generic/accel/nvidia/generic` as fallback? - [Caspar] UCX-CUDA (moved from CPU-only install path?) - [Caspar] CUDA itself (moved from CPU-only install path?) - [Caspar] NCCL (moved from CPU-only install path?) - [Pedro] ESPResSo - OK for `foss/2023a` (not for `foss/2023b`) - PR for easyconfigs repo + software-layer PR to test build for `accel:nvidia/cc80` - to dedicated easystack file: `eessi-2023.06-eb-4.9.3-2023a-CUDA.yml` - [Bob?] GROMACS - [Lara] LAMMPS - using existing `LAMMPS-2Aug2023_update2-foss-2023a-kokkos-CUDA-12.1.1.eb` - TODO: software-layer PR + test `accel` build - [Caspar] TensorFlow - [Caspar/Thomas] PyTorch - only build for now, don't reploy yet, ideally we use GPU nodes so GPU tests are also run - to test the GPU software installations, we also need the `host_injections` stuff in place - see https://www.eessi.io/docs/site_specific_config/gpu - [???] Figure out if `.lmod/XYZ` files should be shipped in accel prefix, see [issue #722](https://github.com/EESSI/software-layer/issues/722) ### Notes - is OpenMPI in EESSI CUDA-aware? - yes, via UCX-CUDA ### Nexts sync meeting - Wed 25 Sept 11:00 CEST (Kenneth, Pedro, Alan, Thomas, Richard?) - clashes with [AI-Friendly EuroHPC Systems EuroHPC session](https://eurohpc-ju.europa.eu/news-events/events/ai-friendly-eurohpc-systems-virtual-2024-09-25_en) (09:30-12:45 CEST) - can we reschedule to 13:00 CEST (GPU tiger team) + 14:00 CEST (dev.eessi.io tiger team)? - OK - (17:00 CEST is bi-weekly EasyBuild conf call, KH needs time to prepare it 15:00-17:00) --- ## Kickoff meeting 20240913 attending: Alan, Thomas, Richard, Pedro, Bob, Kenneth ### Goal - more software installations in EESSI that support NVIDIA GPUs - `accel/*` subdirectories (see [support issue #59](https://gitlab.com/eessi/support/-/issues/59#note_1924348179)) ### Plan #### software-layer scripts - enhance scripts to support building for specified accelerator (see `job.cfg` script) - similar to how `$EESSI_SOFTWARE_SUBDIR_OVERRIDE` is defined in `bot/build.sh` - check whether `accelerator` value is "valid" + whether combo of CPU target + accelerator - have to modify EasyBuild hooks to either: - "select" correct software installation prefix for each installation; - maybe opens the door to making mistakes easily, have stuff installed in the wrong place; - refuse installing non-CUDA stuff in `/accel/` path; - probably less error-prone, so lets do this; - map `accel=nvidia/cc80` to: - "`module use $EASYBUILD_INSTALLPATH/modules/all`" to make CPU module available - `$EASYBUILD_INSTALLPATH=${EASYBUILD_INSTALLPATH}/accel/nvidia/cc80` - `$EASYBUILD_CUDA_COMPUTE_CAPABILITIES=8.0` #### archdetect - auto-detect NVIDIA GPU + compute capability - via `nvidia-smi --query-gpu=compute_cap` - to let init script also do `module use .../accel/nvidia/cc80` #### bot - see [open PR #280](https://github.com/EESSI/eessi-bot-software-layer/pull/280) to add `accelerator=xxx` to `job.cfg` file - new bot release after merging PR #280 #### software - x CUDA-samples (reinstall) - NCCL (reinstall + add license to installation) - UCC-CUDA (reinstall) - x UCX-CUDA (rinstall) - x OSU Microbenchmarks (reinstall) - ESPResSo - no easyconfigs yet for ESPResSo with GPU support => task - GROMACS - LAMMPS - PyTorch - TensorFlow #### CPU/GPU combo targets - AMD Rome + A100 (Vega) => `amd/zen2/accel/nvidia/cc80` - AMD Milan + A100 (Karolina) => `amd/zen3/accel/nvidia/cc80` - Deucalion: Rome? + NVIDIA => `???` - Mexulina: ??? - JUPITER: NVIDIA Grace + H100 => `aarch64/grace/accel/nvidia/cc??` #### Hardware for build nodes - manual test build on target hardware via `EESSI-extend` - or by setting up bot in your own account and make test PRs to your fork of `software-layer` repo - which AWS instance for AMD Rome + A100? #### Ideas/future work - script that reports contents of generated tarball may need some work (Bob?) - also build for `accel/nvidia/generic` (or very low compute capability like `/cc60`) ### Tasks (by next meeting) - [Pedro] easyconfig for ESPResSo with CUDA support - [Kenneth] re-review + merge of [bot PR #280](https://github.com/EESSI/eessi-bot-software-layer/pull/280) - [Kenneth] update software-layer scripts - [Alan] let archdetect detect GPU compute capability + produce list of all valid combos in sensible order - [Alan] enhance init script/module file to implement auto-detect ### Next meetings - Wed 18 Sept 11:00 CEST (Kenneth, Pedro, Alan, Richard?) - Wed 25 Sept 11:00 CEST (Kenneth, Pedro, Alan, Thomas, Richard?)