EESSI
      • Sharing URL Link copied
      • /edit
      • View mode
        • Edit mode
        • View mode
        • Book mode
        • Slide mode
        Edit mode View mode Book mode Slide mode
      • Customize slides
      • Note Permission
      • Read
        • Owners
        • Signed-in users
        • Everyone
        Owners Signed-in users Everyone
      • Write
        • Owners
        • Signed-in users
        • Everyone
        Owners Signed-in users Everyone
      • Engagement control Commenting, Suggest edit, Emoji Reply
    • Invite by email
      Invitee

      This note has no invitees

    • Publish Note

      Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note

      Your note will be visible on your profile and discoverable by anyone.
      Your note is now live.
      This note is visible on your profile and discoverable online.
      Everyone on the web can find and read all notes of this public team.
      See published notes
      Unpublish note
      Please check the box to agree to the Community Guidelines.
      View profile
    • Commenting
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
      • Everyone
    • Suggest edit
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
    • Emoji Reply
    • Enable
    • Versions and GitHub Sync
    • Note settings
    • Note Insights
    • Engagement control
    • Transfer ownership
    • Delete this note
    • Insert from template
    • Import from
      • Dropbox
      • Google Drive
      • Gist
      • Clipboard
    • Export to
      • Dropbox
      • Google Drive
      • Gist
    • Download
      • Markdown
      • HTML
      • Raw HTML
Menu Note settings Versions and GitHub Sync Note Insights Sharing URL Help
Menu
Options
Engagement control Transfer ownership Delete this note
Import from
Dropbox Google Drive Gist Clipboard
Export to
Dropbox Google Drive Gist
Download
Markdown HTML Raw HTML
Back
Sharing URL Link copied
/edit
View mode
  • Edit mode
  • View mode
  • Book mode
  • Slide mode
Edit mode View mode Book mode Slide mode
Customize slides
Note Permission
Read
Owners
  • Owners
  • Signed-in users
  • Everyone
Owners Signed-in users Everyone
Write
Owners
  • Owners
  • Signed-in users
  • Everyone
Owners Signed-in users Everyone
Engagement control Commenting, Suggest edit, Emoji Reply
  • Invite by email
    Invitee

    This note has no invitees

  • Publish Note

    Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note

    Your note will be visible on your profile and discoverable by anyone.
    Your note is now live.
    This note is visible on your profile and discoverable online.
    Everyone on the web can find and read all notes of this public team.
    See published notes
    Unpublish note
    Please check the box to agree to the Community Guidelines.
    View profile
    Engagement control
    Commenting
    Permission
    Disabled Forbidden Owners Signed-in users Everyone
    Enable
    Permission
    • Forbidden
    • Owners
    • Signed-in users
    • Everyone
    Suggest edit
    Permission
    Disabled Forbidden Owners Signed-in users Everyone
    Enable
    Permission
    • Forbidden
    • Owners
    • Signed-in users
    Emoji Reply
    Enable
    Import from Dropbox Google Drive Gist Clipboard
       owned this note    owned this note      
    Published Linked with GitHub
    Subscribed
    • Any changes
      Be notified of any changes
    • Mention me
      Be notified of mention me
    • Unsubscribe
    Subscribe
    # Tiger team - GPU software in EESSI Tiger team lead: Kenneth # Sync meeting 20250507 attending: - Richard, Thomas (UiB) - Caspar (SURF) - Lara (UGent) ## agenda - actions points for weeks since last meeting: - Caspar: CUDA sanity check in EasyBuild, access to service account @ SURF, help with arch_target_map in bot - CUDA sanity check (https://github.com/easybuilders/easybuild-framework/pull/4692) - access to service account @ SURF: "everybody" (works for Kenneth, Lara, Thomas; unclear for Richard) -- access means access to files, not access to the service account - https://gitlab.com/eessi/support/-/wikis/GPU-SURF-cluster - help with arch_target_map in bot (draft https://github.com/EESSI/eessi-bot-software-layer/pull/312) - Lara: full matrix test builds for CUDA-samples & co - did not get to it - postpone to next meeting - Richard did something - Richard: full matrix test builds for CUDA-samples & co - CUDA + bunch CUDA packages - cc70: https://github.com/EESSI/software-layer/pull/1030 - some failure building CUDA Samples - options - don't build CUDA Samples or not this version --> not really a good option - build a newer version of CUDA Samples and keep older version plus provide a module file that points users to newer version - cc80: https://github.com/EESSI/software-layer/pull/1076 - cc90: https://github.com/EESSI/software-layer/pull/1077 - go ahead building CUDA, UCX, UCC and OSU without CUDA-Samples - build newer version of CUDA-Samples for all ccXX - build older version of CUDA-Samples for all ccXX, where cc70 modules have a warning/info that it didn't build and users can use a newer version - [Kenneth] open an issue to update contribution policy for what is acceptible if a software (version) cannot be built for some specific hardware configurations (e.g., CUDA-Samples 12.1 for cc70) - Thomas: arch_target_map in bot, checking in bot - looked briefly at Caspar's draft PR - PyTorch 2.1.2 with CUDA - add ReFrame tests to list of tests that are run - zen3+cc80 (UGent) built - zen4+cc90 (Snellius) failed (timeout? in testing) - grace+cc90 (Jureca) "failed" with ~ 20 tests instead of ~ 10 -> ask Flamefire if anything is concerning - cc70 - create known issue and go ahead - Kenneth: vacation (11-21 April), butterfly mode (help where needed) - very little - Alan: EasyBuild hook + implement fallback from cc89 to cc80 in init script - check with Alan - bot improvement - UGent setup, scrontrol need to know cluster (wrapper could get job id via squeue, then use actual scontrol) - LAMMPS on NVIDIA Grace/Hopper - didn't build on jureca nodes - Richard talks to Lara about it - Try PyTorch 2.6 or even 2.7 - work with Flamefire - next meeting: "doodle" for after last webinar not before june 5 - for Kenneth: not before 5th of June # Sync meeting 20250409 attending: - Thomas, Richard (UiB) - Lara, Kenneth (UGent) - Caspar (SURF) - Alan (CECAM/UB) ## Agenda ### Prepared notes (Thomas) - Formalize supported Cuda Compute Capability & CPU architecture combinations - Decide on which CPU+CCx combinations we want to build for: please, everyone read up on https://gitlab.com/eessi/support/-/issues/142 and make up your mind - so far we have CUDA 12.1.1 (with foss/2023a) and CUDA 12.4.0 (with foss/2023b) - unclear why CUDA 11 is dicussed in the issue? - Blackwell requires CUDA 12.8 - have to keep that in mind for EESSI/2025.04 + toolchains bundled with that - unclear what "auto-upgrade" means, not clear why we need that - proposal for supported CPU+ccXY combinations (regardless of toolchain?) - EESSI/2023.06: any-CPU+cc[70,80,90] - EESSI/2025.04: any-CPU+cc[70,80,90,100] - does CUDA 12.8 still fully support cc70? just curious - guarantees that cc70 builds work on any cc7x device, cc80 builds work on any cc8x device, ... - could/should change archdetect so it only returns the major version (however, we then need to change that if we change our mind and build for minor cc-versions; maybe better to keep archdetect as is and decide which build to use where archdetect is used?) - PTX sounds interesting, but maybe we look into that later? - proposal for build workflow/strategy - first build on nodes with GPUs - if successful, build other combinations - add note to module file to inform whether package has been build on GPU/ccXY or not - Log / document which combinations are supported natively, and which are cross-compiled - we should have a strategy in case something doesn't work - (CUDA sanity check) - Other topics (bot support, access to build nodes with GPUs) ### Meeting notes - things to decide - which CUDA versions per EESSI version - CUDA 12.1.1 (2023a) + 12.4.0 (2023b) in EESSI 2023.06 - CUDA 12.8.0 (Blackwell support, 2024a) in EESSI 2025.0X - which NVIDIA GPU targets (ccXY) per EESSI version (+ toolchain?) - cc80 (A100) would be sufficient to also cover L4, L40, A40, etc. - major versions (cc70, cc80, cc90) for all supported CPU targets - specific minor versions (cc89) only with CPU targets that "make sense" - cover combos available in AWS/Azure - like cc75 (T4) + Cascade Lake => AWS G4 instances (https://aws.amazon.com/ec2/instance-types/g4/) - which CPU+GPU target combos - wait for CUDA sanity check feature in EasyBuild - requires EasyBuild 5.0.X - see https://github.com/easybuilders/easybuild-framework/pull/4692 - can also verify afterwards with `--sanity-check-only` - needs extra option to error out in case CUDA sanity check fails - GPU build nodes vs "cross-compiling" - procedure - for each PR - first test builds on GPU build nodes - @UGent: zen3+cc80 (service account) - can also do cascadelake+cc70 and zen4+cc90 - @SURF: zen4+cc90 (service account) - can also do icelake+cc90 - @JSC: grace+cc90 (service account, when available) - @AWS: cascadelake+cc75 (G4 instances) - only once cascadelake CPU target is fully supported - then build for all major GPU targets (ccX0), cross-compiling - then build for specific CPU+GPU combos, cross-compiling - initially, do only subset: - cascadelake+cc70 (UGent V100) - zen2+cc80 (A100 @ Vega, Karolina, ...) - zen3+cc80 (UGent A100) - zen4+cc90 (SURF + UGent H100) - nvidia/grace+cc90 (JSC) - icelake+cc80 (SURF) -- when CPU target is available - how to deal with explosion of bot build jobs + ensure everything is there - + also improve ingestion - add support for wildcards for CPU/GPU targets so all targets get built automatically - `bot: check-deploy` => each bot verifies wheter PR is ready to deploy - only set label if each bot replies positively - eventually also let 1 bot verify overall status? - start with CUDA-Samples, OSU, NCCL, GROMACS, LAMMPS, ESPResSo, ... - (try) full matrix for CUDA-Samples + OSU + required dependencies - to fix what we broke when removing CUDA & co from CPU-only locations - `CUDA-Samples/12.1-GCC-12.3.0-CUDA-12.1.1` - `OSU-Micro-Benchmarks/7.2-gompi-2023a-CUDA-12.1.1` - `OSU-Micro-Benchmarks/7.5-gompi-2023b-CUDA-12.4.0` - subset for apps like GROMACS - dedicated issue to keep track of progress - archdetect needs to be enhanced to make it aware of major GPU target version - should actually be done by init script (cc89 -> cc89+cc80) - EasyBuild hook should know when it's expected to have a GPU available - based on whether `nvidia-smi` is available - if `$EESSI_ACCELERATOR_TARGET` is specified, verify whether it matches with what `nvidia-smi` reports - to enforce testing + inject info to generated module file on whether GPU testing was done or not - make sure that no CPU-only modules get installed in `accel` subdirectory? - enforce in `EESSI-extend` only for non-project/non-user installations - actions points for coming weeks: - Caspar: CUDA sanity check in EasyBuild, access to service account @ SURF, help with arch_target_map in bot - Lara: full matrix test builds for CUDA-samples & co - Thomas: arch_target_map in bot, checking in bot - Ricard: full matrix test builds for CUDA-samples & co - Kenneth: vacation (11-21 April), butterfly mode (help where needed) - Alan: EasyBuild hook + implement fallback from cc89 to cc80 in init script - next meeting - Wed 7 May 2025 13:00 CEST: OK for all --- # Sync meeting 20250114 attending: Richard, Lara, Kenneth, Alan, Bob, Caspar, Thomas ## Agenda - status update - build for specific CPU microarch + CUDA CC (zen2 + cc80, zen3 + cc80) - some software is tricky to build (PyTorch, TensorFlow, ...) - next steps - in the process of setting up bots on clusters with GPUs (UGent, SURF, ...) - keep focus on A100+Zen2/Zen3 for now (covers most of EuroHPC systems) - A100+Zen3 as GPU build node - A100+Zen2 on non-GPU build node (and then test on A100+Zen3 node) - also start supporting Intel Ice Lake as extra CPU generation between Skylake & Sapphire Rapids? - Intel Ice Lake for SURF A100 nodes + Leonardo GPU partition - general discussion - not feasible to really use GPU nodes for all combinations of CPU+GPU - BUT do at least 1 (?) build on GPU node for each software package also for each GPU generation - for service account for EESSI at UGent, supported GPUs: - V100 (+ Intel Cascade Lake) - A100 (+ Zen3, also + Zen2 on Tier-1 system) - H100 (+ Zen4) - would be helpful if bot knows what "baseline" is for testing GPU builds, like `bot: build arch=gpu-baseline` - as a shorthand for build command for V100+A100+H100 - would be useful for CPUs too - way for ensuring that all required builds have been done - some CI or bot checks - build for generic+cc80 on CPU only nodes (e.g., on AWS), may be tested on more modern CPUs that have A100 or newer - need changes to bot / overall test setup 1. build on resource A, but only run minimal tests if any 2. upload tarball to some S3 3. run test on resource B - software-layer - stacks for new architectures: GH200 (OpenMPI/MPICH) - important for Olivia (NRIS/Sigma2, Norway), JUPITER - could run build bot at JSC - CUDA 12.0 might be minimum version - sync meeting with JSC (Sebastian) to understand status of builds for Grace? - need to re-generate module files for CUDA module file - see [easyblocks PR #3516](https://github.com/easybuilders/easybuild-easyblocks/pull/3516) - PyTorch w/ CUDA - handful of failing tests, maybe the usual suspects? - AP TR: send list to Kenneth - TensorFlow w/ CUDA - host stuff leaking into build of Bazel/TensorFlow, leading to glibc compat issues - maybe [easyblocks #3486](https://github.com/easybuilders/easybuild-easyblocks/pull/3486) is relevant here - AMD GPUs - check if support is needed for deliverables in June - good connection to AMD - currently only relevant for LUMI - toolchain issue - detection of libraries, drivers (how different is it to what we do for NVIDIA) - sanity check for compute capabilities: check if device code in builds is what the build was intended for - add to `bot/check-build.sh` script - see [support issue #92](https://gitlab.com/eessi/support/-/issues/92) in GitLab - EESSI test suite - some issue with test with locally deployed bot https://gitlab.com/eessi/support/-/issues/114 - what GPU tests do we already have - eessi-bot-software-layer - support for `node` filter (to send job to GPU nodes) - change way jobs are created (to avoid that many jobs are created by accident) - think about baseline feature - think about letting ReFrame itself schedule test jobs - add support for silencing the bot :-) - next meetings: schedule via Slack - AMD: - tiger-NVIDIA # Sync meeting 20241009 attending: Kenneth, Alan, Caspar, Richard, Lara, Pedro ## Merged PRs - *(none)* ## Open PRs - Allow Nvidia driver script to set `$LD_PRELOAD` ([software-layer PR #754](https://github.com/EESSI/software-layer/pull/754)) - list of CUDA compat libraries is a lot shorter, are those actually enough? - we should try and get this merged as is, additional features can be implemented in follow-up PRs - Kenneth will review PR - enhance archdetect to support detection of NVIDIA GPUs + using that in EESSI init script ([software-layer PR #767](https://github.com/EESSI/software-layer/pull/767)) - only covers init script, not `EESSI` module yet - being deployed right now - cuDNN 8.9.2.26 w/ CUDA 12.1.1 (part 1) ([software-layer PR #772](https://github.com/EESSI/software-layer/pull/772)) - minor open discussion point w.r.t. `--accept-cuda-eula` option, we should either make it more generic (`--accept-eula=CUDA,cuDNN`), or also add `--accept-cudnn-eula` - open question on whether `--rebuild` is required in `install_cuda_and_libraries.sh` - easystack file that specifies what to strip down to runtime stuff + install under `host_injections` should be shipped with EESSI - need to update `install_cuda_and_libraries.sh` script to load a specific EasyBuild module version before loading EESSI-extend module - path to easystack file(s) should be hardcoded in `install_cuda_and_libraries.sh` script, not passed as an argument - no need to use `-dry-run --rebuild` to figure out what needs to be installed, let EB figure it out based on modules in hidden path `.modules/all` - enhance cuDNN easyblock to verify that EULA is accepted before installing it ([easyblocks PR #3473](https://github.com/easybuilders/easybuild-easyblocks/pull/3473)) - waLBerla v6.1 w/ CUDA 12.1.1 ([software-layer PR #780](https://github.com/EESSI/software-layer/pull/780)) - on hold/blocked PRs: - TensorFlow v2.15.1 w/ CUDA 12.1.1 ([software-layer PR #717](https://github.com/EESSI/software-layer/pull/717)): requires `cuDNN` - PyTorch v2.1.2 w/ CUDA 12.1.1 ([software-layer PR #718](https://github.com/EESSI/software-layer/pull/718)): requires `cuDNN` ## Discussion - additional `accel/nvidia/common` subdirectory for CUDA, CUDA-Samples, OSU-MicroBenchmarks, cuDNN, ... ? - useful when specific type of GPU is not supported yet (currently only A100) - have to ensure fat installations of CUDA-Samples & OSU-MicroBenchmarks that can be used on any type of NVIDIA GPU - older CUDA versions will not work for newer NVIDIA GPUs, that's a counter argument against introducing `/common` - CUDA-samples currently specifies `SMS`, so is built for a specific CUDA compute capabilities (+ corresponding PTX code, so can be JIT compiled on newer GPU, not older GPUs) - fallback installations with support for a range of GPUs under `x86_64/generic`? - best way would be by implementing `nvcc` (& co) compiler wrappers in EasyBuild so we can easily inject the correct options (and remove any others being added by whatever build system being used) - sanity check for CUDA compute capabilities being used [support issue #92](https://gitlab.com/eessi/support/-/issues/92) ## TODO - GPU build node `[Kenneth]` - GROMACS w/ CUDA - see [easyconfigs PR #21549](https://github.com/easybuilders/easybuild-easyconfigs/pull/21549) - no libfabric with CUDA dependency (doesn't exist yet) - in EESSI, workaround could be to inject an OpenMPI that has support for CUDA-aware libfabric - GROMACS docs ~~only~~ mostly mention UCX, not libfabric: https://manual.gromacs.org/current/install-guide/index.html#gpu-aware-mpi-support - waLBerla w/ CUDA - see [easyconfigs PR #21601](https://github.com/easybuilders/easybuild-easyconfigs/pull/21601) ## Next sync meeting - Wed 16 Oct'24 15:00 CEST --- # Sync meeting 20241002 attending: Kenneth, Pedro, Lara, Caspar, Thomas ## Status - GPU installations in place for ESPResSo, LAMMPS, CUDA-Samples, OSU-Microbenchmarks ## Merged PRs - CUDA 12.1.1 (**rebuild**) ([PR #720](https://github.com/EESSI/software-layer/pull/720)) - UCX-CUDA v1.14.1 (**rebuild**) ([PR #719](https://github.com/EESSI/software-layer/pull/719)) - CUDA-Samples v12.1 (**rebuild**) ([PR #715](https://github.com/EESSI/software-layer/pull/715)) - NCCL 2.18.3 (**rebuild**) ([PR #741](https://github.com/EESSI/software-layer/pull/741)) - UCC-CUDA ([PR #750](https://github.com/EESSI/software-layer/pull/750)) - OSU-Microbenchmarks v7.2 w/ CUDA 12.1.1 (**rebuild**) ([PR #716](https://github.com/EESSI/software-layer/pull/716)) - LAMMPS 2Aug2023 CUDA 12.1.1 ([PR #711](https://github.com/EESSI/software-layer/pull/711)) - ESPResSo w/ CUDA ([PR #748](https://github.com/EESSI/software-layer/pull/748)) - Take accelerator builds into account in CI that checks for missing installations ([PR #753](https://github.com/EESSI/software-layer/pull/753)) ## Open PRs - enhance archdetect to support detection of NVIDIA GPUs + using that in EESSI init script ([PR #767](https://github.com/EESSI/software-layer/pull/767)) - [Thomas] cuDNN 8.9.2.26 ([PR #581](https://github.com/EESSI/software-layer/pull/581)) - some time spent on it, but no new PR yet - this PR does a bunch of things: - install cuDNN correctly in EESSI (only files we're allowed to re-distribute) - see updated EasyBuild hooks - use easystack file for CUDA + cUDNN - generic script to install CUDA + cUDNN under `host_injections` and unbreak the symlinks - required when installing stuff on top of cUDNN (like TensorFlow) - update Lmod hooks to also throw error when cuDNN module is loaded without symlinks being unbroken first via `host_injections` - => separate PR for this - update EasyBuild hooks to downgrade cuDNN to build-only dependency - Thomas will take a look this week, Caspar+Kenneth can continue the effort next week - [Thomas/Caspar] PyTorch v2.1.2 w/ CUDA 12.1.1 ([PR #586](https://github.com/EESSI/software-layer/pull/586) or [PR #718](https://github.com/EESSI/software-layer/pull/718)) - requires ~~UCX-CUDA~~ + cuDNN - [Thomas/Caspar] TensorFlow v2.15.1 w/ CUDA 12.1.1 ([PR #717](https://github.com/EESSI/software-layer/pull/717)) - requires ~~CUDA + NCCL~~ + cuDNN ## TODO - [Keneth] proper GPU node - will require bot + software-layer script changes too, to support something like `bot build arch=zen3:cc80` - or implement Sam's idea to be able to provide additional options - more (MultiXscale) software w/ GPU support: - [Pedro] waLBerla - also move to newer toolchain - [Kenneth] PyStencils - [Davide?] MetalWalls (=> Kenneth?) - [Kenneth/Bob?] GROMACS (=> Kenneth?) - see also CPU-only GROMACS [PR #709](https://github.com/EESSI/software-layer/pull/709) - for GROMACS, we would definitely/ideally need a GPU node for it - more `accel/*` targets? - currently: - `zen2` + `cc80` (A100) (Vega, Deucalion) - `zen3` + `cc80` (A100) (Karolina) - Snellius: Ice Lake + A100 (cc80), Zen4 + H100 (cc90) - HPC-UGent Tier-2 joltik: Intel Cascade Lake (`intel/skylake_avx512`) + V100 (cc70) - HPC-UGent Tier-2 donphan: Intel Cascade Lake (`intel/skylake_avx512`) + A2 (cc86) - RUG: `intel/icelake` + A100 (cc80) (may be in too short supply to use), `intel/skylake_avx512` + V100 (cc70) - => for now, stick to `zen2`+`cc80` and `zen3`+`cc80`, make sure we have a GPU build node first - clean up GPU installations in CPU path of OSU-Microbenchmarks + CUDA-Samples - produce a warning first before removing - or also install these for `zen4` (just to keep things consistent) - provide `{x86_64,aarch64}/generic` builds of these, and pick up on them automatically? - could affect CPU-only optimized builds - build OSU-Microbenchmarks + CUDA-Samples in `accel/nvidia/generic` (fat binaries)? - there was a problem with building for multiple CUDA compute capabiities, but fixed by Caspar in [easyconfigs PR #21031](https://github.com/easybuilders/easybuild-easyconfigs/pull/21031) - get GPU builds in software overview @ https://www.eessi.io/docs/available_software - separate table on software page, specific to GPU? ## Next meeting - (Thomas is on leave 7-11 Oct) ----------------------------------------------------------------------------- # Sync meeting 20240925 ## Merged PRs - take into account accelerator target when configuring EasyBuild ([PR #710](https://github.com/EESSI/software-layer/pull/710)) - update `EESSI-remove-software.sh` script to support removal of GPU installations (in `accel/*`) ([PR #721](https://github.com/EESSI/software-layer/pull/721)) - Avoid `create_tarball.sh` exiting on non-matching grep ([PR #723](https://github.com/EESSI/software-layer/pull/723)) - Make the `bot/bot-check.sh` script correctly pick up builds for accelerators ([PR #726](https://github.com/EESSI/software-layer/pull/726)) ## Open PRs - CUDA 12.1.1 (**rebuild**) ([PR #720](https://github.com/EESSI/software-layer/pull/720)) - requires changes to EasyBuild hook for CUDA in [PR #735](https://github.com/EESSI/software-layer/pull/735) - built for `x86_64/amd/zen2/accel/nvidia/cc80` + `x86_64/amd/zen3/accel/nvidia/cc80` - deploy triggered, will be ingested soon - [Kenneth] UCX-CUDA v1.14.1 (**rebuild**) ([PR #719](https://github.com/EESSI/software-layer/pull/719)) - requires CUDA - [Thomas] cuDNN 8.9.2.26 ([PR #581](https://github.com/EESSI/software-layer/pull/581)) - requires CUDA - [Caspar] CUDA-Samples v12.1 (**rebuild**) ([PR #715](https://github.com/EESSI/software-layer/pull/715)) - requires CUDA - [Kenneth] NCCL 2.18.3 (**rebuild**) ([PR #741](https://github.com/EESSI/software-layer/pull/741)) - requires CUDA + UCX-CUDA (+ ~~EasyBuild v4.9.4, see [PR #740](https://github.com/EESSI/software-layer/pull/740)~~) - [Lara] UCC-CUDA ([PR #750](https://github.com/EESSI/software-layer/pull/750)) - requires CUDA + UCX-CUDA + NCCL - [Lara] OSU-Microbenchmarks v7.2 w/ CUDA 12.1.1 (**rebuild**) ([PR #716](https://github.com/EESSI/software-layer/pull/716)) - requires CUDA + UCX-CUDA + UCC-CUDA - [Thomas/Caspar] PyTorch v2.1.2 w/ CUDA 12.1.1 ([PR #586](https://github.com/EESSI/software-layer/pull/586) or [PR #718](https://github.com/EESSI/software-layer/pull/718)) - requires UCX-CUDA + cuDNN - [Thomas/Caspar] TensorFlow v2.15.1 w/ CUDA 12.1.1 ([PR #717](https://github.com/EESSI/software-layer/pull/717)) - requires CUDA + NCCL + cuDNN - [Kenneth/Lara] LAMMPS 2Aug2023 CUDA 12.1.1 ([PR #711](https://github.com/EESSI/software-layer/pull/711)) - requires CUDA + UCX-CUDA + NCCL - [Pedro] ESPResSo w/ CUDA ([PR #748](https://github.com/EESSI/software-layer/pull/748)) - requires only CUDA - [easyconfig PR #21440](https://github.com/easybuilders/easybuild-easyconfigs/pull/21440) ## TODO - `create_tarball.sh` might not include Lmod cache and config files correctly for GPU builds ([issue #722](https://github.com/EESSI/software-layer/issues/722)) - we need to determine whether we'll need separate Lmod hooks - [Bob] make Lmod cache aware of modules in `/accel/*` - separate cache for each `accel/*` subdirectories + extra `lmodrc.lua` that is added to `$LMOD_RC` - can start looking into this as soon as `CUDA-samples` is deployed - [Bob] make CI that checks for missing modules aware of GPU installations - check for each CPU+GPU combo separately for now - [Alan?] add NVIDIA GPU detection to `archdetect` + make sure init script & EESSI module uses it - add `gpupath` function to `archdetect` script - only exact match for `cc*` for now, determine via `nvidia-smi`, see also https://gitlab.com/eessi/support/-/issues/59#note_1909332878 ## Notes - (Richard) libfabric + CUDA - (Caspar) we should add technical information on how EESSI is built - incl. how/why we strip down CUDA installations - starting point could be (outdated) ["Behind the Scenes" talk](https://github.com/EESSI/docs/blob/main/talks/20210119_EESSI_behind_the_scenes/EESSI-behind-the-scenes-20210119.pdf) ## Next meeting - Tue 1 Oct 2024 09:00 CEST for GPU tiger team - 10:00 CEST for dev.eessi.io --- ## Sync meeting 20240918 attending: Kenneth, Lara, Alan, Richard, Pedro, Caspar, Thomas ### Done - pass down `accelerator` filter in `build` command for bot into `job.cfg` file ([merged bot PR #280](https://github.com/EESSI/eessi-bot-software-layer/pull/280)) ```ini # from working directory for build job: $ cat cfg/job.cfg ... [architecture] software_subdir = x86_64/amd/zen2 os_type = linux accelerator = nvidia/cc80 ``` - mark `with_accelerator` setting in `[submitted_job_comments]` section as required ([merged bot PR #282](https://github.com/EESSI/eessi-bot-software-layer/pull/282)) ### Open PRs - take into account accelerator target when configuring EasyBuild ([software-layer PR #710](https://github.com/EESSI/software-layer/pull/710)) - tested via [PR to boegel's fork with `beagle-lib`](https://github.com/boegel/software-layer/pull/33) - seems to be working, produces `2023.06/software/linux/x86_64/amd/zen2/accel/nvidia/cc80/modules/all/beagle-lib/4.0.1-GCC-12.3.0-CUDA-12.1.1.lua` - (CPU-only) GROMACS 2024.3 ([software-layer PR #709](https://github.com/EESSI/software-layer/pull/709)) ### Next steps - [Thomas,Kenneth] new bot release (v0.5.0?), so we can build GPU software with production bots @ AWS+Azure - [Kenneth] update `bot/check-build.sh` to be aware of `/accel/` paths - just so reporting of files included in tarball to deploy looks nicer - => should be done in a separate PR - [???] verify `accelerator` value so only allowed values (`cc80`, `cc90`) can be used - mainly to avoid silly mistakes that would be easy to overlook - hard mapping of `cc80` to `8.0` instead of hacky approach with `sed` in `configure_easybuild` to define `$EASYBUILD_CUDA_COMPUTE_CAPABILITIES`? - [???] verify CPU + accelerator target combos, only select combos should be used: - separate small (Python) script to call from `bot/build.sh`? - valid combos for EuroHPC systems: - `x86_64/amd/zen2/accel/nvidia/cc80` (AMD Rome + A100 => Vega + Meluxina + Deucalion) => **MAIN TARGET** - should be possible to find as AWS instance - could use `arch:x86_64/amd/zen2+cc80` - this would also work for Hortense @ UGent, Snellius @ SURF, ... - `x86_64/amd/zen3/accel/nvidia/cc80` (AMD Milan + A100 => Karolina) - `x86_64/intel/icelake/accel/nvidia/cc90` (Intel Icelake + H100 => Leonardo) - or `/skylake_avx512/` since we don't have `icelake` CPU target? - `x86_64/intel/sapphirerapids/accel/nvidia/cc80` (AMD Milan + A100 => MareNostrum 5) - or `/skylake_avx512/` (for now) since no `sapphirerapids` CPU target (yet)? - `x86_64/amd/zen3/accel/amd/gfx90a` (AMD Milan + MI250X => LUMI) - `aarch64/nvidia/grace/accel/nvidia/cc90` (NVIDIA Grace + H100 => JUPITER) - => should be `/neoverse_v2` instead of `/grace/` - (no GPUs in Discoverer) - [???] refuse to install CPU-only modules under `accel/*` - missing CPU-only deps must be installed through separate `software-layer` PR - [Alan?] verify CPU(+GPU) target at start of build job, to catch misconfigurations - *** [Kenneth/Thomas/Alan] build nodes with GPU (relevant for test suites) - needs enhancement in the bot so build jobs can be sent to GPU nodes - `bot build ... node:zen2-cc80` which adds additional `sbatch` arguments - or via `bot build ... arch:zen2+cc80` - easy to add GPU partition to Magic Castle cluster - also needs node image with GPU drivers - Magic Castle auto-installs GPU drivers when it sees a GPU, we probably don't want that - [Caspar] we need a way to pick a different ReFrame configuration file based on having a GPU or not - [Alan?] let bot or GitHub Actions auto-report missing modules for PRs to `software-layer` PRs - build & deploy GPU software - [Caspar] CUDA-samples (moved from CPU-only install path) - [Caspar] OSU-Microbenchmarks (moved from CPU-only install path) - also for `{x86_64,aarch64}/generic/accel/nvidia/generic` as fallback? - [Caspar] UCX-CUDA (moved from CPU-only install path?) - [Caspar] CUDA itself (moved from CPU-only install path?) - [Caspar] NCCL (moved from CPU-only install path?) - [Pedro] ESPResSo - OK for `foss/2023a` (not for `foss/2023b`) - PR for easyconfigs repo + software-layer PR to test build for `accel:nvidia/cc80` - to dedicated easystack file: `eessi-2023.06-eb-4.9.3-2023a-CUDA.yml` - [Bob?] GROMACS - [Lara] LAMMPS - using existing `LAMMPS-2Aug2023_update2-foss-2023a-kokkos-CUDA-12.1.1.eb` - TODO: software-layer PR + test `accel` build - [Caspar] TensorFlow - [Caspar/Thomas] PyTorch - only build for now, don't reploy yet, ideally we use GPU nodes so GPU tests are also run - to test the GPU software installations, we also need the `host_injections` stuff in place - see https://www.eessi.io/docs/site_specific_config/gpu - [???] Figure out if `.lmod/XYZ` files should be shipped in accel prefix, see [issue #722](https://github.com/EESSI/software-layer/issues/722) ### Notes - is OpenMPI in EESSI CUDA-aware? - yes, via UCX-CUDA ### Nexts sync meeting - Wed 25 Sept 11:00 CEST (Kenneth, Pedro, Alan, Thomas, Richard?) - clashes with [AI-Friendly EuroHPC Systems EuroHPC session](https://eurohpc-ju.europa.eu/news-events/events/ai-friendly-eurohpc-systems-virtual-2024-09-25_en) (09:30-12:45 CEST) - can we reschedule to 13:00 CEST (GPU tiger team) + 14:00 CEST (dev.eessi.io tiger team)? - OK - (17:00 CEST is bi-weekly EasyBuild conf call, KH needs time to prepare it 15:00-17:00) --- ## Kickoff meeting 20240913 attending: Alan, Thomas, Richard, Pedro, Bob, Kenneth ### Goal - more software installations in EESSI that support NVIDIA GPUs - `accel/*` subdirectories (see [support issue #59](https://gitlab.com/eessi/support/-/issues/59#note_1924348179)) ### Plan #### software-layer scripts - enhance scripts to support building for specified accelerator (see `job.cfg` script) - similar to how `$EESSI_SOFTWARE_SUBDIR_OVERRIDE` is defined in `bot/build.sh` - check whether `accelerator` value is "valid" + whether combo of CPU target + accelerator - have to modify EasyBuild hooks to either: - "select" correct software installation prefix for each installation; - maybe opens the door to making mistakes easily, have stuff installed in the wrong place; - refuse installing non-CUDA stuff in `/accel/` path; - probably less error-prone, so lets do this; - map `accel=nvidia/cc80` to: - "`module use $EASYBUILD_INSTALLPATH/modules/all`" to make CPU module available - `$EASYBUILD_INSTALLPATH=${EASYBUILD_INSTALLPATH}/accel/nvidia/cc80` - `$EASYBUILD_CUDA_COMPUTE_CAPABILITIES=8.0` #### archdetect - auto-detect NVIDIA GPU + compute capability - via `nvidia-smi --query-gpu=compute_cap` - to let init script also do `module use .../accel/nvidia/cc80` #### bot - see [open PR #280](https://github.com/EESSI/eessi-bot-software-layer/pull/280) to add `accelerator=xxx` to `job.cfg` file - new bot release after merging PR #280 #### software - x CUDA-samples (reinstall) - NCCL (reinstall + add license to installation) - UCC-CUDA (reinstall) - x UCX-CUDA (rinstall) - x OSU Microbenchmarks (reinstall) - ESPResSo - no easyconfigs yet for ESPResSo with GPU support => task - GROMACS - LAMMPS - PyTorch - TensorFlow #### CPU/GPU combo targets - AMD Rome + A100 (Vega) => `amd/zen2/accel/nvidia/cc80` - AMD Milan + A100 (Karolina) => `amd/zen3/accel/nvidia/cc80` - Deucalion: Rome? + NVIDIA => `???` - Mexulina: ??? - JUPITER: NVIDIA Grace + H100 => `aarch64/grace/accel/nvidia/cc??` #### Hardware for build nodes - manual test build on target hardware via `EESSI-extend` - or by setting up bot in your own account and make test PRs to your fork of `software-layer` repo - which AWS instance for AMD Rome + A100? #### Ideas/future work - script that reports contents of generated tarball may need some work (Bob?) - also build for `accel/nvidia/generic` (or very low compute capability like `/cc60`) ### Tasks (by next meeting) - [Pedro] easyconfig for ESPResSo with CUDA support - [Kenneth] re-review + merge of [bot PR #280](https://github.com/EESSI/eessi-bot-software-layer/pull/280) - [Kenneth] update software-layer scripts - [Alan] let archdetect detect GPU compute capability + produce list of all valid combos in sensible order - [Alan] enhance init script/module file to implement auto-detect ### Next meetings - Wed 18 Sept 11:00 CEST (Kenneth, Pedro, Alan, Richard?) - Wed 25 Sept 11:00 CEST (Kenneth, Pedro, Alan, Thomas, Richard?)

    Import from clipboard

    Paste your markdown or webpage here...

    Advanced permission required

    Your current role can only read. Ask the system administrator to acquire write and comment permission.

    This team is disabled

    Sorry, this team is disabled. You can't edit this note.

    This note is locked

    Sorry, only owner can edit this note.

    Reach the limit

    Sorry, you've reached the max length this note can be.
    Please reduce the content or divide it to more notes, thank you!

    Import from Gist

    Import from Snippet

    or

    Export to Snippet

    Are you sure?

    Do you really want to delete this note?
    All users will lose their connection.

    Create a note from template

    Create a note from template

    Oops...
    This template has been removed or transferred.
    Upgrade
    All
    • All
    • Team
    No template.

    Create a template

    Upgrade

    Delete template

    Do you really want to delete this template?
    Turn this template into a regular note and keep its content, versions, and comments.

    This page need refresh

    You have an incompatible client version.
    Refresh to update.
    New version available!
    See releases notes here
    Refresh to enjoy new features.
    Your user state has changed.
    Refresh to load new user state.

    Sign in

    Forgot password

    or

    By clicking below, you agree to our terms of service.

    Sign in via Facebook Sign in via Twitter Sign in via GitHub Sign in via Dropbox Sign in with Wallet
    Wallet ( )
    Connect another wallet

    New to HackMD? Sign up

    Help

    • English
    • 中文
    • Français
    • Deutsch
    • 日本語
    • Español
    • Català
    • Ελληνικά
    • Português
    • italiano
    • Türkçe
    • Русский
    • Nederlands
    • hrvatski jezik
    • język polski
    • Українська
    • हिन्दी
    • svenska
    • Esperanto
    • dansk

    Documents

    Help & Tutorial

    How to use Book mode

    Slide Example

    API Docs

    Edit in VSCode

    Install browser extension

    Contacts

    Feedback

    Discord

    Send us email

    Resources

    Releases

    Pricing

    Blog

    Policy

    Terms

    Privacy

    Cheatsheet

    Syntax Example Reference
    # Header Header 基本排版
    - Unordered List
    • Unordered List
    1. Ordered List
    1. Ordered List
    - [ ] Todo List
    • Todo List
    > Blockquote
    Blockquote
    **Bold font** Bold font
    *Italics font* Italics font
    ~~Strikethrough~~ Strikethrough
    19^th^ 19th
    H~2~O H2O
    ++Inserted text++ Inserted text
    ==Marked text== Marked text
    [link text](https:// "title") Link
    ![image alt](https:// "title") Image
    `Code` Code 在筆記中貼入程式碼
    ```javascript
    var i = 0;
    ```
    var i = 0;
    :smile: :smile: Emoji list
    {%youtube youtube_id %} Externals
    $L^aT_eX$ LaTeX
    :::info
    This is a alert area.
    :::

    This is a alert area.

    Versions and GitHub Sync
    Get Full History Access

    • Edit version name
    • Delete

    revision author avatar     named on  

    More Less

    Note content is identical to the latest version.
    Compare
      Choose a version
      No search result
      Version not found
    Sign in to link this note to GitHub
    Learn more
    This note is not linked with GitHub
     

    Feedback

    Submission failed, please try again

    Thanks for your support.

    On a scale of 0-10, how likely is it that you would recommend HackMD to your friends, family or business associates?

    Please give us some advice and help us improve HackMD.

     

    Thanks for your feedback

    Remove version name

    Do you want to remove this version name and description?

    Transfer ownership

    Transfer to
      Warning: is a public team. If you transfer note to this team, everyone on the web can find and read this note.

        Link with GitHub

        Please authorize HackMD on GitHub
        • Please sign in to GitHub and install the HackMD app on your GitHub repo.
        • HackMD links with GitHub through a GitHub App. You can choose which repo to install our App.
        Learn more  Sign in to GitHub

        Push the note to GitHub Push to GitHub Pull a file from GitHub

          Authorize again
         

        Choose which file to push to

        Select repo
        Refresh Authorize more repos
        Select branch
        Select file
        Select branch
        Choose version(s) to push
        • Save a new version and push
        • Choose from existing versions
        Include title and tags
        Available push count

        Pull from GitHub

         
        File from GitHub
        File from HackMD

        GitHub Link Settings

        File linked

        Linked by
        File path
        Last synced branch
        Available push count

        Danger Zone

        Unlink
        You will no longer receive notification when GitHub file changes after unlink.

        Syncing

        Push failed

        Push successfully