# EESSI software layer sync meeting ## planning - next meeting - Tue 10 Oct at 09:00 CEST ## Meeting (2023-10-10) - attending: Kenneth, Lara, Alan, Bob, Pedro, Richard, Maxim, Caspar, Julián - status update eessi.io - multiple Stratum-1's set up - 1 with S3 backend, other with regular storage backend - old setup is using eessi-infra.org (owned by Terje) - meeting planned for Fri 13 Oct to discuss this - what does the ComputeCanada setup look like - w.r.t. CDN - eessi.io domain is registered with CloudFlare (by Alan) - we also have eessi.science - bot - v0.1.0 of bot has been tested in different context (not a setup like EESSI) - to automate installation of software in central software stack on HPC-UGent systems - only build phase (no deploy, direct installations into `/apps` NFS dir) - only small bug fix was needed (see [PR #220](https://github.com/EESSI/eessi-bot-software-layer/pull/220)) - goal is to try and use bot to better automate and track software installation for new HPC-UGent cluster - next big step (v0.2.0): add testing step in between build & deploy - leverage EESSI test suite, run tests for software installations produced during build phase (ideally in different OS container, on different Slurm cluster) - software-layer - new Slurm clusters are being set up (in AWS + Azure) - using [Magic Castle](https://github.com/ComputeCanada/magic_castle) instead of CitC - **bot for EESSI pilot 2023.06 will be migrated to new Slurm cluster soon!** - should set up a DNS entry like mc-aws.eessi.io ? - CI for testing if EESSI stack is available is only checking single architecture ([issue #349](https://github.com/EESSI/software-layer/issues/349)) - merged PRs - determine easystack files to process via PR patch file (PRs [#351](https://github.com/EESSI/software-layer/pull/351) + [#354](https://github.com/EESSI/software-layer/pull/354)) - helps to avoid hitting GitHub rate limit - remove bot configuration (2023.06) (PR #356) - moved to private repo EESSI/bot-configs - show active EasyBuild configuration when checking for missing installations (PR #358) - to help debug [issue #349](https://github.com/EESSI/software-layer/issues/349) - simplify CI workflow to check for missing installations, just loop over easystack files per CPU target (PR #359) - BAMM (PR #350) - SciPy-bundle 2022.05 w/ foss/2022a (PR #352) - WRF v4.4.1 w/ foss/2022b (PR #336) - open PRs - TensorFlow - stuck for different reasons (testing fails on aarch64/neoverse_v1 for older versions, build ) - may need to mark these installations as missing for aarch64 - via *missing*.yaml - should have pointer to issue that has more info on failing build - and also mention which alternative modules are available - can add "fake" property to generated module for these missing installations - matplotlib - stuck due to instalation problem with Pillow - setup.py looks for zlib.h, doesn't consider compat layer - do we add path to /include in compat layer to `$CPATH`? - Pillow's setup.py does consider `$CPATH` and `$LIBRARY_PATH` - Richard is looking into this, Kenneth can help - [MXS] ESPResSo - we need to use `-DPython3_EXECUTABLE` to control which Python is being used, cfr. [fix for VTK](https://github.com/easybuilders/easybuild-easyconfigs/pull/16741) - Maxim can't access Slurm cluster yet to access build logs - [MXS] LAMMPS - still building (first attempt) - [MXS] waLBerla - [easyconfig PR #18932](https://github.com/easybuilders/easybuild-easyconfigs/pull/18932) is open - currently only header files are being installed - Maxim is helping out Xin with this - RStudio-Server - should be synced with `2023.06` since required R dependency is now in place in EESSI 2023.06 - ComputeCanada patches RStudio a bit to make it work in user space - other - Pedro is interested in helping out with the bot - looking for easy tasks to get started with - planning to look into [issue #212](https://github.com/EESSI/eessi-bot-software-layer/issues/212) - GPU support - really need to get started with this... - we should set up a meeting on this, get a plan worked out, and get it done (definitely involving Alan, Kenneth, ...) - important for MultiXscale milestones due end of 2023! - steps - add CUDA compat libs to EESSI (in compat layer?) - make sure linker can find system libs - check via Lmod hook to make sure that libcuda.so is available, or whether script needs to be run to get required symlinks in place - to build CUDA software in build container, we need a full CUDA install in a local dir - we're only allowed to ship runtime stuff of CUDA in EESSI (via hook available in ...) - includes broken links to local CUDA install --- ## Meeting (2023-10-03) - attending: Kenneth, Lara, Bob, Julián, Pedro, Thomas, Richard, Caspar, Alan - status update eessi.io - Stratum-0 is set up at RUG - single Stratum-1 running in AWS (using S3 backend) - test setup - required lots of manual work (create VM + S3 bucket) because Atlantis wasn't working - Ansible playbooks sort of worked, but does not support S3 buckets yet - see [WIP filesystem-layer PR #160](https://github.com/EESSI/filesystem-layer/pull/160) - GeoAPI doesn't work well with S3 buckets, clients go straight to S3 bucket - need to figure out: - how many Stratum-1's do we want (initially)? - currently we have 4 for eessi-hpc.org (AWS, Azure, RUG in NL, BGO in Norway) - how to deal with S3 buckets vs GeoAPI - who should have admin access? - DNS - using CDN (CloudFlare)? - sync meeting being planned - bot - merged PRs: - add `shared_fs_path` configuration setting [PR #214](https://github.com/EESSI/eessi-bot-software-layer/pull/214) - README updated ([PR #215](https://github.com/EESSI/eessi-bot-software-layer/pull/215)) - v0.1.0 released: https://github.com/EESSI/eessi-bot-software-layer/releases/tag/v0.1.0 - `develop` branch - for active development (PRs) - `main` branch always corresponds to latest release - open PRs: - script to clean up tarballs of jobs given a PR number ([PR #217](https://github.com/EESSI/eessi-bot-software-layer/pull/217)) - can let bot use this when a PR is merged/closed - only cleans up large "checkpoint" tarballs for now, should eventually clean up *everything* related to a PR? - next steps - test step in between build & deploy - make deploy step agnostic of EESSI - new Slurm clusters for bot - new Slurm clusters are being set up with [Magic Castle](https://github.com/ComputeCanada/magic_castle) - in AWS: set up, need to test bot there - will (very) soon replace current CitC cluster... - next steps - create more accounts - increase disk space to couple of TBs (no EFS used there) - in Azure: to set up, need to figure out account/API stuff - software-layer - merged PRs - foss/2023a ([PR #334](https://github.com/EESSI/software-layer/pull/334)) - ignore flaky failing FFTW.MPI tests (see [issue #325](https://github.com/EESSI/software-layer/issues/325)) - use patch to fix detection of Neoverse V1 in OpenBLAS (cfr. [easyconfigs PR #18870](https://github.com/easybuilders/easybuild-easyconfigs/pull/18870)) - foss/2022a ([PR #310](https://github.com/EESSI/software-layer/pull/310)) - R v4.1.0 w/ foss/2021a ([PR #328](https://github.com/EESSI/software-layer/pull/328)) - add YAML file to keep track of known issues in EESSI pilot 2023.06 ([PR #340](https://github.com/EESSI/software-layer/pull/340)) - only increase limit for numerical test failures for OpenBLAS for aarch64/neoverse_v1 ([merged PR #345](https://github.com/EESSI/software-layer/pull/345)) - open PRs - TensorFlow - TensorFlow v2.7.1 with `foss/2021b` ([PR #321](https://github.com/EESSI/software-layer/pull/321)) - several test failures in `aarch64/*` targets - may be fixable by backporting a couple of patches, but maybe not worth the trouble? - TensorFlow v2.8.4 with `foss/2021b` ([PR #343](https://github.com/EESSI/software-layer/pull/343)) - assembler errors on `aarch64/*` when building XNNPACK - due to use of `-mcpu=native` which clashes with custom `-march=...` options used by XNNPACK build procedure - see also [easyconfigs issue #18899](https://github.com/easybuilders/easybuild-easyconfigs/issues/18899) - should be fixed by making sure that `-mcpu=...` is not used when building XNNPACK, see [easyblocks PR #3011](https://github.com/easybuilders/easybuild-easyblocks/pull/3011) - TensorFlow v2.11.0 with `foss/2022a` ([PR #346](https://github.com/EESSI/software-layer/pull/346)) - assembler errors on `aarch64/*` when building XNNPACK, fixed with [easyblocks PR #3011](https://github.com/easybuilders/easybuild-easyblocks/pull/3011) - TensorFlow v2.13.0 with `foss/2022b` ([PR #347](https://github.com/EESSI/software-layer/pull/347)) - 928 failing `scipy` tests on `aarch64/neoverse_v1`... - build error on `x86_64/intel/haswell` because `/usr/include/stdio.h` is picked up - need to set `$TF_SYSROOT`? - matplotlib v3.4.3 with `foss/2021b` ([PR #339](https://github.com/EESSI/software-layer/pull/339)) - open pr for Pillow in EasyBuild: [#PR 18881](https://github.com/easybuilders/easybuild-easyconfigs/pull/18881) - ESPResSo - with `foss/2021a` ([PR #332](https://github.com/EESSI/software-layer/pull/332)) - with `foss/2022a` ([PR #331](https://github.com/EESSI/software-layer/pull/331)) - - wrong Python installation is picked up - WRF - ([PR #336](https://github.com/EESSI/software-layer/pull/336)) - failing netCDF tests due to RPATH issue - seems to be caused by `-DCMAKE_SKIP_RPATH=ON` that was added in https://github.com/easybuilders/easybuild-easyblocks/pull/1031 (Nov 2016) - maybe needed due to a bug in old CMake versions - notes - should add "missing" YAML file (like for old TensorFlow versions on `aarch64/*`) - next packages - OpenFOAM - newer R - Bioconductor - AlphaFold (GPU) - GPU - we should set up a meeting to figure out the right steps... - plan is to look into supporting GPUs in `software.eessi.io` CVMFS repo - is ldconfig OK with non-existing paths (to system paths)? also, order matters - Apptainer also uses ldconfig to figure out paths to required libraries - CUDA compat libs (could be avoided, only needed as a fallback) - last location: Apptainer libs - first step should be to get it working on assumption that GPU driver is sufficiently recent - Alan will look into planning a sync meeting on GPU support # Previous meetings * https://github.com/EESSI/meetings/wiki/Sync-meeting-on-EESSI-Software-layer-(2023-10-03) * https://github.com/EESSI/meetings/wiki/Sync-meeting-on-EESSI-Software-layer-(2023-09-26) * https://github.com/EESSI/meetings/wiki/Sync-meeting-on-EESSI-Software-layer-(2023-09-20) * https://github.com/EESSI/meetings/wiki/Sync-meeting-on-EESSI-Software-layer-(2023-09-12) * https://github.com/EESSI/meetings/wiki/Sync-meeting-on-EESSI-Software-layer-(2023-09-05)