# MultiXscale WP1+WP5 sync meetings - Monthly, every 2nd Tuesday of the month at 10:00 CE(S)T - Notes of previous meetings at https://github.com/multixscale/meetings/wiki --------------------------- ## Next meetings - Tue 12 March 2024 - Tue 9 April 2024 --------------------------- ## Agenda/notes 2024-02-13 attending: - Caspar van Leeuwen, Satish Kamath, Maxim Masterov (SURF) - Alan O'Cais (CECAM, UB) - Kenneth Hoste, Lara Peeters (HPC-UGent) - Thomas Röblitz (UiB) - Julián Morillo (BSC) - Danilo Gonzalez, Helena Vela Beltran, Susana Hernandez, Nadia Martins, Elisabeth Ortega (HPCNow!) - Pedro Santos Neves, Bob Dröge (UGroningen) - Neja Šamec (NIC) project planning overview: https://github.com/orgs/multixscale/projects/1 ### EU 1st period report & Review 19-02-2024 * M12 deliverables have all been submitted * available via https://www.multixscale.eu/deliverables * Technical Report B * Finalized? * Financial still need to be finalized * Preparing presentations for review * "Scientific software development of MultiXscale CoE" (WP2+WP3+WP4) * "Technical aspects of MultiXscale CoE" (WP1+WP5) * "Training" (WP6) * "Dissemination" (WP7) * "Impact" (WP8) * Practice presentations this Friday ### WP1 #### WP 1 Goals for 2024 - All MultiXscale-related software in EESSI - Stable NVIDIA GPU support - Monitoring for CVMFS repositories in place - E.g. e-mails to relevant people when something goes down. - Should include _performance_ of Stratum 1s - Could use the Castiel2 CI infra for this to scrape json from monitoring page & run performance tests - Prometheus on Stratum 1? - Clear maintenance policy for EESSI infrastructure - What is done, when, and by whom? - Use of host MPI tested and documented - Functionality is brittle, e.g. issues when non-matching hwloc versions - May be a reason to add MPICH-based/compatible toolchain? - LUMI could be a relevant use case to try - Performance evaluation vs local software stacks (?) - Publish in docs - OS jitter aspect of CernVM-FS - If we see poor EESSI performance, that is also valuable - E.g. do you recover performance if you inject host MPI? - Initial steps for AMD GPU support - What is needed? Do we need similar steps as for NVIDIA, or different? - Initial steps for RISC-V support - Can we get a bot on some RISC-V hardware? Can we build a toolchain? Does CVMFS work? Can we build a compat layer? - Technical development in improving/expanding RISC-V software ecosystem (compilers, MPI, low level libs) - Focused on LLVM compiler to improve vectorization support - OpenMPI is working on RISC-V - Julián is looking into answering these questions, using HiFive Unmatched hardware @ BSC - D1.3 "Report on stable, shared software stack" (due M24, end of 2024) - "Final report on the development of a stable, shared software stack" - should cover: included software, where EESSI is available, monitoring (!), ... #### Issues * installing MultiXscale apps in EESSI ([planning issue #3](https://github.com/multixscale/planning/issues/3)) * target milestone moved to M15 * only 2-3 out of 5 installed in `software.eessi.io` * OpenFOAM v10 failing, v11 installed * Didn't v11 have the same issue @Bob? * problem with OpenFOAM example case, but underlying cause is bug in OpenMPI (PR for workaround is in place) * OpenFOAM v10 + LAMMPS (only) installed in EESSI pilot * walBerla is WIP, hanging sanity check (MPI issue) * @Pedro: can we do header-only for now? Then plan meeting with Leonardo to hear how they use it / if this is useful at all? * [Expanding on GPU support](https://github.com/multixscale/planning/issues/136) * See also [this software-layer issue](https://github.com/EESSI/software-layer/issues/375#issuecomment-1864599433) * Do we always compile natively? * What about cross-compiling on non-GPU node? * How do we deal with CPU + GPU architecture combinations we don't have access to? * We can at least build OSU with GPU support * We can also _try_ GROMACS with GPU support and see where we get * Low level GPU tests [issue](https://github.com/multixscale/planning/issues/111) * OSU * CUDA samples * GPU burn? * need for a separate EESSI development repository (`dev.eessi.io`) * on top of `software.eessi.io` repository * => should have a dedicated planning issue on this **Deliverables this year: 1.3, "Report on stable, shared software stack" (UB)** ### WP5 #### WP 5 Goals for 2024 - Tests implemented for all MultiXscale key software - Expanding overall application test coverage - Separate (multinode) test step in build-and-deploy workflow - Could use `dev.eessi.io`, before deploying to `software.eessi.io` - Decide how we deal with security updates in compat layer - We could provide a new compat layer. Decide on what the default is: existing, or new? - Dashboard showing test results from regular runs on (at least) Vega and Karolina - Connect better with scientific WPs - POC for one of the key applications using the github action in CI (?) - POC for getting devs of key applications to update their software in EESSI (?) - Improvements to the bot (as needed) - Handling issues from regular support #### Issues - Start setting up dashboard [#10](https://github.com/multixscale/planning/issues/10) - Set up meeting with SURF visualization (and regular meetings for follow-up?) - Set up infra for periodic testing [#36](https://github.com/multixscale/planning/issues/36) - Waiting for Castiel to facilitate this - may be heading towards EESSI for CD + Jacamar CI? - may not work on all sites (like BSC which is reluctant to allow outgoing connections) - Integrate testing in build-and-deploy workflow [#133](https://github.com/multixscale/planning/issues/133) - Implemented in [software-layer PR #467](https://github.com/EESSI/software-layer/pull/467) - Needs some feedback from @Thomas on @Kenneth's review - next steps for bot implementation - GitHub App (v0.3): improve ingestion workflow ([planning issue #99](https://github.com/multixscale/planning/issues/99)) - cfr. effort in NESSI to bundle tarballs before deploying - support for additional deployment actions like 'overwrite', 'remove' - make bot fully agnostic to EESSI - no hardcoding of EESSI software-layer in deploy step - separate test step (in between build + deploy) - also support GitLab repos (would be interesting in scope of EuroHPC, CASTIEL2, CI/CD) - additional effort on bot fits under Task 5.4 (providing support to EESSI) ### WP6 - None of the three ISC tutorial submissions were accepted, despite positive reviews - EESSI tutorial: two reviews 7/7 and 6/7, one review 3/7 (motiviation: overlaps with CVMFS) - EuroHPC summit: - poster on MultiXscale (Eli) - Vega, Karolina, maybe Meluxina have half day demo-labs. Trying to make sure EESSI is part of that. (https://www.eurohpcsummit.eu/demo-lab) - maybe also an informal "BoF" session during lunch break (Barbara @ Vega informed us this may be possible) - speaker on user support (Kenneth has been put forward) - **Deliverables this year: 6.2, "Training Activity Technical Support Infrastructure" (UiB)** (due M24) - Magic Castle, EESSI, Lhumos training portal, ... - cfr. Alan's tutorial last week, incl. snakemake ### WP7 - Starting task 7.1: Scientific applications provisioned on demand (ends in M48) - Use Terraform as a cloud agnostic solution to generate instances on supported cloud providers - Starting task 7.4: Industry oriented training activities (ends in M48) - Two training sessions targeted to HPC industry - Starts task 7.3 starts in Q3: Sustainability (ends in M42) - Bussiness plan - Certification program - "Design of certification program for external partners that would be providing official support" **Deliverables this year: 7.2, "Intermediate report on Dissemination, Communication and Exploitation" (HPCNOW) [M24]** ### WP8: - First periodic report [#130](https://github.com/multixscale/planning/issues/130) - **Deliverables this year: 8.5, "Project Data Management Plan - final" (NIC)** ### CASTIEL2 - ongoing discussion on financial liability for COLA - Alan and Kenneth are involved with meeting with EuroHPC hosting sites via CASTIEL2 - Meetings are on figuring out CI/CD stuff, leaning towards EESSI for CD - next meeting to be planned (probably before EuroHPC Summit), trying to get CernVM-FS dev team involved in it - Alan & Kenneth will present overview on required effort/infrastructure, workaround if non-native CernVM-FS installation is required - SURF has ISO certification for Snellius and has CernVM-FS deployed! - see https://www.surf.nl/files/2022-11/iso-certificaat-surf-2022.pdf (in Dutch) and "SURF services under ISO 27001 certification" at https://www.surf.nl/en/information-security-surf-services --------------------------- ## Notes of previous meetings - https://github.com/multixscale/meetings/wiki/Sync-meeting-2024-01-09 - https://github.com/multixscale/meetings/wiki/Sync-meeting-2023-12-12 - https://github.com/multixscale/meetings/wiki/Sync-meeting-2023-11-14 - https://github.com/multixscale/meetings/wiki/Sync-meeting-2023-10-10 - https://github.com/multixscale/meetings/wiki/Sync-meeting-2023-09-12 - https://github.com/multixscale/meetings/wiki/Sync-meeting-2023-08-08 - https://github.com/multixscale/meetings/wiki/Sync-meeting-2023-07-11 - https://github.com/multixscale/meetings/wiki/Sync-meeting-2023-06-13 - https://github.com/multixscale/meetings/wiki/Sync-meeting-2023-05-09 - https://github.com/multixscale/meetings/wiki/Sync-meeting-2023-04-11 - https://github.com/multixscale/meetings/wiki/Sync-meeting-2023-03-14 - https://github.com/multixscale/meetings/wiki/Sync-meeting-2023-02-14 - https://github.com/multixscale/meetings/wiki/sync-meeting-2023-01-10 ---------------------------- ## Template for sync meeting notes TO COPY-PASTE - overview of MultiXscale planning - https://github.com/orgs/multixscale/projects/1/views/1 - WP status updates - [SURF] WP1 Developing a Central Platform for Scientific Software on Emerging Exascale Technologies - [UGent] T1.1 Stable (EESSI) - due M12+M24 - ... - [RUG] T1.2 Extending support (starts M9, due M30) - [SURF] T1.3 Test suite - due M12+M24 - ... - [BSC] T1.4 RISC-V (starts M13) - [SURF] T1.5 Consolidation (starts M25) - [UGent] WP5 Building, Supporting and Maintaining a Central Shared Stack of Optimized Scientific Software Installations - [UGent] T5.1 Support portal - due M12 - ... - [SURF] T5.2 Monitoring/testing (starts M9) - [UiB] T5.3 community contributions (bot) - due M12 - ... - [UGent] T5.4 support/maintenance (starts M13) - [UB] WP6 Community outreach, education, and training - ... - [HPCNow] WP7 Dissemination, Exploitation & Communication - ...