# MultiXscale WP1+WP5 sync meetings - Monthly, every 2nd Tuesday of the month at 10:00 CE(S)T - Notes of previous meetings at https://github.com/multixscale/meetings/wiki --------------------------- ## Next meetings - Tue 11 June 2024 10:00 CEST - Tue 9 July 2024 10:00 CEST - planning to attend: Caspar, Bob - on summer break: Kenneth, Lara, Thomas - Tue 13 Aug 2024 10:00 CEST - planning to attend: Kenneth, Lara, Caspar, Thomas - on summer break: Bob --------------------------- ## Agenda/notes 2024-06-11 attending: * Neja (NIC) * Alan (UB) * Kenneth, Lara (UGent) * Caspar, Casper, Maxim (SURF) * Thomas & Richard (UiB) * Bob, Pedro (RUG) * Jean-Noël (UStutt) * Julián (BSC) * Eli, Susana, Nadia (HPCNow!) - problems with shared drive - cfr. incomplete progress reports 2024Q1 for WP1 (see drafts in drive upload by Satish) + WP5 (see Lara's email 6 May) - works for Alan when logging in through incognito browser + logging in with personal Microsoft account - who else has this problem? - Kenneth, Susana, Jean-Noël, Rudolph, Pedro - we can try to take a copy and create a new OneDrive - 2024Q2 quarterly report - try to get info in place end of June/early July - Lara & Caspar will be mostly available in July - if problems with OneDrive persist, send PMs info + bullet points with tasks to Caspar/Lara via email/Slack/HackMD - Milestone 3 (M18 - June 2024, lead: UStuttgart) - Milestone name: "First portable test run on two systems with different architectures (e.g. with and without accelerators)" - Means of validation: "Performance and scalability plots available for the application on the two architectures" - working on this using ESPResSo, as extra test in EESSI test suite - see https://github.com/EESSI/test-suite/pull/144, using FFT test case - working with JSC to make FFT communication 8-16x faster - scalability for LJ test case should improve this year - should also be added to ESPResSo test - WP status updates - [SURF] WP1 Developing a Central Platform for Scientific Software on Emerging Exascale Technologies - [UGent] T1.1 Stable (EESSI) - **D1.3 due M24 (Dec'24)** - more software, incl. Espresso 4.2.2 - `dev.eessi.io` => see [notes](https://github.com/EESSI/meetings/wiki/dev.eessi.io-discussion-(2024-05-24)) + [support issue #61](https://gitlab.com/eessi/support/-/issues/61) - would be very interesting service for developers in scientific WPs => cross-cutting across technical & scientific PRs - "looser" policy compared to `software.eessi.io` production repo - Devs can trigger their own builds - pre-release builds accepted (specific commits) - intially focused on Espresso & co - could also be used as "dev" environment for `software.eessi.io` features (e.g. GPU support) - if we're doing this on Azure, we should do it in a new subscription - needs to be created by Martin @ SURF - if done in AWS, Alan can do it - GPU software => see [notes](https://github.com/EESSI/meetings/wiki/meeting-GPU-support-(2024-05-27)) + [support issue #59](https://gitlab.com/eessi/support/-/issues/59) - Update bot to have GPU support [Thomas] - Update archdetect to support CUDA compute capability [???] - directory structure in `software.eessi.io`, for example `software/x86_64/amd/zen2/accel/nvidia/cc80` [???] - blocked by `dev.eessi.io`? - we want to use this as a playground for GPU builds - => can look into this during hackathon on Tue 18 June - **needs to be planned** - need to review description of Task 1.1, make sure all subtasks are covered - => need to update [project planning](https://github.com/orgs/multixscale/projects/1) (Caspar, Kenneth) - "we will benchmark software from the shared software stack and compare the performance against on-premise software stacks to identify potential performance limitations, ..." - Espresso + LAMMPS + OpenFOAM + ALL(?) (MultiXscale), GROMACS (BioExcel) - "increase stability of the shared software stack ... pro-actively by developing monitoring tools" - proper monitoring for CVMFS network (S0 + S1s) - for RUG? - [RUG] T1.2 Extending support - D1.4 due M30 (June'25) - Arm support fits here - zen4 + sapphirerapids - AMD ROCm - lower impact, should we should limit our efforts here? - select apps, like PyTorch/TensorFlow - should also look into Grace Hopper (JUPITER) - [SURF] T1.3 Test suite - D1.5 due M30 (June'25) - Milestone 3 for Espresso test - [BSC] T1.4 RISC-V (starts M13) - cfr. efforts by Bob & Julian, incl. `riscv.eessi.io` - actively looking into adding more software, incl. Extrae - lot of interest from EUPILOT project @ BSC - [SURF] T1.5 Consolidation (starts M25 - Jan'25) - (not started yet) - [UGent] WP5 Building, Supporting and Maintaining a Central Shared Stack of Optimized Scientific Software Installations - _(FINISHED M12 [UGent] T5.1 Support portal)_ - [SURF] T5.2 Monitoring/testing, D5.3 due M30 (June'25) - discussions with SURF + initial work done on dashboard - working on two dashboards: one detailed, one with overview - _(FINISHED M12 [UiB] T5.3 community contributions (bot))_ - [UGent] T5.4 support/maintenance - D5.4 due M48 (Dec'26) - support portal + rotation working well - support issues in April+May - Opened: 12 issues - Closed: 10 issues - total: 69 issues (26 open, 43 closed) - bot release - [UB] WP6 Community outreach, education, and training - [Kenneth, Lara, Pedro] EasyBuild User Meeting (EUM'24), 23-25 April 2024 @ Umeå, Sweden - [Kenneth, Lara, Eli] activity at ISC'24, see https://eessi.io/docs/blog/2024/05/17/isc24 - [Eli] Teratec (29-30 May'24) - poster - demo for Sanofi, were quite interested - [Thomas] presentation @ Norwegian Bioinformatic Days on making bionformatics workflows easy (using Nextflow) - they use a lot of containers, but can also use different backends - backend for EESSI could be interesting - similar work was done in BioHackaton Europe (https://biohackathon-europe.org) - [Lara] EESSI promotion @ DH Benelux in Leuven (Belgium), 4-7 June'24 - some people were interested, like getting students easy access to software installations - [Matej] presenting poster at ASHPC this week - [Alan] invited speaker for Nordic Industry Days (early Sept'24) - submit BoF proposal on EESSI for SC24 (Atlanta, US) - HPCNow! will be attending - tutorial submission done - CernVM-FS workshop (Sept'24, Geneva) - submission due this month - EESSI is in default CernVM-FS configuration - could cover work on `dev.eessi.io` - deliverable due: D6.2 (M24 - Dec'24), D6.3 (M30 - June'25) - [HPCNow] WP7 Dissemination, Exploitation & Communication - T7.1 Scientific applications provisioned on demand (lead: HPCNow) - ... - Task 7.2 - Dissemination and communication activities (lead: NIC) - more EESSI stickers - via HPCNow? - Neja will ask at NIC - new section in MultiXscale website: https://www.multixscale.eu/dissemination - interview with Matej being worked on by Susana - will try to include it in newsletter of July - Task 7.3 - Sustainability (lead: NIC, started M18) - Legal entity for EESSI needs to be looked into? - subcontracting money available for this - we should explore options ourselves a bit first - Task 7.4 - Industry-oriented training activities (lead: HPCNow) - ... - [NIC] WP8 (Management and Coordination) - reply to review report (see Word doc in shared drive, `1st periodic report | Results of the Review`) - amendment in the works? - Neja will start looking into that after holiday in July - next General Assembly meeting - 23-24 Jan'25 in Barcelona/Sitges - coupled to HiPEAC'25 (20-22 Jan 2025) - https://www.hipeac.net/2025/barcelona - call for workshops/tutorials at HiPEAC'25 - https://www.hipeac.net/2025/barcelona/#/call/ - deadline: 1 July - Eli working on workshop submission for Women in HPC/CoE's - two deliverables due 5th of July (in response to project review) - one on co-design (by Alan) - focus on collaborating with projects like EUPILOT, EPI, EUPEX (rather than contacting vendors directly) - one for scientific WPs ### Notes - CI/CD call for EuroHPC - is 100% funded (not 50/50 EU/countries) - not published yet - request for success story by CASTIEL2 - ideally end of June, by latest at end of August - involvement of SKA in EESSI - Neja is talking to Caspar on this - deployment of EESSI on Vega/Karolina - maybe something on Deucalion - at best by mid Aug'24 - collaboration with AWS/Azure - getting EESSI in AWS ParallelCluster - next general MultiXscale meeting - Tue 25 June 2024, 10:00-11:00 CEST - hosted by Alan - agenda point: update on pairing of technical + scientific WPs - (Susana) suggestions for blog are welcome - something on leveraging EESSI on GitHub Actions to run CI - using GROMACS? - we should also have something on CD aspect - Alan has something that may be useful - something on progress in RISC-V --- ## Agenda/notes 2024-05-14 attending: - Neja - Alan - Richard - Bob - Thomas - Pedro - Satish - Xin - Julián - Caspar - Nadia project planning overview: https://github.com/orgs/multixscale/projects/1 ### Notes - overview of MultiXscale planning - https://github.com/orgs/multixscale/projects/1/views/1 - WP status updates - [SURF] WP1 Developing a Central Platform for Scientific Software on Emerging Exascale Technologies - [UGent] T1.1 Stable (EESSI) - due M12+M24 - Need to start working on proper monitoring of the CVMFS infrastructure - Prometheus + Grafana dashboard + alerting? - healthy state of infrastructure (mostly server-side) - bandwidth tests - should start with a list of metrics to collect - check with what Terje (UiO) has done (status.eessi.io) - one page for users (notifications about incidents) - changelog on documentation? - use yml file for known issues? - integrate into init script? - one admins (Cern has some detection of sites who don't use a proxy) - setup dedicated meeting - be clear about what is important to whom (us, EuroHPC, ...) - [RUG] T1.2 Extending support (starts M9, due M30) - Our Arm Neoverse V1 builds revealed a bug (and, apparently, another one while the developers were trying to fix it) in GROMACS: https://gitlab.com/gromacs/gromacs/-/issues/5057 - started building for `zen4` - may look into AMD GPUs, Neoverse V2, ... - may also look into Clang and MPICH - [SURF] T1.3 Test suite - due M12+M24 - Espresso test MultiXscale added (WIP). Deadline Milestone: End of June. [#144](https://github.com/EESSI/test-suite/pull/144) - CP2K [#133](https://github.com/EESSI/test-suite/pull/133), LAMMPS [#131](https://github.com/EESSI/test-suite/pull/131), PyTorch [#130](https://github.com/EESSI/test-suite/pull/130) and QE [#128](https://github.com/EESSI/test-suite/pull/128). - Fixed process binding within the test-suite which was not really compact. [#137](https://github.com/EESSI/test-suite/pull/137) - Certain small fixes: - Renaming of `1_cpn_2_nodes` tags [#140](https://github.com/EESSI/test-suite/pull/140) - set `SRUN_CPUS_PER_TASK` (needed on SLURM >= 22.05 < 23.11) [#141](https://github.com/EESSI/test-suite/pull/141) - Temporary fix for libfabric problems on Karolina [#142](https://github.com/EESSI/test-suite/pull/142). - OpenFOAM may not be relevant w.r.t. MultiXscale anymore but still relevant within EESSI and development is going on for a test. - A repository for saving large input files such as meshes needed for the test. - Kenneth's suggestion: S3 bucket AWS? - Skip certain tests to save time in build jobs particularly - use some lookup table - analyse contents of tarball - [BSC] T1.4 RISC-V (starts M13) - Development repository `riscv.eessi.io` - Documentation: https://www.eessi.io/docs/repositories/riscv.eessi.io/ - Prerequisistes have been made available: CernVM-FS client, build containers, RISC-V support in compatibility layer installation scripts, etc - Compatibility layer available in `/cvmfs/riscv.eessi.io/versions/20240402/compat/linux/riscv64/` - Working on software layer (manually, no bot involved yet): - Notes in https://github.com/EESSI/software-layer/issues/552 - Only `riscv64/generic` for now - Solving lots of issues with easyconfigs, mostly adding/enabling/backporting RISC-V support - Managed to install foss/2023b toolchain, now trying real software on top of it: - Successfully built R 4.3.3 and dlb 3.4 - Currently trying GROMACS, which compiles, but fails in the test step (1 of 91 tests fails with segmentation fault) - Clang is needed/provides better support for RISC-V (BSC, SiPearl) - [SURF] T1.5 Consolidation (starts M25) - [UGent] WP5 Building, Supporting and Maintaining a Central Shared Stack of Optimized Scientific Software Installations - _FINISHED [UGent] T5.1 Support portal - due M12_ - [SURF] T5.2 Monitoring/testing (starts M9) - UiB: ongoing work to use test-suite on national HPC systems in Norway + low-level CVMFS availability tests (likely 2 stages: 1st simple test, 2nd adding feature to Slurm which is only set when EESSI is available on node + jobs can request that feature) - or even better, only start CVMFS if it is requested by job - Initial meeting to discuss public dashboard: https://github.com/EESSI/meetings/wiki/meeting-public-dashboard-2024-05-03 - next meeting planned for mid-June - _FINISHED [UiB] T5.3 community contributions (bot) - due M12_ - [UGent] T5.4 support/maintenance (starts M13) - working rotation, something noteworthy? - rotation schedule until October agreed - bot release around the corner - [UB] WP6 Community outreach, education, and training - Lots of EESSI/MultiXscale activity at ISC as we speak - UiB: preparing presentation "Making it EESSI to run bioinformatics workflows" at Norwegian Bioinformatics days (workshop about data management), May 29 - nextflow repository, uses .direnv (see https://github.com/EESSI/eessi-nextflow-example) - UiB: preparing webinar introducing EESSI/NESSI to users on national HPCs, date:tbd - also market this to NCC (ask Castiel2 for budget if in-person) - discussion within scientific WPs about trainings to offer - series of webinars with CECAM - application to CECAM for a flagship course - should we look into repeating EESSI-only-related tutorials - Alan finalising dates with two NCCs (Austria/Slovenia) and two CECAM nodes (running MPI application, running GPU application) - instructor training with NCC Sweden (about how to prepare and deliver a lecture/tutorial) - one NCC Slovenia event planned in December, Slovenia Supercomputer Days - [HPCNow] WP7 Dissemination, Exploitation & Communication - Task 7.1 Scientific applications provisioned on demand - Initial discussion with HPCNow but we need a dedicated meeting - Task 7.2 - Dissemination and communication activities - Overlap with previous discussion in WP6 - ASHPC in June (Matej is Program Chair) - MultiXscale poster - ESPResSO workshop currently being disseminated - Includes waLBerla - Can disseminate in CASTIEL2 newsletter (used to be NCC only but now includes CoEs) - Website needs some updating based on review feedback - Task 7.3 - Sustainability (NIC + HPCNow!) - due to start in June - Legal entity for EESSI needs to be looked into - Task 7.4 - Industry-oriented training activities (HPCNow and Leonardo) - Subject of a meeting next week - [NIC] WP8 (Management and Coordination) - something about the review? :scream: - Working on a response to the letter - 2 additional deliverables, one relevant to us on co-design - Could be good to focus on Clang and work with vendors to help them deliver/test their customisations - Should also start looking at Neoverse-V2 (NVIDIA GRACE has this) - Connect with #### CASTIEL2 - Decision from EuroHPC for CI/CD call is out - Requested to collaborate more with CASTIEL2 - Can substitute in a technical collaboration task focussing on CI/CD ### Overview progress per WP #### WP1 (Developing a Central Platform for Scientific Software on Emerging Exascale Technologies) - Test suite is developing at a decent pace. Can be better w.r.t. applications, such as mid level software such as BLAS libraries etc. - Getting and displaying scaling information from the reported performance numbers. - We had an initial meeting w.r.t. the dashboard but some urgent work is required and going on since we are already 7 months into the task. - Next meeting planned mid-June. - Working towards a prototype with already existing data. - Maksim is already testing various Databases where the performance logs can be collected. #### WP5 (Building, Supporting and Maintaining a Central Shared Stack of Optimized Scientific Software Installations) - ... #### WP6 (Community outreach, education, and training) - ... #### WP7 (Dissemination, Exploitation & Communication) - ... --------------------------- ## Notes of previous meetings - https://github.com/multixscale/meetings/wiki/Sync-meeting-2024-05-14 - https://github.com/multixscale/meetings/wiki/Sync-meeting-2024-04-09 - https://github.com/multixscale/meetings/wiki/Sync-meeting-2024-03-12 - https://github.com/multixscale/meetings/wiki/Sync-meeting-2024-02-13 - https://github.com/multixscale/meetings/wiki/Sync-meeting-2024-01-09 - https://github.com/multixscale/meetings/wiki/Sync-meeting-2023-12-12 - https://github.com/multixscale/meetings/wiki/Sync-meeting-2023-11-14 - https://github.com/multixscale/meetings/wiki/Sync-meeting-2023-10-10 - https://github.com/multixscale/meetings/wiki/Sync-meeting-2023-09-12 - https://github.com/multixscale/meetings/wiki/Sync-meeting-2023-08-08 - https://github.com/multixscale/meetings/wiki/Sync-meeting-2023-07-11 - https://github.com/multixscale/meetings/wiki/Sync-meeting-2023-06-13 - https://github.com/multixscale/meetings/wiki/Sync-meeting-2023-05-09 - https://github.com/multixscale/meetings/wiki/Sync-meeting-2023-04-11 - https://github.com/multixscale/meetings/wiki/Sync-meeting-2023-03-14 - https://github.com/multixscale/meetings/wiki/Sync-meeting-2023-02-14 - https://github.com/multixscale/meetings/wiki/sync-meeting-2023-01-10 ---------------------------- ## Template for sync meeting notes TO COPY-PASTE - overview of MultiXscale planning - https://github.com/orgs/multixscale/projects/1/views/1 - WP status updates - [SURF] WP1 Developing a Central Platform for Scientific Software on Emerging Exascale Technologies - [UGent] T1.1 Stable (EESSI) - due M12+M24 - ... - [RUG] T1.2 Extending support (starts M9, due M30) - [SURF] T1.3 Test suite - due M12+M24 - ... - [BSC] T1.4 RISC-V (starts M13) - [SURF] T1.5 Consolidation (starts M25) - [UGent] WP5 Building, Supporting and Maintaining a Central Shared Stack of Optimized Scientific Software Installations - [UGent] T5.1 Support portal - due M12 - ... - [SURF] T5.2 Monitoring/testing (starts M9) - [UiB] T5.3 community contributions (bot) - due M12 - ... - [UGent] T5.4 support/maintenance (starts M13) - [UB] WP6 Community outreach, education, and training - ... - [HPCNow] WP7 Dissemination, Exploitation & Communication - ...