# MultiXscale WP1+WP5 sync meetings
- Monthly, every 2nd Tuesday of the month at 10:00 CE(S)T
- Notes of previous meetings at https://github.com/multixscale/meetings/wiki
---------------------------
## Next meetings
- Tue 11 June 2024 10:00 CEST
- Tue 9 July 2024 10:00 CEST
- planning to attend: Caspar, Bob
- on summer break: Kenneth, Lara, Thomas
- Tue 13 Aug 2024 10:00 CEST
- planning to attend: Kenneth, Lara, Caspar, Thomas
- on summer break: Bob
---------------------------
## Agenda/notes 2024-06-11
attending:
* Neja (NIC)
* Alan (UB)
* Kenneth, Lara (UGent)
* Caspar, Casper, Maxim (SURF)
* Thomas & Richard (UiB)
* Bob, Pedro (RUG)
* Jean-Noël (UStutt)
* Julián (BSC)
* Eli, Susana, Nadia (HPCNow!)
- problems with shared drive
- cfr. incomplete progress reports 2024Q1 for WP1 (see drafts in drive upload by Satish) + WP5 (see Lara's email 6 May)
- works for Alan when logging in through incognito browser + logging in with personal Microsoft account
- who else has this problem?
- Kenneth, Susana, Jean-Noël, Rudolph, Pedro
- we can try to take a copy and create a new OneDrive
- 2024Q2 quarterly report
- try to get info in place end of June/early July
- Lara & Caspar will be mostly available in July
- if problems with OneDrive persist, send PMs info + bullet points with tasks to Caspar/Lara via email/Slack/HackMD
- Milestone 3 (M18 - June 2024, lead: UStuttgart)
- Milestone name: "First portable test run on two systems with different architectures (e.g. with and without accelerators)"
- Means of validation: "Performance and scalability plots available for the application on the two architectures"
- working on this using ESPResSo, as extra test in EESSI test suite
- see https://github.com/EESSI/test-suite/pull/144, using FFT test case
- working with JSC to make FFT communication 8-16x faster
- scalability for LJ test case should improve this year
- should also be added to ESPResSo test
- WP status updates
- [SURF] WP1 Developing a Central Platform for Scientific Software on Emerging Exascale Technologies
- [UGent] T1.1 Stable (EESSI) - **D1.3 due M24 (Dec'24)**
- more software, incl. Espresso 4.2.2
- `dev.eessi.io` => see [notes](https://github.com/EESSI/meetings/wiki/dev.eessi.io-discussion-(2024-05-24)) + [support issue #61](https://gitlab.com/eessi/support/-/issues/61)
- would be very interesting service for developers in scientific WPs => cross-cutting across technical & scientific PRs
- "looser" policy compared to `software.eessi.io` production repo
- Devs can trigger their own builds
- pre-release builds accepted (specific commits)
- intially focused on Espresso & co
- could also be used as "dev" environment for `software.eessi.io` features (e.g. GPU support)
- if we're doing this on Azure, we should do it in a new subscription
- needs to be created by Martin @ SURF
- if done in AWS, Alan can do it
- GPU software => see [notes](https://github.com/EESSI/meetings/wiki/meeting-GPU-support-(2024-05-27)) + [support issue #59](https://gitlab.com/eessi/support/-/issues/59)
- Update bot to have GPU support [Thomas]
- Update archdetect to support CUDA compute capability [???]
- directory structure in `software.eessi.io`, for example `software/x86_64/amd/zen2/accel/nvidia/cc80` [???]
- blocked by `dev.eessi.io`?
- we want to use this as a playground for GPU builds
- => can look into this during hackathon on Tue 18 June
- **needs to be planned**
- need to review description of Task 1.1, make sure all subtasks are covered
- => need to update [project planning](https://github.com/orgs/multixscale/projects/1) (Caspar, Kenneth)
- "we will benchmark software from the shared software stack and compare the performance against on-premise software stacks to identify potential performance limitations, ..."
- Espresso + LAMMPS + OpenFOAM + ALL(?) (MultiXscale), GROMACS (BioExcel)
- "increase stability of the shared software stack ... pro-actively by developing monitoring tools"
- proper monitoring for CVMFS network (S0 + S1s)
- for RUG?
- [RUG] T1.2 Extending support - D1.4 due M30 (June'25)
- Arm support fits here
- zen4 + sapphirerapids
- AMD ROCm
- lower impact, should we should limit our efforts here?
- select apps, like PyTorch/TensorFlow
- should also look into Grace Hopper (JUPITER)
- [SURF] T1.3 Test suite - D1.5 due M30 (June'25)
- Milestone 3 for Espresso test
- [BSC] T1.4 RISC-V (starts M13)
- cfr. efforts by Bob & Julian, incl. `riscv.eessi.io`
- actively looking into adding more software, incl. Extrae
- lot of interest from EUPILOT project @ BSC
- [SURF] T1.5 Consolidation (starts M25 - Jan'25)
- (not started yet)
- [UGent] WP5 Building, Supporting and Maintaining a Central Shared Stack of Optimized Scientific Software Installations
- _(FINISHED M12 [UGent] T5.1 Support portal)_
- [SURF] T5.2 Monitoring/testing, D5.3 due M30 (June'25)
- discussions with SURF + initial work done on dashboard
- working on two dashboards: one detailed, one with overview
- _(FINISHED M12 [UiB] T5.3 community contributions (bot))_
- [UGent] T5.4 support/maintenance - D5.4 due M48 (Dec'26)
- support portal + rotation working well
- support issues in April+May
- Opened: 12 issues
- Closed: 10 issues
- total: 69 issues (26 open, 43 closed)
- bot release
- [UB] WP6 Community outreach, education, and training
- [Kenneth, Lara, Pedro] EasyBuild User Meeting (EUM'24), 23-25 April 2024 @ Umeå, Sweden
- [Kenneth, Lara, Eli] activity at ISC'24, see https://eessi.io/docs/blog/2024/05/17/isc24
- [Eli] Teratec (29-30 May'24)
- poster
- demo for Sanofi, were quite interested
- [Thomas] presentation @ Norwegian Bioinformatic Days on making bionformatics workflows easy (using Nextflow)
- they use a lot of containers, but can also use different backends
- backend for EESSI could be interesting
- similar work was done in BioHackaton Europe (https://biohackathon-europe.org)
- [Lara] EESSI promotion @ DH Benelux in Leuven (Belgium), 4-7 June'24
- some people were interested, like getting students easy access to software installations
- [Matej] presenting poster at ASHPC this week
- [Alan] invited speaker for Nordic Industry Days (early Sept'24)
- submit BoF proposal on EESSI for SC24 (Atlanta, US)
- HPCNow! will be attending
- tutorial submission done
- CernVM-FS workshop (Sept'24, Geneva)
- submission due this month
- EESSI is in default CernVM-FS configuration
- could cover work on `dev.eessi.io`
- deliverable due: D6.2 (M24 - Dec'24), D6.3 (M30 - June'25)
- [HPCNow] WP7 Dissemination, Exploitation & Communication
- T7.1 Scientific applications provisioned on demand (lead: HPCNow)
- ...
- Task 7.2 - Dissemination and communication activities (lead: NIC)
- more EESSI stickers
- via HPCNow?
- Neja will ask at NIC
- new section in MultiXscale website: https://www.multixscale.eu/dissemination
- interview with Matej being worked on by Susana
- will try to include it in newsletter of July
- Task 7.3 - Sustainability (lead: NIC, started M18)
- Legal entity for EESSI needs to be looked into?
- subcontracting money available for this
- we should explore options ourselves a bit first
- Task 7.4 - Industry-oriented training activities (lead: HPCNow)
- ...
- [NIC] WP8 (Management and Coordination)
- reply to review report (see Word doc in shared drive, `1st periodic report | Results of the Review`)
- amendment in the works?
- Neja will start looking into that after holiday in July
- next General Assembly meeting
- 23-24 Jan'25 in Barcelona/Sitges
- coupled to HiPEAC'25 (20-22 Jan 2025)
- https://www.hipeac.net/2025/barcelona
- call for workshops/tutorials at HiPEAC'25
- https://www.hipeac.net/2025/barcelona/#/call/
- deadline: 1 July
- Eli working on workshop submission for Women in HPC/CoE's
- two deliverables due 5th of July (in response to project review)
- one on co-design (by Alan)
- focus on collaborating with projects like EUPILOT, EPI, EUPEX (rather than contacting vendors directly)
- one for scientific WPs
### Notes
- CI/CD call for EuroHPC
- is 100% funded (not 50/50 EU/countries)
- not published yet
- request for success story by CASTIEL2
- ideally end of June, by latest at end of August
- involvement of SKA in EESSI
- Neja is talking to Caspar on this
- deployment of EESSI on Vega/Karolina
- maybe something on Deucalion
- at best by mid Aug'24
- collaboration with AWS/Azure
- getting EESSI in AWS ParallelCluster
- next general MultiXscale meeting
- Tue 25 June 2024, 10:00-11:00 CEST
- hosted by Alan
- agenda point: update on pairing of technical + scientific WPs
- (Susana) suggestions for blog are welcome
- something on leveraging EESSI on GitHub Actions to run CI
- using GROMACS?
- we should also have something on CD aspect
- Alan has something that may be useful
- something on progress in RISC-V
---
## Agenda/notes 2024-05-14
attending:
- Neja
- Alan
- Richard
- Bob
- Thomas
- Pedro
- Satish
- Xin
- Julián
- Caspar
- Nadia
project planning overview: https://github.com/orgs/multixscale/projects/1
### Notes
- overview of MultiXscale planning
- https://github.com/orgs/multixscale/projects/1/views/1
- WP status updates
- [SURF] WP1 Developing a Central Platform for Scientific Software on Emerging Exascale Technologies
- [UGent] T1.1 Stable (EESSI) - due M12+M24
- Need to start working on proper monitoring of the CVMFS infrastructure
- Prometheus + Grafana dashboard + alerting?
- healthy state of infrastructure (mostly server-side)
- bandwidth tests
- should start with a list of metrics to collect
- check with what Terje (UiO) has done (status.eessi.io)
- one page for users (notifications about incidents)
- changelog on documentation?
- use yml file for known issues?
- integrate into init script?
- one admins (Cern has some detection of sites who don't use a proxy)
- setup dedicated meeting
- be clear about what is important to whom (us, EuroHPC, ...)
- [RUG] T1.2 Extending support (starts M9, due M30)
- Our Arm Neoverse V1 builds revealed a bug (and, apparently, another one while the developers were trying to fix it) in GROMACS: https://gitlab.com/gromacs/gromacs/-/issues/5057
- started building for `zen4`
- may look into AMD GPUs, Neoverse V2, ...
- may also look into Clang and MPICH
- [SURF] T1.3 Test suite - due M12+M24
- Espresso test MultiXscale added (WIP). Deadline Milestone: End of June. [#144](https://github.com/EESSI/test-suite/pull/144)
- CP2K [#133](https://github.com/EESSI/test-suite/pull/133), LAMMPS [#131](https://github.com/EESSI/test-suite/pull/131), PyTorch [#130](https://github.com/EESSI/test-suite/pull/130) and QE [#128](https://github.com/EESSI/test-suite/pull/128).
- Fixed process binding within the test-suite which was not really compact. [#137](https://github.com/EESSI/test-suite/pull/137)
- Certain small fixes:
- Renaming of `1_cpn_2_nodes` tags [#140](https://github.com/EESSI/test-suite/pull/140)
- set `SRUN_CPUS_PER_TASK` (needed on SLURM >= 22.05 < 23.11) [#141](https://github.com/EESSI/test-suite/pull/141)
- Temporary fix for libfabric problems on Karolina [#142](https://github.com/EESSI/test-suite/pull/142).
- OpenFOAM may not be relevant w.r.t. MultiXscale anymore but still relevant within EESSI and development is going on for a test.
- A repository for saving large input files such as meshes needed for the test.
- Kenneth's suggestion: S3 bucket AWS?
- Skip certain tests to save time in build jobs particularly
- use some lookup table
- analyse contents of tarball
- [BSC] T1.4 RISC-V (starts M13)
- Development repository `riscv.eessi.io`
- Documentation: https://www.eessi.io/docs/repositories/riscv.eessi.io/
- Prerequisistes have been made available: CernVM-FS client, build containers, RISC-V support in compatibility layer installation scripts, etc
- Compatibility layer available in `/cvmfs/riscv.eessi.io/versions/20240402/compat/linux/riscv64/`
- Working on software layer (manually, no bot involved yet):
- Notes in https://github.com/EESSI/software-layer/issues/552
- Only `riscv64/generic` for now
- Solving lots of issues with easyconfigs, mostly adding/enabling/backporting RISC-V support
- Managed to install foss/2023b toolchain, now trying real software on top of it:
- Successfully built R 4.3.3 and dlb 3.4
- Currently trying GROMACS, which compiles, but fails in the test step (1 of 91 tests fails with segmentation fault)
- Clang is needed/provides better support for RISC-V (BSC, SiPearl)
- [SURF] T1.5 Consolidation (starts M25)
- [UGent] WP5 Building, Supporting and Maintaining a Central Shared Stack of Optimized Scientific Software Installations
- _FINISHED [UGent] T5.1 Support portal - due M12_
- [SURF] T5.2 Monitoring/testing (starts M9)
- UiB: ongoing work to use test-suite on national HPC systems in Norway + low-level CVMFS availability tests (likely 2 stages: 1st simple test, 2nd adding feature to Slurm which is only set when EESSI is available on node + jobs can request that feature)
- or even better, only start CVMFS if it is requested by job
- Initial meeting to discuss public dashboard: https://github.com/EESSI/meetings/wiki/meeting-public-dashboard-2024-05-03
- next meeting planned for mid-June
- _FINISHED [UiB] T5.3 community contributions (bot) - due M12_
- [UGent] T5.4 support/maintenance (starts M13)
- working rotation, something noteworthy?
- rotation schedule until October agreed
- bot release around the corner
- [UB] WP6 Community outreach, education, and training
- Lots of EESSI/MultiXscale activity at ISC as we speak
- UiB: preparing presentation "Making it EESSI to run bioinformatics workflows" at Norwegian Bioinformatics days (workshop about data management), May 29
- nextflow repository, uses .direnv (see https://github.com/EESSI/eessi-nextflow-example)
- UiB: preparing webinar introducing EESSI/NESSI to users on national HPCs, date:tbd
- also market this to NCC (ask Castiel2 for budget if in-person)
- discussion within scientific WPs about trainings to offer
- series of webinars with CECAM
- application to CECAM for a flagship course
- should we look into repeating EESSI-only-related tutorials
- Alan finalising dates with two NCCs (Austria/Slovenia) and two CECAM nodes (running MPI application, running GPU application)
- instructor training with NCC Sweden (about how to prepare and deliver a lecture/tutorial)
- one NCC Slovenia event planned in December, Slovenia Supercomputer Days
- [HPCNow] WP7 Dissemination, Exploitation & Communication
- Task 7.1 Scientific applications provisioned on demand
- Initial discussion with HPCNow but we need a dedicated meeting
- Task 7.2 - Dissemination and communication activities
- Overlap with previous discussion in WP6
- ASHPC in June (Matej is Program Chair)
- MultiXscale poster
- ESPResSO workshop currently being disseminated
- Includes waLBerla
- Can disseminate in CASTIEL2 newsletter (used to be NCC only but now includes CoEs)
- Website needs some updating based on review feedback
- Task 7.3 - Sustainability (NIC + HPCNow!)
- due to start in June
- Legal entity for EESSI needs to be looked into
- Task 7.4 - Industry-oriented training activities (HPCNow and Leonardo)
- Subject of a meeting next week
- [NIC] WP8 (Management and Coordination)
- something about the review? :scream:
- Working on a response to the letter
- 2 additional deliverables, one relevant to us on co-design
- Could be good to focus on Clang and work with vendors to help them deliver/test their customisations
- Should also start looking at Neoverse-V2 (NVIDIA GRACE has this)
- Connect with
#### CASTIEL2
- Decision from EuroHPC for CI/CD call is out
- Requested to collaborate more with CASTIEL2
- Can substitute in a technical collaboration task focussing on CI/CD
### Overview progress per WP
#### WP1 (Developing a Central Platform for Scientific Software on Emerging Exascale Technologies)
- Test suite is developing at a decent pace. Can be better w.r.t. applications, such as mid level software such as BLAS libraries etc.
- Getting and displaying scaling information from the reported performance numbers.
- We had an initial meeting w.r.t. the dashboard but some urgent work is required and going on since we are already 7 months into the task.
- Next meeting planned mid-June.
- Working towards a prototype with already existing data.
- Maksim is already testing various Databases where the performance logs can be collected.
#### WP5 (Building, Supporting and Maintaining a Central Shared Stack of Optimized Scientific Software Installations)
- ...
#### WP6 (Community outreach, education, and training)
- ...
#### WP7 (Dissemination, Exploitation & Communication)
- ...
---------------------------
## Notes of previous meetings
- https://github.com/multixscale/meetings/wiki/Sync-meeting-2024-05-14
- https://github.com/multixscale/meetings/wiki/Sync-meeting-2024-04-09
- https://github.com/multixscale/meetings/wiki/Sync-meeting-2024-03-12
- https://github.com/multixscale/meetings/wiki/Sync-meeting-2024-02-13
- https://github.com/multixscale/meetings/wiki/Sync-meeting-2024-01-09
- https://github.com/multixscale/meetings/wiki/Sync-meeting-2023-12-12
- https://github.com/multixscale/meetings/wiki/Sync-meeting-2023-11-14
- https://github.com/multixscale/meetings/wiki/Sync-meeting-2023-10-10
- https://github.com/multixscale/meetings/wiki/Sync-meeting-2023-09-12
- https://github.com/multixscale/meetings/wiki/Sync-meeting-2023-08-08
- https://github.com/multixscale/meetings/wiki/Sync-meeting-2023-07-11
- https://github.com/multixscale/meetings/wiki/Sync-meeting-2023-06-13
- https://github.com/multixscale/meetings/wiki/Sync-meeting-2023-05-09
- https://github.com/multixscale/meetings/wiki/Sync-meeting-2023-04-11
- https://github.com/multixscale/meetings/wiki/Sync-meeting-2023-03-14
- https://github.com/multixscale/meetings/wiki/Sync-meeting-2023-02-14
- https://github.com/multixscale/meetings/wiki/sync-meeting-2023-01-10
----------------------------
## Template for sync meeting notes
TO COPY-PASTE
- overview of MultiXscale planning
- https://github.com/orgs/multixscale/projects/1/views/1
- WP status updates
- [SURF] WP1 Developing a Central Platform for Scientific Software on Emerging Exascale Technologies
- [UGent] T1.1 Stable (EESSI) - due M12+M24
- ...
- [RUG] T1.2 Extending support (starts M9, due M30)
- [SURF] T1.3 Test suite - due M12+M24
- ...
- [BSC] T1.4 RISC-V (starts M13)
- [SURF] T1.5 Consolidation (starts M25)
- [UGent] WP5 Building, Supporting and Maintaining a Central Shared Stack of Optimized Scientific Software Installations
- [UGent] T5.1 Support portal - due M12
- ...
- [SURF] T5.2 Monitoring/testing (starts M9)
- [UiB] T5.3 community contributions (bot) - due M12
- ...
- [UGent] T5.4 support/maintenance (starts M13)
- [UB] WP6 Community outreach, education, and training
- ...
- [HPCNow] WP7 Dissemination, Exploitation & Communication
- ...