Sync meeting on MultiXscale deliverables due M30 June 2025 (20250528)
Present: Alan, Pedro, Satish, Kenneth, Lara, Maksim, Thomas
D1.4 Support for emerging system architectures
- Rhea v1 (Neoverse V1) already covered, Rhea v2 (Neoverse V2) very close to NVIDIA Grace which is already supported
- not building yet for Graviton 4 (only 2+3), but we could/should
- we could get in touch with SiPearl (via Estella -> Alan)
- ROCm compat guarantee only from 6.4
- Overview of ROCm ecosystem (external contribution with our input)
- LLVM - see if we have enough for a full section (maybe out of scope); external contribution (w/ internal contribution). We run and pass the test suite. Important for future architectures.
- Future RISC-V work will be referenced in future deliverable
D1.5
- Caspar tackled all comments, to be reviewed again by Kenneth
- OpenFOAM test is close to being ready, so that should be reflected in deliverable
D5.3
- placeholder page on dashboard to be added to EESSI docs
- w.r.t. performance comparisons: just mention that no attempt was made to tune the tests for a particular system, it's not a benchmark suite
- mention that dashboard will need to actively maintained to keep the service up
Next sync
- during WP1+WP5 sync meeting on Tue 10 June
- aim to have deliverables 100% ready for final review by SC + submission by Petra
Sync meeting on MultiXscale deliverables due M30 June 2025 (20250509)
Present: Pedro, Bob, Petra, Richard, Thomas, Maksim, Kenneth, Satish
D1.4 Support for emerging system architectures
- Written by Pedro and Bob
- Thomas will review the document
- Some things are still missing or need to be updated
- E.g. the metadata on the first page and the RISC-V LLVM section
- Numbers of available apps need to be filled in just before submitting the deliverable
- Namings and references may not be consistent throughout the document
- A64FX is not an emerging microarchitecture, has been there for many years
- we could name it "additional" or "interesting for EuroHPC"
- The additional targets should match the aforementioned criteria
- Reason for adding cascadelake could be that we now have a system that allows us to build for GPUs
- Is the NVIDIA GPU section too detailed?
- The ROCm section could stress a bit more how complex and fast evolving the ROCm stack is
- LLVM can be removed from the RISC-V section (or just mention it in the first subsection)
D1.5 Portable test suite for shared software stack
- List of authors needs to be fixed/updated
- Satish has reviewed the document
- Should we show a bit of relevant code of the EESSI Mixin class?
- This deliverable is close to done, just need to address some details (see comments in Overleaf)
- More or less complete, details.
- Write some more details about the tests.
- Community contributions.
D5.3 Report on testing provided software
- Remove IP addresses from figure 2
- Combine figure 4 and 5 (with the same layout as on the website)
- Connected system 4.6 could move to section 3 Periodic testing.
- But the dashboard only collects info on some systems. Otherwise refer to D1.5
- Refer to the list of systems table in periodic testing.
- 5.1 title (sanity checks -> test step of install procedure)
- 5.2.2 figure 11, include timepoints before the performance drop
- Hardware based comparison -> Show a plot comparing between ARM systems
- Also more or less complete, great overview.
Next meetings
- Wed 28 May 2025 10:30-12:00 CEST
- goal: have deliverables reviewed + camera-ready for handover for final review to Steering Committee
Sync meeting on MultiXscale deliverables due M30 June 2025 (20250407)
D1.4 Support for emerging system architectures
writing effort lead by: Bob + Pedro (RUG)
- process for identifying emerging targets
- new systems (within EuroHPC context, national systems)
- support requests (e.g. https://gitlab.com/eessi/support/-/issues/68 for Sapphire Rapids)
- supported instructions + (expected) performance difference
- also Intel Cascade Lake + Ice Lake
- overview of procedure to provide installations for additional CPU target
- reproduce installation order
- step-by-step overview: bot on system where CPU target is supported, etc.
- tension between doing installations in same order as they were done for existing CPU targets vs required bug fixes only present in newer EasyBuild versions
- lessons learned from adding additional CPU targets to existing EESSI version
- TODO
- A64FX: currently ~1/3rd of modules available
- set up bot in service account + give others access to it
- use EasyBuild 4.9.4 to install missing bits?
- use Bob's script to generate easystack files from existing installations
- NVIDIA Grace CPU: close to ready?
- AMD ROCm: quite a lot of work to do
- also make progress on NVIDIA GPU?
- workflow is missing to put more software installations in place
- no workflow that includes testing
- no fixed set of GPU targets (CUDA compute capabilities)
- expose GPU software in overview in docs
- capture whether modules were built on GPU build node, or not (in description or module-load-message)
- should have sanity check for CUDA compute capabilities in place, so we can rely on it
- more relevant for progress report (due in Aug'25, work done by June'25)
- tiger team should convene again! => Thomas
- timeline
- [Pedro] early draft by end of April
- [Thomas] review done of the draft by mid May
- can start reviewing on Monday, May 12
- [Pedro] camera-ready by end of May
- June as buffer
D1.5 Portable test suite for shared software stack
writing effort led by Caspar (SURF)
- focus on EESSI test suite itself
- Supported software in EESSI test suite
- How it's used (daily runs, test step as part of deployment procedure, and even on local software stacks)
- EESSI mixin class: extraction of common logic in portable tests to a single mixin class
- Check for and discuss other substantial improvements in the test suite repo (go through release notes?)
- Community building (EB user meeting / docs & tutorial on how to contribute)
- EESSI mixin facilitates writing new tests, as it points you to missing keywords etc.
- timeline
- [Caspar+Satish] early draft by end of April
- [Kenneth] review done of the draft by mid May
- [Caspar+Satish] camera-ready by end of May
- June as buffer
D5.3 Report on testing provided software
writing effort led by:
- Lara (UGent) + Satish (SURF) for daily runs + performance results
- Maksim (SURF) on dashboard
- Daily runs: which systems? Improvements to configuration (disjoining test version and config version), automatically use latest test suite release
- Lessons learned
- alerting based on sudden change of performance?
- Study performance results
- more variation for e.g. TensorFlow tests
- RHEL8 vs RHEL9 (Snellius, HPC-UGent Tier-2)
- interesting patterns in Vega?
- OSU Microbenchmarks faster with 2023a compared to 2023b?
- performance drop in early version of OpenMPI 5.0.x (2024a toolchain)
- Performance variations due to change in system software, changes to test suite, etc.
- lack of impact from changes to EESSI (which is a good thing!)
- Dashboard
- Inclusion of more sites on the dashboard?
- Challenges? (e.g. permission to publish?)
- timeline
- early draft by end of April
- Lara+Kenneth on daily runs of test suite
- Satish on performance results
- Maksim on dashboard
- [???] review done of the draft by mid May
- [Lara/Satish/Maksim] camera-ready by end of May
- June as buffer
Next meetings
- also invite Petra to these meetings?
- Wed 30 April 2025 10:30-12:00 CEST
- goal: have drafts ready for review
- Wed 28 May 2025 10:30-12:00 CEST
- goals:
- have camera-ready versions ready
- assess whether deliverables are ready for review by MultiXscale Steering Committee