Slurm at NAISS - Planning Document

# Slurm at NAISS - Planning Document This document: https://hackmd.io/@NAISS-Training/Slurm-at-NAISS Repository: https://github.com/UPPMAX/NAISS_Slurm Rendered page: https://uppmax.github.io/NAISS_Slurm/ Video recording: https://youtu.be/5fllUd5PyZc?si=M1nOGeWg-WhVv13k ### Who? Currently involved: * UPPMAX: Diana * HPC2N: Birgitte * LUNARC: Joachim, (Rebecca) * C3SE: Sahar * PDC, NSC - not actively involved but can provide input if needed ## Agenda, Fri, Jan 16, 15:00 -16:00 ### Feedback from participants ### Reflections from the previous instance of the course: what worked well, what didn't? - consider having more exercises instead of type-along - improve advertisement to make clear this is a beginner course - **Suggestion for a different event/workshop**: possible improvements for more intermediate users: hackathon with bring your own jobs script; or new modules/lectures to the existing one covering more advanced topics: hybrid jobs, task farms ### Teachers and helpers: who's available on Feb 3 afternoon? - Sahar - Joachim - Diana - Birgitte ### Schedule, spring 2026: needs updates? 13:00 - 13:05 Intro to the course **Sahar** 13:05 - 13:25 Intro to clusters **Sahar** 13:25 - 13:40 Batch system concepts (job scheduling) **Joachim** 13:40 - 14:20 Intro to Slurm (sbatch, squeue, scontrol, SBATCH options: -J/-o/-e/--mail), including 10 min of exercise time **Birgitte** 14:20 - 14:22 Interactive jobs - mention as self-study **Birgitte** 14:22 - 14:35 BREAK 14:35 - 15:45 Additional sample scripts, including job arrays, extra memory, node local disk (demos during lectures + a few exercises per student) - 14:35 - 15:10 serial/basic, OpenMP, MPI, I/O-intensive **Joachim** - 15:10 - 15:45 memory, job arrays, GPU **Diana** 15:45 - 15:47 mention job efficiency **Diana** 15:47 - 16:00 Summary **Diana** ### Changes in material? - add scripts for a given cluster if missing - focus on eitehr Tetralith and Dardel, explain both where different - encourage users to ask questions about other clusters where relevant ### Communication with participants? - anything special needed for this course module that cannot be specify as initial email to participants of the NAISS intro week? - same Zoom as for the other course modules or different? --- ## Schedule (2025) 09:00 - 09:05 Intro to course **RP** 09:05 - 09:25 Intro to clusters **RP** 09:25 - 09:40 Batch system concepts (job scheduling) **JH** 09:40 - 10:20 Intro to Slurm (sbatch, squeue, scontrol, SBATCH options: -J/-o/-e/--mail), including 10 min of exercise time **BB** 10:20 - 10:22 Interactive jobs - mention as self-study **BB** 10:22 - 10:35 BREAK 10:35 - 11:45 Additional sample scripts, including job arrays, extra memory, node local disk, task farms??? (demos during lectures + a few exercises per student) **JH** and **BB** 10:35 - 11:10 **JH**: serial/basic, OpenMP, MPI, I/O-intensive 11:10 - 11:45 **BB**: memory, job arrays, GPU ~~11:20 - 11:50~~ Job monitoring and efficiency (self-reading material) 11:45 - 12:00 Summary **JH** ## Lesson material issues - MPI https://uppmax.github.io/NAISS_Slurm/cluster/#mpi__message__passing__interface - shorten? - keep https://uppmax.github.io/NAISS_Slurm/cluster/#which__programs__can__be__run__effectively__on__a__computer__cluster where it is or move to Intro to Slurm / sample job scripts - move salloc as a separate section after "Intro to Slurm" as "Interactive jobs / Open on Demand" - copy & trim On-Demand doc from R-Matlab-Julia course - directory per centre with exercises and how to compile + binaries (tarball). Maybe a makefile. Cover this during setup / course preparations maybe in the intro ## Planning meetings ### Next meeting: :calendar: Mon, Nov 24, 10:00 - if needed Zoom link: see calendar invite or room description at #slurm:naiss.se. Previous meetings: - Tue, Nov 18, 13:00 - 14:00 - Wed, Nov 12, 11:00 - 12:00 - Wed, Nov 5, 13:30 - 14:30 - Tue, Oct 28, 13:00 - 14:00 - Thu, Oct 23, 11:00 - 12:00 - Thu, Oct 2, 15:00 - 16:00 - Thu, Sep 18, 14:00 - 15:00 - Thu, Aug 21, 14:00 - 15:00 - Wed, June 4, 14:00 - 15:00 - Mon, May 5, 15:00 - 16:00 - Fri, April 11, 11:00 - 12:00 - Thu, Mar 13, 11:00 - 12:15 - Wed, Jan 29, 09:00 - 10:00 ## Next instance for the course November 25, 2025, 13:00 - 14:00 ## Meeting notes 2025-11-18 - registration up to now: 32 3 Tetralith 4 Dardel 2 Alvis 1 Vera 3 K-kaise 4 Cosmos 2 Pele 11 without account - share preerquisites with participants https://uppmax.github.io/NAISS_Slurm/#prerequisites - what about the interactive session? Who does that and when? BB for 2 min - exercise structure: NAISS_Slurm/exercises/resource/ex_type DI fixes the struct - Summary has no text: JH writes the summary - BB sends a letter template to JH ## Meeting notes 2025-11-12 - registration up to now: 28 registrations - potentially use the NAISS course project `naiss2025-22-934`, to be decided by JH and/or BB close to registration deadline - type-along or exercises for some sections - [x] BB to fix https://uppmax.github.io/NAISS_Slurm/intro/#prepare__the__exercise__environment (step 6) - note: skip schduling figure for the sake of time in the ==Intro to clusters== part, teacher decides - [x] remove ThinLic part for Alvis from https://uppmax.github.io/NAISS_Slurm/intro/#login__info - [x] DI adds Sahar to course repo - [x] BB: add tab for Dardel for the simple job script as that requires the `shared` partition, https://uppmax.github.io/NAISS_Slurm/slurm/#simplest__job - [x] BB: add `projinfo` and SUPR link to get project number https://uppmax.github.io/NAISS_Slurm/intro/#project__number__and__project__directory - [x] BB: in the description text, make bold walltime, number of tasks, and other important keywords in the job script; alternatively: add back paragraph on `-A/-N/-t/-n/-c` that Birgitte has already - in https://uppmax.github.io/NAISS_Slurm/slurm/#simplest__job, add "it is typically a good idea to overbook your job time by 30-50%"; possibly re-write the other bullet points under "Time/walltime" - [x] BB: https://uppmax.github.io/NAISS_Slurm/slurm/#dependencies formatting after "Generally" needs formating - [x] rewrite https://uppmax.github.io/NAISS_Slurm/slurm/#script__example - rename "Information about jobs" to "How to monitor jobs?" in https://uppmax.github.io/NAISS_Slurm/slurm/#information__about__jobs - add time in the usage https://uppmax.github.io/NAISS_Slurm/interactive/#examples - maybe join together paragraphs https://uppmax.github.io/NAISS_Slurm/interactive/#salloc__and__interactive under cluster tabs instead? - [x] BB: add `-J`?, -o` and `-e` options under "Introduction to Slurm" ## Meeting notes 2025-11-05 - extend registration for the course until the 16th of Nov - final closing of the registration: Nov 23rd at midnight - Potential course project: naiss2025-22-934 - Add prereqs: login course and modules course. Give links in prereqs on the course material site and then also in info mail (where there is zoom info) - Diana not available for teaching on Nov 25 due to a diff course - inspiration for instructions to participants: https://hackmd.io/@UPPMAX/UPPMAX-login - job monitoring and efficiency - have tabs for Tetralith / Dardel / Alvis / more if possible - cover info for 1 or 2 clusters, but suggest participants to read the info relevant to other clusters - keeping it as self-reading material - clarify when using scratch disk for I/O intensive jobs is useful and when not - add a disclaimer: using scratch is not always useful - JH wants feedback on the "batch system concepts" section - BB adds dependency example(s) in the section "Intro to Slurm" - teach task farms / Slurm job steps for next instance - have "Job monitoring and efficiency" as optional material that people can read on their on ## Meeting notes 2025-10-28 - Headlines of tables in intro sorted - Sahar to have a look at intro material the C3SE hosted machines - Diana adds information for Bianca and Pelle - add `-J` / `-o` / `-e` / `--mail` options to "Intro to Slurm" - memory and I/O and arrays go to "Additional examples" - add examples for `--dependency` - add MPI and OpenMP codes that people can use during demo/exercises, have them compiled in the course directory, but add details on how it was compiled on different clusters ## Meeting notes 2025-10-23 - Part of the material from "intro to clusters" should be moved to other sections or "extra". - Other ideas and pictures: https://uppmax.github.io/HPC-python/common/understanding_clusters.html Mainly like the mermaid picture from something like Cosmos but made generic - Make a bullet-point box per section (at the top) that can be covered during the lecture and then the text is there for self-study afterwards - enlarge picture for the login and compute nodes - include admonitions (refer to https://mrkeo.github.io/reference/admonitions/#supported-types or maybe https://sphinx-immaterial.readthedocs.io/en/latest/admonitions.html) - use: abstract or tldr depending on what works - Parallelism in its own section or to extra - Who does what (material): - Intro to course - Intro to clusters **RP** - Batch system concepts / job scheduling **JH** - Parallelism? - Intro to Slurm (sbatch, squeue, scontrol, ...) **BB** - BREAK - Additional sample scripts, including job arrays, task farms??? **all?** - Job monitoring and efficiency **DI** - Summary ## Meeting notes 2025-09-18 - include details on partitions at different NAISS clusters (Birgitte) - update the schematic figures for clusters and nodes - remove the NAISS_Slurm `images` dir, keep `docs/images` ## Software for graphics/animations? - InkScape - Gimp - Blender ## Various presentations and material - image source slides: https://docs.google.com/presentation/d/1nOfJC8rJqRCVYKP9TAsIYllkaUD-L-47eB_87EEdcGA/edit?usp=sharing ## Decisions - We will use the new material for the November 2025 instance of the course - The Title! (Running jobs on clusters) ## Working repo, including previously-presented material * Current repo: https://github.com/UPPMAX/NAISS_Slurm * :heavy_check_mark: Diana working on configuring it as gh-page using mkdocs (readthedocs theme) * rendered page: https://uppmax.github.io/NAISS_Slurm/ * UPPMAX and LUNARC: https://github.com/UPPMAX/NAISS_Slurm/tree/main/presentations (to be converted to .md format) * HPC2N: [Batch system documentation (slurm) on our documentation pages](https://docs.hpc2n.umu.se/documentation/batchsystem/intro/) and [batch intro material from our Kebnekaise intro course](https://hpc2n.github.io/intro-course/batch/) ## Aim - two workshops of 3h each, covering beginner-intermediate and intermediate-advanced topics - ask for feedback on what they wish to know more of after the course ## Workshop title: Discussion over! - :heavy_check_mark: Running jobs on clusters ## Headings for the course webpage - Intro (3h) - Home - Introduction to the course - Practical details: Login info, project number, project storage - Prerequisites - Schedule - Introduction to clusters (Rebecca) - Example 1: need to replace "slave" with "worker", may need to replace "master" with "manager" - *Rebecca* to redraw schematic based on JH's HPC schematic, but update with more cores, nodes with thin and fat memory, GPU nodes (8, 4, 4, + FE) - Batch system concepts (Joachim) - Introduction to Slurm (Birgitte) - Partitions (wherever it fits, also mention if they are required or not) - Sbatch options (start off with simple script) - sbatch / scancel / squeue / sinfo - Sample job scripts - Simple script / serial? (Rebecca) - OpenMP and MPI (Joachim) - Memory-intensive jobs or I/O intensive (Diana) - Running on GPUs (cluster-specific tabs) (Birgitte) - Job monitoring and efficiency - sinfo / sacct (Diana) - sacct -l -j JOBID -o jobname,NTasks,nodelist,MaxRSS,MaxVMSize | less -S - job-usage (HPC2N) jobstats and finishedjobinfo (UPPMAX) (Diana) - https://www.c3se.chalmers.se/documentation/submitting_jobs/monitoring/ (job_stats.py) ## To be covered in the workshop Proposed: ### Beginner (3h) * discuss cluster architecture and login/compute nodes * explain difference between front-end/login node and back-end/compute nodes * explain what are cores and nodes (+ schematics) * memory of different nodes and local disks? * `sbatch` options for CPU job scripts * proj. number, time, partitions, ... * sample job scripts for * I/O intensive jobs * OpenMP and MPI jobs * job arrays * simple example for task farming * section on memory-hungry jobs and ways of increasing the memory per task (either via asking for a fatter node or increasing the memory/task) * running on GPUs * job monitoring / efficiency * job-usage: https://hpc2n.github.io/intro-course/images/job-usage.png * [job-usage command](https://docs.hpc2n.umu.se/documentation/batchsystem/basic_commands/#job-usage__-__get__url__to__see__details__of__your__job__not__a__slurm__command) * jobstats * to-do: ask Alex about job efficiency monitoring UI on LUNARC * `sacct -l -j JOBID -o jobname,NTasks,nodelist,MaxRSS,MaxVMSize | less -S` * run executable using different number of cores/nodes - intro to find optimal SBATCH options for a particular simularions **Hands-ons:** * demo exercises on a specific cluster and have tabs for the others so that participats may chose ### Intermediate (2-3h): * job efficiency * job dependency * task farming * dask * job arrays * Slurm task farm * hyperqueue * check pointing (same state of job for very long jobs) * * run executable using different number of cores/nodes - more of it #### Note Diana: My wish is to have tabs in the course material for different clusters under NAISS for the parts which are not identical (e.g. partition names, qos, scripts for job monitoring, etc.) ## Format * presentations or mkdocs or else? * [suggested mkdocs template](https://uppmax.github.io/naiss_course_template/) * readthedocs - DI to create repo ## SUPR course project Resources at: - ... - DI to create a course project with resources at all SNIC/NAISS centers (later) - reservation for a few nodes at different centers ## Dates, duration, and frequecy * aim for fall 2025 as a first instance