# Slurm at NAISS - Planning Document
This document: https://hackmd.io/@NAISS-Training/Slurm-at-NAISS
Repository: https://github.com/UPPMAX/NAISS_Slurm
Rendered page: https://uppmax.github.io/NAISS_Slurm/
Video recording: https://youtu.be/5fllUd5PyZc?si=M1nOGeWg-WhVv13k
### Who?
Currently involved:
* UPPMAX: Diana
* HPC2N: Birgitte
* LUNARC: Joachim, (Rebecca)
* C3SE: Sahar
* PDC, NSC - not actively involved but can provide input if needed
## Agenda, Fri, Jan 16, 15:00 -16:00
### Feedback from participants
### Reflections from the previous instance of the course: what worked well, what didn't?
- consider having more exercises instead of type-along
- improve advertisement to make clear this is a beginner course
- **Suggestion for a different event/workshop**: possible improvements for more intermediate users: hackathon with bring your own jobs script; or new modules/lectures to the existing one covering more advanced topics: hybrid jobs, task farms
### Teachers and helpers: who's available on Feb 3 afternoon?
- Sahar
- Joachim
- Diana
- Birgitte
### Schedule, spring 2026: needs updates?
13:00 - 13:05 Intro to the course **Sahar**
13:05 - 13:25 Intro to clusters **Sahar**
13:25 - 13:40 Batch system concepts (job scheduling) **Joachim**
13:40 - 14:20 Intro to Slurm (sbatch, squeue, scontrol, SBATCH options: -J/-o/-e/--mail), including 10 min of exercise time **Birgitte**
14:20 - 14:22 Interactive jobs - mention as self-study **Birgitte**
14:22 - 14:35 BREAK
14:35 - 15:45 Additional sample scripts, including job arrays, extra memory, node local disk (demos during lectures + a few exercises per student)
- 14:35 - 15:10 serial/basic, OpenMP, MPI, I/O-intensive **Joachim**
- 15:10 - 15:45 memory, job arrays, GPU **Diana**
15:45 - 15:47 mention job efficiency **Diana**
15:47 - 16:00 Summary **Diana**
### Changes in material?
- add scripts for a given cluster if missing
- focus on eitehr Tetralith and Dardel, explain both where different
- encourage users to ask questions about other clusters where relevant
### Communication with participants?
- anything special needed for this course module that cannot be specify as initial email to participants of the NAISS intro week?
- same Zoom as for the other course modules or different?
---
## Schedule (2025)
09:00 - 09:05 Intro to course **RP**
09:05 - 09:25 Intro to clusters **RP**
09:25 - 09:40 Batch system concepts (job scheduling) **JH**
09:40 - 10:20 Intro to Slurm (sbatch, squeue, scontrol, SBATCH options: -J/-o/-e/--mail), including 10 min of exercise time **BB**
10:20 - 10:22 Interactive jobs - mention as self-study **BB**
10:22 - 10:35 BREAK
10:35 - 11:45 Additional sample scripts, including job arrays, extra memory, node local disk, task farms??? (demos during lectures + a few exercises per student) **JH** and **BB**
10:35 - 11:10 **JH**: serial/basic, OpenMP, MPI, I/O-intensive
11:10 - 11:45 **BB**: memory, job arrays, GPU
~~11:20 - 11:50~~ Job monitoring and efficiency (self-reading material)
11:45 - 12:00 Summary **JH**
## Lesson material issues
- MPI https://uppmax.github.io/NAISS_Slurm/cluster/#mpi__message__passing__interface - shorten?
- keep https://uppmax.github.io/NAISS_Slurm/cluster/#which__programs__can__be__run__effectively__on__a__computer__cluster where it is or move to Intro to Slurm / sample job scripts
- move salloc as a separate section after "Intro to Slurm" as "Interactive jobs / Open on Demand"
- copy & trim On-Demand doc from R-Matlab-Julia course
- directory per centre with exercises and how to compile + binaries (tarball). Maybe a makefile. Cover this during setup / course preparations maybe in the intro
## Planning meetings
### Next meeting:
:calendar: Mon, Nov 24, 10:00 - if needed
Zoom link: see calendar invite or room description at #slurm:naiss.se.
Previous meetings:
- Tue, Nov 18, 13:00 - 14:00
- Wed, Nov 12, 11:00 - 12:00
- Wed, Nov 5, 13:30 - 14:30
- Tue, Oct 28, 13:00 - 14:00
- Thu, Oct 23, 11:00 - 12:00
- Thu, Oct 2, 15:00 - 16:00
- Thu, Sep 18, 14:00 - 15:00
- Thu, Aug 21, 14:00 - 15:00
- Wed, June 4, 14:00 - 15:00
- Mon, May 5, 15:00 - 16:00
- Fri, April 11, 11:00 - 12:00
- Thu, Mar 13, 11:00 - 12:15
- Wed, Jan 29, 09:00 - 10:00
## Next instance for the course
November 25, 2025, 13:00 - 14:00
## Meeting notes 2025-11-18
- registration up to now: 32
3 Tetralith
4 Dardel
2 Alvis
1 Vera
3 K-kaise
4 Cosmos
2 Pele
11 without account
- share preerquisites with participants https://uppmax.github.io/NAISS_Slurm/#prerequisites
- what about the interactive session? Who does that and when? BB for 2 min
- exercise structure: NAISS_Slurm/exercises/resource/ex_type DI fixes the struct
- Summary has no text: JH writes the summary
- BB sends a letter template to JH
## Meeting notes 2025-11-12
- registration up to now: 28 registrations
- potentially use the NAISS course project `naiss2025-22-934`, to be decided by JH and/or BB close to registration deadline
- type-along or exercises for some sections
- [x] BB to fix https://uppmax.github.io/NAISS_Slurm/intro/#prepare__the__exercise__environment (step 6)
- note: skip schduling figure for the sake of time in the ==Intro to clusters== part, teacher decides
- [x] remove ThinLic part for Alvis from https://uppmax.github.io/NAISS_Slurm/intro/#login__info
- [x] DI adds Sahar to course repo
- [x] BB: add tab for Dardel for the simple job script as that requires the `shared` partition, https://uppmax.github.io/NAISS_Slurm/slurm/#simplest__job
- [x] BB: add `projinfo` and SUPR link to get project number https://uppmax.github.io/NAISS_Slurm/intro/#project__number__and__project__directory
- [x] BB: in the description text, make bold walltime, number of tasks, and other important keywords in the job script; alternatively: add back paragraph on `-A/-N/-t/-n/-c` that Birgitte has already
- in https://uppmax.github.io/NAISS_Slurm/slurm/#simplest__job, add "it is typically a good idea to overbook your job time by 30-50%"; possibly re-write the other bullet points under "Time/walltime"
- [x] BB: https://uppmax.github.io/NAISS_Slurm/slurm/#dependencies formatting after "Generally" needs formating
- [x] rewrite https://uppmax.github.io/NAISS_Slurm/slurm/#script__example
- rename "Information about jobs" to "How to monitor jobs?" in https://uppmax.github.io/NAISS_Slurm/slurm/#information__about__jobs
- add time in the usage https://uppmax.github.io/NAISS_Slurm/interactive/#examples
- maybe join together paragraphs https://uppmax.github.io/NAISS_Slurm/interactive/#salloc__and__interactive under cluster tabs instead?
- [x] BB: add `-J`?, -o` and `-e` options under "Introduction to Slurm"
## Meeting notes 2025-11-05
- extend registration for the course until the 16th of Nov
- final closing of the registration: Nov 23rd at midnight
- Potential course project: naiss2025-22-934
- Add prereqs: login course and modules course. Give links in prereqs on the course material site and then also in info mail (where there is zoom info)
- Diana not available for teaching on Nov 25 due to a diff course
- inspiration for instructions to participants: https://hackmd.io/@UPPMAX/UPPMAX-login
- job monitoring and efficiency
- have tabs for Tetralith / Dardel / Alvis / more if possible
- cover info for 1 or 2 clusters, but suggest participants to read the info relevant to other clusters
- keeping it as self-reading material
- clarify when using scratch disk for I/O intensive jobs is useful and when not
- add a disclaimer: using scratch is not always useful
- JH wants feedback on the "batch system concepts" section
- BB adds dependency example(s) in the section "Intro to Slurm"
- teach task farms / Slurm job steps for next instance
- have "Job monitoring and efficiency" as optional material that people can read on their on
## Meeting notes 2025-10-28
- Headlines of tables in intro sorted
- Sahar to have a look at intro material the C3SE hosted machines
- Diana adds information for Bianca and Pelle
- add `-J` / `-o` / `-e` / `--mail` options to "Intro to Slurm"
- memory and I/O and arrays go to "Additional examples"
- add examples for `--dependency`
- add MPI and OpenMP codes that people can use during demo/exercises, have them compiled in the course directory, but add details on how it was compiled on different clusters
## Meeting notes 2025-10-23
- Part of the material from "intro to clusters" should be moved to other sections or "extra".
- Other ideas and pictures: https://uppmax.github.io/HPC-python/common/understanding_clusters.html Mainly like the mermaid picture from something like Cosmos but made generic
- Make a bullet-point box per section (at the top) that can be covered during the lecture and then the text is there for self-study afterwards
- enlarge picture for the login and compute nodes
- include admonitions (refer to https://mrkeo.github.io/reference/admonitions/#supported-types or maybe https://sphinx-immaterial.readthedocs.io/en/latest/admonitions.html)
- use: abstract or tldr depending on what works
- Parallelism in its own section or to extra
- Who does what (material):
- Intro to course
- Intro to clusters **RP**
- Batch system concepts / job scheduling **JH**
- Parallelism?
- Intro to Slurm (sbatch, squeue, scontrol, ...) **BB**
- BREAK
- Additional sample scripts, including job arrays, task farms??? **all?**
- Job monitoring and efficiency **DI**
- Summary
## Meeting notes 2025-09-18
- include details on partitions at different NAISS clusters (Birgitte)
- update the schematic figures for clusters and nodes
- remove the NAISS_Slurm `images` dir, keep `docs/images`
## Software for graphics/animations?
- InkScape
- Gimp
- Blender
## Various presentations and material
- image source slides: https://docs.google.com/presentation/d/1nOfJC8rJqRCVYKP9TAsIYllkaUD-L-47eB_87EEdcGA/edit?usp=sharing
## Decisions
- We will use the new material for the November 2025 instance of the course
- The Title! (Running jobs on clusters)
## Working repo, including previously-presented material
* Current repo: https://github.com/UPPMAX/NAISS_Slurm
* :heavy_check_mark: Diana working on configuring it as gh-page using mkdocs (readthedocs theme)
* rendered page: https://uppmax.github.io/NAISS_Slurm/
* UPPMAX and LUNARC: https://github.com/UPPMAX/NAISS_Slurm/tree/main/presentations (to be converted to .md format)
* HPC2N: [Batch system documentation (slurm) on our documentation pages](https://docs.hpc2n.umu.se/documentation/batchsystem/intro/) and [batch intro material from our Kebnekaise intro course](https://hpc2n.github.io/intro-course/batch/)
## Aim
- two workshops of 3h each, covering beginner-intermediate and intermediate-advanced topics
- ask for feedback on what they wish to know more of after the course
## Workshop title:
Discussion over!
- :heavy_check_mark: Running jobs on clusters
## Headings for the course webpage - Intro (3h)
- Home
- Introduction to the course
- Practical details: Login info, project number, project storage
- Prerequisites
- Schedule
- Introduction to clusters (Rebecca)
- Example 1: need to replace "slave" with "worker", may need to replace "master" with "manager"
- *Rebecca* to redraw schematic based on JH's HPC schematic, but update with more cores, nodes with thin and fat memory, GPU nodes (8, 4, 4, + FE)
- Batch system concepts (Joachim)
- Introduction to Slurm (Birgitte)
- Partitions (wherever it fits, also mention if they are required or not)
- Sbatch options (start off with simple script)
- sbatch / scancel / squeue / sinfo
- Sample job scripts
- Simple script / serial? (Rebecca)
- OpenMP and MPI (Joachim)
- Memory-intensive jobs or I/O intensive (Diana)
- Running on GPUs (cluster-specific tabs) (Birgitte)
- Job monitoring and efficiency
- sinfo / sacct (Diana)
- sacct -l -j JOBID -o jobname,NTasks,nodelist,MaxRSS,MaxVMSize | less -S
- job-usage (HPC2N)
jobstats and finishedjobinfo (UPPMAX) (Diana)
- https://www.c3se.chalmers.se/documentation/submitting_jobs/monitoring/ (job_stats.py)
## To be covered in the workshop
Proposed:
### Beginner (3h)
* discuss cluster architecture and login/compute nodes
* explain difference between front-end/login node and back-end/compute nodes
* explain what are cores and nodes (+ schematics)
* memory of different nodes and local disks?
* `sbatch` options for CPU job scripts
* proj. number, time, partitions, ...
* sample job scripts for
* I/O intensive jobs
* OpenMP and MPI jobs
* job arrays
* simple example for task farming
* section on memory-hungry jobs and ways of increasing the memory per task (either via asking for a fatter node or increasing the memory/task)
* running on GPUs
* job monitoring / efficiency
* job-usage: https://hpc2n.github.io/intro-course/images/job-usage.png
* [job-usage command](https://docs.hpc2n.umu.se/documentation/batchsystem/basic_commands/#job-usage__-__get__url__to__see__details__of__your__job__not__a__slurm__command)
* jobstats
* to-do: ask Alex about job efficiency monitoring UI on LUNARC
* `sacct -l -j JOBID -o jobname,NTasks,nodelist,MaxRSS,MaxVMSize | less -S`
* run executable using different number of cores/nodes - intro to find optimal SBATCH options for a particular simularions
**Hands-ons:**
* demo exercises on a specific cluster and have tabs for the others so that participats may chose
### Intermediate (2-3h):
* job efficiency
* job dependency
* task farming
* dask
* job arrays
* Slurm task farm
* hyperqueue
* check pointing (same state of job for very long jobs)
* * run executable using different number of cores/nodes - more of it
#### Note
Diana: My wish is to have tabs in the course material for different clusters under NAISS for the parts which are not identical (e.g. partition names, qos, scripts for job monitoring, etc.)
## Format
* presentations or mkdocs or else?
* [suggested mkdocs template](https://uppmax.github.io/naiss_course_template/)
* readthedocs - DI to create repo
## SUPR course project
Resources at:
- ...
- DI to create a course project with resources at all SNIC/NAISS centers (later)
- reservation for a few nodes at different centers
## Dates, duration, and frequecy
* aim for fall 2025 as a first instance