Lecture 1.1 - HackMD

# Lecture 1.1 This document: https://hackmd.io/@coderefinery/TTT4HPC-Lecture-1-1 Schedule: Intro (5 min) Slurm basics (10 min) Number of cores (15 min) Memory (15 min) Buffer/Q&A (5 min) After lunch: exercises # Lecture 1.1 **Learning outcomes** - Can check what resources their HPC job uses and request appropriate resources - Understands the meaning of different resource types - Can anticipate load on the file system and store files in an appropriate format ## Slurm basics **What is Slurm?** - provides a framework for starting, executing, and monitoring jobs on the compute nodes - schedules the jobs on the clusters - allocates the required resources (compute cores or nodes, memory) - free, open source, lightweight -> popular - available at most supercomputing centra - https://slurm.schedmd.com **How to submit a job to the Slurm manager?** `sbatch <job script>` Example: `sbatch job_script.sh`. The job script contains the details of the simulation to be performed and the resources to be allocated for it. ### SBATCH flags -A <project> -p <partition> `-t <max walltime>` Maximum wall time for the job. The general format is `-t days-hours:minutes:seconds`. - while testing new software or inputs, use a short time limit, 10 min - 1h. - if you have an idea of how long a program would take to run, overbook by 25-50% - if you have no idea how long a program will take to run, you may book long time, e.g. 2-00:00:00 -n <number of tasks> Number of tasks, typically the number of cores. -N <number of nodes> -J <job name> --output=slurm-%j.out --error=slurm-%j.err -—mail-type=BEGIN,END,FAIL Notifies the user by email when a certain event occurs: the job starts, ends, or fails. **Sample MPI job script** ```bash #!/bin/bash #SBATCH -J dhcpNd #SBATCH -A naiss2024-22-49 #SBATCH -t 00-07:00:00 #SBATCH -p node #SBATCH -N 4 module load RSPt/2023-10-04 export RSPT_SCRATCH=$SNIC_TMP srun -n 80 rspt ``` The example above is for an RSPt simulation running on 80 cores spread over 4 nodes for a maximum of 7h. **Sample OpenMP or job script** ```bash #!/bin/bash #SBATCH -A naiss2024-22-49 #SBATCH --exclusive #SBATCH -t 01:00:00 #SBATCH -p node #SBATCH --ntasks-per-node=1 #SBATCH --cpus-per-task=20 module load uppasd export OMP_NUM_THREADS=20 sd > out.log ``` **OpenMP or MPI code?** - read the software documentation - search the source code for **OMP** and/or **MPI** **Slurm commands** `sbatch` `squeue` `squeue --me` lists the running and pending of the user `squeue -u <username>` `squeue -A <project>` `squeue -u <username> --state=running` `squeue -u <username> --state=pending` `scancel` - `scancel <jobid>` cancels the job with <jobid> - `scancel -u username` cancels all the jobs of user <username> (your jobs) - `scancel --state --PENDING --user <username>` cancel pending jobs - `scancel --state --RUNNING --user <username>` cancel running jobs - `scancel --name <jobname>` cancel jobs with a given <jobname> - `-i` ask for confirmation `sinfo` Example: `sinfo` or `sinfo -p <partition>` `scontrol` Example: `scontrol show job <jobid>` lists all the Slurm parameters of a job: number of cores and nodes, partition, submit directory, ... `scontrol` can be used to modify the job details after the job has been submitted, though not all SBATCH parameters may be modified by regular users. Example: `scontrol update JobID=jobid TimeLimit=0-01:00:00` decreases the wall time to 1h. `salloc` - allocate resources for an interactive session - handy for debugging a code or a script or for using programs with a graphical user interface useful together with the --begin=<time> flag Example: `salloc -A naiss2024-22-49 -n 20 -t 03:00:00 --begin=2024-04-17T09:00:00` asks for 20 cores for 3h in a interactve session, which is to start earliest on April 17th at 9 o'clock. **Parameters in the job script or the command line?** `sbatch -p devel -t 00:15:00 jobscript.sh`

Read more

BYOC_blog

Ambassadors

CR Website

ACM SIGHPC Education award