Try   HackMD

SLURM at HPCFS

Partition rome

20 compute nodes with Rocky Linux 8.5 have the folowing characteristics:

  • 2 x AMD EPYC 7402 24-Core Processor in multithreaded configuration totaling 96 compute cores per node. Note that for MPI jobs --ntasks-per-core=1 should be used.
  • max --mem=125G RAM can be used per node
  • max --time=72:0:0 limit can be used per job. Longer jobs can use --signal=USR1 or similar, to start graceful shutdown and restart.

Partition haswell

20 compute nodes with Rocky Linux 8.5

  • 2 x Intel® Xeon® CPU E5-2680 v3 @ 2.50GHz in multithreaded configuration totaling 24x2 compute cores per node.
  • max --mem=60G can be used per node

Interactive X11 jobs

Software rendering can be used on compute nodes.

MATLAB example

Limit the job to 4 hours and alocate

[leon@viz ~]$ salloc --nodes=1 --ntasks=96 \
--partition=rome --mem=50G --time=4:00
salloc: Granted job allocation 57759
salloc: Waiting for resource configuration
salloc: Nodes cn41 are ready for job
[leon@viz ~]$ ssh -X cn41
Warning: Permanently added 'cn41,10.0.2.141' (ECDSA) to the list of known hosts.
Last login: Wed Oct 20 11:21:27 2021 from 10.0.2.99
[leon@cn41 ~]$ module load MATLAB
[leon@cn41 ~]$ matlab
MATLAB is selecting SOFTWARE OPENGL rendering.
[leon@cn41 ~]$ exit
logout
Connection to cn41 closed.
[leon@viz ~]$ exit
salloc: Relinquishing job allocation 57759
[leon@viz ~]$

MATLAB is usually run single threaded and not parallel and therefore the following is recommended way to start MATLAB for max 4 hours

[leon@viz ~]$ ml MATLAB    
[leon@viz ~]$ srun --nodes=1 --ntasks=2 \
--ntasks-per-core=2 --mem=8G --partition=rome --time=4:0:0 --x11 --pty matlab

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →
Note that env --unset=LD_PRELOAD reduces meaningless warnings in the logfiles when submitting from display nodes. It prevents forwarding VirtualGL imposter libraries used for virtual hardware graphics rendering, so that you do not see

​​​​ERROR: ld.so: object '/usr/NX/scripts/vgl/librrfaker.so' from LD_PRELOAD cannot be preloaded (cannot open shared object file): ignored.

messages appearing when redirected to compute node. The VGL libraries are installed on compute nodes to prevent showing the warnings in case whennLD_PRELOADnot being unset at submission.

Interactive shell

In the following example we additionally set bash shell timeout to 600 sectonds (10 minutes) that will auto logout from the compute node if no command is being typed for that time. Max time for interactive job is set to 4 hours.

[leon@viz ~]$ env --unset=LD_PRELOAD TMOUT=600 \
srun --mem=100G --time=4:0:0 -p rome --x11 --pty bash -i
[leon@cn41 ~]$ timed out waiting for input: auto-logout

Useful SLURM job information commands

List detailed information for a job (useful for troubleshooting):

​​​​scontrol show jobid -dd <jobid>

List status info for a currently running job:

​​​​sstat --format=AveCPU,AvePages,AveRSS,AveVMSize,JobID -j <jobid> --allsteps

Once your job has completed, you can get additional information that was not available during the run. This includes run time, memory used, etc.
To get statistics on completed jobs by jobID:

​​​​sacct -j <jobid> --format=JobID,JobName,MaxRSS,Elapsed

Runing R script with Rmpi under SLURM

Rmpi is demonstrated with the following R example

library(Rmpi)

size <- Rmpi::mpi.comm.size(0)
rank <- Rmpi::mpi.comm.rank(0)
host <- Rmpi::mpi.get.processor.name()

if (rank == 0){
        print('I am the master')
} else {
        print(paste("I am", rank, "of", size, "running on", host))
}

and sbatch script

#!/bin/bash
#SBATCH --export=ALL,LD_PRELOAD=
#SBATCH --job-name MyR
#SBATCH --partition=haswell --mem=24GB --time=02:00
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=24
module load R
srun  Rscript rmpi-test.R
tags: HPCFS SLURM