# SLURM at HPCFS ## Partition `rome` 20 compute nodes with Rocky Linux 8.5 have the folowing characteristics: - 2 x AMD EPYC 7402 24-Core Processor in multithreaded configuration totaling 96 compute cores per node. Note that for MPI jobs `--ntasks-per-core=1` should be used. - max `--mem=125G` RAM can be used per node - max `--time=72:0:0` limit can be used per job. Longer jobs can use `--signal=USR1` or similar, to start graceful shutdown and restart. ## Partition `haswell` 20 compute nodes with Rocky Linux 8.5 - 2 x Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz in multithreaded configuration totaling 24x2 compute cores per node. - max `--mem=60G` can be used per node ## Interactive X11 jobs Software rendering can be used on compute nodes. #### MATLAB example Limit the job to 4 hours and alocate ~~~ bash [leon@viz ~]$ salloc --nodes=1 --ntasks=96 \ --partition=rome --mem=50G --time=4:00 salloc: Granted job allocation 57759 salloc: Waiting for resource configuration salloc: Nodes cn41 are ready for job [leon@viz ~]$ ssh -X cn41 Warning: Permanently added 'cn41,10.0.2.141' (ECDSA) to the list of known hosts. Last login: Wed Oct 20 11:21:27 2021 from 10.0.2.99 [leon@cn41 ~]$ module load MATLAB [leon@cn41 ~]$ matlab MATLAB is selecting SOFTWARE OPENGL rendering. [leon@cn41 ~]$ exit logout Connection to cn41 closed. [leon@viz ~]$ exit salloc: Relinquishing job allocation 57759 [leon@viz ~]$ ~~~ MATLAB is usually run single threaded and not parallel and therefore the following is recommended way to start MATLAB for max 4 hours ~~~bash [leon@viz ~]$ ml MATLAB [leon@viz ~]$ srun --nodes=1 --ntasks=2 \ --ntasks-per-core=2 --mem=8G --partition=rome --time=4:0:0 --x11 --pty matlab ~~~ #### Jupyter notebook example ~~~ bash ml Python python3 -m venv jupyter jupyter/bin/pip install jupyter env --unset LD_PRELOAD --unset SESSION_MANAGER srun --partition=haswell --mem=0 --time=2:00:00 --pty jupyiter/bin/jupyter-notebook --no-browser --ip=0.0.0.0 ~~~ Under NoMachine desktop use browser and open a link reported by the jupyiter server at the compute node. Such as: ~~~ Jupyter Server 2.15.0 is running at: http://cn72.hpc:8888/tree?token=1fbc9e418c3c40341c9e48f41c77631ace9c91e89cd08e58 ~~~ :::info :zap: Note that `env --unset=LD_PRELOAD` reduces meaningless warnings in the logfiles when submitting from display nodes. It prevents forwarding VirtualGL imposter libraries used for virtual hardware graphics rendering, so that you do not see ERROR: ld.so: object '/usr/NX/scripts/vgl/librrfaker.so' from LD_PRELOAD cannot be preloaded (cannot open shared object file): ignored. messages appearing when redirected to compute node. The VGL libraries are installed on compute nodes to prevent showing the warnings in case whenn`LD_PRELOAD`not being unset at submission. ::: ### Interactive shell In the following example we additionally set bash shell timeout to 600 sectonds (10 minutes) that will auto logout from the compute node if no command is being typed for that time. Max time for interactive job is set to 4 hours. ~~~bash [leon@viz ~]$ env --unset=LD_PRELOAD TMOUT=600 \ srun --mem=100G --time=4:0:0 -p rome --x11 --pty bash -i [leon@cn41 ~]$ timed out waiting for input: auto-logout ~~~ ## Useful SLURM job information commands List detailed information for a job (useful for troubleshooting): scontrol show jobid -dd <jobid> List status info for a currently running job: sstat --format=AveCPU,AvePages,AveRSS,AveVMSize,JobID -j <jobid> --allsteps Once your job has completed, you can get additional information that was not available during the run. This includes run time, memory used, etc. To get statistics on completed jobs by jobID: sacct -j <jobid> --format=JobID,JobName,MaxRSS,Elapsed ## Runing R script with Rmpi under SLURM Rmpi is demonstrated with the following R example ~~~ R library(Rmpi) size <- Rmpi::mpi.comm.size(0) rank <- Rmpi::mpi.comm.rank(0) host <- Rmpi::mpi.get.processor.name() if (rank == 0){ print('I am the master') } else { print(paste("I am", rank, "of", size, "running on", host)) } ~~~ and sbatch script ~~~ sbatch #!/bin/bash #SBATCH --export=ALL,LD_PRELOAD= #SBATCH --job-name MyR #SBATCH --partition=haswell --mem=24GB --time=02:00 #SBATCH --nodes=2 #SBATCH --ntasks-per-node=24 module load R srun Rscript rmpi-test.R ~~~ ###### tags: `HPCFS` `SLURM`