slurm
This page will give you a list of the commonly used commands for SLURM. Although there are a few advanced ones in here, as you start making significant use of the cluster, you’ll find that these advanced ones are essential!
Get documentation on a command:
Try the following commands:
The following example script specifies a partition, time limit, memory allocation and number of cores. All your scripts should specify values for these four parameters. You can also set additional parameters as shown, such as jobname and output file. For This script performs a simple task — it generates of file of random numbers and then sorts it. A detailed explanation the script is available here.
Now you can submit your job with the command:
If you want to test your job and find out when your job is estimated to run use (note this does not actually submit the job):
Once your job has completed, you can get additional information that was not available during the run. This includes run time, memory used, etc.
As shown in the commands above, its easy to refer to one job by its Job ID, or to all your jobs via your username. What if you want to refer to a subset of your jobs? The answer is to submit your job set as a job array. Then you can use the job array ID to refer to the set when running SLURM commands. See the following excellent resources for further information:
Running Jobs: Job Arrays
SLURM job arrays
e.g.
The following commands work for individual jobs and for job arrays, and allow easy manipulation of large numbers of jobs. You can combine these commands with the parameters shown above to provide great flexibility and precision in job control. (Note that all of these commands are entered on one line)
Suspend all running jobs for a user (takes into account job arrays):
Note that while node 03 has free cores, all its memory in use. So those cores are necessarily idle.
Node 02 has a little free memory but all the cores are in use.
The scheduler will shoot for 100% utilization, but jobs are generally stochastic; beginning and ending at different times with unpredictable amounts of CPU and RAM released/requested.