[TOC]
## Useful SLURM commands
### View information about SLURM nodes and partitions: `sinfo`
```bash!
BH ps_app_team@bcm-head-01:~/demo$ sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
defq* up infinite 15 down* dgx-[006-020]
defq* up infinite 1 mix dgx-001
defq* up infinite 4 idle dgx-[002-005]
```
### Detailed info about your SLURM jobs: `squeue --me`
```bash!
BH ps_app_team@bcm-head-01:~/demo$ squeue --me
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
1579 defq bash ps_app_t R 41:29 1 dgx-001
```
### Even more detailed info about your SLURM jobs: `scontrol show job`
Get JOBID 1579 and run scontrol command
```bash!
scontrol show job 1579
JobId=1579 JobName=bash
UserId=ps_app_team(1011) GroupId=ps_app_team(1014) MCS_label=N/A
Priority=4294901510 Nice=0 Account=safana-nlp QOS=normal
JobState=RUNNING Reason=None Dependency=(null)
Requeue=1 Restarts=0 BatchFlag=0 Reboot=0 ExitCode=0:0
RunTime=00:42:52 TimeLimit=UNLIMITED TimeMin=N/A
SubmitTime=2023-06-08T13:24:30 EligibleTime=2023-06-08T13:24:30
AccrueTime=Unknown
StartTime=2023-06-08T13:24:30 EndTime=Unknown Deadline=N/A
SuspendTime=None SecsPreSuspend=0 LastSchedEval=2023-06-08T13:24:30 Scheduler=Main
Partition=defq AllocNode:Sid=bcm-head-01:587789
ReqNodeList=(null) ExcNodeList=(null)
NodeList=dgx-001
BatchHost=dgx-001
NumNodes=1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
TRES=cpu=1,node=1,billing=1,gres/gpu=2
Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0
Features=(null) DelayBoot=00:00:00
OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
Command=bash
WorkDir=/home/ps_app_team
Power=
TresPerNode=gres:gpu:2
```
### Terminate a specific job: `scancel`
`scancel <YOUR_JOBID>`
`scancel 1579`
You can create an alias to get your last job id:
```
alias my_last_jobid="squeue --me -o %A | tail -n 1"
```
If you put the text above into `~/.bashrc`, this alias will become available in all your terminal sessions. After that, you can cancel the last job:
```
scancel $(my_last_jobid)
```
### Show recent usage: `sacct -a`
```bash!
ps_app_team@bcm-head-01:~/demo$ sacct -a
JobID JobName Partition Account AllocCPUS State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------
1494 bash defq safana-nlp 1 COMPLETED 0:0
1517 bash defq safana-nlp 1 FAILED 0:2
1518 bash defq safana-nlp 1 FAILED 6:0
1519 bash defq safana-nlp 2 COMPLETED 0:0
1520 bash defq safana-nlp 2 COMPLETED 0:0
```
## Submitting a job: Multi-GPU & Convert srun → sbatch
### sbatch
1. Create file with name `multi-gpu.sh` and put this code:
```bash!
#!/bin/bash
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=4
#SBATCH --gres=gpu:8
#SBATCH --job-name=nvidia_smi
#SBATCH --output=out_multi-gpu_%j.txt
srun nvidia-smi -L
```
Run command for set permission to execute + create folder:
```bash
chmod +x multi-gpu.sh
sbatch multi-gpu.sh
```
Check output files
`cat out_multi-gpu*`
```bash!
GPU 0: NVIDIA A100-SXM4-80GB (UUID: GPU-9479cbc2-c65a-9393-264f-e67a50300019)
GPU 1: NVIDIA A100-SXM4-80GB (UUID: GPU-eb7d783b-2b63-acb4-663e-e91e1bb395ec)
GPU 2: NVIDIA A100-SXM4-80GB (UUID: GPU-a81383d8-f3c8-17bf-a3e3-aac6d0bf6e10)
GPU 3: NVIDIA A100-SXM4-80GB (UUID: GPU-62b5120e-e1c1-fd37-31a4-773a33bdca6a)
GPU 4: NVIDIA A100-SXM4-80GB (UUID: GPU-2873df40-d01e-bb33-6862-6a9fc6f8cdac)
GPU 5: NVIDIA A100-SXM4-80GB (UUID: GPU-111701b5-8b97-e177-9aba-bd1778f8b317)
GPU 6: NVIDIA A100-SXM4-80GB (UUID: GPU-be59c44b-f01c-682e-c8b5-5ed5da0c7a8c)
GPU 7: NVIDIA A100-SXM4-80GB (UUID: GPU-0e01616e-2671-8bd9-c0a3-c3da3f37ead9)
GPU 0: NVIDIA A100-SXM4-80GB (UUID: GPU-51d6615a-1789-bdc5-2dd3-c95d1acc8d44)
GPU 1: NVIDIA A100-SXM4-80GB (UUID: GPU-394dab2c-5f8e-0273-06ed-049ac172f5b4)
GPU 2: NVIDIA A100-SXM4-80GB (UUID: GPU-a57b3089-c861-9a45-39bf-77b77b2a10a5)
GPU 3: NVIDIA A100-SXM4-80GB (UUID: GPU-9c1cf23f-c433-7514-db9d-eb261fcde228)
GPU 4: NVIDIA A100-SXM4-80GB (UUID: GPU-097e1798-0dc7-9de5-8031-74b2998bb94a)
GPU 5: NVIDIA A100-SXM4-80GB (UUID: GPU-73db5afe-1811-5795-bd21-9fcd00ecee53)
GPU 6: NVIDIA A100-SXM4-80GB (UUID: GPU-21de40f5-855f-91f8-254f-8ee9a3585ac3)
GPU 7: NVIDIA A100-SXM4-80GB (UUID: GPU-ae76a7ac-456b-e711-e1c2-560376a94feb)
```
### srun
`srun --nodes=2 --gres=gpu:8 --ntasks-per-node=1 --cpus-per-task=4 nvidia-smi -L `
## Pyxis — use containers as an environment for the job
```bash!
srun -N 1 --gres=gpu:8 --container-image="nvcr.io#nvidia/tensorflow:23.07-tf2-py3" \
--container-mount-home --pty bash
```
## Pyxis + ENROOT — import the container to speed up the start
1. Load SLURM module
`module load slurm`
1. Start an auxillary interactive bash session
`srun -N 1 --pty bash`
1. Download the recent PyTorch image from the NGC. This command will create a file nvidia+pytorch+23.07-py3.sqsh in the current directory.
`
cd ~;
enroot import "docker://nvcr.io#nvidia/pytorch:23.07-py3"
`
1. Exit the interactive session with Ctrl+d
1. Now you can run an interactive session inside the PyTorch container with 1 GPU
```bash!
srun -N 1 --gpus-per-node 1 --container-image ~/nvidia+pytorch+23.07-py3.sqsh --container-mount-home --pty bash
```
## ENROOT — how to add new packages to image
To add new layers to the ENROOT image you should run it manually, without Pyxis. The example below will install `rolldice` package.
1. Start an interactive session on the machine with ENROOT
```bash!
srun -N 1 --gpus-per-node 1 --pty bash
```
2. Now inside the interactive job, create a container from the image
```bash!
enroot create --name pytorch ~/nvidia+pytorch+23.07-py3.sqsh
```
3. Start the interactive job inside the container
```bash!
enroot start --root --rw cuda bash
```
4. Install everything you need in an interactive mode
```bash!
apt install rolldice
```
5. Test the installed packages
```bash!
/usr/games/rolldice 6
```
6. Exit the interactive tty from the container with `Ctrl+D`
7. Export the container
```bash!
enroot export --output ~/nvidia+pytorch+23.07-py3+dice.sqsh pytorch
```
8. Exit the interactive srun session with `Ctrl+D`.
## NCCL tests — an example of multi-node container run, testing the interconnect
1. Complete all the steps from the [container import section above](#Pyxis--ENROOT-—-import-the-container-to-speed-up-the-start)
1. Run an interactive session inside the PyTorch container with 1 GPU
`
srun -N 1 --gpus-per-node 1 --container-image ~/nvidia+pytorch+23.07-py3.sqsh --container-mount-home --pty bash
`
1. Inside the container get to your home
`cd ~`
1. Create the directory and clone the NCCL-test repo:
```bash
mkdir nccl_test
cd nccl_test
git clone https://github.com/NVIDIA/nccl-tests.git
cd nccl-tests
```
8. Compile the executables with MPI support:
```bash
make clean
make MPI=1 MPI_HOME=/usr/local/mpi -j 20
```
9. Try running one example. That should raise an error. Avoid combining interactive jobs and MPI.
`./build/all_reduce_perf -b 8 -e 128M -f 2 -g 8`
10. Exit the interactive job:
Ctrl+d
11. Run the 1-node all-reduce test:
```bash!
srun -N 1 --gpus-per-node 8 --exclusive --mpi=pmi2 \
--container-image ~/nvidia+pytorch+23.07-py3.sqsh \
--container-mount-home \
~/nccl_test/nccl-tests/build/all_reduce_perf -b 8 -e 128M -f 2 -g 8
```
12. Run the 2-node all-reduce test:
```bash!
srun -N 2 -w dgx-00[2-3] --gpus-per-node 8 --exclusive --mpi=pmi2 \
--container-image ~/nvidia+pytorch+23.07-py3.sqsh \
--container-mount-home \
~/nccl_test/nccl-tests/build/all_reduce_perf -b 8 -e 128M -f 2 -g 8 \
| tee 2-node.log
```
13. Checking the NCCL_DEBUG:
```bash!
NCCL_DEBUG=INFO NCCL_DEBUG_SUBSYS=INIT,GRAPH \
srun -N 2 -w dgx-00[2-3] --gpus-per-node 8 --exclusive --mpi=pmi2 \
--container-image ~/nvidia+pytorch+23.07-py3.sqsh \
--container-mount-home \
~/nccl_test/nccl-tests/build/all_reduce_perf -b 8 -e 128M -f 2 -g 8 \
| tee 2-node-graph.log
```
14. Similar thing can be achieved via sbatch. See session below.
```bash
BH ps_app_team@bcm-head-01:~$ cat ~/nccl_test/2-node.sbatch
```
```bash!
#!/usr/bin/env bash
#SBATCH -N 2
#SBATCH -w dgx-00[2-3]
#SBATCH --gpus-per-node 8
#SBATCH --exclusive
#SBATCH --container-image /home/ps_app_team/nvidia+pytorch+23.07-py3.sqsh
#SBATCH --container-mount-home
#SBATCH -o /home/ps_app_team/nccl_test/2-node-sbatch.log
NCCL_DEBUG=INFO NCCL_DEBUG_SUBSYS=INIT,GRAPH \
~/nccl_test/nccl-tests/build/all_reduce_perf -b 8 -e 128M -f 2 -g 8
```
```bash
BH ps_app_team@bcm-head-01:~$ sbatch ~/nccl_test/2-node.sbatch
```
```bash
Submitted batch job 1482
```
```bash
BH ps_app_team@bcm-head-01:~$ scontrol show job 1482
```
```bash
JobId=1482 JobName=2-node.sbatch
UserId=ps_app_team(1011) GroupId=ps_app_team(1014) MCS_label=N/A
Priority=4294901580 Nice=0 Account=safana-nlp QOS=normal
JobState=PENDING Reason=Resources Dependency=(null)
Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
RunTime=00:00:00 TimeLimit=UNLIMITED TimeMin=N/A
SubmitTime=2023-06-07T20:31:45 EligibleTime=2023-06-07T20:31:45
AccrueTime=2023-06-07T20:31:45
StartTime=Unknown EndTime=Unknown Deadline=N/A
SuspendTime=None SecsPreSuspend=0 LastSchedEval=2023-06-07T20:31:47 Scheduler=Main
Partition=defq AllocNode:Sid=bcm-head-01:587789
ReqNodeList=dgx-00[2-3] ExcNodeList=(null)
NodeList=(null)
NumNodes=2-2 NumCPUs=2 NumTasks=2 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
TRES=cpu=2,mem=2M,node=2,billing=2,gres/gpu=16
Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0
Features=(null) DelayBoot=00:00:00
OverSubscribe=NO Contiguous=0 Licenses=(null) Network=(null)
Command=/home/ps_app_team/nccl_test/2-node.sbatch
WorkDir=/home/ps_app_team
StdErr=/home/ps_app_team/nccl_test/2-node-sbatch.log
StdIn=/dev/null
StdOut=/home/ps_app_team/nccl_test/2-node-sbatch.log
Power=
TresPerNode=gres:gpu:8
```
```bash
BH ps_app_team@bcm-head-01:~$ less /home/ps_app_team/nccl_test/2-node-sbatch.log
```
## tmux — how to keep a terminal session even through disconnections
1. Attach to a tmux session named “main” o create a new one with this name:
`tmux a -d -t main || tmux new -A -s main`
1. Create a new pane in the tmuxed terminal:
`Ctrl+b, then c`
1. Switch back to pane number 0:
`Ctrl+b, then 0`
1. Close the current pane:
`Ctrl+d`
1. Keep all the panes running, but detach from the session:
`Ctrl+b, then d`
1. Navigate tmux terminal history:
`Ctrl+b, then [` to switch to the navigation mode
`PgUp, PgDown` to navigate
`q` to exit back to the command entering mode
## sattach — if the connection to srun job failed, but the job is still running
If your tmux session with srun was killed, the srun interactive job may still continue running. To check, if it does, use
`squeue --me -s` to show your running jobs and their steps. For me, the output is shown below.
```bash!
$ squeue --me -as
STEPID NAME PARTITION USER TIME NODELIST
79638.0 bash batch u00u5ya1 3:40 dgx1-000
79638.extern extern batch u00u5ya1 3:41 dgx1-000
```
Most probably, you will need to attach to step 0. The command to do this is
`sattach --pty jobid.stepid`. In my case `sattach --pty 79638.0`.
## SLURM NGC login
To enable enroot access to the NGC private registry one must:
1. Get your NGC API key from https://ngc.nvidia.com/setup/api-key
1. Create a ``~/.config/enroot/`` directory with the following contents ([details](https://github.com/NVIDIA/enroot/blob/master/doc/cmd/import.md#description))
`mkdir -p ~/.config/enroot/`
1. Create a `~/.config/enroot/.credentials` file with the following contents and open it for editing:
`nano ~/.config/enroot/.credentials`
1. Paste the following contents, replacing <YOUR_NGC_TOKEN> with your actual token
```
machine nvcr.io login $oauthtoken password <YOUR_NGC_TOKEN>
machine authn.nvidia.com login $oauthtoken password <YOUR_NGC_TOKEN>
```
5. Write the file to the disk: Ctrl+o, then Ctrl+x
## Jupyter Lab in SLURM
Run the interactive job on 2 GPUs:
```bash!
srun -N 1 --gpus-per-node 2 --mpi=pmi2 \
--container-image ~/nvidia+pytorch+23.07-py3.sqsh \
--container-mount-home --pty \
bash -c "hostname -I && hostname && jupyter lab --ip=0.0.0.0 --port=8000 --allow-root --no-browser --NotebookApp.token='my_simple_token' --NotebookApp.allow_origin='*' --notebook-dir=/"
```
As a first line, it will output a number of IPs. Use the first one, which is similar to the IP of the node you're connecting to (usually, the one **not** starting from 10 or 127)
Open your browser and go to `http://<that IP address>:8000`. It will redirect you to the jupyter lab. As a token, input `my_simple_token` from the command above. Note the port number **8000**. It's the number you've specified in the command `--port=8000`, but not the one, which is spat out by the Jupyter Lab launch script.
Change to the directory with the write access, create a jupyter notebook and run it.
Inside, you can check the output of the following cells:
```python
import torch
torch.cuda.is_available() # True
torch.cuda.device_count() # 2
torch.cuda.current_device() # 0
torch.cuda.get_device_name(0) # 'NVIDIA A100-SXM4-80GB'
```
## PyTorch Lightning distributed training
Create a file `simple-pl.py` with the contents, listed in the end of this section.
Choose number of nodes and number of GPUs per node. In our case 2 nodes, 8 gpus each. Choose ntasks-per-node == gpus-per-node
Launch srun:
```bash!
srun -N 2 -w dgx-00[4-5] --gpus-per-node 8 --ntasks-per-node 8 --exclusive --mpi=pmi2 \
--container-image ~/nvidia+pytorch+23.07-py3.sqsh \
--container-mount-home \
python ~/jupyter/simple-pl.py
```
[Additional docs](https://pytorch-lightning.readthedocs.io/en/1.6.3/clouds/cluster.html#slurm-managed-cluster)
File contents:
```python
import lightning.pytorch as pl
import torch
from torch import nn
import socket
import argparse
from lightning.pytorch import loggers as pl_loggers
import os
class Module(pl.LightningModule):
def __init__(self):
super().__init__()
self.linear = nn.Linear(5, 1)
def configure_optimizers(self):
return torch.optim.Adam(self.linear.parameters())
def training_step(self, batch, batch_idx):
return self.linear(batch).sum()
def validation_step(self, batch, batch_idx):
return batch_idx
def on_validation_epoch_end(self):
print("VALIDATING")
if __name__ == "__main__":
m = Module()
datasets = [torch.rand([5]) for __ in range(200)]
train_loader = torch.utils.data.DataLoader(datasets, batch_size=16)
val_loader = torch.utils.data.DataLoader(datasets, batch_size=16)
tb_logger = pl_loggers.TensorBoardLogger(save_dir="/home/ps_app_team/logs/")
trainer = pl.Trainer(
accelerator="gpu",
strategy='ddp',
devices=os.environ["SLURM_GPUS_ON_NODE"],
num_nodes=os.environ["SLURM_JOB_NUM_NODES"],
max_epochs=2,
logger=tb_logger
)
print(socket.gethostname(),'pre_node:',trainer.node_rank)
print(socket.gethostname(),'pre_local:',trainer.local_rank)
print(socket.gethostname(),'pre_global:',trainer.global_rank)
trainer.fit(m, train_loader, val_loader)
print(socket.gethostname(),'post_node:',trainer.node_rank)
print(socket.gethostname(),'post_local:',trainer.local_rank)
print(socket.gethostname(),'post_global:',trainer.global_rank)
```
## Array jobs — run multiple similar, but **independent** jobs
Like a hyperparameter search.
Create file with name array_jobs.sh and put this code:
```bash
#!/bin/bash
#SBATCH --job-name=my_array_job
#SBATCH --array=0-9
#SBATCH --cpus-per-task=4
#SBATCH --output=array_out_%A_%a.txt
echo "This is job ${SLURM_ARRAY_JOB_ID}_${SLURM_ARRAY_TASK_ID}"
echo "Running on host: $(hostname)"
```
Run command for set permission to execute + create folder:
` chmod +x array_jobs.sh && mkdir array_output `
Run next sbatch command:
`sbatch array_jobs.sh `
Check output files what you got.
`cat array_output/out_*`
```bash!
BH ps_app_team@bcm-head-01:~/demo$ cat array_output/out_*
Running on host: dgx-001
This is job 1592_1
Running on host: dgx-001
This is job 1592_2
Running on host: dgx-001
This is job 1592_3
Running on host: dgx-001
This is job 1592_4
Running on host: dgx-001
This is job 1592_5
Running on host: dgx-001
This is job 1592_6
Running on host: dgx-001
This is job 1592_7
Running on host: dgx-001
This is job 1592_8
Running on host: dgx-001
This is job 1592_9
Running on host: dgx-001
```
## Additional Reading
* [The last version of this document is available online](https://hackmd.io/_gNV67kuRoGE9XCNXh0dmw?edit)
* [Good collection of useful SLURM commands](https://curc.readthedocs.io/en/latest/running-jobs/slurm-commands.html)
* Official SLURM docs
* [srun](https://slurm.schedmd.com/srun.html)
* [sbatch](https://slurm.schedmd.com/sbatch.html)
* [sinfo](https://slurm.schedmd.com/sinfo.html)
* [squeue](https://slurm.schedmd.com/squeue.html)
* [scontrol](https://slurm.schedmd.com/scontrol.html)
* [sattach](https://slurm.schedmd.com/sattach.html)
* [ENROOT Usage Guide](https://github.com/NVIDIA/enroot/blob/master/doc/usage.md)
* [Pyxis Usage Guide](https://github.com/NVIDIA/pyxis#usage)
## Still have questions?
Drop a message to Dmitry: [dmitrym@nvidia.com](mailto://dmitrym@nvidia.com)