SLURM Demo - HackMD

[TOC] ## Useful SLURM commands ### View information about SLURM nodes and partitions: `sinfo` ```bash! BH ps_app_team@bcm-head-01:~/demo$ sinfo PARTITION AVAIL TIMELIMIT NODES STATE NODELIST defq* up infinite 15 down* dgx-[006-020] defq* up infinite 1 mix dgx-001 defq* up infinite 4 idle dgx-[002-005] ``` ### Detailed info about your SLURM jobs: `squeue --me` ```bash! BH ps_app_team@bcm-head-01:~/demo$ squeue --me JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 1579 defq bash ps_app_t R 41:29 1 dgx-001 ``` ### Even more detailed info about your SLURM jobs: `scontrol show job` Get JOBID 1579 and run scontrol command ```bash! scontrol show job 1579 JobId=1579 JobName=bash UserId=ps_app_team(1011) GroupId=ps_app_team(1014) MCS_label=N/A Priority=4294901510 Nice=0 Account=safana-nlp QOS=normal JobState=RUNNING Reason=None Dependency=(null) Requeue=1 Restarts=0 BatchFlag=0 Reboot=0 ExitCode=0:0 RunTime=00:42:52 TimeLimit=UNLIMITED TimeMin=N/A SubmitTime=2023-06-08T13:24:30 EligibleTime=2023-06-08T13:24:30 AccrueTime=Unknown StartTime=2023-06-08T13:24:30 EndTime=Unknown Deadline=N/A SuspendTime=None SecsPreSuspend=0 LastSchedEval=2023-06-08T13:24:30 Scheduler=Main Partition=defq AllocNode:Sid=bcm-head-01:587789 ReqNodeList=(null) ExcNodeList=(null) NodeList=dgx-001 BatchHost=dgx-001 NumNodes=1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:* TRES=cpu=1,node=1,billing=1,gres/gpu=2 Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=* MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0 Features=(null) DelayBoot=00:00:00 OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null) Command=bash WorkDir=/home/ps_app_team Power= TresPerNode=gres:gpu:2 ``` ### Terminate a specific job: `scancel` `scancel <YOUR_JOBID>` `scancel 1579` You can create an alias to get your last job id: ``` alias my_last_jobid="squeue --me -o %A | tail -n 1" ``` If you put the text above into `~/.bashrc`, this alias will become available in all your terminal sessions. After that, you can cancel the last job: ``` scancel $(my_last_jobid) ``` ### Show recent usage: `sacct -a` ```bash! ps_app_team@bcm-head-01:~/demo$ sacct -a JobID JobName Partition Account AllocCPUS State ExitCode ------------ ---------- ---------- ---------- ---------- ---------- -------- 1494 bash defq safana-nlp 1 COMPLETED 0:0 1517 bash defq safana-nlp 1 FAILED 0:2 1518 bash defq safana-nlp 1 FAILED 6:0 1519 bash defq safana-nlp 2 COMPLETED 0:0 1520 bash defq safana-nlp 2 COMPLETED 0:0 ``` ## Submitting a job: Multi-GPU & Convert srun → sbatch ### sbatch 1. Create file with name `multi-gpu.sh` and put this code: ```bash! #!/bin/bash #SBATCH --nodes=2 #SBATCH --ntasks-per-node=1 #SBATCH --cpus-per-task=4 #SBATCH --gres=gpu:8 #SBATCH --job-name=nvidia_smi #SBATCH --output=out_multi-gpu_%j.txt srun nvidia-smi -L ``` Run command for set permission to execute + create folder: ```bash chmod +x multi-gpu.sh sbatch multi-gpu.sh ``` Check output files `cat out_multi-gpu*` ```bash! GPU 0: NVIDIA A100-SXM4-80GB (UUID: GPU-9479cbc2-c65a-9393-264f-e67a50300019) GPU 1: NVIDIA A100-SXM4-80GB (UUID: GPU-eb7d783b-2b63-acb4-663e-e91e1bb395ec) GPU 2: NVIDIA A100-SXM4-80GB (UUID: GPU-a81383d8-f3c8-17bf-a3e3-aac6d0bf6e10) GPU 3: NVIDIA A100-SXM4-80GB (UUID: GPU-62b5120e-e1c1-fd37-31a4-773a33bdca6a) GPU 4: NVIDIA A100-SXM4-80GB (UUID: GPU-2873df40-d01e-bb33-6862-6a9fc6f8cdac) GPU 5: NVIDIA A100-SXM4-80GB (UUID: GPU-111701b5-8b97-e177-9aba-bd1778f8b317) GPU 6: NVIDIA A100-SXM4-80GB (UUID: GPU-be59c44b-f01c-682e-c8b5-5ed5da0c7a8c) GPU 7: NVIDIA A100-SXM4-80GB (UUID: GPU-0e01616e-2671-8bd9-c0a3-c3da3f37ead9) GPU 0: NVIDIA A100-SXM4-80GB (UUID: GPU-51d6615a-1789-bdc5-2dd3-c95d1acc8d44) GPU 1: NVIDIA A100-SXM4-80GB (UUID: GPU-394dab2c-5f8e-0273-06ed-049ac172f5b4) GPU 2: NVIDIA A100-SXM4-80GB (UUID: GPU-a57b3089-c861-9a45-39bf-77b77b2a10a5) GPU 3: NVIDIA A100-SXM4-80GB (UUID: GPU-9c1cf23f-c433-7514-db9d-eb261fcde228) GPU 4: NVIDIA A100-SXM4-80GB (UUID: GPU-097e1798-0dc7-9de5-8031-74b2998bb94a) GPU 5: NVIDIA A100-SXM4-80GB (UUID: GPU-73db5afe-1811-5795-bd21-9fcd00ecee53) GPU 6: NVIDIA A100-SXM4-80GB (UUID: GPU-21de40f5-855f-91f8-254f-8ee9a3585ac3) GPU 7: NVIDIA A100-SXM4-80GB (UUID: GPU-ae76a7ac-456b-e711-e1c2-560376a94feb) ``` ### srun `srun --nodes=2 --gres=gpu:8 --ntasks-per-node=1 --cpus-per-task=4 nvidia-smi -L ` ## Pyxis — use containers as an environment for the job ```bash! srun -N 1 --gres=gpu:8 --container-image="nvcr.io#nvidia/tensorflow:23.07-tf2-py3" \ --container-mount-home --pty bash ``` ## Pyxis + ENROOT — import the container to speed up the start 1. Load SLURM module `module load slurm` 1. Start an auxillary interactive bash session `srun -N 1 --pty bash` 1. Download the recent PyTorch image from the NGC. This command will create a file nvidia+pytorch+23.07-py3.sqsh in the current directory. ` cd ~; enroot import "docker://nvcr.io#nvidia/pytorch:23.07-py3" ` 1. Exit the interactive session with Ctrl+d 1. Now you can run an interactive session inside the PyTorch container with 1 GPU ```bash! srun -N 1 --gpus-per-node 1 --container-image ~/nvidia+pytorch+23.07-py3.sqsh --container-mount-home --pty bash ``` ## ENROOT — how to add new packages to image To add new layers to the ENROOT image you should run it manually, without Pyxis. The example below will install `rolldice` package. 1. Start an interactive session on the machine with ENROOT ```bash! srun -N 1 --gpus-per-node 1 --pty bash ``` 2. Now inside the interactive job, create a container from the image ```bash! enroot create --name pytorch ~/nvidia+pytorch+23.07-py3.sqsh ``` 3. Start the interactive job inside the container ```bash! enroot start --root --rw cuda bash ``` 4. Install everything you need in an interactive mode ```bash! apt install rolldice ``` 5. Test the installed packages ```bash! /usr/games/rolldice 6 ``` 6. Exit the interactive tty from the container with `Ctrl+D` 7. Export the container ```bash! enroot export --output ~/nvidia+pytorch+23.07-py3+dice.sqsh pytorch ``` 8. Exit the interactive srun session with `Ctrl+D`. ## NCCL tests — an example of multi-node container run, testing the interconnect 1. Complete all the steps from the [container import section above](#Pyxis--ENROOT-—-import-the-container-to-speed-up-the-start) 1. Run an interactive session inside the PyTorch container with 1 GPU ` srun -N 1 --gpus-per-node 1 --container-image ~/nvidia+pytorch+23.07-py3.sqsh --container-mount-home --pty bash ` 1. Inside the container get to your home `cd ~` 1. Create the directory and clone the NCCL-test repo: ```bash mkdir nccl_test cd nccl_test git clone https://github.com/NVIDIA/nccl-tests.git cd nccl-tests ``` 8. Compile the executables with MPI support: ```bash make clean make MPI=1 MPI_HOME=/usr/local/mpi -j 20 ``` 9. Try running one example. That should raise an error. Avoid combining interactive jobs and MPI. `./build/all_reduce_perf -b 8 -e 128M -f 2 -g 8` 10. Exit the interactive job: Ctrl+d 11. Run the 1-node all-reduce test: ```bash! srun -N 1 --gpus-per-node 8 --exclusive --mpi=pmi2 \ --container-image ~/nvidia+pytorch+23.07-py3.sqsh \ --container-mount-home \ ~/nccl_test/nccl-tests/build/all_reduce_perf -b 8 -e 128M -f 2 -g 8 ``` 12. Run the 2-node all-reduce test: ```bash! srun -N 2 -w dgx-00[2-3] --gpus-per-node 8 --exclusive --mpi=pmi2 \ --container-image ~/nvidia+pytorch+23.07-py3.sqsh \ --container-mount-home \ ~/nccl_test/nccl-tests/build/all_reduce_perf -b 8 -e 128M -f 2 -g 8 \ | tee 2-node.log ``` 13. Checking the NCCL_DEBUG: ```bash! NCCL_DEBUG=INFO NCCL_DEBUG_SUBSYS=INIT,GRAPH \ srun -N 2 -w dgx-00[2-3] --gpus-per-node 8 --exclusive --mpi=pmi2 \ --container-image ~/nvidia+pytorch+23.07-py3.sqsh \ --container-mount-home \ ~/nccl_test/nccl-tests/build/all_reduce_perf -b 8 -e 128M -f 2 -g 8 \ | tee 2-node-graph.log ``` 14. Similar thing can be achieved via sbatch. See session below. ```bash BH ps_app_team@bcm-head-01:~$ cat ~/nccl_test/2-node.sbatch ``` ```bash! #!/usr/bin/env bash #SBATCH -N 2 #SBATCH -w dgx-00[2-3] #SBATCH --gpus-per-node 8 #SBATCH --exclusive #SBATCH --container-image /home/ps_app_team/nvidia+pytorch+23.07-py3.sqsh #SBATCH --container-mount-home #SBATCH -o /home/ps_app_team/nccl_test/2-node-sbatch.log NCCL_DEBUG=INFO NCCL_DEBUG_SUBSYS=INIT,GRAPH \ ~/nccl_test/nccl-tests/build/all_reduce_perf -b 8 -e 128M -f 2 -g 8 ``` ```bash BH ps_app_team@bcm-head-01:~$ sbatch ~/nccl_test/2-node.sbatch ``` ```bash Submitted batch job 1482 ``` ```bash BH ps_app_team@bcm-head-01:~$ scontrol show job 1482 ``` ```bash JobId=1482 JobName=2-node.sbatch UserId=ps_app_team(1011) GroupId=ps_app_team(1014) MCS_label=N/A Priority=4294901580 Nice=0 Account=safana-nlp QOS=normal JobState=PENDING Reason=Resources Dependency=(null) Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0 RunTime=00:00:00 TimeLimit=UNLIMITED TimeMin=N/A SubmitTime=2023-06-07T20:31:45 EligibleTime=2023-06-07T20:31:45 AccrueTime=2023-06-07T20:31:45 StartTime=Unknown EndTime=Unknown Deadline=N/A SuspendTime=None SecsPreSuspend=0 LastSchedEval=2023-06-07T20:31:47 Scheduler=Main Partition=defq AllocNode:Sid=bcm-head-01:587789 ReqNodeList=dgx-00[2-3] ExcNodeList=(null) NodeList=(null) NumNodes=2-2 NumCPUs=2 NumTasks=2 CPUs/Task=1 ReqB:S:C:T=0:0:*:* TRES=cpu=2,mem=2M,node=2,billing=2,gres/gpu=16 Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=* MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0 Features=(null) DelayBoot=00:00:00 OverSubscribe=NO Contiguous=0 Licenses=(null) Network=(null) Command=/home/ps_app_team/nccl_test/2-node.sbatch WorkDir=/home/ps_app_team StdErr=/home/ps_app_team/nccl_test/2-node-sbatch.log StdIn=/dev/null StdOut=/home/ps_app_team/nccl_test/2-node-sbatch.log Power= TresPerNode=gres:gpu:8 ``` ```bash BH ps_app_team@bcm-head-01:~$ less /home/ps_app_team/nccl_test/2-node-sbatch.log ``` ## tmux — how to keep a terminal session even through disconnections 1. Attach to a tmux session named “main” o create a new one with this name: `tmux a -d -t main || tmux new -A -s main` 1. Create a new pane in the tmuxed terminal: `Ctrl+b, then c` 1. Switch back to pane number 0: `Ctrl+b, then 0` 1. Close the current pane: `Ctrl+d` 1. Keep all the panes running, but detach from the session: `Ctrl+b, then d` 1. Navigate tmux terminal history: `Ctrl+b, then [` to switch to the navigation mode `PgUp, PgDown` to navigate `q` to exit back to the command entering mode ## sattach — if the connection to srun job failed, but the job is still running If your tmux session with srun was killed, the srun interactive job may still continue running. To check, if it does, use `squeue --me -s` to show your running jobs and their steps. For me, the output is shown below. ```bash! $ squeue --me -as STEPID NAME PARTITION USER TIME NODELIST 79638.0 bash batch u00u5ya1 3:40 dgx1-000 79638.extern extern batch u00u5ya1 3:41 dgx1-000 ``` Most probably, you will need to attach to step 0. The command to do this is `sattach --pty jobid.stepid`. In my case `sattach --pty 79638.0`. ## SLURM NGC login To enable enroot access to the NGC private registry one must: 1. Get your NGC API key from https://ngc.nvidia.com/setup/api-key 1. Create a ``~/.config/enroot/`` directory with the following contents ([details](https://github.com/NVIDIA/enroot/blob/master/doc/cmd/import.md#description)) `mkdir -p ~/.config/enroot/` 1. Create a `~/.config/enroot/.credentials` file with the following contents and open it for editing: `nano ~/.config/enroot/.credentials` 1. Paste the following contents, replacing <YOUR_NGC_TOKEN> with your actual token ``` machine nvcr.io login $oauthtoken password <YOUR_NGC_TOKEN> machine authn.nvidia.com login $oauthtoken password <YOUR_NGC_TOKEN> ``` 5. Write the file to the disk: Ctrl+o, then Ctrl+x ## Jupyter Lab in SLURM Run the interactive job on 2 GPUs: ```bash! srun -N 1 --gpus-per-node 2 --mpi=pmi2 \ --container-image ~/nvidia+pytorch+23.07-py3.sqsh \ --container-mount-home --pty \ bash -c "hostname -I && hostname && jupyter lab --ip=0.0.0.0 --port=8000 --allow-root --no-browser --NotebookApp.token='my_simple_token' --NotebookApp.allow_origin='*' --notebook-dir=/" ``` As a first line, it will output a number of IPs. Use the first one, which is similar to the IP of the node you're connecting to (usually, the one **not** starting from 10 or 127) Open your browser and go to `http://<that IP address>:8000`. It will redirect you to the jupyter lab. As a token, input `my_simple_token` from the command above. Note the port number **8000**. It's the number you've specified in the command `--port=8000`, but not the one, which is spat out by the Jupyter Lab launch script. Change to the directory with the write access, create a jupyter notebook and run it. Inside, you can check the output of the following cells: ```python import torch torch.cuda.is_available() # True torch.cuda.device_count() # 2 torch.cuda.current_device() # 0 torch.cuda.get_device_name(0) # 'NVIDIA A100-SXM4-80GB' ``` ## PyTorch Lightning distributed training Create a file `simple-pl.py` with the contents, listed in the end of this section. Choose number of nodes and number of GPUs per node. In our case 2 nodes, 8 gpus each. Choose ntasks-per-node == gpus-per-node Launch srun: ```bash! srun -N 2 -w dgx-00[4-5] --gpus-per-node 8 --ntasks-per-node 8 --exclusive --mpi=pmi2 \ --container-image ~/nvidia+pytorch+23.07-py3.sqsh \ --container-mount-home \ python ~/jupyter/simple-pl.py ``` [Additional docs](https://pytorch-lightning.readthedocs.io/en/1.6.3/clouds/cluster.html#slurm-managed-cluster) File contents: ```python import lightning.pytorch as pl import torch from torch import nn import socket import argparse from lightning.pytorch import loggers as pl_loggers import os class Module(pl.LightningModule): def __init__(self): super().__init__() self.linear = nn.Linear(5, 1) def configure_optimizers(self): return torch.optim.Adam(self.linear.parameters()) def training_step(self, batch, batch_idx): return self.linear(batch).sum() def validation_step(self, batch, batch_idx): return batch_idx def on_validation_epoch_end(self): print("VALIDATING") if __name__ == "__main__": m = Module() datasets = [torch.rand([5]) for __ in range(200)] train_loader = torch.utils.data.DataLoader(datasets, batch_size=16) val_loader = torch.utils.data.DataLoader(datasets, batch_size=16) tb_logger = pl_loggers.TensorBoardLogger(save_dir="/home/ps_app_team/logs/") trainer = pl.Trainer( accelerator="gpu", strategy='ddp', devices=os.environ["SLURM_GPUS_ON_NODE"], num_nodes=os.environ["SLURM_JOB_NUM_NODES"], max_epochs=2, logger=tb_logger ) print(socket.gethostname(),'pre_node:',trainer.node_rank) print(socket.gethostname(),'pre_local:',trainer.local_rank) print(socket.gethostname(),'pre_global:',trainer.global_rank) trainer.fit(m, train_loader, val_loader) print(socket.gethostname(),'post_node:',trainer.node_rank) print(socket.gethostname(),'post_local:',trainer.local_rank) print(socket.gethostname(),'post_global:',trainer.global_rank) ``` ## Array jobs — run multiple similar, but **independent** jobs Like a hyperparameter search. Create file with name array_jobs.sh and put this code: ```bash #!/bin/bash #SBATCH --job-name=my_array_job #SBATCH --array=0-9 #SBATCH --cpus-per-task=4 #SBATCH --output=array_out_%A_%a.txt echo "This is job ${SLURM_ARRAY_JOB_ID}_${SLURM_ARRAY_TASK_ID}" echo "Running on host: $(hostname)" ``` Run command for set permission to execute + create folder: ` chmod +x array_jobs.sh && mkdir array_output ` Run next sbatch command: `sbatch array_jobs.sh ` Check output files what you got. `cat array_output/out_*` ```bash! BH ps_app_team@bcm-head-01:~/demo$ cat array_output/out_* Running on host: dgx-001 This is job 1592_1 Running on host: dgx-001 This is job 1592_2 Running on host: dgx-001 This is job 1592_3 Running on host: dgx-001 This is job 1592_4 Running on host: dgx-001 This is job 1592_5 Running on host: dgx-001 This is job 1592_6 Running on host: dgx-001 This is job 1592_7 Running on host: dgx-001 This is job 1592_8 Running on host: dgx-001 This is job 1592_9 Running on host: dgx-001 ``` ## Additional Reading * [The last version of this document is available online](https://hackmd.io/_gNV67kuRoGE9XCNXh0dmw?edit) * [Good collection of useful SLURM commands](https://curc.readthedocs.io/en/latest/running-jobs/slurm-commands.html) * Official SLURM docs * [srun](https://slurm.schedmd.com/srun.html) * [sbatch](https://slurm.schedmd.com/sbatch.html) * [sinfo](https://slurm.schedmd.com/sinfo.html) * [squeue](https://slurm.schedmd.com/squeue.html) * [scontrol](https://slurm.schedmd.com/scontrol.html) * [sattach](https://slurm.schedmd.com/sattach.html) * [ENROOT Usage Guide](https://github.com/NVIDIA/enroot/blob/master/doc/usage.md) * [Pyxis Usage Guide](https://github.com/NVIDIA/pyxis#usage) ## Still have questions? Drop a message to Dmitry: [dmitrym@nvidia.com](mailto://dmitrym@nvidia.com)