# Summary of slurm commands
[toc]
##### Author - @Yize Sun
##### Created On - 10/04/2022
This is a working summary of some slurm commands, there are some official references [batch command](https://slurm.schedmd.com/sbatch.html) and [quickstart](https://slurm.schedmd.com/quickstart.html). For every training task, one slurm shell should be configured.
## 1. Run slurm
Run a slurm shell e.g. "slurmtest.sh" in /home/stud/user by using:
```console
sbatch slurmtest.sh
```
or
```console
srun slurmtest.sh
```
## 2. Config your own Slurm shell
Slurm is a popular framework to manage resouces of supercomputer. It is recommanded to use a shell file to config the environment for training task.
The most important part of the config shell is:
```sh
#!/usr/bin/env bash
#
#SBATCH --job-name torch-test
#SBATCH --output=res.txt
#SBATCH --ntasks=1
#SBATCH --time=10:00
#SBATCH --gres=gpu:1
```
, these commands define:
**--job-name**
name of task
**--output**
output path
**--ntasks**
maximum of number tasks, default is one task per node
**--time**
time limit of task, the task will be left in a PENDING state, if the time limit is exceeded
("minutes", "minutes:seconds", "hours:minutes:seconds", "days-hours", "days-hours:minutes" and "days-hours:minutes:seconds")
**--gres**
some numbers of resouces, e.g. "--gres=[resouce]:[number]"
## 3. Config Slurm for running py in virtual Environment
```sh
#!/usr/bin/env bash
#
#SBATCH --job-name torch-test
#SBATCH --output=res.txt
#SBATCH --ntasks=1
#SBATCH --time=10:00
#SBATCH --gres=gpu:1
RUNPATH=/home/stud/user/path/to/your/test/folder
cd $RUNPATH
VENVPATH=/home/stud/user/path/to/your/virtual/environment
source $VENVPATH/bin/activate
python mytest.py
```
## 4. Some useful linux commands
### 4.0 Join and exit the remote
To join the remote:
```console
ssh [user]@[domain.de]
```
To exit the remote:
```console
exit
```
### 4.1 Check Path
```console
pwd
```
### 4.2 Transfer file from local to remote
1. Using **scp**
Transfer file by using openssh:
```console
scp [path/of/local] [user]@[domain]:[/path/of/target]
```
e.g.
```console
scp /local/test.py bob@domain.de:/home/stud/bob/tmp/
```
2. Using **git**
Some times, the training task is not a single python file but a set of files. You can push your local code into GitHub and then pull them down to your remote.
### 4.3 Install Anaconda
According to this [reference](https://linuxize.com/post/how-to-install-anaconda-on-ubuntu-20-04/)
Downoload Anaconda Ubuntu 20.04 Installer
```console
wget -P /tmp https://repo.anaconda.com/archive/Anaconda3-2021.11-Linux-x86_64.sh
```
Check sha256
```console
sha256sum /tmp/Anaconda3-2021.11-Linux-x86_64.sh
```
The output should look like:
```console
fedf9e340039557f7b5e8a8a86affa9d299f5e9820144bd7b92ae9f7ee08ac60 /tmp/Anaconda3-2021.11-Linux-x86_64.sh
```
Run installer:
```console
bash /tmp/Anaconda3-2021.11-Linux-x86_64.sh
```
You should see:
```console
Welcome to Anaconda3 2021.11
In order to continue the installation process, please review the license
agreement.
Please, press ENTER to continue
>>>
```
Press ENTER to continue or interrupt by "ctrl+c":
```console
Do you approve the license terms? [yes|no]
```
Type yes:
```console
Anaconda3 will now be installed into this location:
/home/stud/user/anaconda3
- Press ENTER to confirm the location
- Press CTRL-C to abort the installation
- Or specify a different location below
```
Press ENTER to confirm the location:
```console
Installation finished.
Do you wish the installer to initialize Anaconda3
by running conda init? [yes|no]
```
Tpying "yes" will add the conda to your system path. To load the path variable into the current shell by every connection:
```console
source ~/.bashrc
```
If you will use a virtual environment, it is recommanded to deactivate the default configuration:
```console
conda config --set auto_activate_base false
```
To test the installation:
```console
conda --version
```
You should see
```console
conda 4.10.3
```
## 5 Create and Activate Virtual Environment
### 5.1 Create virtual environment by conda
Load the conda:
```console
source ~/.bashrc
```
to create virtual environment by typing:
```console
conda create -n [venvname] python=[x.x] anaconda
```
to activate your virtual environment
```console
conda activate [venvname]
```
after activating virtual environment, to install python packages:
```console
pip install [package]
```
or
```console
conda install -n [venvname] [package]
```
To check selected python version:
```console
python --version
```
To deactivate virtual environment:
```console
conda deactivate
```
### 5.2 Create virtual enviroment by python
First check your python version:
```console
python3 --version
```
you shuold see:
```console
Python 3.8.10
```
If you want use multiple python versions, see this [tutorial](https://medium.com/analytics-vidhya/how-to-install-and-switch-between-different-python-versions-in-ubuntu-16-04-dc1726796b9b)
To create a virtual enviroment:
```console
python3 -m venv [path/to/your/virtual/venv]
```
To activate the virtual environment:
```console
source [path/to/your/virtual/venv/bin/activate]
```
To install packages:
```console
[path/to/your/virtual/venv/bin/pip] install [package]
```
or type:
```console
pip install [package]
```
To run python in virtual environment:
```console
[path/to/your/virtual/venv/bin/python] [mypyfile.py]
```
or type:
```console
python [mypyfile.py]
```
To deactivate:
```console
deactivate
```
## 6. Multiple File into One GPU
The question, to send multiple python file or one file multiple times with different arguments into one gpu, is viewed as to create multiple processes and to send each process to the same GPU.
Some important information is:
1. for task, think about how many tasks I have
2. for node, think about how many tasks one node have
### 6.1 Understanding task, node, core, gpu
1. One task is a abstruct thing. It can one running python file, also a thread under paraell threads. One task can correspond to thread or process(include multiple threads) . One task includes: number of nodes, number of cores etc. You can define tasks by SLURM config. e.g. This codes will create 1 task
```sh
#!some slurm config
#SBATCH --ntasks=1
for i in range(3):
srun pyfile.py arg[i]
```
2. The core corresponds to thread(1 core = 1 thread = 1 paraell job of one running python file), so if you have only one python file to run, and this file will make a paraell running with 10 separted works, then we need 10 cores for this single python file.
```sh
#SBATCH --cpus-per-task=2
```
3. The node is a unit to manage cores. One task can have one or multiple nodes.
```shell
#SBATCH --nodes=3
```
4. The GPU can has multiple task.
```shell
#SBATCH --gpus=1
#SBATCH --ntasks-per-gpu=8
### the same as, but take are of gpu memory
#SBATCH --gpus=8
#SBATCH --ntasks-per-gpu=1
```
### 6.2 Defining SLURM for Multiple File into One GPU
Before defineing these information, the researchers shuold make sure
1. how many processes(running python work in this case), how many cores(num of paraell works in one work) per process are needed?
2. how many cores are there in one node using
```console
scontrol show node
```
3. GPU memory size, and total gpu memory request of each task.
There are some good ref [see0](https://stackoverflow.com/a/51141287), [see1](https://stackoverflow.com/questions/45883372/slurm-submit-multiple-tasks-per-node), [see2](https://docs.computecanada.ca/wiki/Using_GPUs_with_Slurm)
### 6.3 Current SLURM Config for Mutil Process in One GPU
```shell
#!/usr/bin/env bash
#
#SBATCH --job-name torch-test-paraell
#SBATCH --output=rlt_test_para.txt
### option1#SBATCH --ntasks+--ntasks-per-gpu following is option2
#SBATCH --nodes=1
#SBATCH --gres=gpu:1
#SBATCH --ntasks=5
#SBATCH --time=72:30:00
RUNPATH=/home/stud/sunyi/tests/test_paraell
cd $RUNPATH
PNYL37PATH=/home/stud/sunyi/anaconda3/envs/pnyl37
source ~/.bashrc
conda activate pnyl37
python --version
python test_paraell.py
```
the result look like:
```console
==> /mnt/gpustat/worker-5 <==
worker-99999 Week Month day x:x:x x xxx.xx.xx
[0] NVIDIA RTX | 56'C, 83 % | 44161 / 48685 MB |
[1] NVIDIA RTX | 50'C, 3 % | 33219 / 48685 MB |
[2] NVIDIA RTX | 33'C, 0 % | 7678 / 48685 MB | sunyi(1535M) sunyi(1535M) sunyi(1535M) sunyi(1535M) sunyi(1535M)
[3] NVIDIA RTX | 33'C, 12 % | 2052 / 48685 MB |
[4] NVIDIA RTX | 28'C, 0 % | 1 / 48685 MB |
[5] NVIDIA RTX | 47'C, 58 % | 43012 / 48685 MB |
[6] NVIDIA RTX | 28'C, 0 % | 1 / 48685 MB |
[7] NVIDIA RTX | 53'C, 77 % | 31222 / 48685 MB |
```