![it-logo-100x100.png](https://hackmd.io/_uploads/HkjdnyFXp.png)
## HPC Workflows using Slurm
Machine Learning examples on Aristotle Cluster
---
## This presentation
will hopefully help you learn how to:
* Connect to Aristotle Web Interface
* Run a Jupyter Notebook on Aristotle cluster
* Submit a batch job to use additional computing resources
---
### Preparation for this session
Please go through a `Unix Command cheat sheet` as the following:
* https://hpc.it.auth.gr/cheat-sheet/
A few unix commands can be useful to run the examples that follow.
---
### Access Aristotle HPC cluster
from your browser: https://hpc.auth.gr/
or via `SSH`: https://hpc.it.auth.gr/intro/
---
### Examples for this presentation
To get the examples use the [Web inteface file manager](https://hpc.it.auth.gr/ondemand-desktop/#_2) or the copy command (`cp`) on the terminal:
```
$ cp -r /mnt/apps/share/HPC-AI-examples $HOME
```
---
### Example Jupyter Notebook is using:
The Extreme Gradient Boosting ([`XGBoost`](https://xgboost.readthedocs.io/en/stable/index.html)) open-source library is used for this simple Regression example.
`XGBoost` implements machine learning algorithms under the Gradient Boosting framework.
---
#### [Start Jupyter Server](https://hpc.it.auth.gr/applications/jupyter/#jupyter-server-cluster)
`Interactive Apps` -> `Jupyter Server`
![ONDEMAND-MENU.png](https://hackmd.io/_uploads/HJj_h1YQa.png)
... and launch!
---
### Start a new terminal
on the Jupyter Server
![Jupyter_Terminal_Menu.png](https://hackmd.io/_uploads/ryjunyF76.png)
---
Use `cp` command to copy the example jupyter notebook:
```
$ cp /mnt/apps/custom/jupyter/nb/xgboost_example.ipynb .
```
Source the prebuilt python virtual environment:
```
$ source /mnt/apps/custom/python-envs/xgboost-env/bin/activate
```
Install the `IPython kernel` in this environment for your user account:
```
$ python -m ipykernel install --user --name xgboost-env \
--display "xgboost environment"
```
---
### Start new notebook
Using the custom environment
![xgboost-environment.png](https://hackmd.io/_uploads/rJjO21FXp.png)
On Jupyter menu select `File -> Open`
to load the `xboost example notebook`.
---
#### Export Python script
At the notebook menu select:
`Download as` -> `Python (.py)`
---
### Python virtual enviroment
(+ Jupyter IPython Kernel)
To create a new custom python `venv` on your account
the following process can be used:
```
module load gcc/9.4.0-eewq4j6 python/3.9.10-ve54vyn
python -m venv xgboost-env
source xgboost-env/bin/activate
pip install --upgrade pip
pip install jupyter xgboost matplotlib scikit-learn
python -m ipykernel install --user --name xgboost-env \
--display "xgboost environment"
```
[custom virtual enviroment on Jupyter](https://hpc.it.auth.gr/applications/jupyter/#custom-python-virtual-environments)
---
### Using Slurm
to Access HPC Resources
---
### Slurm Workload Manager
* Allocates and manages exclusive users access to cluster resources
* Provides a framework for job tracking and parallel job execution
* [Quick Start User Guide](https://slurm.schedmd.com/quickstart.html)
* [Slurm Directives](https://hpc.it.auth.gr/jobs/slurm/)
---
### Slurm user commands (1)
* Submit a job to the cluster
```
$ sbatch <job_script>
```
* Show status of running and queued jobs
```
$ squeue
# Filter results for one user
$ squeue -u <username>
# Filter results for one partition
$ squeue -p <partition>
```
* Cancel a submitted job
```
$ scancel
```
----
### Slurm user commands (2)
* Show status of available partitions
```
$ sinfo
$ sinfo -N --long # how node status
```
* Show resources and efficiency of completed job
```
$ seff <jobid>
```
* Report job accounting information
```
$ sacct
```
---
## Batch job examples
---
Steps:
1. Create a submission script
2. Submit job to Slurm
3. Monitor job execution
4. Get job results
Related docs:
* https://hpc.it.auth.gr/jobs/serial-slurm/
---
### Example 1: A test job
Submission script
```
#!/bin/bash
#SBATCH --time=10:00
#SBATCH --partition=testing
echo "Hello from $(hostname)"
sleep 30
echo Bye
```
---
### Example 2: More CPUs
```
#!/bin/bash
#SBATCH --partition=rome
#SBATCH --time=10:00
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=16
stress --cpu ${SLURM_NTASKS} --timeout 60
```
CPU Efficiency: `seff <jobid>`
---
### Example 3: More Memory
```
Memory Per Task = Total Memory on Node / #CPUs on Node
```
To allocate more memory use `--mem` directive:
```
#!/bin/bash
#SBATCH --partition=rome
#SBATCH --job-name=memory
#SBATCH --time=4:00
#SBATCH --mem=11G
./allocate-10gb
```
---
### Example 4: GPU jobs
* Partitions:
* **gpu** : 2 nodes with a [NVIDIA Tesla P100](https://www.nvidia.com/en-us/data-center/tesla-p100/)
* **ampere**: 1 node with 8 [NVIDIA A100](https://www.nvidia.com/en-us/data-center/a100/)
```
#!/bin/bash
#SBATCH --partition=gpu
#SBATCH --gres=gpu:1
#SBATCH --cpus-per-task=20
#SBATCH --time=10:00
nvidia-smi
```
---
### Run XGBoost example python script
as a batch job on the cluster
```
#!/bin/bash
#SBATCH --job-name=xgboost-example
#SBATCH --partition=rome
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=8
#SBATCH --time=1:00:00
source /mnt/apps/custom/python-envs/xgboost-env/bin/activate
python example.py
```
---
### THANK YOU !!
![it-auth.png](https://hackmd.io/_uploads/HJjd2kFXp.png)
{"title":"Machine Learning examples on Aristotle Cluster","description":"it-logo-100x100.png","slideOptions":"{\"theme\":\"black\"}","contributors":"[{\"id\":\"3e4e321b-8d22-47f8-8091-afc9b313f3dd\",\"add\":6613,\"del\":1083}]"}