Machine Learning examples on Aristotle Cluster

![it-logo-100x100.png](https://hackmd.io/_uploads/HkjdnyFXp.png) ## HPC Workflows using Slurm Machine Learning examples on Aristotle Cluster --- ## This presentation will hopefully help you learn how to: * Connect to Aristotle Web Interface * Run a Jupyter Notebook on Aristotle cluster * Submit a batch job to use additional computing resources --- ### Preparation for this session Please go through a `Unix Command cheat sheet` as the following: * https://hpc.it.auth.gr/cheat-sheet/ A few unix commands can be useful to run the examples that follow. --- ### Access Aristotle HPC cluster from your browser: https://hpc.auth.gr/ or via `SSH`: https://hpc.it.auth.gr/intro/ --- ### Examples for this presentation To get the examples use the [Web inteface file manager](https://hpc.it.auth.gr/ondemand-desktop/#_2) or the copy command (`cp`) on the terminal: ``` $ cp -r /mnt/apps/share/HPC-AI-examples $HOME ``` --- ### Example Jupyter Notebook is using: The Extreme Gradient Boosting ([`XGBoost`](https://xgboost.readthedocs.io/en/stable/index.html)) open-source library is used for this simple Regression example. `XGBoost` implements machine learning algorithms under the Gradient Boosting framework. --- #### [Start Jupyter Server](https://hpc.it.auth.gr/applications/jupyter/#jupyter-server-cluster) `Interactive Apps` -> `Jupyter Server` ![ONDEMAND-MENU.png](https://hackmd.io/_uploads/HJj_h1YQa.png) ... and launch! --- ### Start a new terminal on the Jupyter Server ![Jupyter_Terminal_Menu.png](https://hackmd.io/_uploads/ryjunyF76.png) --- Use `cp` command to copy the example jupyter notebook: ``` $ cp /mnt/apps/custom/jupyter/nb/xgboost_example.ipynb . ``` Source the prebuilt python virtual environment: ``` $ source /mnt/apps/custom/python-envs/xgboost-env/bin/activate ``` Install the `IPython kernel` in this environment for your user account: ``` $ python -m ipykernel install --user --name xgboost-env \ --display "xgboost environment" ``` --- ### Start new notebook Using the custom environment ![xgboost-environment.png](https://hackmd.io/_uploads/rJjO21FXp.png) On Jupyter menu select `File -> Open` to load the `xboost example notebook`. --- #### Export Python script At the notebook menu select: `Download as` -> `Python (.py)` --- ### Python virtual enviroment (+ Jupyter IPython Kernel) To create a new custom python `venv` on your account the following process can be used: ``` module load gcc/9.4.0-eewq4j6 python/3.9.10-ve54vyn python -m venv xgboost-env source xgboost-env/bin/activate pip install --upgrade pip pip install jupyter xgboost matplotlib scikit-learn python -m ipykernel install --user --name xgboost-env \ --display "xgboost environment" ``` [custom virtual enviroment on Jupyter](https://hpc.it.auth.gr/applications/jupyter/#custom-python-virtual-environments) --- ### Using Slurm to Access HPC Resources --- ### Slurm Workload Manager * Allocates and manages exclusive users access to cluster resources * Provides a framework for job tracking and parallel job execution * [Quick Start User Guide](https://slurm.schedmd.com/quickstart.html) * [Slurm Directives](https://hpc.it.auth.gr/jobs/slurm/) --- ### Slurm user commands (1) * Submit a job to the cluster ``` $ sbatch <job_script> ``` * Show status of running and queued jobs ``` $ squeue # Filter results for one user $ squeue -u <username> # Filter results for one partition $ squeue -p <partition> ``` * Cancel a submitted job ``` $ scancel ``` ---- ### Slurm user commands (2) * Show status of available partitions ``` $ sinfo $ sinfo -N --long # how node status ``` * Show resources and efficiency of completed job ``` $ seff <jobid> ``` * Report job accounting information ``` $ sacct ``` --- ## Batch job examples --- Steps: 1. Create a submission script 2. Submit job to Slurm 3. Monitor job execution 4. Get job results Related docs: * https://hpc.it.auth.gr/jobs/serial-slurm/ --- ### Example 1: A test job Submission script ``` #!/bin/bash #SBATCH --time=10:00 #SBATCH --partition=testing echo "Hello from $(hostname)" sleep 30 echo Bye ``` --- ### Example 2: More CPUs ``` #!/bin/bash #SBATCH --partition=rome #SBATCH --time=10:00 #SBATCH --nodes=1 #SBATCH --ntasks-per-node=16 stress --cpu ${SLURM_NTASKS} --timeout 60 ``` CPU Efficiency: `seff <jobid>` --- ### Example 3: More Memory ``` Memory Per Task = Total Memory on Node / #CPUs on Node ``` To allocate more memory use `--mem` directive: ``` #!/bin/bash #SBATCH --partition=rome #SBATCH --job-name=memory #SBATCH --time=4:00 #SBATCH --mem=11G ./allocate-10gb ``` --- ### Example 4: GPU jobs * Partitions: * **gpu** : 2 nodes with a [NVIDIA Tesla P100](https://www.nvidia.com/en-us/data-center/tesla-p100/) * **ampere**: 1 node with 8 [NVIDIA A100](https://www.nvidia.com/en-us/data-center/a100/) ``` #!/bin/bash #SBATCH --partition=gpu #SBATCH --gres=gpu:1 #SBATCH --cpus-per-task=20 #SBATCH --time=10:00 nvidia-smi ``` --- ### Run XGBoost example python script as a batch job on the cluster ``` #!/bin/bash #SBATCH --job-name=xgboost-example #SBATCH --partition=rome #SBATCH --nodes=1 #SBATCH --ntasks-per-node=8 #SBATCH --time=1:00:00 source /mnt/apps/custom/python-envs/xgboost-env/bin/activate python example.py ``` --- ### THANK YOU !! ![it-auth.png](https://hackmd.io/_uploads/HJjd2kFXp.png)