High Performance Computing Tutorial

--- title: 'High Performance Computing Tutorial' tags: codata-rda, sords --- Most of this content has been copied from the chpc [wiki page](https://wiki.chpc.ac.za/), Software Carpentries [lesson on HPC](https://carpentries-incubator.github.io/hpc-intro), and Epcced [tutorial on HPC](https://epcced.github.io/hpc-intro). # High Performance Computing ## Table of Contents [TOC] ## A few words about High Performance Computing - HPC Frequently, research problems that use computing can outgrow the capabilities of the desktop or laptop computer where they started: * A statistics student wants to cross-validate a model. This involves running the model 1000 times – but each run takes an hour. Running the model on a laptop will take over a month! In this research problem, final results are calculated after all 1000 models have run, but typically only one model is run at a time (in serial) on the laptop. Since each of the 1000 runs is independent of all others, and given enough computers, it’s theoretically possible to run them all at once (in parallel). * A genomics researcher has been using small datasets of sequence data, but soon will be receiving a new type of sequencing data that is 10 times as large. It’s already challenging to open the datasets on a computer – analyzing these larger datasets will probably crash it. In this research problem, the calculations required might be impossible to parallelize, but a computer with more memory would be required to analyze the much larger future data set. * An engineer is using a fluid dynamics package that has an option to run in parallel. So far, this option was not used on a desktop. In going from 2D to 3D simulations, the simulation time has more than tripled. It might be useful to take advantage of that option or feature. In this research problem, the calculations in each region of the simulation are largely independent of calculations in other regions of the simulation. It’s possible to run each region’s calculations simultaneously (in parallel), communicate selected results to adjacent regions as needed, and repeat the calculations to converge on a final set of results. In moving from a 2D to a 3D model, both the amount of data and the amount of calculations increases greatly, and it’s theoretically possible to distribute the calculations across multiple computers communicating over a shared network. In all these cases, access to more (and larger) computers is needed. Those computers should be usable at the same time, solving many researchers’ problems in parallel. ### What is HPC? The words “cloud”, “cluster”, and “high-performance computing” are used a lot in different contexts and with varying degrees of correctness. So what do they mean exactly? And more importantly, how do we use them for our work? The cloud is a generic term commonly used to refer to remote computing resources of any kind – that is, any computers that you use but are not right in front of you. Cloud can refer to machines serving websites, providing shared storage, providing webservices (such as e-mail or social media platforms), as well as more traditional “compute” resources. An HPC system on the other hand, is a term used to describe a network of computers. The computers in a cluster typically share a common purpose, and are used to accomplish tasks that might otherwise be too big for any one computer. * The cluster can serve to **offload code execution from your laptop/workstation** * code that runs too long or needs too much memory or disk space * clusters are particularly useful for executing parallel code * on one compute node * on multiple compute nodes at once >📝 **Note on speed of execution:** > * the compute nodes have similar architecture to your desktop > * they are not much faster > * the main advantage of cluster computing lies in parallel code execution ## Accessing the remote system The first step in using a cluster is to establish a connection from our laptop to the cluster. When we are sitting at a computer (or standing, or holding it in our hands or on our wrists), we have come to expect a visual display with icons, widgets, and perhaps some windows or applications: a graphical user interface, or GUI. Since computer clusters are remote resources that we connect to over slow or intermittent interfaces (WiFi and VPNs especially), it is more practical to use a command-line interface, or CLI, to send commands as plain-text. If a command returns output, it is printed as plain text as well. The commands we run today will not open a window to show graphical results. If you have ever opened the Windows Command Prompt or macOS Terminal, you have seen a CLI. If you have already taken The Carpentries’ courses on the UNIX Shell or Version Control, you have used the CLI on your local machine extensively. The only leap to be made here is to open a CLI on a remote machine, while taking some precautions so that other folks on the network can’t see (or change) the commands you’re running or the results the remote machine sends back. We will use the Secure SHell protocol (or SSH) to open an encrypted network connection between two machines, allowing you to send & receive text and data without having to worry about prying eyes. SSH clients are usually command-line tools, where you provide the remote machine address as the only required argument. If your username on the remote system differs from what you use locally, you must provide that as well. If your SSH client has a graphical front-end, such as PuTTY or MobaXterm, you will set these arguments before clicking “connect.” From the terminal, you’ll write something like `ssh userName@hostname`, where the argument is just like an email address: the “@” symbol is used to separate the personal ID from the address of the remote machine. ```bash= $ ssh student66@lengau.chpc.ac.za Last login: Tue Jul 5 05:48:04 2022 from 177.141.194.38 Welcome to LENGAU ################################################################################ # # # In order to receive notifications via email from the CHPC all users should # # be subscribed to the CHPC user distribution list. If you are not part of the # # distribution list you can subscribe at the following link: # # https://lists.chpc.ac.za/mailman/listinfo/chpc-users # # # ################################################################################ [student66@login2 ~]$ ``` ### Where are we? Very often, many users are tempted to think of a high-performance computing installation as one giant, magical machine. Sometimes, people will assume that the computer they’ve logged onto is the entire computing cluster. So what’s really happening? What computer have we logged on to? The name of the current computer we are logged onto can be checked with the `hostname` command. (You may also notice that the current hostname is also part of our prompt!) ```bash= [student66@login2 ~]$ hostname login2 ``` ```sequence User->System: Connect to the system using SSH Note right of System: System checks available login methods\nand selects password \nas authentication method System-->User: Request Password Note left of User: User fills password at prompt User->System: Sends Password Note right of System: Checks if password if valid System-->User: User successfully authenticated Note left of User: Receives remote prompt User->System: Issue commands ``` ## Job Scheduler An HPC system might have thousands of nodes and thousands of users. How do we decide who gets what and when? How do we ensure that a task is run with the resources it needs? This job is handled by a special piece of software called the ***scheduler***. On an HPC system, the **scheduler manages which jobs run where and when**. The CHPC cluster uses PBSPro as its job scheduler. With the exception of interactive jobs, all jobs are submitted to a batch queuing system and only execute when the requested resources become available. All batch jobs are queued according to priority. A user's priority is not static: the CHPC uses the “Fairshare” facility of PBSPro to modify priority based on activity. This is done to ensure the finite resources of the CHPC cluster are shared fairly amongst all users. ## Terminology * **Job**: your program on the cluster * **Submit job**: instruct the cluster to run your program * **Node**: compute node = group of cores that can access the same memory (also known as a computer or a machine) * **Memory**: main memory or RAM (fast memory directly connected to the processor, when your program is running it is stored in RAM together with needed data) * **Core**: the basic computation unit inside a processor that can run a single process * **Serial code**: runs on one core * **Parallel code**: program that runs on two or more cores ## Creating our first batch job The most basic use of the scheduler is to run a command non-interactively. Any command (or series of commands) that you want to run on the cluster is called a job, and the process of using a scheduler to run the job is called batch job submission. In this case, the job we want to run is just a shell script. Let’s create a demo shell script to run as a test. ```bash= #!/bin/bash echo 'This script is running on:' hostname sleep 60 ``` Now let us run or first job and see what happens? ```bash= $ chmod +x job1.sh $ ./job1.sh ``` Result: ```bash > login2 ``` ### Submitting the job to the cluster ```bash $ qsub -P WCHPC -l select=1,walltime=0:10:0 job1.sh ``` ```bash > 4286014.sched01 ``` And that’s all we need to do to submit a job. Our work is done – now the scheduler takes over and tries to run the job for us. While the job is waiting to run, it goes into a list of jobs called the queue. To check on our job’s status, we check the queue using the command `qstat -u yourUsername` ```bash= $ qstat -u student66 ``` ```bash sched01: Req'd Req'd Elap Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time --------------- -------- -------- ---------- ------ --- --- ------ ----- - ----- 4286014.sched01 student6 serial job1.sh -- 1 1 -- 00:10 Q -- ``` We can see all the details of our job, most importantly that it is in the **“R”** or **“RUNNING”** state. Sometimes our jobs might need to wait in a queue (**“Q”** or **“QUEUED”**) or have an error (**"E"**). A very important information shown by the `qstat` command is the *Queue*. The queue is the parameter used for Job Sizing. The queue defines a partition of the Hardware resources that will be available for you job. The CHPC facility has these queues predefined: | Queue Name | Max. cores | Min. cores | Max. jobs In Queue | Max. jobs Running | Max. time | Notes | |-------------|-------------|-------------|---------------------|-------------------|-------------|---------| | serial | 23 | 1 | 24 | 10 | 48 | For single-node non-parallel jobs. | | seriallong | 12 | 1 | 24 | 10 | 144 | For very long sub 1-node jobs. | | smp | 24 | 24 | 20 | 10 | 96 | For single-node parallel jobs. | | normal | 240 | 25 | 20 | 10 | 48 | The standard queue for parallel jobs | | large | 2400 | 264 | 10 | 5 | 96 | For large parallel runs | | xlarge | 6000 | 2424 | 2 | 1 | 96 | For extra-large parallel runs | | express | 2400 | 25 | N/A | 100 total nodes | 96 | For paid commercial use only | | bigmem | 280 | 28 | 4 | 1 | 48 | For the large memory (1TiB RAM) nodes. | | vis | 12 | 1 | 1 | 1 | 3 | Visualisation node | | | test | 24 | 1 | 1 | 1 | 3 | Normal nodes, for testing only | | gpu_1 | 10 | 1 | | 2 | 12 | Up to 10 cpus, 1 GPU | | gpu_2 | 20 | 1 | | 2 | 12 | Up to 20 cpus, 2 GPUs | | gpu_3 | 36 | 1 | | 2 | 12 | Up to 36 cpus, 3 GPUs | | gpu_4 | 40 | 1 | | 2 | 12 | Up to 40 cpus, 4 GPUs | | gpu_long | 20 | 1 | | 1 | 24 | Up to 20 cpus, 1 or 2 GPUs | ### PBS Pro commands | Command | What it does? | | -------- | -------- | | `qstat` | View queued jobs | | `qsub` | Submit a job to the scheduler | | `qdel` | Delete one of your jobs from the queue| ### Job script parameters Parameters for any job submission are specified as `#PBS` comments in the job script file or as options to the `qsub` command. The essential options for the CHPC cluster include: ```bash -l select=10:ncpus=24:mpiprocs=24:mem=120gb ``` sets the size of the job in number of processors: |Option| Effect| | ----| ----| |select=N | number of nodes needed| |ncpus=N |number of cores per node| |mpiprocs=N|number of MPI ranks (processes) per node| |mem=Ngb|amount of ram per node| ```bash= -l walltime=4:00:00 ``` sets the total expected wall clock time in hours:minutes:seconds. Note the wall clock limits for each queue. The job size and wall clock time must be within the limits imposed on the queue used: ```bash= -q normal ``` to specify the queue. Each job will draw from the allocation of cpu-hours granted to your Research Programme: ```bash= -P PRJT1234 ``` specifies the project identifier short name, which is needed to identify the Research Programme allocation you will draw from for this job. Ask your PI for the project short name and replace *PRJT1234* with it. For our workshop we will use the `WCHPC` project. ## Environment setup ### Modules CHPC uses the GNU modules utility, which manipulates your environment, to provide access to the supported software in /apps/. A module is a self-contained description of a software package - it contains the settings required to run a software packace and, usually, encodes required dependencies on other software packages. Each of the major CHPC applications has a modulefile that sets, unsets, appends to, or prepends to environment variables such as $PATH, $LD_LIBRARY_PATH, $INCLUDE, $MANPATH for the specific application. Each modulefile also sets functions or aliases for use with the application. You need only to invoke a single command to configure the application/programming environment properly. The general format of this command is: ```$ module load <module_name>``` where `<module_name>` is the name of the module to load. It also supports Tab-key completion of command parameters. For a list of available modules: ```$ module avail``` The module command may be abbreviated and optionally be given a search term, eg.: ```$ module ava chpc/open``` To see a synopsis of a particular modulefile's operations: ```$ module help <module_name>``` To see currently loaded modules: ```$ module list``` To remove a module: ```$ module unload <module_name>``` To remove all modules: ```$ module purge``` To search for a module name or part of a name ```$ module -t avail 2>&1 | grep -i partname``` ### Extending the environment by adding new software Sometimes, the software you need to use is not available at the cluster or you need a more updated version of that software. To solve this kind of problem we will use the *conda environment manager* tool. ### The `conda` environment manager ![](https://i.imgur.com/JwQoXke.png) For this hands-on we are going to use the [conda environment manager](https://docs.conda.io/en/latest/). This is a dependency and environment management for any language—Python, R, Ruby, Lua, Scala, Java, JavaScript, C/ C++, FORTRAN, and more. The *conda environment management* tool allows us to create virtual environments completely separated from the operating system. The tool is available for downloading in two flavors: * ***Miniconda***: Minimal package containing only the basic softwares/packages. This will be the version that we are going to use during this hands-on; * ***Anaconda***: Maximal package containing most of the libraries/software used for doing data analysis; The tool is commonly available as a unix module at most clusters. To check if `conda` is available at the cluster run: ```bash= $ module -t avail 2>&1 | grep -i conda ``` ``` chpc/astro/anaconda/2 chpc/astro/anaconda/3 chpc/astro/anaconda/3_dev chpc/pycgtool/anaconda-3 chpc/python/anaconda/2 chpc/python/anaconda/3 chpc/python/anaconda/3-2019.10 chpc/python/anaconda/3-2020.02 chpc/python/anaconda/3-2020.11 chpc/python/anaconda/3-2021.05 chpc/python/anaconda/3-2021.11 chpc/python/miniconda/3-4.10.3 ``` In order to load the module we need to run: `module load chpc/python/miniconda` ### Creating the virtual environment using the `conda` command Now we need to create our virtual environment to install the newer version of R: ```bash= $ conda create -n rlang ``` We can list the environment by typing the `conda env list` command: ```bash= $ conda env list ``` The output should look like this: ``` # conda environments: # base * /apps/chpc/chem/miniconda3-4.10.3 ame /apps/chpc/chem/miniconda3-4.10.3/envs/ame rlang /home/student66/.conda/envs/rlang ``` Now we are going to *activate* the `jupyter` environment: ```bash= $ source activate rlang ``` The prompt should change to: ```bash= (rlang) [student66@login2 ~]$ ``` If your prompt changed then the environment was correctly created and it is activated. ### Installing *R*, and extra libraries Now that we are inside our `rlang` environment we need to install the *r-base* package and a few dependencies. After activating the *rlang* environment we are going to use the `conda install` command to install the packages 1. `r-base` 2. `r-tidyverse` 3. `r-doparallel` We can install all the needed software at once by performing the command: ```bash= $ conda install r-base r-tidyverse r-doparallel ``` We can check if the notebook was correctly installed by using the `conda list` command combined with the `grep` command: ```bash= $ conda list | grep r-base ``` The result should be: ``` r-base 4.2.0 h1ae530e_0 ``` In order to check the version of the R that we just installed we need to run: ```bash= $ R --version ```` ``` R version 4.2.0 (2022-04-22) -- "Vigorous Calisthenics" Copyright (C) 2022 The R Foundation for Statistical Computing Platform: x86_64-conda-linux-gnu (64-bit) R is free software and comes with ABSOLUTELY NO WARRANTY. You are welcome to redistribute it under the terms of the GNU General Public License versions 2 or 3. For more information about these matters see https://www.gnu.org/licenses/. ``` ### Cleaning up If you want to remove all the software that we've installed we should deactivate the virtual environment `rlang` before removing it: ```bash= $ source deactivate ``` Now we can use the `conda env remove` with the `-n` option to specify which environment should be deleted: ```bash= $ conda env remove -n rlang ``` ## Running our first R Job Our first job will be a simple 1 core job to perform 1000 trials of a classification task using a [Logistic Regression](https://en.wikipedia.org/wiki/Logistic_regression) approach. For that we will use the `glm` [function](https://www.rdocumentation.org/packages/stats/versions/3.6.2/topics/glm). ```r= library(foreach) library(iterators) x <- iris[which(iris[,5] != "setosa"), c(1,5)] trials <- 10000 system.time({ r <- foreach(icount(trials), .combine=rbind) %do% { ind <- sample(100, 100, replace=TRUE) result1 <- glm(x[ind,2]~x[ind,1], family=binomial(logit)) coefficients(result1) } }) ``` Now we will create our submission script: ```bash= #!/usr/bin/bash #PBS -l select=1:ncpus=1 #PBS -P WCHPC #PBS -q serial #PBS -l walltime=0:04:00 #PBS -o /mnt/lustre/users/student66/r_single.out #PBS -e /mnt/lustre/users/student66/r_single.err module load chpc/python/miniconda/3-4.10.3 source activate rlang Rscript --vanilla iris_base.R ``` ## Running a parallel R Job We will use the `doParallel` [package](https://cran.r-project.org/web/packages/doParallel/index.html) for running for loops in parallel by means of the `%dopar%` operator. ```r= library(foreach) library(doParallel) library(iterators) registerDoParallel(2) x <- iris[which(iris[,5] != "setosa"), c(1,5)] trials <- 10000 system.time({ r <- foreach(icount(trials), .combine=rbind) %dopar% { ind <- sample(100, 100, replace=TRUE) result1 <- glm(x[ind,2]~x[ind,1], family=binomial(logit)) coefficients(result1) } }) ``` and the PBS submission file: ```bash= #!/usr/bin/bash #PBS -l select=1:ncpus=2:mpiprocs=2 #PBS -P WCHPC #PBS -q serial #PBS -l walltime=0:04:00 #PBS -o /mnt/lustre/users/student66/r_parallel.out #PBS -e /mnt/lustre/users/student66/r_parallel.err module load chpc/python/miniconda/3-4.10.3 source activate rlang Rscript --vanilla iris_par.R ``` In both R scripts we used the `system.time()` function to measure the time used for performing the 10000 trials. First let's inspect the `r_single.out` file: ```bash= $ cat lustre/r_single.out ```` ``` user system elapsed 23.020 0.058 23.078 ``` And now we will compare with the parallel version: ```bash= $ cat lustre/r_parallel.out ``` ``` user system elapsed 22.028 0.113 11.663 ``` From the outputs we should focus at the `elapsed` column that represents the wall clock time elapsed. We see that the parallel version ran for almost half the time the serial version. ###### tags: `Parallel Computing` `PBS` `R`