Try   HackMD

Most of this content has been copied from the chpc wiki page, Software Carpentries lesson on HPC, and Epcced tutorial on HPC.

High Performance Computing

Table of Contents

A few words about High Performance Computing - HPC

Frequently, research problems that use computing can outgrow the capabilities of the desktop or laptop computer where they started:

  • A statistics student wants to cross-validate a model. This involves running the model 1000 times – but each run takes an hour. Running the model on a laptop will take over a month! In this research problem, final results are calculated after all 1000 models have run, but typically only one model is run at a time (in serial) on the laptop. Since each of the 1000 runs is independent of all others, and given enough computers, it’s theoretically possible to run them all at once (in parallel).
  • A genomics researcher has been using small datasets of sequence data, but soon will be receiving a new type of sequencing data that is 10 times as large. It’s already challenging to open the datasets on a computer – analyzing these larger datasets will probably crash it. In this research problem, the calculations required might be impossible to parallelize, but a computer with more memory would be required to analyze the much larger future data set.
  • An engineer is using a fluid dynamics package that has an option to run in parallel. So far, this option was not used on a desktop. In going from 2D to 3D simulations, the simulation time has more than tripled. It might be useful to take advantage of that option or feature. In this research problem, the calculations in each region of the simulation are largely independent of calculations in other regions of the simulation. It’s possible to run each region’s calculations simultaneously (in parallel), communicate selected results to adjacent regions as needed, and repeat the calculations to converge on a final set of results. In moving from a 2D to a 3D model, both the amount of data and the amount of calculations increases greatly, and it’s theoretically possible to distribute the calculations across multiple computers communicating over a shared network.

In all these cases, access to more (and larger) computers is needed. Those computers should be usable at the same time, solving many researchers’ problems in parallel.

What is HPC?

The words “cloud”, “cluster”, and “high-performance computing” are used a lot in different contexts and with varying degrees of correctness. So what do they mean exactly? And more importantly, how do we use them for our work?

The cloud is a generic term commonly used to refer to remote computing resources of any kind – that is, any computers that you use but are not right in front of you. Cloud can refer to machines serving websites, providing shared storage, providing webservices (such as e-mail or social media platforms), as well as more traditional “compute” resources. An HPC system on the other hand, is a term used to describe a network of computers. The computers in a cluster typically share a common purpose, and are used to accomplish tasks that might otherwise be too big for any one computer.

  • The cluster can serve to offload code execution from your laptop/workstation
    • code that runs too long or needs too much memory or disk space
  • clusters are particularly useful for executing parallel code
    • on one compute node
    • on multiple compute nodes at once

📝 Note on speed of execution:

  • the compute nodes have similar architecture to your desktop
  • they are not much faster
  • the main advantage of cluster computing lies in parallel code execution

Accessing the remote system

The first step in using a cluster is to establish a connection from our laptop to the cluster. When we are sitting at a computer (or standing, or holding it in our hands or on our wrists), we have come to expect a visual display with icons, widgets, and perhaps some windows or applications: a graphical user interface, or GUI. Since computer clusters are remote resources that we connect to over slow or intermittent interfaces (WiFi and VPNs especially), it is more practical to use a command-line interface, or CLI, to send commands as plain-text. If a command returns output, it is printed as plain text as well. The commands we run today will not open a window to show graphical results.

If you have ever opened the Windows Command Prompt or macOS Terminal, you have seen a CLI. If you have already taken The Carpentries’ courses on the UNIX Shell or Version Control, you have used the CLI on your local machine extensively. The only leap to be made here is to open a CLI on a remote machine, while taking some precautions so that other folks on the network can’t see (or change) the commands you’re running or the results the remote machine sends back. We will use the Secure SHell protocol (or SSH) to open an encrypted network connection between two machines, allowing you to send & receive text and data without having to worry about prying eyes.

SSH clients are usually command-line tools, where you provide the remote machine address as the only required argument. If your username on the remote system differs from what you use locally, you must provide that as well. If your SSH client has a graphical front-end, such as PuTTY or MobaXterm, you will set these arguments before clicking “connect.” From the terminal, you’ll write something like ssh userName@hostname, where the argument is just like an email address: the “@” symbol is used to separate the personal ID from the address of the remote machine.

$ ssh student66@lengau.chpc.ac.za Last login: Tue Jul 5 05:48:04 2022 from 177.141.194.38 Welcome to LENGAU ################################################################################ # # # In order to receive notifications via email from the CHPC all users should # # be subscribed to the CHPC user distribution list. If you are not part of the # # distribution list you can subscribe at the following link: # # https://lists.chpc.ac.za/mailman/listinfo/chpc-users # # # ################################################################################ [student66@login2 ~]$

Where are we?

Very often, many users are tempted to think of a high-performance computing installation as one giant, magical machine. Sometimes, people will assume that the computer they’ve logged onto is the entire computing cluster. So what’s really happening? What computer have we logged on to? The name of the current computer we are logged onto can be checked with the hostname command. (You may also notice that the current hostname is also part of our prompt!)

[student66@login2 ~]$ hostname login2
Created with Raphaël 2.2.0UserUserSystemSystemConnect to the system using SSHSystem checks available login methodsand selects password as authentication methodRequest PasswordUser fills password at promptSends PasswordChecks if password if validUser successfully authenticatedReceives remote promptIssue commands

Job Scheduler

An HPC system might have thousands of nodes and thousands of users. How do we decide who gets what and when? How do we ensure that a task is run with the resources it needs? This job is handled by a special piece of software called the scheduler. On an HPC system, the scheduler manages which jobs run where and when.

The CHPC cluster uses PBSPro as its job scheduler. With the exception of interactive jobs, all jobs are submitted to a batch queuing system and only execute when the requested resources become available. All batch jobs are queued according to priority. A user's priority is not static: the CHPC uses the “Fairshare” facility of PBSPro to modify priority based on activity. This is done to ensure the finite resources of the CHPC cluster are shared fairly amongst all users.

Terminology

  • Job: your program on the cluster
  • Submit job: instruct the cluster to run your program
  • Node: compute node = group of cores that can access the same memory (also known as a computer or a machine)
  • Memory: main memory or RAM (fast memory directly connected to the processor, when your program is running it is stored in RAM together with needed data)
  • Core: the basic computation unit inside a processor that can run a single process
  • Serial code: runs on one core
  • Parallel code: program that runs on two or more cores

Creating our first batch job

The most basic use of the scheduler is to run a command non-interactively. Any command (or series of commands) that you want to run on the cluster is called a job, and the process of using a scheduler to run the job is called batch job submission.

In this case, the job we want to run is just a shell script. Let’s create a demo shell script to run as a test.

#!/bin/bash echo 'This script is running on:' hostname sleep 60

Now let us run or first job and see what happens?

$ chmod +x job1.sh $ ./job1.sh

Result:

> login2 

Submitting the job to the cluster

$ qsub -P WCHPC -l select=1,walltime=0:10:0 job1.sh
> 4286014.sched01

And that’s all we need to do to submit a job. Our work is done – now the scheduler takes over and tries to run the job for us. While the job is waiting to run, it goes into a list of jobs called the queue. To check on our job’s status, we check the queue using the command qstat -u yourUsername

$ qstat -u student66
sched01:
                                                            Req'd  Req'd   Elap
Job ID          Username Queue    Jobname    SessID NDS TSK Memory Time  S Time
--------------- -------- -------- ---------- ------ --- --- ------ ----- - -----
4286014.sched01 student6 serial   job1.sh       --    1   1    --  00:10 Q   --

We can see all the details of our job, most importantly that it is in the “R” or “RUNNING” state. Sometimes our jobs might need to wait in a queue (“Q” or “QUEUED”) or have an error ("E").

A very important information shown by the qstat command is the Queue. The queue is the parameter used for Job Sizing. The queue defines a partition of the Hardware resources that will be available for you job. The CHPC facility has these queues predefined:

Queue Name Max. cores Min. cores Max. jobs In Queue Max. jobs Running Max. time Notes
serial 23 1 24 10 48 For single-node non-parallel jobs.
seriallong 12 1 24 10 144 For very long sub 1-node jobs.
smp 24 24 20 10 96 For single-node parallel jobs.
normal 240 25 20 10 48 The standard queue for parallel jobs
large 2400 264 10 5 96 For large parallel runs
xlarge 6000 2424 2 1 96 For extra-large parallel runs
express 2400 25 N/A 100 total nodes 96 For paid commercial use only
bigmem 280 28 4 1 48 For the large memory (1TiB RAM) nodes.
vis 12 1 1 1 3 Visualisation node
test 24 1 1 1 3 Normal nodes, for testing only
gpu_1 10 1 2 12 Up to 10 cpus, 1 GPU
gpu_2 20 1 2 12 Up to 20 cpus, 2 GPUs
gpu_3 36 1 2 12 Up to 36 cpus, 3 GPUs
gpu_4 40 1 2 12 Up to 40 cpus, 4 GPUs
gpu_long 20 1 1 24 Up to 20 cpus, 1 or 2 GPUs

PBS Pro commands

Command What it does?
qstat View queued jobs
qsub Submit a job to the scheduler
qdel Delete one of your jobs from the queue

Job script parameters

Parameters for any job submission are specified as #PBS comments in the job script file or as options to the qsub command. The essential options for the CHPC cluster include:

-l select=10:ncpus=24:mpiprocs=24:mem=120gb

sets the size of the job in number of processors:

Option Effect
select=N number of nodes needed
ncpus=N number of cores per node
mpiprocs=N number of MPI ranks (processes) per node
mem=Ngb amount of ram per node
-l walltime=4:00:00

sets the total expected wall clock time in hours:minutes:seconds. Note the wall clock limits for each queue.

The job size and wall clock time must be within the limits imposed on the queue used:

-q normal

to specify the queue.

Each job will draw from the allocation of cpu-hours granted to your Research Programme:

-P PRJT1234

specifies the project identifier short name, which is needed to identify the Research Programme allocation you will draw from for this job. Ask your PI for the project short name and replace PRJT1234 with it. For our workshop we will use the WCHPC project.

Environment setup

Modules

CHPC uses the GNU modules utility, which manipulates your environment, to provide access to the supported software in /apps/.

A module is a self-contained description of a software package - it contains the settings required to run a software packace and, usually, encodes required dependencies on other software packages.

Each of the major CHPC applications has a modulefile that sets, unsets, appends to, or prepends to environment variables such as $PATH, $LD_LIBRARY_PATH, $INCLUDE, $MANPATH for the specific application. Each modulefile also sets functions or aliases for use with the application. You need only to invoke a single command to configure the application/programming environment properly. The general format of this command is:

$ module load <module_name>

where <module_name> is the name of the module to load. It also supports Tab-key completion of command parameters.

For a list of available modules:

$ module avail

The module command may be abbreviated and optionally be given a search term, eg.:

$ module ava chpc/open

To see a synopsis of a particular modulefile's operations:

$ module help <module_name>

To see currently loaded modules:

$ module list

To remove a module:

$ module unload <module_name>

To remove all modules:

$ module purge

To search for a module name or part of a name

$ module -t avail 2>&1 | grep -i partname

Extending the environment by adding new software

Sometimes, the software you need to use is not available at the cluster or you need a more updated version of that software. To solve this kind of problem we will use the conda environment manager tool.

The conda environment manager

For this hands-on we are going to use the conda environment manager. This is a dependency and environment management for any language—Python, R, Ruby, Lua, Scala, Java, JavaScript, C/ C++, FORTRAN, and more.

The conda environment management tool allows us to create virtual environments completely separated from the operating system.
The tool is available for downloading in two flavors:

  • Miniconda: Minimal package containing only the basic softwares/packages. This will be the version that we are going to use during this hands-on;
  • Anaconda: Maximal package containing most of the libraries/software used for doing data analysis;

The tool is commonly available as a unix module at most clusters. To check if conda is available at the cluster run:

$ module -t avail 2>&1 | grep -i conda
chpc/astro/anaconda/2
chpc/astro/anaconda/3
chpc/astro/anaconda/3_dev
chpc/pycgtool/anaconda-3
chpc/python/anaconda/2
chpc/python/anaconda/3
chpc/python/anaconda/3-2019.10
chpc/python/anaconda/3-2020.02
chpc/python/anaconda/3-2020.11
chpc/python/anaconda/3-2021.05
chpc/python/anaconda/3-2021.11
chpc/python/miniconda/3-4.10.3

In order to load the module we need to run: module load chpc/python/miniconda

Creating the virtual environment using the conda command

Now we need to create our virtual environment to install the newer version of R:

$ conda create -n rlang

We can list the environment by typing the conda env list command:

$ conda env list

The output should look like this:

# conda environments:
#
base                  *  /apps/chpc/chem/miniconda3-4.10.3
ame                      /apps/chpc/chem/miniconda3-4.10.3/envs/ame
rlang                    /home/student66/.conda/envs/rlang

Now we are going to activate the jupyter environment:

$ source activate rlang

The prompt should change to:

(rlang) [student66@login2 ~]$

If your prompt changed then the environment was correctly created and it is activated.

Installing R, and extra libraries

Now that we are inside our rlang environment we need to install the r-base package and a few dependencies.

After activating the rlang environment we are going to use the conda install command to install the packages

  1. r-base
  2. r-tidyverse
  3. r-doparallel

We can install all the needed software at once by performing the command:

$ conda install r-base r-tidyverse r-doparallel

We can check if the notebook was correctly installed by using the conda list command combined with the grep command:

$ conda list | grep r-base

The result should be:

r-base                    4.2.0                h1ae530e_0

In order to check the version of the R that we just installed we need to run:

$ R --version
R version 4.2.0 (2022-04-22) -- "Vigorous Calisthenics"
Copyright (C) 2022 The R Foundation for Statistical Computing
Platform: x86_64-conda-linux-gnu (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under the terms of the
GNU General Public License versions 2 or 3.
For more information about these matters see
https://www.gnu.org/licenses/.

Cleaning up

If you want to remove all the software that we've installed we should deactivate the virtual environment rlang before removing it:

$ source deactivate

Now we can use the conda env remove with the -n option to specify which environment should be deleted:

$ conda env remove -n rlang

Running our first R Job

Our first job will be a simple 1 core job to perform 1000 trials of a classification task using a Logistic Regression approach. For that we will use the glm function.

library(foreach) library(iterators) x <- iris[which(iris[,5] != "setosa"), c(1,5)] trials <- 10000 system.time({ r <- foreach(icount(trials), .combine=rbind) %do% { ind <- sample(100, 100, replace=TRUE) result1 <- glm(x[ind,2]~x[ind,1], family=binomial(logit)) coefficients(result1) } })

Now we will create our submission script:

#!/usr/bin/bash #PBS -l select=1:ncpus=1 #PBS -P WCHPC #PBS -q serial #PBS -l walltime=0:04:00 #PBS -o /mnt/lustre/users/student66/r_single.out #PBS -e /mnt/lustre/users/student66/r_single.err module load chpc/python/miniconda/3-4.10.3 source activate rlang Rscript --vanilla iris_base.R

Running a parallel R Job

We will use the doParallel package for running for loops in parallel by means of the %dopar% operator.

library(foreach) library(doParallel) library(iterators) registerDoParallel(2) x <- iris[which(iris[,5] != "setosa"), c(1,5)] trials <- 10000 system.time({ r <- foreach(icount(trials), .combine=rbind) %dopar% { ind <- sample(100, 100, replace=TRUE) result1 <- glm(x[ind,2]~x[ind,1], family=binomial(logit)) coefficients(result1) } })

and the PBS submission file:

#!/usr/bin/bash #PBS -l select=1:ncpus=2:mpiprocs=2 #PBS -P WCHPC #PBS -q serial #PBS -l walltime=0:04:00 #PBS -o /mnt/lustre/users/student66/r_parallel.out #PBS -e /mnt/lustre/users/student66/r_parallel.err module load chpc/python/miniconda/3-4.10.3 source activate rlang Rscript --vanilla iris_par.R

In both R scripts we used the system.time() function to measure the time used for performing the 10000 trials. First let's inspect the r_single.out file:

$ cat lustre/r_single.out
   user  system elapsed
 23.020   0.058  23.078

And now we will compare with the parallel version:

$ cat lustre/r_parallel.out
   user  system elapsed
 22.028   0.113  11.663

From the outputs we should focus at the elapsed column that represents the wall clock time elapsed. We see that the parallel version ran for almost half the time the serial version.

tags: Parallel Computing PBS R