# Introduction to High Performance Computing ### Dr. Anirban Pal, Asst. Professor of Mechanical Engineering Time, Date: 4:00-5:00 CT, 04/14/2023 Location: ECS 142 and Zoom Pre-requisites: Ability to read and type Recommended: Familiarity with Linux command line and Python3 **Zoom link:** Join Zoom Meeting https://wtamu.zoom.us/j/94667371336?pwd=clhVbmRjckMvSTB0M2VVbWh3clh0UT09 Meeting ID: 946 6737 1336 Passcode: 9^FhY?57 **Old Video recording** https://ensemble.wtamu.edu/hapi/v1/contents/permalinks/WTAMU-HPC-Workshop01/view ## 0. What is High Performance Computing (HPC)? (4:00-4:15) HPC or High Performance Computing is the use of a numerous processing elements (CPU/GPU) to solve calculation-intensive and/or data-intensive tasks. Tasks include but not limited to scientific/engineering problem solving, data-analysis, visualization. History can be traced to supercomputing. In fact, supercomputing and HPC is often used synonymously. Computing on your laptop/desktop may be inadequate and “You are going to need a bigger boat!” Slides here: [Introduction to HPC](https://wtamu0-my.sharepoint.com/:b:/g/personal/apal_wtamu_edu/Eaa7xN4MHAhEvLUgeuxYlP0B4b2LFc_TueWm5s_YfMrPcw?e=CssIgh) **If you are using Windows, follow steps W1 and W2 and skip to Section 1: STEP 3. If you are using Linux, skip to Section 1: STEP 1** --- **NOTE: For Windows users only!** STEP W1: Install PuTTY from the Microsoft store or download **putty.exe** from the [PuTTY website](https://www.chiark.greenend.org.uk/~sgtatham/putty/latest.html). You can also find the executable [here](https://wtamu0-my.sharepoint.com/:u:/g/personal/apal_wtamu_edu/EZR0Fw2nLD9ErbGrzhkP4n8BYSSJZ_QoaOumlZFpaAwMTQ?e=zvMqE3). STEP W2: Run the PuTTY application and enter the following details and click **Open**. Make sure you are on **CISCO Campus VPN**. >Hostname: hpcjump.wtamu.edu >Port: 22 ![](https://i.imgur.com/nZOsJsw.png) STEP W3: Enter your HPC username and password. The password may not be displayed so type carefully. Move to **Section 1: STEP 3**. --- ## 1. Creating working folder and setting up file transfer with HPC STEP 1: Open command line terminal (bash) and create a working directory in your name. ``` bash mkdir ~/Downloads/johndoe && cd ~/Downloads/johndoe ``` STEP 2: Download and extract filezilla. Set up a remote connection. ``` git clone https://github.com/anirban-pal/filezilla_u64.git cd filezilla_u64/ tar -xvf FileZilla_3.63.2.1_x86_64-linux-gnu.tar.bz2 FileZilla3/bin/filezilla ``` This should open up a GUI. ![](https://i.imgur.com/RAXeyKi.png) Enter the following credentials in the GUI. Host: hpcjump.wtamu.edu Username: temp** (your username) Port: 22 Hit QuickConnect and enter your **password**. If you succeed you should see a folder structure on the right (remote site). ![](https://i.imgur.com/9eEBoGq.png) This interface will allow you to quickly transfer files between the HPC and the local machine. ## 2. Logging into WTAMU HPC (4:30-4:35) STEP 1: ssh into the hpc jump srver (temp** is your HPC username) ``` ssh temp**@hpcjump.wtamu.edu ``` STEP 2: Enter the HPC password provided to you. ``` apal@ecsl-44:~$ ssh buff50@hpcjump.wtamu.edu +------------------------------------------------------------------------------+ | INFORMATION RESOURCES ACCESS POLICY | +------------------------------------------------------------------------------+ | This computer system is the property of West Texas A&M University. Only | | authorized users may login to this computer system. All unauthorized use is | | strictly prohibited and subject to local, state, and/or federal laws. | | Therefore, this system is subject to security testing and monitoring. Misuse | | is subject to criminal prosecution. | | | | If you proceed to log into this system, you acknowledge compliance with all | | related TAMU System Security Standards and University Security Controls | | located at https://www.wtamu.edu/rules/ . There should be no expectation of | | privacy except as otherwise provided by applicable privacy laws. | +------------------------------------------------------------------------------+ buff50@hpcjump.wtamu.edu's password: ``` STEP 3: You should be in the **hpcjump** jump server now. Log into the login node. Enter **yes** if prompted. ``` [buff50@hpcjump ~]$ ssh L01 The authenticity of host 'l01 (10.10.55.21)' can't be established. ECDSA key fingerprint is SHA256:8bmTckGZJdagIq64oIPNJMyaA3K8FisXMnaHpi8Y3sE. ECDSA key fingerprint is MD5:19:f6:c5:3c:e7:5d:33:39:49:87:71:68:2b:e4:bb:e1. Are you sure you want to continue connecting (yes/no)? yes ``` You may need to reenter the HPC password. ``` Warning: Permanently added 'l01,10.10.55.21' (ECDSA) to the list of known hosts. +------------------------------------------------------------------------------+ | WTAMU COMPUTER ACCESS POLICY | +------------------------------------------------------------------------------+ | This computer system is the property of West Texas A&M University. Only | | authorized users may login to this computer system. All unauthorized use is | | strictly prohibited and subject to local, state, and/or federal laws. | | Therefore, this system is subject to security testing and monitoring. | | Misuse is subject to criminal prosecution. | | | | If you proceed to log into this system, you acknowledge compliance with | | University Rule 24.99.99.W1 -- Security of Electronic Information Resources | | and all related University security polices located at | | http://www.wtamu.edu/informationtechnology/university-saps-and-rules.aspx | | and University Rule 33.04.99.W1 -- Rules for Responsible Information | | Technology Usage. There should be no expectation of privacy except as | | otherwise provided by applicable privacy laws. | +------------------------------------------------------------------------------+ buff50@l01's password: Creating ECDSA key for ssh [buff50@L01 ~]$ ``` Now you are in the login node and ready to submit jobs! ## 3. Submitting a simple python job on the HPC (4:35-4:40) An HPC job typically needs three things: >(1) **input data** to be processed (optional) >(2) **program** to do something, eg. process the data (necessary) >(3) **job script** that informs the HPC how the job will be run. (necessary) and produces: >(1) **output data** containing some information/data. ### 3.1 Preparing and studying the files Files for this following section can be obtained by ``` [buff50@L01 ~]$ module load spack19/git/2.38.1 [buff50@L01 ~]$ git clone https://github.com/anirban-pal/hackhpcworkshop.git Cloning into 'hackhpcworkshop'... remote: Enumerating objects: 5, done. remote: Counting objects: 100% (5/5), done. remote: Compressing objects: 100% (3/3), done. remote: Total 5 (delta 0), reused 0 (delta 0), pack-reused 0 Receiving objects: 100% (5/5), done. [buff50@L01 ~]$ cd hackhpcworkshop/simple/ [buff50@L01 simple]$ ``` Here we will run a very simple python program *simple.py*. Note that this program has no input data and simply prints the message "Hello World". ``` [buff50@L01 simple]$ cat simple.py print("Hello World") ``` To run this program on an HPC, we need to create a **slurm job script** that will contain instructions on how to run the program. The slurm job script *sb.simple* contains the following. ``` [buff50@L01 simple]$ cat sb.simple #!/bin/bash #SBATCH --nodes=1 #SBATCH --tasks-per-node=1 #SBATCH --partition=compute-cpu #SBATCH --output=log.slurm.out #SBATCH --error=log.slurm.err #SBATCH --time=10:00:00 module load slurm/20.11.9 module load spack19/python/3.10.8 srun python simple.py ``` Let us spend some time understanding what all this means. >#!/bin/bash This line indicates that this is a bash script. Bash is the language of the command line. >#SBATCH --nodes=1 #SBATCH --tasks-per-node=1 #SBATCH --partition=compute-cpu #SBATCH --output=log.slurm.out #SBATCH --error=log.slurm.err #SBATCH --time=10:00:00 These lines provide all the information on how the program will be run, including: how many nodes are requested, how many processors/node are requested, which partition of the cluster to run the program on, filenames where the output and error information will be stored, and the upper time-limit within which the program should complete/terminate. >module load slurm/20.11.9 module load spack/python/3.10.8 These commands basically tells the HPC where the **python** and **srun** programs are located. If you do not provide this, the HPC will not know what the word **python** and **srun** (used in the next line) means. >srun python simple.py Finally, this line indicates the program **simple.py** will be executed using the **python** interpreter and **slurm (srun)** scheduler. ### 3.2 Running the job and viewing results Once the files *simple.py* and *sb.simple* are available, they can be executed using: ``` [buff50@L01 simple]$ sbatch sb.simple Submitted batch job 27968 [buff50@L01 simple]$ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 27968 compute-g sb.simpl buff50 PD 0:00 1 (None) [temp60@L01 simple]$ cat log.slurm.out Hello World [temp60@L01 simple]$ ``` You can also execute the parallel version of the program *simple_mpi.py* using the batch script *sb.simple_mpi*. ``` [temp60@L01 simple]$ sbatch sb.simple_mpi Submitted batch job 81717 [temp60@L01 simple]$ cat log.slurm.out Hello World from processor 0 Hello World from processor 1 Hello World from processor 2 Hello World from processor 3 Hello World from processor 4 Hello World from processor 5 Hello World from processor 6 Hello World from processor 7 Hello World from processor 8 Hello World from processor 9 Hello World from processor 10 Hello World from processor 11 Hello World from processor 12 Hello World from processor 13 Hello World from processor 14 Hello World from processor 15 ``` Now try setting the number of processors in *sb.simple* and *sb.simple_mpi*, resubmit the jobs, and re-view the output. >#SBATCH --tasks-per-node=16 ## 4. Submitting a less simple python job on the HPC (using single CPU, multiple CPUs, GPU) ### 4.1 Single CPU implementation (serial) (4:40-4:45) ![](https://i.imgur.com/AyZxysV.jpg) Now let us look at a less simple program **color2gray.py**. This program will run on 1 cpu and convert a set of images from color (rgb) to grayscale. ``` [buff50@L01 simple]$ cd ../serial/ [buff50@L01 serial]$ cat color2gray_serial.py import numpy as np from matplotlib import pyplot as plt from time import time import os try: os.mkdir("grayscale") except OSError as error: pass start_time = time() num_imgs = 800 for i in range(1,num_imgs+1): # Read Images infile = "/cm/shared/data/DIV2K_train_HR/{:04d}.png".format(i) img = plt.imread(infile) # Convert to grayscale orig = np.asarray(img) gray = (0.2989 * orig[:,:,0] + 0.5870 * orig[:,:,1] + 0.1140 * orig[:,:,2])*255 # Output Images outfile = "grayscale/{:04d}.png".format(i) plt.imsave(outfile, gray, cmap="gray") print(outfile) finish_time = time() elapsed_time = finish_time - start_time print("Time taken (seconds):", elapsed_time) ``` If you are familiar with python, you might be able to figure out what is going on. If not, we can study this program more closely. The key portion of the program is the loop. ```python= num_imgs = 800 for i in range(1,num_imgs+1): # Read Images infile = "/cm/shared/data/DIV2K_train_HR/{:04d}.png".format(i) img = plt.imread(infile) # Convert to grayscale orig = np.asarray(img) gray = (0.2989 * orig[:,:,0] + 0.5870 * orig[:,:,1] + 0.1140 * orig[:,:,2])*255 # Output Images outfile = "grayscale/{:04d}.png".format(i) plt.imsave(outfile, gray, cmap="gray") print(outfile) ``` The first part of the loop reads an image (indexed by i, the loop variable) into an 3D array **img** which is then converted to a 3D numpy array **orig** in the second part. The three color channels (red, green, blue) of the image are the 2d matrices **orig[:,:,0]**,**orig[:,:,1]**, and **orig[:,:,2]** respectively, which are weight-averaged to create a grayscale image matrix **gray**. The third part writes the image to a file. ![](https://i.imgur.com/W7jWGfB.jpg) See how long the job took, once it has completed converting 800 images. ``` [buff50@L01 serial]$ cat log.slurm.out | grep "Time" Time taken (seconds): 92.5342857837677 with 8 processors. ``` ### 4.2 Multi-CPU implementation (mpi) (4:45-4:55) Now let us run the above program in parallel. The easiest way to parallelize the above code is to divide the set of images equaly among the various processors. ``` [buff50@L01 serial]$ cd ../mpi/ [buff50@L01 mpi]$ cat color2gray_mpi.py import numpy as np from matplotlib import pyplot as plt from time import time from mpi4py import MPI import os try: os.mkdir("grayscale") except OSError as error: pass start_time = time() world_comm = MPI.COMM_WORLD world_size = world_comm.Get_size() my_rank = world_comm.Get_rank() num_imgs = 800 for i in range(1,num_imgs+1): if (i % world_size == my_rank): # Read Images infile = "/cm/shared/data/DIV2K_train_HR/{:04d}.png".format(i) img = plt.imread(infile) # Convert image to grayscale orig = np.asarray(img) gray = (0.2989 * orig[:,:,0] + 0.5870 * orig[:,:,1] + 0.1140 * orig[:,:,2])*255 # Output Images outfile = "grayscale/{:04d}.png".format(i) plt.imsave(outfile, gray, cmap="gray") print(outfile, "processor ", my_rank) world_comm.Barrier() if (my_rank == 0): finish_time = time() elapsed_time = finish_time - start_time print("Time taken (seconds):", elapsed_time, " with ",world_size, " processors." ) ``` We have had to make a few changes. See if you can spot what has changed. What do the changes mean? ```python= world_comm = MPI.COMM_WORLD # world_size = world_comm.Get_size() # my_rank = world_comm.Get_rank() # num_imgs = 800 for i in range(1,num_imgs+1): if (i % world_size == my_rank): # # Read Images ... ... ... ``` Now submit the mpi job. ``` [buff50@L01 mpi]$ sbatch sb.mpi Submitted batch job 28103 [buff50@L01 mpi]$ cat log.slurm.out | grep "Time" Time taken (seconds): 92.5342857837677 with 8 processors. ``` Change the number of processors used (--tasks-per-node) to 16 and 64 and resubmit the job to see how the *Time taken* changes. You should get faster (shorter times) with increasing number of processors. ### 4.3 GPU implementation (cuda) (4:55-5:00) Running code on gpu can be fast if there are a lot of matrix operations involved, compared to I/O operations. Here we shall look at a different program **cupy_numpy.py** to see the advantage of using GPUs. This program has three tasks. (Task 1) Creating a 1000x1000x1000 array of ones. ```python= ### Numpy and CPU s = time.time() x_cpu = np.ones((1000,1000,1000)) e = time.time() print("(Task 1) Time with Numpy+CPU: ",e - s) ### CuPy and GPU s = time.time() x_gpu = cp.ones((1000,1000,1000)) cp.cuda.Stream.null.synchronize() e = time.time() print("(Task 1) Time with Cupy+GPU: ",e - s) ``` (Task 2) Multiply entire array by 5 ```python= ### Numpy and CPU s = time.time() x_cpu *= 5 e = time.time() print("(Task 2) Time with Numpy+CPU: ",e - s) ### CuPy and GPU s = time.time() x_gpu *= 5 cp.cuda.Stream.null.synchronize() e = time.time() print("(Task 2) Time with Cupy+GPU: ",e - s) ``` (Task 3) Multiply the array by 5, then multiply the array by itself, and then add the array to itself. ```python= ### Numpy and CPU s = time.time() x_cpu *= 5 x_cpu *= x_cpu x_cpu += x_cpu e = time.time() print("(Task 3) Time with Numpy+CPU: ",e - s) ### CuPy and GPU s = time.time() x_gpu *= 5 x_gpu *= x_gpu x_gpu += x_gpu cp.cuda.Stream.null.synchronize() e = time.time() print("(Task 3) Time with Cupy+GPU: ",e - s) ``` Submitting the program will show us the advantage of using GPUs. ``` [buff50@L01 mpi]$ cd ../cuda/ [buff50@L01 cuda]$ sbatch sb.cuda Submitted batch job 28114 [buff50@L01 cuda]$ cat log.slurm.out (Task 1) Time with Numpy+CPU: 0.8523945808410645 (Task 1) Time with Cupy+GPU: 0.31257200241088867 (Task 2) Time with Numpy+CPU: 0.590524435043335 (Task 2) Time with Cupy+GPU: 0.04698443412780762 (Task 3) Time with Numpy+CPU: 1.8139195442199707 (Task 3) Time with Cupy+GPU: 0.07738733291625977 ``` ## 5. File transfer between HPC and your computer Linux users can use [FileZilla](https://filezilla-project.org/) with similar instructions as below. Copy files from the HPC into your local folder. --- **NOTE: For Windows users only!** STEP W1: Download **WinSCP.exe** from the [WinSCP website](https://winscp.net/download/WinSCP-5.19.6-Portable.zip). You can also find the executable [here](https://wtamu0-my.sharepoint.com/:u:/g/personal/apal_wtamu_edu/EZR0Fw2nLD9ErbGrzhkP4n8BYSSJZ_QoaOumlZFpaAwMTQ?e=zvMqE3). STEP W2: Run the WinSCP application and enter the following details and click **Login**. Make sure you are on **CISCO Campus VPN**. >Hostname: hpcjump.wtamu.edu >Port: 22 ![](https://i.imgur.com/EIsuuQI.png) STEP W3: Enter your HPC username and password. You should now be able to transfer files between HPC and your computer. --- ## 6. Conclusion HPC resources can be really useful when the task at hand is too big for a workstation to handle (eg. data science, computational science and engineering). In this case, one must either develop a parallel program (with potentially parallel I/O) or use a parallelized code for the task. Developing a good parallel code requires extensive knowledge of computer architecture (processor speeds, cache sizes, memory, network latency and bandwidth, etc). Maintaining and administering such HPC systems is also quite a challenge considering reliability, cybersecurity, and package/library management). Both present significant career opportunities. **Please complete the following survey (2-3 min):** https://wtamuuw.az1.qualtrics.com/jfe/form/SV_es3ZVxmqXfMd0V0