Introduction to High Performance Computing

Dr. Anirban Pal, Asst. Professor of Mechanical Engineering

Time, Date: 4:00-5:00 CT, 04/14/2023
Location: ECS 142 and Zoom
Pre-requisites: Ability to read and type
Recommended: Familiarity with Linux command line and Python3

Zoom link:
Join Zoom Meeting
https://wtamu.zoom.us/j/94667371336?pwd=clhVbmRjckMvSTB0M2VVbWh3clh0UT09
Meeting ID: 946 6737 1336
Passcode: 9^FhY?57

Old Video recording
https://ensemble.wtamu.edu/hapi/v1/contents/permalinks/WTAMU-HPC-Workshop01/view

0. What is High Performance Computing (HPC)? (4:00-4:15)

HPC or High Performance Computing is the use of a numerous processing elements (CPU/GPU) to solve calculation-intensive and/or data-intensive tasks. Tasks include but not limited to scientific/engineering problem solving, data-analysis, visualization. History can be traced to supercomputing. In fact, supercomputing and HPC is often used synonymously. Computing on your laptop/desktop may be inadequate and “You are going to need a bigger boat!”

Slides here: Introduction to HPC

If you are using Windows, follow steps W1 and W2 and skip to Section 1: STEP 3. If you are using Linux, skip to Section 1: STEP 1

NOTE: For Windows users only!

STEP W1: Install PuTTY from the Microsoft store or download putty.exe from the PuTTY website. You can also find the executable here.

STEP W2: Run the PuTTY application and enter the following details and click Open. Make sure you are on CISCO Campus VPN.

Hostname: hpcjump.wtamu.edu
Port: 22

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

STEP W3: Enter your HPC username and password. The password may not be displayed so type carefully. Move to Section 1: STEP 3.

1. Creating working folder and setting up file transfer with HPC

STEP 1: Open command line terminal (bash) and create a working directory in your name.

bash
mkdir ~/Downloads/johndoe && cd ~/Downloads/johndoe

STEP 2: Download and extract filezilla. Set up a remote connection.

git clone https://github.com/anirban-pal/filezilla_u64.git
cd filezilla_u64/
tar -xvf FileZilla_3.63.2.1_x86_64-linux-gnu.tar.bz2
FileZilla3/bin/filezilla

This should open up a GUI.

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

Enter the following credentials in the GUI.

Host: hpcjump.wtamu.edu
Username: temp** (your username)
Port: 22

Hit QuickConnect and enter your password. If you succeed you should see a folder structure on the right (remote site).

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

This interface will allow you to quickly transfer files between the HPC and the local machine.

2. Logging into WTAMU HPC (4:30-4:35)

STEP 1: ssh into the hpc jump srver (temp** is your HPC username)

ssh temp**@hpcjump.wtamu.edu

STEP 2: Enter the HPC password provided to you.

apal@ecsl-44:~$ ssh buff50@hpcjump.wtamu.edu
+------------------------------------------------------------------------------+
| INFORMATION RESOURCES ACCESS POLICY                                          |
+------------------------------------------------------------------------------+
| This computer system is the property of West Texas A&M University. Only      |
| authorized users may login to this computer system. All unauthorized use is  |
| strictly prohibited and subject to local, state, and/or federal laws.        |
| Therefore, this system is subject to security testing and monitoring. Misuse |
| is subject to criminal prosecution.                                          |
|                                                                              |
| If you proceed to log into this system, you acknowledge compliance with all  |
| related TAMU System Security Standards and University Security Controls      |
| located at https://www.wtamu.edu/rules/ . There should be no expectation of  |
| privacy except as otherwise provided by applicable privacy laws.             |
+------------------------------------------------------------------------------+

buff50@hpcjump.wtamu.edu's password:

STEP 3: You should be in the hpcjump jump server now. Log into the login node. Enter yes if prompted.

[buff50@hpcjump ~]$ ssh L01
The authenticity of host 'l01 (10.10.55.21)' can't be established.
ECDSA key fingerprint is SHA256:8bmTckGZJdagIq64oIPNJMyaA3K8FisXMnaHpi8Y3sE.
ECDSA key fingerprint is MD5:19:f6:c5:3c:e7:5d:33:39:49:87:71:68:2b:e4:bb:e1.
Are you sure you want to continue connecting (yes/no)? yes

You may need to reenter the HPC password.

Warning: Permanently added 'l01,10.10.55.21' (ECDSA) to the list of known hosts.
+------------------------------------------------------------------------------+
| WTAMU COMPUTER ACCESS POLICY                                                 |
+------------------------------------------------------------------------------+
| This computer system is the property of West Texas A&M University.  Only     |
| authorized users may login to this computer system.  All unauthorized use is |
| strictly prohibited and subject to local, state, and/or federal laws.        |
| Therefore, this system is subject to security testing and monitoring.        |
| Misuse is subject to criminal prosecution.                                   |
|                                                                              |
| If you proceed to log into this system, you acknowledge compliance with      |
| University Rule 24.99.99.W1 -- Security of Electronic Information Resources  |
| and all related University security polices located at                       |
| http://www.wtamu.edu/informationtechnology/university-saps-and-rules.aspx    |
| and University Rule 33.04.99.W1 -- Rules for Responsible Information         |
| Technology Usage.  There should be no expectation of privacy except as       |
| otherwise provided by applicable privacy laws.                               |
+------------------------------------------------------------------------------+
buff50@l01's password: 
Creating ECDSA key for ssh
[buff50@L01 ~]$

Now you are in the login node and ready to submit jobs!

3. Submitting a simple python job on the HPC (4:35-4:40)

An HPC job typically needs three things:

(1) input data to be processed (optional)
(2) program to do something, eg. process the data (necessary)
(3) job script that informs the HPC how the job will be run. (necessary)

and produces:

(1) output data containing some information/data.

3.1 Preparing and studying the files

Files for this following section can be obtained by

[buff50@L01 ~]$ module load spack19/git/2.38.1 
[buff50@L01 ~]$ git clone https://github.com/anirban-pal/hackhpcworkshop.git
Cloning into 'hackhpcworkshop'...
remote: Enumerating objects: 5, done.
remote: Counting objects: 100% (5/5), done.
remote: Compressing objects: 100% (3/3), done.
remote: Total 5 (delta 0), reused 0 (delta 0), pack-reused 0
Receiving objects: 100% (5/5), done.
[buff50@L01 ~]$ cd hackhpcworkshop/simple/
[buff50@L01 simple]$

Here we will run a very simple python program simple.py. Note that this program has no input data and simply prints the message "Hello World".

[buff50@L01 simple]$ cat simple.py
print("Hello World")

To run this program on an HPC, we need to create a slurm job script that will contain instructions on how to run the program. The slurm job script sb.simple contains the following.

[buff50@L01 simple]$ cat sb.simple
#!/bin/bash
#SBATCH --nodes=1
#SBATCH --tasks-per-node=1
#SBATCH --partition=compute-cpu
#SBATCH --output=log.slurm.out
#SBATCH --error=log.slurm.err
#SBATCH --time=10:00:00

module load slurm/20.11.9
module load spack19/python/3.10.8

srun python simple.py

Let us spend some time understanding what all this means.

#!/bin/bash

This line indicates that this is a bash script. Bash is the language of the command line.

#SBATCH –nodes=1
#SBATCH –tasks-per-node=1
#SBATCH –partition=compute-cpu
#SBATCH –output=log.slurm.out
#SBATCH –error=log.slurm.err
#SBATCH –time=10:00:00

These lines provide all the information on how the program will be run, including: how many nodes are requested, how many processors/node are requested, which partition of the cluster to run the program on, filenames where the output and error information will be stored, and the upper time-limit within which the program should complete/terminate.

module load slurm/20.11.9
module load spack/python/3.10.8

These commands basically tells the HPC where the python and srun programs are located. If you do not provide this, the HPC will not know what the word python and srun (used in the next line) means.

srun python simple.py

Finally, this line indicates the program simple.py will be executed using the python interpreter and slurm (srun) scheduler.

3.2 Running the job and viewing results

Once the files simple.py and sb.simple are available, they can be executed using:

[buff50@L01 simple]$ sbatch sb.simple
Submitted batch job 27968
[buff50@L01 simple]$ squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
             27968 compute-g sb.simpl     buff50 PD       0:00      1 (None)
[temp60@L01 simple]$ cat log.slurm.out 
Hello World
[temp60@L01 simple]$

You can also execute the parallel version of the program simple_mpi.py using the batch script sb.simple_mpi.

[temp60@L01 simple]$ sbatch sb.simple_mpi 
Submitted batch job 81717
[temp60@L01 simple]$ cat log.slurm.out 
Hello World from processor 0
Hello World from processor 1
Hello World from processor 2
Hello World from processor 3
Hello World from processor 4
Hello World from processor 5
Hello World from processor 6
Hello World from processor 7
Hello World from processor 8
Hello World from processor 9
Hello World from processor 10
Hello World from processor 11
Hello World from processor 12
Hello World from processor 13
Hello World from processor 14
Hello World from processor 15

Now try setting the number of processors in sb.simple and sb.simple_mpi, resubmit the jobs, and re-view the output.

#SBATCH –tasks-per-node=16

4. Submitting a less simple python job on the HPC (using single CPU, multiple CPUs, GPU)

4.1 Single CPU implementation (serial) (4:40-4:45)

Now let us look at a less simple program color2gray.py. This program will run on 1 cpu and convert a set of images from color (rgb) to grayscale.

[buff50@L01 simple]$ cd ../serial/
[buff50@L01 serial]$ cat color2gray_serial.py 
import numpy as np
from matplotlib import pyplot as plt
from time import time

import os
try:
    os.mkdir("grayscale")
except OSError as error:
    pass

start_time = time()

num_imgs = 800
for i in range(1,num_imgs+1):
    
    # Read Images
    infile = "/cm/shared/data/DIV2K_train_HR/{:04d}.png".format(i)
    img = plt.imread(infile)

    # Convert to grayscale
    orig = np.asarray(img)
    gray = (0.2989 * orig[:,:,0] + 0.5870 * orig[:,:,1] + 0.1140 * orig[:,:,2])*255

    # Output Images
    outfile = "grayscale/{:04d}.png".format(i)
    plt.imsave(outfile, gray, cmap="gray")
    print(outfile)

finish_time = time()
elapsed_time = finish_time - start_time
print("Time taken (seconds):", elapsed_time)

If you are familiar with python, you might be able to figure out what is going on. If not, we can study this program more closely. The key portion of the program is the loop.















num_imgs = 800
for i in range(1,num_imgs+1):
    
    # Read Images
    infile = "/cm/shared/data/DIV2K_train_HR/{:04d}.png".format(i)
    img = plt.imread(infile)

    # Convert to grayscale
    orig = np.asarray(img)
    gray = (0.2989 * orig[:,:,0] + 0.5870 * orig[:,:,1] + 0.1140 * orig[:,:,2])*255

    # Output Images
    outfile = "grayscale/{:04d}.png".format(i)
    plt.imsave(outfile, gray, cmap="gray")
    print(outfile)

The first part of the loop reads an image (indexed by i, the loop variable) into an 3D array img which is then converted to a 3D numpy array orig in the second part. The three color channels (red, green, blue) of the image are the 2d matrices orig[:,:,0],orig[:,:,1], and orig[:,:,2] respectively, which are weight-averaged to create a grayscale image matrix gray. The third part writes the image to a file.

See how long the job took, once it has completed converting 800 images.

[buff50@L01 serial]$ cat log.slurm.out | grep "Time"
Time taken (seconds): 92.5342857837677  with  8  processors.

4.2 Multi-CPU implementation (mpi) (4:45-4:55)

Now let us run the above program in parallel. The easiest way to parallelize the above code is to divide the set of images equaly among the various processors.

[buff50@L01 serial]$ cd ../mpi/
[buff50@L01 mpi]$ cat color2gray_mpi.py 
import numpy as np
from matplotlib import pyplot as plt
from time import time
from mpi4py import MPI

import os
try:
    os.mkdir("grayscale")
except OSError as error:
    pass

start_time = time()

world_comm = MPI.COMM_WORLD
world_size = world_comm.Get_size()
my_rank = world_comm.Get_rank()

num_imgs = 800
for i in range(1,num_imgs+1):
    
	if (i % world_size == my_rank):
		
		# Read Images
		infile = "/cm/shared/data/DIV2K_train_HR/{:04d}.png".format(i)
		img = plt.imread(infile)
		
		# Convert image to grayscale
		orig = np.asarray(img)
		gray = (0.2989 * orig[:,:,0] + 0.5870 * orig[:,:,1] + 0.1140 * orig[:,:,2])*255
		
		# Output Images
		outfile = "grayscale/{:04d}.png".format(i)
		plt.imsave(outfile, gray, cmap="gray")
		print(outfile, "processor ", my_rank)

world_comm.Barrier()

if (my_rank == 0):
	finish_time = time()
	elapsed_time = finish_time - start_time
	print("Time taken (seconds):", elapsed_time, " with ",world_size, " processors." )

We have had to make a few changes. See if you can spot what has changed. What do the changes mean?













world_comm = MPI.COMM_WORLD #
world_size = world_comm.Get_size() #
my_rank = world_comm.Get_rank() #

num_imgs = 800
for i in range(1,num_imgs+1):
    
	if (i % world_size == my_rank): #
		
		# Read Images
		...
		...
		...

Now submit the mpi job.

[buff50@L01 mpi]$ sbatch sb.mpi
Submitted batch job 28103
[buff50@L01 mpi]$ cat log.slurm.out | grep "Time"
Time taken (seconds): 92.5342857837677  with  8  processors.

Change the number of processors used (–tasks-per-node) to 16 and 64 and resubmit the job to see how the Time taken changes. You should get faster (shorter times) with increasing number of processors.

4.3 GPU implementation (cuda) (4:55-5:00)

Running code on gpu can be fast if there are a lot of matrix operations involved, compared to I/O operations. Here we shall look at a different program cupy_numpy.py to see the advantage of using GPUs.

This program has three tasks.

(Task 1) Creating a 1000x1000x1000 array of ones.











### Numpy and CPU
s = time.time()
x_cpu = np.ones((1000,1000,1000))
e = time.time()
print("(Task 1) Time with Numpy+CPU: ",e - s)
### CuPy and GPU
s = time.time()
x_gpu = cp.ones((1000,1000,1000))
cp.cuda.Stream.null.synchronize()
e = time.time()
print("(Task 1) Time with Cupy+GPU: ",e - s)

(Task 2) Multiply entire array by 5












### Numpy and CPU
s = time.time()
x_cpu *= 5
e = time.time()
print("(Task 2) Time with Numpy+CPU: ",e - s)

### CuPy and GPU
s = time.time()
x_gpu *= 5
cp.cuda.Stream.null.synchronize()
e = time.time()
print("(Task 2) Time with Cupy+GPU: ",e - s)

(Task 3) Multiply the array by 5, then multiply the array by itself, and then add the array to itself.
















### Numpy and CPU
s = time.time()
x_cpu *= 5
x_cpu *= x_cpu
x_cpu += x_cpu
e = time.time()
print("(Task 3) Time with Numpy+CPU: ",e - s)

### CuPy and GPU
s = time.time()
x_gpu *= 5
x_gpu *= x_gpu
x_gpu += x_gpu
cp.cuda.Stream.null.synchronize()
e = time.time()
print("(Task 3) Time with Cupy+GPU: ",e - s)

Submitting the program will show us the advantage of using GPUs.

[buff50@L01 mpi]$ cd ../cuda/
[buff50@L01 cuda]$ sbatch sb.cuda 
Submitted batch job 28114
[buff50@L01 cuda]$ cat log.slurm.out 
(Task 1) Time with Numpy+CPU:  0.8523945808410645
(Task 1) Time with Cupy+GPU:  0.31257200241088867
(Task 2) Time with Numpy+CPU:  0.590524435043335
(Task 2) Time with Cupy+GPU:  0.04698443412780762
(Task 3) Time with Numpy+CPU:  1.8139195442199707
(Task 3) Time with Cupy+GPU:  0.07738733291625977

5. File transfer between HPC and your computer

Linux users can use FileZilla with similar instructions as below.

Copy files from the HPC into your local folder.

NOTE: For Windows users only!

STEP W1: Download WinSCP.exe from the WinSCP website. You can also find the executable here.

STEP W2: Run the WinSCP application and enter the following details and click Login. Make sure you are on CISCO Campus VPN.

Hostname: hpcjump.wtamu.edu
Port: 22

STEP W3: Enter your HPC username and password. You should now be able to transfer files between HPC and your computer.

6. Conclusion

HPC resources can be really useful when the task at hand is too big for a workstation to handle (eg. data science, computational science and engineering). In this case, one must either develop a parallel program (with potentially parallel I/O) or use a parallelized code for the task. Developing a good parallel code requires extensive knowledge of computer architecture (processor speeds, cache sizes, memory, network latency and bandwidth, etc). Maintaining and administering such HPC systems is also quite a challenge considering reliability, cybersecurity, and package/library management). Both present significant career opportunities.

Please complete the following survey (2-3 min):
https://wtamuuw.az1.qualtrics.com/jfe/form/SV_es3ZVxmqXfMd0V0

Introduction to High Performance Computing

Dr. Anirban Pal, Asst. Professor of Mechanical Engineering

0. What is High Performance Computing (HPC)? (4:00-4:15)

1. Creating working folder and setting up file transfer with HPC

2. Logging into WTAMU HPC (4:30-4:35)

3. Submitting a simple python job on the HPC (4:35-4:40)

3.1 Preparing and studying the files

3.2 Running the job and viewing results

4. Submitting a less simple python job on the HPC (using single CPU, multiple CPUs, GPU)

4.1 Single CPU implementation (serial) (4:40-4:45)

4.2 Multi-CPU implementation (mpi) (4:45-4:55)

4.3 GPU implementation (cuda) (4:55-5:00)

5. File transfer between HPC and your computer

6. Conclusion

Read more

Package Management with Spack