Lab 1

From 2022 OSU CSE5449 Lab 1

Unlimited Attempts Allowed
The lab has the following tasks.

Login to Owens (OSC) cluster with your accounts and learn how to allocate nodes. You can allocate nodes with CPU only as well as nodes with GPUs.

ssh USERNAME@ owens.osc.edu

To allocate a node with GPU, use:

salloc --nodes=1 --ntasks-per-node=28 --gpus-per-node=1 -A PAS2312 --time 1:0:0

To allocate a CPU-only node, use:

salloc --nodes=1 --ntasks-per-node=28 -A PAS2312 --time 1:0:0

Please note that you need to add '-A PAS2312’ at the end to be able to use the allocated resources.
Please use Owens for this lab

Installing Miniconda and OMB

ssh into compute node after allocating a node to install miniconda and OMB

Node name: To get the names of allocated nodes use the following command:

squeue -u $(whoami)

And ssh into it

ssh o0xx

Create directories

mkdir owens
cd owens
mkdir lab1
cd lab1

Load CUDA and GCC (Run this command after every new allocation)

module load gcc-compatibility/10.3.0 cuda/11.6.1

Load MVAPICH2 (Run this command after every new allocation)

source /fs/ess/PAS2312/owens/load_mv2

Install Miniconda

wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-x86_64.sh -b -p $PWD/miniconda3

Install OMB

wget https://mvapich.cse.ohio-state.edu/download/mvapich/osu-micro-benchmarks-6.1.tar.gz
tar -xvf osu-micro-benchmarks-6.1.tar.gz
cd osu-micro-benchmarks-6.1

Configuration

autoreconf -vif
./configure CC=/fs/ess/PAS2312/owens/mvapich2//bin/mpicc CXX=/fs/ess/PAS2312/owens/mvapich2/bin/mpicxx --enable-cuda --with-cuda-include=/usr/local/cuda/11.6.1/include --with-cuda-libpath=/usr/local/cuda/11.6.1/lib64 --prefix=$PWD/installed
make -j 20
make install

OMB will be installed in ~/owens/lab1/osu-micro-benchmarks-6.1/installed

Create Miniconda Environment and Setup MPI4py

cd ~/owens/lab1
source miniconda3/bin/activate
conda create -n mpi4py python=3.9
conda activate mpi4py
export PYTHONNOUSERSITE=true

Only required to install MPI4py

export LD_LIBRARY_PATH=/lib64:$LD_LIBRARY_PATH

pip  install numba cupy-cuda116 numpy
pip install --no-cache-dir mpi4py

Once conda env is created and libraries installed, we just need to load in future experiments using following commands

source ~/owens/lab1/miniconda3/bin/activate
conda activate mpi4py
export PYTHONNOUSERSITE=true

Experiments

You need to run the following experiments for the Lab.
Allocate two nodes with GPU, use:

salloc --nodes=2 --ntasks-per-node=28 --gpus-per-node=1 -A PAS2312 --time 1:0:0

Run these command after every new allocation

module load gcc-compatibility/10.3.0 cuda/11.6.1
source /fs/ess/PAS2312/owens/load_mv2

source ~/owens/lab1/miniconda3/bin/activate
conda activate mpi4py
export PYTHONNOUSERSITE=true

Part 1 ( 30 points )

Allocate 2 nodes and run OSU Micro-Benchmarks (OMB) on Owens.

Run point-to-point communication benchmarks on CPU and GPU with max message size of 128 MB
1. osu_latency
2. osu_bw
3. osu_bibw
Run collective communication benchmarks on CPU and GPU with max message size of 128 MB
1. osu_allreduce
2. osu_bcast

OMB README

Example commands to run above benchmarks

Point-to-point:
GPU:

srun -n 2 --export=ALL,MV2_USE_CUDA=1,MV2_SUPPORT_DL=1 installed/libexec/osu-micro-benchmarks/get_local_rank installed/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_latency -m 128000000 D D

CPU:

srun -n 2 --export=ALL,MV2_USE_CUDA=1,MV2_SUPPORT_DL=1 installed/libexec/osu-micro-benchmarks/get_local_rank installed/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_latency -m 128000000 H H

flags:
D: transfer data to/from GPU
H: transfer data to/from CPU
-m: max message size
–help: for more information

For runing osu_bibw on GPU use following command:

srun -n 2 --export=ALL,MV2_USE_CUDA=1,MV2_SUPPORT_DL=1,MV2_CUDA_BLOCK_SIZE=8388608  installed/libexec/osu-micro-benchmarks/get_local_rank installed/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_bibw -m 140000000 D D

Collectives:

srun -n 2 --export=ALL,MV2_USE_CUDA=1,MV2_SUPPORT_DL=1 installed/libexec/osu-micro-benchmarks/get_local_rank installed/libexec/osu-micro-benchmarks/mpi/collective/osu_bcast -d cuda -m 128000000

-d cuda: Place buffer on GPU

Part 2 ( 30 points )

Allocate 2 nodes and run OSU Micro-Benchmarks for Python (OMB-Py) on Owens.

Run point-to-point communication benchmarks on numpy, numba, and cupy buffers with max message size of 128 MB
1. osu_latency
2. osu_bw
3. osu_bibw
Run collective communication benchmarks on numpy, numba, and cupy buffers with max message size of 128 MB
1. osu_allreduce
2. osu_bcast

numba and cupy buffers are allocated on GPU, and NumPy buffer is created on CPU.

OMB-Py README

Example command:

srun -n 2 --export=ALL,MV2_USE_CUDA=1,MV2_SUPPORT_DL=1 python/run.py --benchmark latency --buffer numba --max 128000000

For runing bibw benchmark on GPU( numba & cupy) use following command:

srun -n 2 --export=ALL,MV2_USE_CUDA=1,MV2_SUPPORT_DL=1,MV2_CUDA_BLOCK_SIZE=8388608  python run.py --benchmark latency --buffer numba --max 140000000

Part 3

Based on your experiments, answer the following:

(10 points) Compare the performance of OMB point-to-point and collective benchmarks on CPU with GPU. Explain why CPU communication is faster than GPU.
(10 points) Compare the performance of OMB-Py point-to-point and collective benchmarks for numpy, numba, and cupy. Describe if you find any interesting trend
(10 points) Compare the performance of OMB and OMB-Py. Why is OMB faster than OMB-Py?
(5 points) What is MPI4py and why do we need it?
(5 points) Compare Pitzer and Owens in terms of processor, GPU, main memory, and interconnect

Submission

Submit a report in .docx or .pdf format through carmen.

Report Requirements

Questions 1 and 2: please paste the output of each experiment/benchmark and the run command (srun …..)
Question 3 and 4: compare the performance numbers for CPU and GPU in experiment 1/2 for different message sizes and give reasons why CPU communication may be faster
Question 5: performance comparison and a list of reasons
Question 6: 2-3 lines description of MPI4py and a list of reasons why we need it
Question 7: A table

Rubrics:

Question 1: 3 points for each experiment (latency, bw, bibw, bcast, and allreduce)

CPU: 3*5 = 15 points
GPU: 3*5 = 15 points

Question 2: 2 points for each experiment (latency, bw, bibw, bcast, and allreduce)

numpy: 2*5 = 10 points
numba: 2*5 = 10 points
cupy: 2*5 = 10 points

Others

Late submissions will lose points. The due date will not be extended
10 points will be deducted for every 24 hours of delay for a maximum allowable delay of 48 hours. Beyond 48 hours of delay, the student will receive no points for the assignment.

Academic Integrity

Collaborating or completing the assignment with help from others or as a group is NOT permitted

Copying or reusing previous work done by others is not permitted. You may reuse the work you did as part of this class