---
tags: CSE5449
---
# Lab 1
*From 2022 OSU CSE5449 Lab 1*
Unlimited Attempts Allowed
The lab has the following tasks.
## Setup and Login
Login to Owens (OSC) cluster with your accounts and learn how to allocate nodes. You can allocate nodes with CPU only as well as nodes with GPUs.
Login:
```
ssh USERNAME@ owens.osc.edu
```
To allocate a node with GPU, use:
```
salloc --nodes=1 --ntasks-per-node=28 --gpus-per-node=1 -A PAS2312 --time 1:0:0
```
To allocate a CPU-only node, use:
```
salloc --nodes=1 --ntasks-per-node=28 -A PAS2312 --time 1:0:0
```
Please note that you need to add '-A PAS2312’ at the end to be able to use the allocated resources.
==Please use Owens for this lab==
### Installing Miniconda and OMB
ssh into compute node after allocating a node to install miniconda and OMB
Node name: To get the names of allocated nodes use the following command:
```
squeue -u $(whoami)
```
And ssh into it
```
ssh o0xx
```
Create directories
```
mkdir owens
cd owens
mkdir lab1
cd lab1
```
Load CUDA and GCC (==Run this command after every new allocation==)
```
module load gcc-compatibility/10.3.0 cuda/11.6.1
```
Load MVAPICH2 (==Run this command after every new allocation==)
```
source /fs/ess/PAS2312/owens/load_mv2
```
#### Install Miniconda
```
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-x86_64.sh -b -p $PWD/miniconda3
```
#### Install OMB
```
wget https://mvapich.cse.ohio-state.edu/download/mvapich/osu-micro-benchmarks-6.1.tar.gz
tar -xvf osu-micro-benchmarks-6.1.tar.gz
cd osu-micro-benchmarks-6.1
```
##### Configuration
```
autoreconf -vif
./configure CC=/fs/ess/PAS2312/owens/mvapich2//bin/mpicc CXX=/fs/ess/PAS2312/owens/mvapich2/bin/mpicxx --enable-cuda --with-cuda-include=/usr/local/cuda/11.6.1/include --with-cuda-libpath=/usr/local/cuda/11.6.1/lib64 --prefix=$PWD/installed
make -j 20
make install
```
:::info
OMB will be installed in ~/owens/lab1/osu-micro-benchmarks-6.1/installed
:::
##### Create Miniconda Environment and Setup MPI4py
```
cd ~/owens/lab1
source miniconda3/bin/activate
conda create -n mpi4py python=3.9
conda activate mpi4py
export PYTHONNOUSERSITE=true
```
Only required to install MPI4py
```
export LD_LIBRARY_PATH=/lib64:$LD_LIBRARY_PATH
```
```
pip install numba cupy-cuda116 numpy
pip install --no-cache-dir mpi4py
```
:::warning
Once conda env is created and libraries installed, we just need to load in future experiments using following commands
```
source ~/owens/lab1/miniconda3/bin/activate
conda activate mpi4py
export PYTHONNOUSERSITE=true
```
:::
## Experiments
You need to run the following experiments for the Lab.
Allocate ==two== nodes with GPU, use:
```
salloc --nodes=2 --ntasks-per-node=28 --gpus-per-node=1 -A PAS2312 --time 1:0:0
```
:::danger
Run these command after every new allocation
```
module load gcc-compatibility/10.3.0 cuda/11.6.1
source /fs/ess/PAS2312/owens/load_mv2
source ~/owens/lab1/miniconda3/bin/activate
conda activate mpi4py
export PYTHONNOUSERSITE=true
```
:::
### Part 1 ( 30 points )
==Allocate 2 nodes and run OSU Micro-Benchmarks (OMB) on Owens.==
1. Run point-to-point communication benchmarks on CPU and GPU with max message size of 128 MB
1. osu_latency
2. osu_bw
3. osu_bibw
2. Run collective communication benchmarks on CPU and GPU with max message size of 128 MB
1. osu_allreduce
2. osu_bcast
[OMB README](https://mvapich.cse.ohio-state.edu/static/media/mvapich/README-OMB.txt)
#### Example commands to run above benchmarks
:::info Pt2Pt
Point-to-point:
GPU:
```
srun -n 2 --export=ALL,MV2_USE_CUDA=1,MV2_SUPPORT_DL=1 installed/libexec/osu-micro-benchmarks/get_local_rank installed/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_latency -m 128000000 D D
```
CPU:
```
srun -n 2 --export=ALL,MV2_USE_CUDA=1,MV2_SUPPORT_DL=1 installed/libexec/osu-micro-benchmarks/get_local_rank installed/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_latency -m 128000000 H H
```
flags:
D: transfer data to/from GPU
H: transfer data to/from CPU
-m: max message size
--help: for more information
:::
:::warning
For runing osu_bibw on GPU use following command:
```
srun -n 2 --export=ALL,MV2_USE_CUDA=1,MV2_SUPPORT_DL=1,MV2_CUDA_BLOCK_SIZE=8388608 installed/libexec/osu-micro-benchmarks/get_local_rank installed/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_bibw -m 140000000 D D
```
:::
---
:::info Collectives
Collectives:
```
srun -n 2 --export=ALL,MV2_USE_CUDA=1,MV2_SUPPORT_DL=1 installed/libexec/osu-micro-benchmarks/get_local_rank installed/libexec/osu-micro-benchmarks/mpi/collective/osu_bcast -d cuda -m 128000000
```
-d cuda: Place buffer on GPU
:::
---
### Part 2 ( 30 points )
==Allocate 2 nodes and run OSU Micro-Benchmarks for Python (OMB-Py) on Owens.==
1. Run point-to-point communication benchmarks on numpy, numba, and cupy buffers with max message size of 128 MB
1. osu_latency
2. osu_bw
3. osu_bibw
2. Run collective communication benchmarks on numpy, numba, and cupy buffers with max message size of 128 MB
1. osu_allreduce
2. osu_bcast
:::warning
numba and cupy buffers are allocated on GPU, and NumPy buffer is created on CPU.
:::
[OMB-Py README](https://mvapich.cse.ohio-state.edu/static/media/mvapich/README-OMB-PY.txt)
#### Example command:
:::info
```
srun -n 2 --export=ALL,MV2_USE_CUDA=1,MV2_SUPPORT_DL=1 python/run.py --benchmark latency --buffer numba --max 128000000
```
:::
:::warning
For runing bibw benchmark on GPU( numba & cupy) use following command:
```
srun -n 2 --export=ALL,MV2_USE_CUDA=1,MV2_SUPPORT_DL=1,MV2_CUDA_BLOCK_SIZE=8388608 python run.py --benchmark latency --buffer numba --max 140000000
```
:::
---
### Part 3
==Based on your experiments, answer the following:==
1. (10 points) Compare the performance of OMB point-to-point and collective benchmarks on CPU with GPU. Explain why CPU communication is faster than GPU.
2. (10 points) Compare the performance of OMB-Py point-to-point and collective benchmarks for numpy, numba, and cupy. Describe if you find any interesting trend
3. (10 points) Compare the performance of OMB and OMB-Py. Why is OMB faster than OMB-Py?
4. (5 points) What is MPI4py and why do we need it?
5. (5 points) Compare Pitzer and Owens in terms of processor, GPU, main memory, and interconnect
## Submission
==Submit a report in `.docx` or `.pdf` format through carmen.==
### Report Requirements
`Questions 1 and 2`: please paste the output of each experiment/benchmark and the run command (srun …..)
`Question 3 and 4`: compare the performance numbers for CPU and GPU in experiment 1/2 for different message sizes and give reasons why CPU communication may be faster
`Question 5`: performance comparison and a list of reasons
`Question 6`: 2-3 lines description of MPI4py and a list of reasons why we need it
`Question 7`: A table
### Rubrics:
`Question 1`: 3 points for each experiment (latency, bw, bibw, bcast, and allreduce)
1. CPU: 3*5 = 15 points
2. GPU: 3*5 = 15 points
`Question 2`: 2 points for each experiment (latency, bw, bibw, bcast, and allreduce)
1. numpy: 2*5 = 10 points
2. numba: 2*5 = 10 points
3. cupy: 2*5 = 10 points
### Others
1. ==Late submissions will lose points.== The due date will not be extended
2. 10 points will be deducted for every 24 hours of delay for a maximum allowable delay of 48 hours. Beyond 48 hours of delay, the student will receive no points for the assignment.
Academic Integrity
Collaborating or completing the assignment with help from others or as a group is NOT permitted
Copying or reusing previous work done by others is not permitted. You may reuse the work you did as part of this class
## Resources:
OMB: https://mvapich.cse.ohio-state.edu/benchmarks/Links
Miniconda: https://docs.conda.io/en/latest/miniconda.htmlLinks
MPI4py: https://mpi4py.readthedocs.io/en/stable/Links
https://ieeexplore.ieee.org/document/9439927Links
MVAPICH2: https://mvapich.cse.ohio-state.edu/Links
Owens: https://www.osc.edu/resources/technical_support/supercomputers/owensLinks
Pitzer: https://www.osc.edu/resources/technical_support/supercomputers/pitzer