--- tags: CSE5449 --- # Lab 1 *From 2022 OSU CSE5449 Lab 1* Unlimited Attempts Allowed The lab has the following tasks. ## Setup and Login Login to Owens (OSC) cluster with your accounts and learn how to allocate nodes. You can allocate nodes with CPU only as well as nodes with GPUs. Login: ``` ssh USERNAME@ owens.osc.edu ``` To allocate a node with GPU, use: ``` salloc --nodes=1 --ntasks-per-node=28 --gpus-per-node=1 -A PAS2312 --time 1:0:0 ``` To allocate a CPU-only node, use: ``` salloc --nodes=1 --ntasks-per-node=28 -A PAS2312 --time 1:0:0 ``` Please note that you need to add '-A PAS2312’ at the end to be able to use the allocated resources. ==Please use Owens for this lab== ### Installing Miniconda and OMB ssh into compute node after allocating a node to install miniconda and OMB Node name: To get the names of allocated nodes use the following command: ``` squeue -u $(whoami) ``` And ssh into it ``` ssh o0xx ``` Create directories ``` mkdir owens cd owens mkdir lab1 cd lab1 ``` Load CUDA and GCC (==Run this command after every new allocation==) ``` module load gcc-compatibility/10.3.0 cuda/11.6.1 ``` Load MVAPICH2 (==Run this command after every new allocation==) ``` source /fs/ess/PAS2312/owens/load_mv2 ``` #### Install Miniconda ``` wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh bash Miniconda3-latest-Linux-x86_64.sh -b -p $PWD/miniconda3 ``` #### Install OMB ``` wget https://mvapich.cse.ohio-state.edu/download/mvapich/osu-micro-benchmarks-6.1.tar.gz tar -xvf osu-micro-benchmarks-6.1.tar.gz cd osu-micro-benchmarks-6.1 ``` ##### Configuration ``` autoreconf -vif ./configure CC=/fs/ess/PAS2312/owens/mvapich2//bin/mpicc CXX=/fs/ess/PAS2312/owens/mvapich2/bin/mpicxx --enable-cuda --with-cuda-include=/usr/local/cuda/11.6.1/include --with-cuda-libpath=/usr/local/cuda/11.6.1/lib64 --prefix=$PWD/installed make -j 20 make install ``` :::info OMB will be installed in ~/owens/lab1/osu-micro-benchmarks-6.1/installed ::: ##### Create Miniconda Environment and Setup MPI4py ``` cd ~/owens/lab1 source miniconda3/bin/activate conda create -n mpi4py python=3.9 conda activate mpi4py export PYTHONNOUSERSITE=true ``` Only required to install MPI4py ``` export LD_LIBRARY_PATH=/lib64:$LD_LIBRARY_PATH ``` ``` pip install numba cupy-cuda116 numpy pip install --no-cache-dir mpi4py ``` :::warning Once conda env is created and libraries installed, we just need to load in future experiments using following commands ``` source ~/owens/lab1/miniconda3/bin/activate conda activate mpi4py export PYTHONNOUSERSITE=true ``` ::: ## Experiments You need to run the following experiments for the Lab. Allocate ==two== nodes with GPU, use: ``` salloc --nodes=2 --ntasks-per-node=28 --gpus-per-node=1 -A PAS2312 --time 1:0:0 ``` :::danger Run these command after every new allocation ``` module load gcc-compatibility/10.3.0 cuda/11.6.1 source /fs/ess/PAS2312/owens/load_mv2 source ~/owens/lab1/miniconda3/bin/activate conda activate mpi4py export PYTHONNOUSERSITE=true ``` ::: ### Part 1 ( 30 points ) ==Allocate 2 nodes and run OSU Micro-Benchmarks (OMB) on Owens.== 1. Run point-to-point communication benchmarks on CPU and GPU with max message size of 128 MB 1. osu_latency 2. osu_bw 3. osu_bibw 2. Run collective communication benchmarks on CPU and GPU with max message size of 128 MB 1. osu_allreduce 2. osu_bcast [OMB README](https://mvapich.cse.ohio-state.edu/static/media/mvapich/README-OMB.txt) #### Example commands to run above benchmarks :::info Pt2Pt Point-to-point: GPU: ``` srun -n 2 --export=ALL,MV2_USE_CUDA=1,MV2_SUPPORT_DL=1 installed/libexec/osu-micro-benchmarks/get_local_rank installed/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_latency -m 128000000 D D ``` CPU: ``` srun -n 2 --export=ALL,MV2_USE_CUDA=1,MV2_SUPPORT_DL=1 installed/libexec/osu-micro-benchmarks/get_local_rank installed/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_latency -m 128000000 H H ``` flags: D: transfer data to/from GPU H: transfer data to/from CPU -m: max message size --help: for more information ::: :::warning For runing osu_bibw on GPU use following command: ``` srun -n 2 --export=ALL,MV2_USE_CUDA=1,MV2_SUPPORT_DL=1,MV2_CUDA_BLOCK_SIZE=8388608 installed/libexec/osu-micro-benchmarks/get_local_rank installed/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_bibw -m 140000000 D D ``` ::: --- :::info Collectives Collectives: ``` srun -n 2 --export=ALL,MV2_USE_CUDA=1,MV2_SUPPORT_DL=1 installed/libexec/osu-micro-benchmarks/get_local_rank installed/libexec/osu-micro-benchmarks/mpi/collective/osu_bcast -d cuda -m 128000000 ``` -d cuda: Place buffer on GPU ::: --- ### Part 2 ( 30 points ) ==Allocate 2 nodes and run OSU Micro-Benchmarks for Python (OMB-Py) on Owens.== 1. Run point-to-point communication benchmarks on numpy, numba, and cupy buffers with max message size of 128 MB 1. osu_latency 2. osu_bw 3. osu_bibw 2. Run collective communication benchmarks on numpy, numba, and cupy buffers with max message size of 128 MB 1. osu_allreduce 2. osu_bcast :::warning numba and cupy buffers are allocated on GPU, and NumPy buffer is created on CPU. ::: [OMB-Py README](https://mvapich.cse.ohio-state.edu/static/media/mvapich/README-OMB-PY.txt) #### Example command: :::info ``` srun -n 2 --export=ALL,MV2_USE_CUDA=1,MV2_SUPPORT_DL=1 python/run.py --benchmark latency --buffer numba --max 128000000 ``` ::: :::warning For runing bibw benchmark on GPU( numba & cupy) use following command: ``` srun -n 2 --export=ALL,MV2_USE_CUDA=1,MV2_SUPPORT_DL=1,MV2_CUDA_BLOCK_SIZE=8388608 python run.py --benchmark latency --buffer numba --max 140000000 ``` ::: --- ### Part 3 ==Based on your experiments, answer the following:== 1. (10 points) Compare the performance of OMB point-to-point and collective benchmarks on CPU with GPU. Explain why CPU communication is faster than GPU. 2. (10 points) Compare the performance of OMB-Py point-to-point and collective benchmarks for numpy, numba, and cupy. Describe if you find any interesting trend 3. (10 points) Compare the performance of OMB and OMB-Py. Why is OMB faster than OMB-Py? 4. (5 points) What is MPI4py and why do we need it? 5. (5 points) Compare Pitzer and Owens in terms of processor, GPU, main memory, and interconnect ## Submission ==Submit a report in `.docx` or `.pdf` format through carmen.== ### Report Requirements `Questions 1 and 2`: please paste the output of each experiment/benchmark and the run command (srun …..) `Question 3 and 4`: compare the performance numbers for CPU and GPU in experiment 1/2 for different message sizes and give reasons why CPU communication may be faster `Question 5`: performance comparison and a list of reasons `Question 6`: 2-3 lines description of MPI4py and a list of reasons why we need it `Question 7`: A table ### Rubrics: `Question 1`: 3 points for each experiment (latency, bw, bibw, bcast, and allreduce) 1. CPU: 3*5 = 15 points 2. GPU: 3*5 = 15 points `Question 2`: 2 points for each experiment (latency, bw, bibw, bcast, and allreduce) 1. numpy: 2*5 = 10 points 2. numba: 2*5 = 10 points 3. cupy: 2*5 = 10 points ### Others 1. ==Late submissions will lose points.== The due date will not be extended 2. 10 points will be deducted for every 24 hours of delay for a maximum allowable delay of 48 hours. Beyond 48 hours of delay, the student will receive no points for the assignment. Academic Integrity Collaborating or completing the assignment with help from others or as a group is NOT permitted Copying or reusing previous work done by others is not permitted. You may reuse the work you did as part of this class ## Resources: OMB: https://mvapich.cse.ohio-state.edu/benchmarks/Links Miniconda: https://docs.conda.io/en/latest/miniconda.htmlLinks MPI4py: https://mpi4py.readthedocs.io/en/stable/Links https://ieeexplore.ieee.org/document/9439927Links MVAPICH2: https://mvapich.cse.ohio-state.edu/Links Owens: https://www.osc.edu/resources/technical_support/supercomputers/owensLinks Pitzer: https://www.osc.edu/resources/technical_support/supercomputers/pitzer