---
tags: 111上
---
# Parallel Programming HW4
> 310552027 Chia-Liang Kuo
### Q1. MPI Hello World
1. **How do you control the number of MPI processes on each node?**
- I use `mpirun -np [num] --hostfile hosts mpi_hello` to control the number of MPI processes by specifying `[num]`
2. **Which functions do you use for retrieving the rank of an MPI process and the total number of processes?**
- retrieving the rank: `MPI_Comm_rank(MPI_COMM_WORLD, &rank_size)`
- retrieving the total number of processes: `MPI_Comm_size(MPI_COMM_WORLD, &world_size)`
### Q2. `pi_block_linear.cc`
1. Why MPI_Send and MPI_Recv are called “blocking” communication?
- "blocking" means the sender will wait until the receiver get the messages from sender, and the receiver will wait for the messages from sender.
2. Measure the performance (execution time) of the code for 2, 4, 8, 12, 16 MPI processes and plot it
![](https://i.imgur.com/sbnMAky.png)
### Q3. `pi_block_tree`
1. Measure the performance (execution time) of the code for 2, 4, 8, 16 MPI processes and plot it. (1 points)
![](https://i.imgur.com/4NUEKFU.png)
2. How does the performance of binary tree reduction compare to the performance of linear reduction?
- I think the performance of binary tree reduction is averagely better than linear reduction, especially when the number of MPI processes increased
3. Increasing the number of processes, which approach (linear/tree) is going to perform better? Why? Think about the number of messages and their costs
- **in the linear reduction implementation**: the master MPI process has a time complexity of O(N) to receive the messages from the other MPI processes.
- **int the binary tree reduction implementation**: some of the MPI process will share the burden of receiving messages, which let the master MPI process only has a time complexity of O(log(N)) to receive messages from other processes.
- Besides, the master MPI process need to wait all the other MPI processes to finishing the computation in linear reduction method, which may be more probably that the execution time bounded by the slowest process.
### Q4. `pi_nonblock_linear`
1. Measure the performance (execution time) of the code for 2, 4, 8, 12, 16 MPI processes and plot it.
![](https://i.imgur.com/ffVAoms.png)
2. What are the MPI functions for non-blocking communication?
- Non-blocking communication process return immediately even if the communication is not finished yet. We must call `MPI_Wait()`, `MPI_Waitall()` or `MPI_Test()` to see whether the communication has finished.
![](https://i.imgur.com/yYMy2fo.png)
3. How the performance of non-blocking communication compares to the performance of blocking communication?
- In my case, the performance of non-blocking linear reduction drastically outperforms the linear reduction one when using 8, 12 processers, but when using larger number of processers (i.e., 16), the performance will be slightly worse than the linear one. I think the main reason is the overhead increases when calling `MPI_Wait` function with larger number of processer.
### Q5. `MPI_Gather`
1. Measure the performance (execution time) of the code for 2, 4, 8, 12, 16 MPI processes and plot it. (1 points)
![](https://i.imgur.com/w2Wg6wA.png)
### Q6. `MPI_Reduce`
1. Measure the performance (execution time) of the code for 2, 4, 8, 12, 16 MPI processes and plot it
![](https://i.imgur.com/DQfDHgX.png)
### Q7. Describe what approach(es) were used in your MPI matrix multiplication for each data set.
When computing matrix multiplication, we need to input a matrix A with size ${n \times m}$ and another matrix B with size ${m \times l}$ then output the matrix C wirh size ${n \times l}$ the each element ${C[i][j]}$ is the dot product of ${A[i]}$ and ${B^T[j]}$
- Basic method: Assume there are `k` MPI processers.
- First, split the matrix `A` into `k` sub-matrices with size ${\frac{N}{k} \times M}$.
![](https://i.imgur.com/hSUjdbR.png)
- Then, send `k-1` such matrices and the whole matrix B to the other MPI processers from the master processer
- Each MPI process computes the result of the multiplication of the sub-matrix of A and the whole matrix B, then the results will be sent to the master processer.
![](https://i.imgur.com/wYiMk57.png)
- The master MPI processer output the result of the matrix multiplication results C.
![](https://i.imgur.com/8mr5rLd.png)
This basic method has a performance issue: when the array size B is quite large, we need to spend many time in sending data to other MPI processes. As a results, I tried another method that could drastically reduce the message size.
- Modified method: Assume there are `k` MPI processers.
- First, the modified method will define a constant called `MAX_BUFFER` that limiting the max size of the sending buffer, then deciding an variable called `iterations`, which make sure that ${\frac{N \times M}{iterations} < \text{MAX_BUFFER} }$ and ${\frac{M \times L}{iterations} < \text{MAX_BUFFER} }$
- Then, split the matrix `A` and `B` into `k` sub-matrices with size ${\frac{N}{k} \times M}$ and ${\frac{M}{k} \times L}$.
- Then, send `k-1` such matrices from A and B to the other MPI processers from the master processer.
- Each MPI process computes the result of the multiplication of the sub-matrix of A and the sub-matrix B, then the results will be sent and sum up to the master processer. After such process we can get the partial answer of matrix C. See the schematic diagram below: this process will compute the partial answer of multiplication results of C
![](https://i.imgur.com/kB8W5rj.png)
- Sadly, there were some bugs in my code and I couldn't figure out how to solve the problems. Finally, I only implemented the basic version that was quite slow :((

310552027 郭家良 Q1 (5 points): Explain your implementation. How do you optimize the performance of convolution? In the following experiment, I use 7x7 filter and images with size 400x600. run $./conv 1. First, I just implement a basic code and kernel.cl hostFE.c #include <stdio.h>

1/5/2023310552027 郭家良 Run code using shell script :::spoiler runnning shell script # bin/sh ITERATIONS=(1000 10000 100000)

12/22/2022ID:310552027 Name:Kuo Chia-Liang part2 Modify the starter code to parallelize the Mandelbrot generation using two processors. Specifically, compute the top half of the image in thread 0, and the bottom half of the image in thread 1. This type of problem decomposition is referred to as spatial decomposition since different spatial regions of the image are computed by different processors. Extend your code to use 2, 3, 4 threads, partitioning the image generation work accordingly (threads should get blocks of the image). Execution results 2,3,4 threads for view1

11/10/2022310552027 網工所 郭家良 Q1-1: Does the vector utilization increase, decrease or stay the same as VECTOR_WIDTH changes? Why? use the shell script to automatically run the experiment # bin/sh VECTOR_WIDTH=(2 4 8 16)

10/25/2022
Published on ** HackMD**

or

By clicking below, you agree to our terms of service.

Sign in via Facebook
Sign in via Twitter
Sign in via GitHub
Sign in via Dropbox
Sign in with Wallet

Wallet
(
)

Connect another wallet
New to HackMD? Sign up