# Parallel Programming Homework 4 ## Q1 ### 1 The number of processes per node can be configure by specifying the _slots_ option in the hosts file. See the example below. ```shell= pp1 slots=1 pp2 slots=1 pp3 slots=1 pp4 slots=1 pp5 slots=1 ``` ### 2 The functions to retrieve the rank of an MPI process and the total number of processes are _MPI_Comm_size(MPI_COMM_WORLD, &world_size)_ and _MPI_Comm_rank(MPI_COMM_WORLD, &world_rank)_. ### 3 There are various implementation of the MPI standard. Some common implementation are OpenMPI, MPICH, Intel MPI. The difference in their implementaion caters to different needs. For example, MPICH is considered the most complete implementation of the MPI standard, whereas OpenMP is a implementation that target the common case. > [color=#b59734] should be OpenMPI > [name=Liu An Chi] ## Q2 ### 1 _MPI_Send_ and _MPI_Recv_ are called blocking because the block the execution of the caller process before the operation is complete. For instance, _MPI_Send_ will only return after the target recieved the data. ### 2 ![](https://i.imgur.com/RTPQ6cV.png) ## Q3 ### 1 ![](https://i.imgur.com/PmatpX1.png) ### 2 The measured performance difference is not significant, with the tree method slightly faster with 16 cores. > [color=#b59734] why? > [name=Liu An Chi] ### 3 Binary tree reduction algorithm should perform better with more cores. Since the communication use blocking calls, the linear reduction required the results to be recieve sequencially, which is not necessary and time consuming. ## Q4 ### 1 ![](https://i.imgur.com/G1CGjqm.png) ### 2 The functions for MPI non-blocking communication are MPI_Isend and MPI_Irecv. ### 3 The performance is slightly worse at the time of measure. However the load of the workstation might be different for each case at the time of measure. Therefore the results mught not be accurate. ## Q5 ### 1 ![](https://i.imgur.com/3NMLstB.png) ## Q6 ### 1 ![](https://i.imgur.com/UEJpmS3.png) ## Q7 ### 1 ![](https://i.imgur.com/iA44HeV.png) ### 2 Based on the time measurement, the linear blocking case have the best performance with 16 cores. The difference between all those algorithm is not significantly different, also, the execution environment is shared, the results might not be accurate. > [color=#b59734] Provide a table to compare the data. The analysis is not quite right. > [name=Liu An Chi] ### 3 For MPICH, if the input data size is small, the default algorithm is a binomial tree algorithm based on the ranks. While the data size is large, MPICH uses a reduce-scatter followed by a gather to the root. ## Q8 ### 1 ![](https://i.imgur.com/vJOuWJj.png) > [color=#b59734] intra seems wrong > [name=Liu An Chi] ### 2 The bandwidth are 0.11GB and 6.65GB for inter and intra node communication. And the latency are 4.85e-5s and 1.23e-6s respactively. Do to the small value of the intra-node data, the error is quite significant. Therefor, using linear regression on the obtained data results in negative latency. Due to such effect, the latency is obtained by calculating the average of few data points whose message size are small. A possible solution to such problem is to measure the error of measurement by repeating the measure with the same message size and calculated the standard error. Finally use bayesian linear regression to extrapolate the latency. > [color=#b59734] us or ms may be a better unit. > [name=Liu An Chi] ### 3 The obtained latency might be effect by various factor, including the current CPU workload, memory occupancy, and network condition. ## Q9 ### 1 The matrix is split to the number of processes by row, which preserve the continuity of elements in the memory. Also, the second matrix is stored in the transposed form to ensure the elements are contiguous in the memory when calculating dot product, which allow optimization using AVX2. Initially, the input matrixs are send to all processes by _MPI_Bcast_. The results are written back to root process by one-sided communication.