---
# System prepended metadata

title: Multi-thread Programming
tags: [' pi', para., ' thread', ' pthread', ' SIMD']

---

---
tags: para., pthread, thread, pi, SIMD
---
https://nctu-sslab.github.io/PP-f20/HW2/
# Multi-thread Programming

## Q1: 


### <font color="green"> **In your write-up, produce a graph of speedup compared to the reference sequential implementation as a function of the number of threads used FOR VIEW.**  </font>


![](https://i.imgur.com/mFayLAr.png =350x)![](https://i.imgur.com/siNX3OO.png =350x)

**1. Is speedup linear in the number of threads used?**
* 加速是非線性的，除了程式會包含不可平行的部分外，Create Thread 也需要時間成本，因此無法靠thread 無限線性加速。

**2. In your writeup hypothesize why this is (or is not) the case? (You may also wish to produce a graph for VIEW 2 to help you come up with a good answer. Hint: take a careful look at the three-thread data-point.)**

* 從上圖可以看出，Thread 數目增加到3或4左右，速度就沒有顯著提升，甚至可能會有些微比更少的thread還慢，我推測也許程式能平行的部分所花的時間，已經因為平行加速到某一程度**相較於不可平行的部分相當微小**，因此再如何增加thread 都沒有顯著差異。

## Q2:
### <font color="green"> **How do your measurements explain the speedup graph you previously created?**</font>
* 要達到thread 的最佳平行，就要讓各個thread 工作量均衡，這樣較不會產生較快的thread 需要等待較慢的thread 產生的bottle neck。為了克服這點我發現，Index Ouput Array 的中段需要花費較高的成本，因為搜尋是連續的，從陣列的始端Index 或從末端Index 都會較快。
* 因此thread 交叉執行鄰近的列會是較平均的做法，Ex. thread 1 做0 2 4 6 8 ....列，thread 2做 1 3 5 7 9 .... 列。
![](https://i.imgur.com/W6YcIaY.png)
從結果來看效果是不錯的。

接著我們觀察各個thread 在 `workerThreadStart()`運行所花的時間
* thread number 1 => total: 461.212 ms
![](https://i.imgur.com/Ly3kCsy.png)
* thread number 2 => total: 236.325 ms 
![](https://i.imgur.com/yZFCiLf.png)
* thread number 3 => total: 158.896 ms
![](https://i.imgur.com/o9s3zux.png)
* thread number 4 => total: 121.580 ms
![](https://i.imgur.com/ojlfOxW.png)
* thread number 5 => total: 145.258 ms
![](https://i.imgur.com/PZ9A7Nu.png)
* thread number 6 => total: 124.721 ms
![](https://i.imgur.com/K2Oz809.png)


從結果發現，在thread 較小的情況下，`workerThreadStart()`所運行的時間都是相差不多的，但是當thread 一變大，也代表`thread 0` 與`thread N` 所Index 的空間差距越大，所以`thread N`花的時間也會比`thread 0`還更久，所以產生bottle neck，所以程式執行時間主要受Index 中段array的影響，繼續平行的效果便不顯著。

## Q3:
### <font color="green"> Q3: In your write-up, describe your approach to parallelization and report the final 4-thread speedup obtained. </font>

由於程式中段的處理較花時間所以我程式執行方式`thread 0`執行`0, 0+thread_num*1, 0+thread_num*2, 0+thread_num*3......`以此類推。
```
// 此處 step 設置為1
for(int start=0; start < height;)
{
      if(start+(threadId*step) < height)
      {
          mandelbrotSerial(x0, y0, x1, y1, width, height, start+(threadId*step), 
                           step, maxIterations, args->output);
      }
      start += numThreads; 
}
```
```
[mandelbrot serial]:            [460.861] ms
Wrote image file mandelbrot-serial.ppm
[mandelbrot thread]:            [121.495] ms
Wrote image file mandelbrot-thread.ppm
                                (3.79x speedup from 4 threads)
```
經過實驗`step`參數設置越大速度會越慢，因為會造成`thread0`和`threadN`距離越遠，等待情形越嚴重。


## Q4:
### <font color="green">Now run your improved code with eight threads. Is performance noticeably greater than when running with four threads? Why or why not? (Notice that the workstation server provides 4 cores 4 threads.)</font>


|          | View 1  | View 2 |
| -------- |:-------:|:------:|
| Thread 4 | 121.529 | 71.529 |
| Thread 8 | 122.546 | 74.458 |

執行表現沒有變得更好，4 cores 4 thread 對於8個thread 並行的需求是足夠的，但是每個thread 所運行的空間越遠，所執行差的時間就越多，即使8個thread 相較於4個thread，每個thread 的平均執行時間較短許多，但主要影響整體執行時間的也還是執行最久的那個thread，這也許和快取記憶體大小有關。

:::info
應該說 4 cores 4 thread 的最大極限就是同時跑 4 thread，所以就算用 8 thread 讓單一執行緒的運算量減少了，因為同時最多 4 個在跑，所以還是會有 4 個需要等其他 4 個去執行，並且也會造成額外的 context switch 成本
> [name=趙益揚]
:::