# F74096297 林懋 平行處理作業二
contributed by < `limaoxd` >
## 第一部份
* 作業要求的 **Scatterv()** , **Gatherv()** 傳遞陣列的元素不能直接使用自己建立的 **Structrue** , 這邊使用 **MPI_Type_contiguous()** 來串 3 個 **MPI_UNSIGNED_CHAR** 讓函式傳遞 RGB 像素 .
``` cpp
MPI_Datatype MPI_RGB;
MPI_Type_contiguous(3, MPI_UNSIGNED_CHAR, &MPI_RGB);
MPI_Type_commit(&MPI_RGB);
```
* 檔案讀取和由 **PROCESS0** **廣播** 讀取圖片的 **WIDTH** 和 **HEIGHT**
``` cpp
if(!id) {
//由 PROCESS0 讀取檔案
if ( readBMP( infileName) )
cout << "Read file successfully!!" << endl;
else
cout << "Read file fails!!" << endl;
w_h[1] = bmpInfo.biHeight, w_h[0] = bmpInfo.biWidth;
}
MPI_Bcast(&w_h, 2, MPI_INT, 0, MPI_COMM_WORLD);
if(id) BMPSaveData = alloc_memory( w_h[1], w_h[0]);
```
- 動態配置記憶體給要分的陣列 , 並且為其保留 2 格高的 **row** 為了底下的平滑化 , 就夠元的底下順便計算每個 **PROCESS** 的像素位移和非整除時的高度
``` cpp
int divide_size[size];
int divide_height[size];
int disp[size];
int remainder = w_h[1] % size;
//當有餘數時 (pixel_number / size)算出來的 row 高要加一
int divide_h = w_h[1] / size + (remainder ? 1 : 0);
//建立分割的圖像 , 並且為高度多 2 來保留交換下的空間
scatter_BMP = alloc_memory(divide_h+2, w_h[0]);
//事先計算傳遞像素大小 (int* divide_size) 和每個接收資料的位移 (int* disp)
for(int i = 0 ; i < size ; ++i) {
if(remainder && i == size - 1) divide_height[i] = w_h[1] - (divide_h * (size-1));
else divide_height[i] = divide_h;
divide_size[i] = w_h[0] * divide_height[i];
disp[i] = (i ? disp[i-1] + divide_size[i-1] : 0);
}
//process 0 將像素資訊傳給其他 process
MPI_Scatterv(BMPSaveData[0], divide_size, disp, MPI_RGB, scatter_BMP[0], divide_size[id], MPI_RGB, 0, MPI_COMM_WORLD);
```
- 底下圖片是當分成 **n** 份時 , 各區塊要怎麼把自己的像素傳遞或接收邊界的像素

> 我自己是每個區塊先 send 兩次 , 希望再夠透過 recv 兩次 process 避免再等其他 process 資訊而卡住 , 但為了保險我也加上了 MPI_Barrier() 防止偷跑 .
``` cpp
for(int count = 0; count < NSmooth ; count ++){
MPI_Send(scatter_BMP[divide_height[id]], w_h[0], MPI_RGB, (id + 1)%size, 0, MPI_COMM_WORLD);
MPI_Send(scatter_BMP[1], w_h[0], MPI_RGB, (id - 1 + size)%size, 0, MPI_COMM_WORLD);
MPI_Recv(scatter_BMP[0], w_h[0], MPI_RGB, (id - 1 + size)%size, 0, MPI_COMM_WORLD, &status);
MPI_Recv(scatter_BMP[divide_height[id]+1], w_h[0], MPI_RGB, (id + 1)%size, 0, MPI_COMM_WORLD, &status);
MPI_Barrier(MPI_COMM_WORLD);
swap(scatter_BMP, s_BMP);
//進行平滑運算
for(int i = 1; i <= divide_height[id]; i++)
for(int j =0; j < w_h[0] ; j++){
/*********************************************************/
/*設定上下左右像素的位置 */
/*********************************************************/
int Top = i-1;
int Down = i+1;
int Left = j>0 ? j-1 : w_h[0]-1;
int Right = j<w_h[0]-1 ? j+1 : 0;
/*********************************************************/
/*與上下左右像素做平均,並四捨五入 */
/*********************************************************/
scatter_BMP[i][j].rgbBlue = (double) (s_BMP[i][j].rgbBlue+s_BMP[Top][j].rgbBlue+s_BMP[Top][Left].rgbBlue+s_BMP[Top][Right].rgbBlue+s_BMP[Down][j].rgbBlue+s_BMP[Down][Left].rgbBlue+s_BMP[Down][Right].rgbBlue+s_BMP[i][Left].rgbBlue+s_BMP[i][Right].rgbBlue)/9+0.5;
scatter_BMP[i][j].rgbGreen = (double) (s_BMP[i][j].rgbGreen+s_BMP[Top][j].rgbGreen+s_BMP[Top][Left].rgbGreen+s_BMP[Top][Right].rgbGreen+s_BMP[Down][j].rgbGreen+s_BMP[Down][Left].rgbGreen+s_BMP[Down][Right].rgbGreen+s_BMP[i][Left].rgbGreen+s_BMP[i][Right].rgbGreen)/9+0.5;
scatter_BMP[i][j].rgbRed = (double) (s_BMP[i][j].rgbRed+s_BMP[Top][j].rgbRed+s_BMP[Top][Left].rgbRed+s_BMP[Top][Right].rgbRed+s_BMP[Down][j].rgbRed+s_BMP[Down][Left].rgbRed+s_BMP[Down][Right].rgbRed+s_BMP[i][Left].rgbRed+s_BMP[i][Right].rgbRed)/9+0.5;
}
}
```
> 其實我還有做 row 的上下移動 , 這樣子就能直接套用迴圈嘞上下移動的操作不複雜 , 就不貼上來嘞
- 最後由 **MPI_Gatherv()** 將各區塊在 **PROCESS0** 組合起來
``` cpp
MPI_Gatherv(scatter_BMP[0], divide_size[id], MPI_RGB, BMPSaveData[0], divide_size, disp, MPI_RGB, 0, MPI_COMM_WORLD);
MPI_Barrier(MPI_COMM_WORLD);
if(!id){
//寫入檔案
if ( saveBMP( outfileName ) )
cout << "Save file successfully!!" << endl;
else
cout << "Save file fails!!" << endl;
}
```
### 實績測試
0. 1 process 未平型化

1. 2 process

2. 4 process

3. 8 process

4. 16 process

> 觀察發現核心數越多 , 其加速逐漸減少
> 而且也 diff 過原程式生成的模糊, 並沒有跳出錯誤訊息
> 
## 第二部份

我的理解是當從奇數位(Odd)偶數位(Even) 開頭各取一對 **PAIR** 比較大小並進行 **SWAP()** , 直到 **Odd** 和 **Even** 開頭進行 **SORT** 並且未更動陣列就代表 sort 完嘞 .
>複雜度大概是 ***O(n^2)***
以下為未平行的基本版本
``` c=1
#include <bits/stdc++.h>
using namespace std;
int main (){
int n , offset = 0 , odd = 0 , even = 0;
cin >> n;
srand(time(NULL));
vector<int> arr(n);
for(auto &it : arr)
it = rand();
for(;;){
if(offset) even = 1;
else odd = 1;
for(int i = offset; i < n; i += 2) {
if(i + 1 < n && arr[i] > arr[i + 1]) {
swap(arr[i], arr[i + 1]);
if(offset) even = 0;
else odd = 0;
}
}
if(odd && even) break;
offset = !offset;
}
for(auto &it : arr)
cout << it << " ";
cout << endl;
}
```
不過現在題目要求的平行化出現了限制 , 要在各個 PROCESS 生成 n/size 大小的 int* 陣列並不能將資訊 Boardcast .
而我的平行化想法如下 ,
* 各 process 會交換邊界的資訊 (e.g. PROCESS0 最右邊的元素(5) 和 PROCESS1 最左邊的元素(3),則 SWAP) 再對自己的陣列進行 odd_even sort. 而這個步驟會持續至各個 process 不能再 sort 的時候結束
* 由 process 去讀取 stdin 的內容來決定元素的總數量
``` cpp
if(!id) cin >> n;
MPI_Bcast(&n, 1, MPI_INT, 0, MPI_COMM_WORLD);
```
* 每個 process 含有的元素數量皆不同 , 我的分配方法如下
```cpp
int remainder = n % size;
for(int i = 0 ; i < size ; ++i) {
if (id >= n && size > n){ //特判 n < process 數
sub_n[id] = 1;
disp[i] = disp[i-1] + 1;
}
else {
disp[i] = (i ? disp[i-1] + sub_n[i-1] : 0); //由 remainder 個數量的元素 + 1 來補足餘數的量
sub_n[i] = n / size + (i < remainder ? 1 : 0);
}
}
```
* 主程式
```cpp
while(n--){
int l, r, tr, tl;
if(id){ //對左邊交換 , id == 0 則不用
tl = arr[0];
MPI_Send(&tl, 1, MPI_INT, id - 1, 0, MPI_COMM_WORLD);
MPI_Recv(&l, 1, MPI_INT, id - 1, 0, MPI_COMM_WORLD, &status);
if(l > arr[0]){
undone = 1; //交換後代表還要做一次 sort
arr[0] = l;
}
}
if(id < size - 1){ //對右邊交換 , id == size -1 則不用
tr = arr[sub_n[id]-1];
MPI_Recv(&r, 1, MPI_INT, id + 1, 0, MPI_COMM_WORLD, &status);
MPI_Send(&tr, 1, MPI_INT, id + 1, 0, MPI_COMM_WORLD);
if(r < arr[sub_n[id] - 1]){
undone = 1; //交換後代表還要做一次 sort
arr[sub_n[id] - 1] = r;
}
}
if(undone) odd_sort(&arr, sub_n[id]);
undone = 0;
MPI_Barrier(MPI_COMM_WORLD); //保險
}
```
* 實機測試


> 由於隨機性的關係 , 就算是固定 n = 100 的數量但是時間並未變快甚至變慢 , 接下來改以同 process 數和 n 來驗證
``` bash=0
F74096297@sivslab-pn1:~/HW2$ mpiexec -n 16 ./a.out
100
Process 1 execution time = 1.98447
Process 2 execution time = 1.98445
Process Process 4 execution time = 1.98446
Process 5 execution time = 1.98448
Process 7 execution time = 1.98446
Process 9 execution time = 1.98447
Process 10 execution time = 1.98445
Process 11 execution time = 1.98447
Process 12 execution time = 1.98445
Process 13 execution time = 1.98447
Process 14 execution time = 1.98444
Process 15 execution time = 1.98447
Process 6 execution time = 1.98444
3 execution time = 1.98446
2 3 5 5 6 6 7 9 10 11 11 12 12 15 15 15 17 18 18 19 19 20 20 22 24 24 26 28 29 29 32 33 34 35 36 38 41 42 42 42 43 44 44 45 46 46 47 47 47 47 48 49 50 51 53 55 55 58 61 61 61 62 62 62 63 67 67 69 69 69 74 74 74 76 76 77 77 78 78 78 80 81 83 84 84 86 86 86 89 89 90 91 91 92 93 93 95 96 96 97
Process 0 execution time = 0.415945
Process 8 execution time = 2.01542
F74096297@sivslab-pn1:~/HW2$ mpiexec -n 16 ./a.out
100
Process 1 execution time = 3.69547
Process 2 execution time = 3.69548
Process 3 execution time = 3.7234
Process 4 execution time = 3.70348
Process 5 execution time = 3.69948
Process 6 execution time = 3.71945
Process 8 execution time = 3.70746
Process 9 execution time = 3.69547
Process 10 execution time = 3.67947
Process 11 execution time = 3.7234
Process 12 execution time = 3.71545
Process 13 execution time = 3.70745
Process 14 execution time = 3.71944
Process Process 7 execution time = 3.7234
1 2 3 3 6 7 7 8 8 8 11 12 12 12 12 13 14 15 15 16 16 16 17 17 18 19 21 22 22 23 24 26 27 28 29 30 31 32 32 33 35 35 36 38 40 40 41 43 43 45 45 45 46 46 46 46 49 50 50 51 51 52 56 56 57 58 59 60 61 62 62 64 66 68 70 71 72 73 74 76 77 77 79 81 81 83 84 86 86 87 88 88 90 92 93 94 96 96 98 99
Process 0 execution time = 0.337826
15 execution time = 3.7234
F74096297@sivslab-pn1:~/HW2$ mpiexec -n 16 ./a.out
100
Process 1 execution time = 5.79329
Process 2 execution time = 5.79329
Process 3 execution time = 5.79329
Process 4 execution time = 5.79328
Process Process 6 execution time = 5.79328
Process 8 execution time = 5.79329
Process 9 execution time = 5.79329
Process 10 execution time = 5.79329
Process 11 execution time = 5.79322
Process 12 execution time = 5.79328
Process 13 execution time = 5.79329
Process 14 execution time = 5.79328
0 0 1 2 3 4 4 7 7 8 8 9 10 10 11 14 15 16 18 18 18 19 20 24 24 27 27 28 30 31 31 32 32 35 35 37 41 46 47 47 48 49 49 50 50 51 51 52 53 54 55 55 63 63 63 65 65 66 68 69 69 69 69 69 70 72 72 75 75 75 77 78 81 82 83 84 84 86 86 86 87 88 89 89 90 90 91 91 91 92 93 93 95 96 96 96 96 96 98 98
Process 0 execution time = 1.38966
5 execution time = 5.79329
Process 15 execution time = 5.79329
Process 7 execution time = 5.79329
F74096297@sivslab-pn1:~/HW2$ mpiexec -n 16 ./a.out
100
Process 2 execution time = 3.16269
Process 3 execution time = 3.16271
Process 4 execution time = 3.1626
Process 5 execution time = 3.16269
Process 6 execution time = 3.16269
Process 8 execution time = 3.1626
Process 9 execution time = 3.16269
Process Process 11 execution time = 3.16271
Process 12 execution time = 3.1626
Process 13 execution time = 3.16269
Process 15 execution time = 3.16271
Process 1 execution time = 3.16269
Process 14 execution time = 3.1627
10 execution time = 3.1627
1 4 6 7 8 8 10 10 11Process 7 execution time = 3.16427 11 11 13 14 15 15 15 18 19 19 20 21 21 22 23 23 24 25 27 27 31 31 34 34 34
35 35 35 37 38 38 39 39 40 40 41 41 41 41 42 45 47 50 50 54 55 55 56 56 57 57 58 59 60 60 60 61 62 62 65 69 69 69 70 71 71 72 72 75 77 78 82 82 83 84 84 86 86 86 89 90 91 92 93 94 95 96 96 96 99 99
Process 0 execution time = 1.95233
```
> 數量浮動 , 所以不好比較 process 是否越多越好 接下來將 n 調大
```shell=0
F74096297@sivslab-pn1:~/HW2$ mpiexec -n 2 ./a.out
10000
10 21 24 26 ...
Process 0 execution time = 23.852867
Process 1 execution time = 24.931279
F74096297@sivslab-pn1:~/HW2$ mpiexec -n 4 ./a.out
10000
10 10 30 34 ...
Process 0 execution time = 7.474880
Process 1 execution time = 18.015723
Process 2 execution time = 18.015730
Process 3 execution time = 18.015719
F74096297@sivslab-pn1:~/HW2$ mpiexec -n 8 ./a.out
10000
14 16 25 30 ...
Process 0 execution time = 2.384799
Process 1 execution time = 4.019529
Process 3 execution time = 4.019531
Process 4 execution time = 4.019538
Process 5 execution time = 4.019536
Process 6 execution time = 4.019539
Process 2 execution time = 4.019539
Process 7 execution time = 4.019528
F74096297@sivslab-pn1:~/HW2$ mpiexec -n 16 ./a.out
10000
27 37 41 47 ...
Process 0 execution time = 1.282921
Process 1 execution time = 6.544068
Process 3 execution time = 6.544043
Process 4 execution time = 6.544054
Process 5 execution time = 6.544060
Process 7 execution time = 6.544020
Process 2 execution time = 6.544044
Process 6 execution time = 6.544042
Process 8 execution time = 6.544045
Process 10 execution time = 6.544045
Process 11 execution time = 6.544045
Process 12 execution time = 6.544046
Process 13 execution time = 6.544062
Process 14 execution time = 6.544044
Process 15 execution time = 6.544024
Process 9 execution time = 6.544070
```
* 結論
* 仔細想的話當 process 越多 , 要 coumminate 的次數就越多所以速率增長會遞減而沒辦法到 n(big process_n / small process_n) 倍.
* 原本想解決 n < process 數量時 boarst 會等不到多餘 process 的回應 , 原本想試著處理 group 讓 WORLD_COMM_WORLD 移除掉多餘的 PROCESS 讓 BOARDCAST 不會被卡住.
> Fatal error in PMPI_Group_range_excl: Invalid group, error stack:
PMPI_Group_range_excl(224): MPI_Group_range_excl(group=0x44000000, n=4, ranges=0x7ffe4d04cc6c, new_group=0x7ffe4d04cbc4) failed
PMPI_Group_range_excl(173): Invalid group
編譯會過 , 但是我找不到方法解決 QwQ 網路上資源也少所以後來不沒有做這個操作 .
* 後來改為多餘的 process 會生成大小為 1 的陣列且數字恆大 random 的數