# 平行程式 HW3
環境: 依作業提供之apollo server。
## 3-1
### Implementation
#### Which algorithm do you choose in hw3-1?
使用FLOYD-WARSHALL演算法,Floyd-Warshall演算法是一種用於計算加權圖中所有頂點對之間最短路徑的演算法。它的主要特點是能夠處理帶有負權邊的圖,但要求圖中沒有負權迴路。以下是Floyd-Warshall演算法的基本步驟:
1. **初始化**:建立一個距離矩陣,初始時將每對頂點之間的距離設為無限大(如果兩個頂點之間沒有直接邊),並將每個頂點到自身的距離設為0。如果有直接邊,則將距離設為該邊的權重。
2. **動態規劃**:對於每一個中間頂點 $( k )$,更新距離矩陣,使得對於每一對頂點 $( (i, j))$,如果通過頂點 $( k )$ 的路徑比直接從 $( i )$ 到 $( j )$ 的路徑更短,則更新距離矩陣中的值。具體公式為:
$\text{dist}[i][j] = \min(\text{dist}[i][j], \text{dist}[i][k] + \text{dist}[k][j])$
3. **重複更新**:重複上述步驟,對所有頂點作為中間點進行更新,直到所有頂點對之間的最短距離都被計算出來。
這樣,最終的距離矩陣將包含所有頂點對之間的最短路徑距離。Floyd-Warshall演算法的時間複雜度為 $(O(n^3))$,其中 $(n)$ 是圖中的頂點數。
Sudo code
```c
FLOYD-WARSHALL( W ):
n = W.rows
D(0) = W
for k = 1 to n
let D(k) = = [ d[i][j](k) ] be a new n x n matrix
for i = 1 to n
for j = 1 to n
d[i][j](k) = min(d[i][j](k-1), d[i][k](k) + d[k][j](k-1))
return D(n)
```
#### 實作
* 主函數:
* 讀取輸入檔案並初始化距離矩陣。
* 獲取CPU的執行緒數量並初始化障礙變數。
* 創建多個執行緒,每個執行緒執行block_FW函數。
* 等待所有執行緒完成計算。
* 將結果寫入輸出檔案。
```clike=
int main(int argc, char* argv[]) {
if (pthread_barrier_init(&barrier, NULL, NUM_THREADS) != 0) {
printf("Barrier initialization failed\n");
return 1;
}
//.....
for(int i = 0 ; i < NUM_THREADS ; i++){
ids[i] = i;
pthread_create(&threads[i], nullptr, block_FW,(void*)&ids[i]);
}
// gether threads
for(int i = 0 ; i < NUM_THREADS ; i++){
pthread_join(threads[i], nullptr);
}
output(argv[2]);
return 0;
}
void* block_FW(void* args) {
int pid = *(int*)args;
//for(int k = pid ; k < n ; k+=NUM_THREADS){ // 不能用Round,會卡pthread barrier
for(int k = 0 ; k < n ; k++){
//for(int i = 0 ; i < n ; i++){
for(int i = pid ; i < n ; i+=NUM_THREADS){
for(int j = 0 ; j < n ; j++){
if (Dist[i][k] + Dist[k][j] < Dist[i][j]) {
Dist[i][j] = Dist[i][k] + Dist[k][j];}
}
}
//因為下一個round的結果depands on 上一個round
pthread_barrier_wait(&barrier);
}
pthread_exit(nullptr);
return NULL;
}
```
#### 心得
因為下一個round的結果depands on 上一個round的結果,所以需在每個round結束時放barrier。最外層的迴圈(k)代表round次數,需注意的是不能把round次數分散在各pthreads,因為若有thread提早跳出for loop,便可能導致其他threads卡死在`pthread_barrier_wait`。
----
## 3-2
此部分包含2種實作的實驗,第一種是依照作業說明的每次launch phase1~3,重複次數等於$\frac{n}{B}$,其中B代表block數量。第二種是展開整個matrix,launch次數$n$次,但每個kernel內容較簡單。 本report包含實作並比較兩種極端做法的performance差異:(1).Kernel launch較少,但每個kernel複雜、(2)Kernel launch較多,但每個Kernel簡單。
### 作法(一): Cuda實做Blocked FLOYD-WARSHALL
屬於Kernel launch較少,但每個kernel複雜的作法。
原理:每個iteration會launch 3個phase的cuda函數,每個pahse的函數各自處理$B\times B$大小的sub-matrix。

```cpp=
#include <stdio.h>
#include <stdlib.h>
#include <cuda_runtime.h>
#include <iostream>
//======================
const int INF = ((1 << 30) - 1);
//const int V = 50010;
int* Dist;
int n, m;
int ceil(int a, int b) { return (a + b - 1) / b; }
__device__ void calKernel(int* dist, int B, int Round, int block_start_x, int block_start_y, int block_width, int block_height , int n) {
int b_i = blockIdx.x * blockDim.x + threadIdx.x + block_start_x;
int b_j = blockIdx.y * blockDim.y + threadIdx.y + block_start_y;
if (b_i < block_start_x + block_height && b_j < block_start_y + block_width) {
for (int k = Round * B; k < (Round + 1) * B && k < n; ++k) {
int block_internal_start_x = b_i * B;
int block_internal_end_x = (b_i + 1) * B;
int block_internal_start_y = b_j * B;
int block_internal_end_y = (b_j + 1) * B;
if (block_internal_end_x > n) block_internal_end_x = n;
if (block_internal_end_y > n) block_internal_end_y = n;
for (int i = block_internal_start_x; i < block_internal_end_x; ++i) {
for (int j = block_internal_start_y; j < block_internal_end_y; ++j) {
if (dist[i * n + k] + dist[k * n + j] < dist[i * n + j]) {
dist[i * n + j] = dist[i * n + k] + dist[k * n + j];
}
}
}
}
}
}
__global__ void blockedFloydWarshallPhase1(int* dist, int round ,int r , int B , int n){
calKernel(dist,B,r,r,r,1,1,n);
}
__global__ void blockedFloydWarshallPhase2(int* dist, int round ,int r , int B , int n){
calKernel(dist,B, r, r, 0, r, 1,n);
calKernel(dist,B, r, r, r + 1, round - r - 1, 1,n);
calKernel(dist,B, r, 0, r, 1, r,n);
calKernel(dist,B, r, r + 1, r, 1, round - r - 1,n);
}
__global__ void blockedFloydWarshallPhase3(int* dist, int round ,int r , int B , int n){
calKernel(dist,B, r, 0, 0, r, r,n);
calKernel(dist,B, r, 0, r + 1, round - r - 1, r,n);
calKernel(dist,B, r, r + 1, 0, r, round - r - 1,n);
calKernel(dist,B, r, r + 1, r + 1, round - r - 1, round - r - 1,n);
}
#define B 512
int main(int argc, char* argv[]) {
input(argv[1]);
//=============================
// cuda memory
//=============================
int* device_D = NULL;
cudaHostRegister(&Dist[0], sizeof(int) * n*n, cudaHostRegisterDefault);
cudaMalloc((void **)&device_D, sizeof(int) * n*n);
cudaMemcpy(device_D , &Dist[0], sizeof(int) * n*n , cudaMemcpyHostToDevice);
//=============================
// Cuda launch
//=============================
dim3 threadsPerBlock(16, 16);
dim3 numBlocks((n + threadsPerBlock.x - 1) / threadsPerBlock.x,
(n + threadsPerBlock.y - 1) / threadsPerBlock.y);
int round = ceil(n, B);
for (int r = 0; r < round; ++r) {
// Phase 1
blockedFloydWarshallPhase1<<<numBlocks, threadsPerBlock>>>(device_D , round,r,B,n );
// Phase 2
blockedFloydWarshallPhase2<<<numBlocks, threadsPerBlock>>>(device_D, round,r,B,n );
// Phase 3
blockedFloydWarshallPhase3<<<numBlocks, threadsPerBlock>>>(device_D, round,r,B,n );
}
cudaMemcpy(Dist, device_D, sizeof(int)*n*n, cudaMemcpyDeviceToHost);
cudaHostUnregister(&Dist[0]);
cudaFree(device_D);
output(argv[2]);
delete[] Dist;
return 0;
}
```
#### 潛在缺點
1. Uncoalesced index
2. Not using shadered memory
以下圖來說明當k=0時,一個block內程式主要的access pattern:
`dist[i * n + k] + dist[k * n + j] < dist[i * n + j]`
* i * n + k:黑框區域,其column-major的特性不利於memory access
* k * n + j:紅框區域,雖然是row-major,但每個k-iteration會經歷i次非連續的記憶體access。
* i * n + j:籃框區域。
如此雜亂的access pattern組合,若不先把dist的內容放到shared-memory,則會大大的被Memory access時間拖慢。

#### Unroll對Performance的影響
在主要的核心方法上加上`unroll`指示如下:
```cpp=
__device__ void calKernel(int* dist, int B, int Round, int block_start_x, int block_start_y, int block_width, int block_height , int n) {
int b_i = blockIdx.x * blockDim.x + threadIdx.x + block_start_x;
int b_j = blockIdx.y * blockDim.y + threadIdx.y + block_start_y;
if (b_i < block_start_x + block_height && b_j < block_start_y + block_width) {
#pragma unroll 32
for (int k = Round * B; k < (Round + 1) * B && k < n; ++k) {
int block_internal_start_x = b_i * B;
int block_internal_end_x = (b_i + 1) * B;
int block_internal_start_y = b_j * B;
int block_internal_end_y = (b_j + 1) * B;
if (block_internal_end_x > n) block_internal_end_x = n;
if (block_internal_end_y > n) block_internal_end_y = n;
#pragma unroll 32
for (int i = block_internal_start_x; i < block_internal_end_x; ++i) {
#pragma unroll 32
for (int j = block_internal_start_y; j < block_internal_end_y; ++j) {
if (dist[i * n + k] + dist[k * n + j] < dist[i * n + j]) {
dist[i * n + j] = dist[i * n + k] + dist[k * n + j];
}
}
}
}
}
}
```
| 方法 | c10.1 | c20.1 |
| -------- | -------- | -------- |
| 無unroll | 15.86 | timeout |
| unroll | 15.56 | timeout |
並無提升的原因可能有:
* 迴圈迭代次數:如果迴圈迭代次數不是32的倍數,編譯器可能無法完全展開迴圈,導致代碼效率低下。
* **記憶體訪問模式**:如果展開的迴圈導致非合併的記憶體訪問(Uncoalesced index),性能可能會下降。確保記憶體訪問是合併的對性能至關重要。
* 指令快取:展開迴圈會增加代碼大小,這可能導致指令快取未命中,對性能產生負面影響。
* 編譯器優化:編譯器可能已經在有效地優化迴圈。手動展開迴圈可能會干擾這些優化。
可發現**Uncoalesced index問題**對於後續優化的影響極大。
---
#### B的設定對Performance的影響
B的設定控制了每個sub-matrix要切成$B\times B$大小。
| B | c10.1 | c20.1 |
| -------- | -------- | -------- |
| 512 | 15.86 | timeout |
| 256 | 15.51 | timeout |
| 64 | 2.22 | timeout |
| 32 | 0.82 | 23.88 |
| 16 | 0.42 | **timeout** |
| 8 | 0.32 | 19.97 |
| 4 | 0.27 | 11.90 |
| **2** | **0.22** | **8.64** |
| 1 | 0.22 | 14.05 |
經過實驗,當**B=2**是最好的,可能原因是較小的B能提升記憶體access pattern的緊密程度(跳躍幅度較小)。
---
### 作法(二) Cuda實做FLOYD-WARSHALL
屬於 Kernel launch較多,但每個Kernel簡單。
原理: 回歸到原本的FLOYD-WARSHALL演算法取代多個Phase,以提升程式的可讀性,並簡化問題。
```cpp=
__global__ void floydWarshallKernel(int* dist, int k, int n) {
int i = blockIdx.x * blockDim.x + threadIdx.x;
int j = blockIdx.y * blockDim.y + threadIdx.y;
if (i < n && j < n) {
int ik = i * n + k;
int kj = k * n + j;
int ij = i * n + j;
if (dist[ik] + dist[kj] < dist[ij]) {
dist[ij] = dist[ik] + dist[kj];
}
}
dim3 threadsPerBlock(16, 16);
dim3 numBlocks((n + threadsPerBlock.x - 1) / threadsPerBlock.x,
(n + threadsPerBlock.y - 1) / threadsPerBlock.y);
for (int k = 0; k < n; ++k) {
floydWarshallKernel<<<numBlocks, threadsPerBlock>>>(device_D, k, n);
cudaDeviceSynchronize();
}
```
在雙方皆無任何優化的情況下,此做法能跟Blocked FLOYD-WARSHALL有相近的Performance且無須調整參數B。
| 方法 |c10.1 | c20.1 |p11k1|
| -------- | -------- | -------- |-------- |
| 作法1:Blocked FLOYD-WARSHALL with B=32|0.82 | 23.88 |Timeout |
| 作法1:Blocked FLOYD-WARSHALL with B=2 |0.22 | 8.64 |Timeout |
| 作法2:FLOYD-WARSHALL | 0.22 |10.45 |Timeout|
| 方法 | 評分系統總體時間 |
| -------- | -------- |
| Blocked FLOYD-WARSHALL with B=2 | 25.13 |
| FLOYD-WARSHALL | 29.35 |
#### 嘗試用Shared memory
```cpp=
__device__ void calKernel(int* dist, int Round, int block_start_x, int block_start_y, int block_width, int block_height, int n) {
extern __shared__ int sharedMem[];
int b_i = blockIdx.x * blockDim.x + threadIdx.x + block_start_x;
int b_j = blockIdx.y * blockDim.y + threadIdx.y + block_start_y;
int tid = threadIdx.y * blockDim.x + threadIdx.x;
if (b_i < block_start_x + block_height && b_j < block_start_y + block_width) {
for (int k = Round * B; k < (Round + 1) * B && k < n; ++k) {
int block_internal_start_x = b_i * B;
int block_internal_end_x = (b_i + 1) * B;
int block_internal_start_y = b_j * B;
int block_internal_end_y = (b_j + 1) * B;
if (block_internal_end_x > n) block_internal_end_x = n;
if (block_internal_end_y > n) block_internal_end_y = n;
// Load data into shared memory
if (threadIdx.x < B && threadIdx.y < B) {
sharedMem[threadIdx.y * B + threadIdx.x] = dist[block_internal_start_x * n + block_internal_start_y + threadIdx.y * n + threadIdx.x];
}
__syncthreads();
for (int i = block_internal_start_x; i < block_internal_end_x; ++i) {
for (int j = block_internal_start_y; j < block_internal_end_y; ++j) {
int shared_i = i - block_internal_start_x;
int shared_j = j - block_internal_start_y;
if (sharedMem[shared_i * B + k] + sharedMem[k * B + shared_j] < sharedMem[shared_i * B + shared_j]) {
sharedMem[shared_i * B + shared_j] = sharedMem[shared_i * B + k] + sharedMem[k * B + shared_j];
}
}
}
__syncthreads();
// Write data back to global memory
if (threadIdx.x < B && threadIdx.y < B) {
dist[block_internal_start_x * n + block_internal_start_y + threadIdx.y * n + threadIdx.x] = sharedMem[threadIdx.y * B + threadIdx.x];
}
}
}
}
```
| 方法 | c10.1 | c20.1 |
| -------- | -------- | -------- |
| 用Shared memory | timeout | timeout |
| 無優化 | 0.22 | 8.64 |
----
## Profling
分別對上述的Blocked FLOYD-WARSHALL與FLOYD-WARSHALL 做Profling。以data c10.1與c20.1來看。
### Nsight Data
* Algorithm : BLOCK
* Data: C10.1

---
* Algorithm : BLOCK
* Data: C20.1

### Performance分析
Performance分析分為CPU端與GPU端的面向。
#### CPU
以CPU端來看,有近9成的時間在launch kernel的for-loop中。
| Time | Total Time | Num Calls | Avg | Med | Min | Max | StdDev | Name |
| --- | --- | --- | --- | --- | --- | --- | --- | --- |
| 85.90% | 7.581 s | 7500 | 1.011 ms | 390.149 μs | 2.769 μs | 4.204 ms | 1.172 ms | cudaLaunchKernel |
| 12.70% | 1.123 s | 2 | 561.465 ms | 561.465 ms | 8.229 ms | 1.115 s | 782.393 ms | cudaMemcpy |
| 1.30% | 114.300 ms | 1 | 114.300 ms | 114.300 ms | 114.300 ms | 114.300 ms | 0 ns | cudaHostRegister |
| 0.00% | 2.881 ms | 1 | 2.881 ms | 2.881 ms | 2.881 ms | 2.881 ms | 0 ns | cudaHostUnregister |
| 0.00% | 197.125 μs | 1 | 197.125 μs | 197.125 μs | 197.125 μs | 197.125 μs | 0 ns | cudaFree |
| 0.00% | 161.312 μs | 1 | 161.312 μs | 161.312 μs | 161.312 μs | 161.312 μs | 0 ns | cudaMalloc |
| 0.00% | 878 ns | 1 | 878 ns | 878 ns | 878 ns | 878 ns | 0 ns | cuModuleGetLoadingMode |
---
#### GPU
GPU有8成的時間在執行Phase3,而Phase2與Phase1各占約10%。因為Phase3執行phase1與phase2之外的較多Pixel導致。
| Time | Total Time | Instances | Avg | Med | Min | Max | StdDev | Name |
| --- | --- | --- | --- | --- | --- | --- | --- | --- |
| 80.60% | 7.001 s | 2500 | 2.800 ms | 2.773 ms | 2.516 ms | 4.234 ms | 207.766 μs | blockedFloydWarshallPhase3(int *, int, int, int) |
| 11.30% | 979.608 ms | 2500 | 391.843 μs | 391.270 μs | 387.718 μs | 455.750 μs | 4.960 μs | blockedFloydWarshallPhase2(int *, int, int, int) |
| 8.10% | 707.639 ms | 2500 | 283.055 μs | 282.628 μs | 279.843 μs | 327.524 μs | 3.423 μs | blockedFloydWarshallPhase1(int *, int, int, int) |
雖然乍看之下會下意識覺得要優化Phase3的工作量,但若從所有的工作量佔比來看的話,耗時最久的是Kernel launch的overhead。即使Blocked FLOYD-WARSHALL能有效減少Kernel launch數量,
| Time | Total Time | Instances | Avg | Med | Min | Max | StdDev | Category | Operation |
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
| 43.30% | 7.581 s | 7500 | 1.011 ms | 390.149 μs | 2.769 μs | 4.204 ms | 1.172 ms | CUDA_API | cudaLaunchKernel |
| 39.90% | 7.001 s | 2500 | 2.800 ms | 2.773 ms | 2.516 ms | 4.234 ms | 207.766 μs | CUDA_KERNEL | blockedFloydWarshallPhase3(int *, int, int, int) |
| 6.40% | 1.123 s | 2 | 561.465 ms | 561.465 ms | 8.229 ms | 1.115 s | 782.393 ms | CUDA_API | cudaMemcpy |
| 5.60% | 979.608 ms | 2500 | 391.843 μs | 391.270 μs | 387.718 μs | 455.750 μs | 4.960 μs | CUDA_KERNEL | blockedFloydWarshallPhase2(int *, int, int, int) |
| 4.00% | 707.639 ms | 2500 | 283.055 μs | 282.628 μs | 279.843 μs | 327.524 μs | 3.423 μs | CUDA_KERNEL | blockedFloydWarshallPhase1(int *, int, int, int) |
| 0.70% | 114.300 ms | 1 | 114.300 ms | 114.300 ms | 114.300 ms | 114.300 ms | 0 ns | CUDA_API | cudaHostRegister |
| 0.00% | 8.186 ms | 1 | 8.186 ms | 8.186 ms | 8.186 ms | 8.186 ms | 0 ns | MEMORY_OPER | [CUDA memcpy Host-to-Device] |
| 0.00% | 7.635 ms | 1 | 7.635 ms | 7.635 ms | 7.635 ms | 7.635 ms | 0 ns | MEMORY_OPER | [CUDA memcpy Device-to-Host] |
| 0.00% | 2.881 ms | 1 | 2.881 ms | 2.881 ms | 2.881 ms | 2.881 ms | 0 ns | CUDA_API | cudaHostUnregister |
| 0.00% | 197.125 μs | 1 | 197.125 μs | 197.125 μs | 197.125 μs | 197.125 μs | 0 ns | CUDA_API | cudaFree |
| 0.00% | 161.312 μs | 1 | 161.312 μs | 161.312 μs | 161.312 μs | 161.312 μs | 0 ns | CUDA_API | cudaMalloc |
| 0.00% | 878 ns | 1 | 878 ns | 878 ns | 878 ns | 878 ns | 0 ns | CUDA_API | cuModuleGetLoadingMode |
Kernel launch 的 overhead 會佔很大比例的原因有幾個:
1. **CPU-GPU 通訊延遲**:每次 kernel launch 都需要 CPU 和 GPU 之間的通訊,這會產生一定的延遲。
2. **上下文切換**:每次 kernel launch 都會涉及上下文切換,這會增加額外的延遲。
3. **同步開銷**:在某些情況下,kernel launch 需要進行同步操作,這會進一步增加 overhead。
4. **小 kernel 的影響**:對於執行時間較短的小 kernel,launch overhead 會顯得特別明顯,因為 kernel 執行時間相對較短。
#### GPU Memory 用量
Blocked FLOYD-WARSHALL與FLOYD-WARSHALL一樣在測資C20.1共使用了約95.37MB存放資料。
| Total | Count | Avg | Med | Min | Max | StdDev | Operation |
| --- | --- | --- | --- | --- | --- | --- | --- |
| 95.37 MiB | 1 | 95.37 MiB | 95.37 MiB | 95.37 MiB | 95.37 MiB | 0 B | [CUDA memcpy Device-to-Host] |
| 95.37 MiB | 1 | 95.37 MiB | 95.37 MiB | 95.37 MiB | 95.37 MiB | 0 B | [CUDA memcpy Host-to-Device] |
---
## HW3-3 Multi-GPU
我只有成功做出資料切割的部分。
### 資料切割
範例: 將原本5*5的matrix切成上下兩部分。

以下程式能正確將資料分散在2個gpu上。
```cpp!
int main(int argc, char* argv[]) {
input(argv[1]);
int* device_D[2];
int n_per_device = n / 2 ;
for (int i = 0; i < 2; ++i) {
int start_index = i * n * n_per_device;
if(i%2==1 && n%2==1){
n_per_device++;
}
cudaSetDevice(i);
cudaMalloc((void**)&device_D[i], sizeof(int) * n * n_per_device);
cudaMemcpy(device_D[i], &Dist[start_index], sizeof(int) * n * n_per_device, cudaMemcpyHostToDevice);
}
dim3 threadsPerBlock(32, 32);
dim3 numBlocks( (n_per_device + threadsPerBlock.x - 1) / threadsPerBlock.x,
(n + threadsPerBlock.y - 1) / threadsPerBlock.y);
#pragma omp parallel num_threads(2)
{
int gpu_id = omp_get_thread_num();
cudaSetDevice(gpu_id);
runFloydWarshall(device_D[gpu_id], n, numBlocks, threadsPerBlock);
}
n_per_device = n / 2 ;
for (int i = 0; i < 2; ++i) {
int start_index = i * n * n_per_device;
if(i%2==1 && n%2==1){
n_per_device++;
}
cudaSetDevice(i);
cudaMemcpy(&Dist[start_index], device_D[i], sizeof(int) * n * n_per_device, cudaMemcpyDeviceToHost);
cudaFree(device_D[i]);
}
output(argv[2]);
delete[] Dist;
return 0;
}
```
我沒有成功做出來,但以下闡述相關構想。
* 構想: 將Kernel launch分散給各GPU device。
```cpp
for (int i = 0; i < numDevices; ++i) {
cudaSetDevice(i);
for (int k = i ; k < n; k+=numDevices) {
floydWarshallKernel<<<numBlocks, threadsPerBlock>>>(device_D[i], k, n);
cudaDeviceSynchronize();
}
cudaDeviceSynchronize();
}
```
但由於這樣每隔會少跟另一半的k做比較,因此需再用個function比較最短的。
```cpp
__global__ void cmp(int* dist1 , int* dist2 , int n){
int i = blockIdx.x * blockDim.x + threadIdx.x;
int j = blockIdx.y * blockDim.y + threadIdx.y;
printf("cmp i %d j %d " , i , j);
if (i < n && j < n) {
int ij = i * n + j;
printf("dist1[ij] %d dist2[ij] %d \n" , dist1[ij], dist2[ij]);
if(dist1[ij] > dist2[ij]){
dist1[ij] = dist2[ij];
}
}
}
....
cudaSetDevice(0);
cmp<<<numBlocks , threadsPerBlock>>>(device_D[0] , device_D[1] , n);
cudaDeviceSynchronize();
```
---
* 構想2: 由上面Nsight-system的數據可知道Phase3佔了最大一部份,因此這邊把Phase3額外拆給另一個GPU做。
```cpp=
for (int r = 0; r < round; ++r) {
{
cudaSetDevice(0);
blockedFloydWarshallPhase1<<<numBlocks, threadsPerBlock>>>(device_D[0], round, r, n);
blockedFloydWarshallPhase2<<<numBlocks, threadsPerBlock>>>(device_D[0], round, r, n);
cudaDeviceSynchronize();
cudaSetDevice(1);
blockedFloydWarshallPhase3<<<numBlocks, threadsPerBlock>>>(device_D[0], round, r, n);
cudaDeviceSynchronize();
}
}
```
### Weak scalability
Judge執行時間(s)關係如下:
|方法 | c01.1 | c04.1 |
| -------- | -------- | -------- |
| Single-GPU | 0.27 | 8.79 |
| 構想1 | 0.37 | 5.33 |
| 構想2 | 0.32 | 15.01 |
---
## Experiment on AMD GPU
將hw3-2編譯成.hip檔案
```
hipify-clang hw3-2.cu
```
測資: c20.1
### 測量方法# 1
在頭尾加上計時程式:
```
#include <chrono>
int main(){
auto start = std::chrono::high_resolution_clock::now();
....
auto end = std::chrono::high_resolution_clock::now();
std::chrono::duration<double> duration = end - start;
printf("Your code runs for : %f secondes \n" , duration.count());
}
```
### 測量方法# 2
使用Nsight System。
由於是執行AMD的程式,CUDA event無法捕捉,使NSight system的資訊只剩下CPU端的執行時間。

| 方法 | c20.1執行時間(s) |c01.1執行時間(s) |
| -------- | -------- | -------- |
| Nvidia | 8.64 |0.27|
| AMD | 1.31 |0.02|
## 心得
雖然將程式轉換為AMD版本看似速度很快,但其輸出結果卻是錯誤的(Nvidia版本的輸出是正確的)。我認為在現階段,直接將Nvidia程式轉換為AMD程式仍存在一定的風險。
---
## Experience & conclusion
這頁筆記主要涵蓋以下幾個重點:
- **Floyd-Warshall演算法**:詳細介紹了Floyd-Warshall演算法的步驟和實作,包括初始化距離矩陣、動態規劃更新距離矩陣、以及重複更新直到計算出所有頂點對之間的最短路徑。
- **CUDA實作**:展示了如何使用CUDA進行Blocked Floyd-Warshall演算法的實作,並探討了不同B值對性能的影響,以及使用shared memory和unroll指示的優化效果。
- **多GPU構想**:提出了將Kernel launch分散給多個GPU device的構想,並進行了相關實驗和性能分析。
- **AMD GPU實驗**:將CUDA程式轉換為HIP程式在AMD GPU上執行,並比較了Nvidia和AMD GPU的性能差異。