2016q3 Homework2 (raytracing)

contributed by <LanKuDot>

環境

OS: Ubuntu 15.04 64bit
CPU: Genuine Intel® CPU U7300 @ 1.30GHz × 2
GPU: Mobile Intel® GM45 Express Chipset
RAM: 4 GB

程式碼

Gprof

make PROFILE=1：多加 -pg flag，讓 compiler 安插額外程式碼輸出執行時期的資料讓 gprof 分析，執行完會產生 gmon.out，以追蹤程式的執行行為 (動態分析)

[Makefile]
ifeq ($(strip $(PROFILE)),1) "strip 幫助去除 leading 或 trailing space
PROF_FLAGS = -pg 
CFLAGS += $(PROF_FLAGS)
LDFLAGS += $(PROF_FLAGS) 
endif

gprof ./raytracing：讓 gprof 分析資料，gprof 會取用產生的資料檔。會列出呼叫次數、執行時間比、call graph 等資訊。配合 less 服用，用分頁顯示大量資料。

static inline

static：對 definition 有效，限制 function 或 global variable 只會在所在檔案中被 access
lnline：告訴編譯器將 function 直接展開成 definition 的程式，可透過加上 __attribute__((always_line)) 強制 gcc 在最佳化沒有開啟時 inline。

核心

Function `raytracing`

傳入場景必要的物件：長方面 (牆)、球體、光源、背景色、攝影機位置及視角、攝影機畫面 (輸出圖片) 的長寬
先計算以攝影機平面為基準的 vector space 的 basis vector，將結果存在 vector u, v, w。
- function calculateBasisVector：
  - w：攝影機面的 normal (vpn, view-plane normal)
  - u：w x 攝影機面的 up vector (vup)
  - v：w x u 找到最後一個 vector

SAMPLES：為了反鋸齒 (MSAA，多重採樣反鋸齒法)，設定成取樣四個 pixel 作為該 pixel 的顏色

取法如下

以 SAMPLES = 4 為例，對於 pixel (0,0) 會採
(0*2+0/2, 0*2+0%2) = (0,0)
(0*2+1/2, 0*2+1%2) = (0,1)
(0*2+2/2, 0*2+2%2) = (1,0)
(0*2+3/2, 0*2+3%2) = (1,1) 四個點做平均
而 pixel (1,0) 則為 (2,0), (2,1), (3,0), (3,1)四個點
每次採樣的點都不一樣，可以在這裡做平行化

從攝影機畫面的每個 pixel 射出光源
- function rayConstruction：建立光源出發點，會存在 d。
  - 設定 view plane 的大小為 [-0.0175:0.0175][-0.0175:0.0175] (座標)；消失點的位置在 z = 0.05 的地方 (view plane space)
  - 計算該 pixel 在 view plane space 上的座標
    - w (z)：座標為消失點 z 座標
    - u (x)：為 xMin + xUnit * i
    - v (y)：為 yMin + yUnit * j
  - 將得到的座標轉換到 world space 上：
```
basis of space B(represented by space A) * coordainate on space B + origin on space B(represented by space A)= coordinate on space A
```
  - 後與 view plane 的原點相減(?)
計算光源在旅行的過程中會變成什麼顏色
- function rayColor
判定如果發出去的光沒有碰到任何東西，則加上背景顏色，如果有則加上旅行之後的顏色。(這裡的顏色 scale 為 [0.0:1.0])
將 4 個取樣點的顏色值做平均，map 到 [0:255]，然後存到 pixel 中

輸出

ppm：P6 Width Height MaxColorValue ColorValue
- ColorValue：以 (row, column) 填值

優化

Baseline

執行時間：11.650552 sec
時間大多卡在 dot_product 跟 substract_vector

前 10 名的程式熱區：
  %   cumulative   self              self     total           
 time   seconds   seconds    calls   s/call   s/call  name    
 27.96      2.61     2.61 69646433     0.00     0.00  dot_product
 18.32      4.32     1.71 56956357     0.00     0.00  subtract_vector
  9.64      5.22     0.90 13861875     0.00     0.00  rayRectangularIntersection
  6.75      5.85     0.63 31410180     0.00     0.00  multiply_vector
  5.68      6.38     0.53 17836094     0.00     0.00  add_vector
  5.68      6.91     0.53 10598450     0.00     0.00  normalize
  5.46      7.42     0.51  4620625     0.00     0.00  ray_hit_object
  4.18      7.81     0.39 13861875     0.00     0.00  raySphereIntersection
  4.18      8.20     0.39 17821809     0.00     0.00  cross_product
  3.05      8.49     0.29  1048576     0.00     0.00  ray_color

Loop Unrolling

由於 for loop 有 branch 的判斷，會減慢執行速度。而在 math_toolkit.h 中的 vector 運算最多為三維，可以輕鬆展開
執行時間：6.881327 sec (效能提昇 41%)
可以看到做 loop unrolling 之後的函式總執行時間下降很多

dot_product      2.61 -> 0.73
substact_vector  1.71 -> 0.70
multiply_vector  0.63 -> 0.43
add_vector       0.53 -> 0.21 (seconds)

使用 pthread 做平行化處理

閱讀材料
- POSIX Threads Programming
- Introduction to Parallel progamming
我認為 pthread 可以用在
- 重複執行的動作，像是 for loop
- 可以分開的工作
如果沒有 critical section 會比較好處理，否則要注意共用資源的使用問題
觀察 raytracing 可以做平行化的地方
- 整張圖的運算切割：row 分, column 分或 block 分。缺點是老師在影片中講到的，每一個 pixel 的呼叫熱度不一樣，所以如果切出來熱度過於集中在某個 block，平行化的效果不高。
- 模糊化處理：每個 sample 做完運算後，再取平均。
- RGB：RGB 的計算結果是不互相干擾的，可以個別平行加。缺點 ray_color 計算完顏色一次輸出 RGB，如果每個 thread 只取其中一個值會太浪費。

整張圖運算切割

基本想法：以 row 為單位切割運算區間，將 raytracing 的核心迴圈移到 raytracing_thread 函式，在 raytracing 只負責 create thread。
因為 pthread_create 只允許 pass 一個參數，所以將所需資訊打包成一個資料結構

typedef struct __RAY_TRACING_THREAD_INFO {
    int height;
    int width;
    int fromHeight;     // Inclusive, divide the blocks by rows
    int toHeight;       // Exclusive
    uint8_t *pixels;    // The block of data for each thread is not overlapping.
    color background_color;
    rectangular_node rectangulars;
    sphere_node spheres;
    light_node lights;
    const viewpoint *view;
} thread_info;

問題：運行時間雖然減少，但是輸出的圖總是少一塊

原因是因為 passing 的參數是 shared memory，而在我的資料結構中 fromHeight,toHeight 會因為迴圈而改變，所以在 thread 取用之前，就可能被更改了
驗證






























#include <stdio.h>
#include <pthread.h>
#include <unistd.h>

void *printMessage(void *msg)
{
    int *value;
    sleep(1);	// Wait for a while
    value = (int *)msg;
    printf("%d\n", *value);
                        
    pthread_exit((void *)0);
}

#define NUM_OF_THREADS 5
int main(void)
{
    pthread_t threads[NUM_OF_THREADS];
    void *status;
    int x;

    for (int i = 0; i < NUM_OF_THREADS; ++i) {
        x = i;
        pthread_create(&threads[i], NULL, printMessage, (void *)&x);
    }
    for (int i = 0; i < NUM_OF_THREADS; ++i)
        pthread_join(threads[i], &status);
                   
    return 0;
}

Output:
4
4
4
4
4

解法：如果是 thread 間有差異的資料，用 array 存並 pass 對應的 element 到 thread 中。稍微修改一些地方：






int x[NUM_OF_THREADS];

for (int i = 0; i < NUM_OF_THREADS; ++i) {
    x[i] = i;
    pthread_create(&threads[i], NULL, printMessage, (void *)&x[i]);
}

Output:
1
0
2
3
4

所以我必須要為每一個 thread 準備一份他所需要的資料，不過對於同樣的資料可以用 pointer 指向同一個地方。修改過後的資料結構

typedef struct __RAY_TRACING_THREAD_INFO {
    int *pHeight;
    int *pWidth;
    int fromHeight;     // Inclusive, divide the blocks by rows
    int toHeight;       // Exclusive
    uint8_t *pixels;    // The block of data for each thread is not overlapping.
    color background_color;
    rectangular_node *pRectangulars;
    sphere_node *pSpheres;
    light_node *pLights;
    const viewpoint *view;
} thread_info;

執行時間：
- 3.664291 sec (4 threads)
- 3.602248 sec (8 threads)
- 3.608806 sec (16 threads)
- 推測就是因為計算較多的地方都集中在某個區塊，所以執行時間無法再有更好的改善
透過 $ cat /proc/sys/kernel/threads-max 可以知道每一個 process 的最多 thread 數，我的是 30669

2016q3 Homework2 (raytracing)

環境

程式碼

Gprof

static inline

核心

Function raytracing

輸出

優化

Baseline

Loop Unrolling

使用 pthread 做平行化處理

整張圖運算切割

tags: sysprog2016

Function `raytracing`

tags: `sysprog2016`