2016q3 Homework2 (raytracing)

# 2016q3 Homework2 (raytracing) contributed by <`LanKuDot`> ## 環境 * OS: Ubuntu 15.04 64bit * CPU: Genuine Intel® CPU U7300 @ 1.30GHz × 2 * GPU: Mobile Intel® GM45 Express Chipset * RAM: 4 GB ## 程式碼 ### Gprof * `make PROFILE=1`：多加 `-pg` flag，讓 compiler 安插額外程式碼輸出執行時期的資料讓 `gprof` 分析，執行完會產生 `gmon.out`，以追蹤程式的執行行為 (動態分析) ``` [Makefile] ifeq ($(strip $(PROFILE)),1) "strip 幫助去除 leading 或 trailing space PROF_FLAGS = -pg CFLAGS += $(PROF_FLAGS) LDFLAGS += $(PROF_FLAGS) endif ``` * `gprof ./raytracing`：讓 gprof 分析資料，gprof 會取用產生的資料檔。會列出呼叫次數、執行時間比、call graph 等資訊。配合 `less` 服用，用分頁顯示大量資料。 ### static inline * `static`：對 definition 有效，限制 function 或 global variable 只會在所在檔案中被 access * `lnline`：告訴編譯器將 function 直接展開成 definition 的程式，可透過加上 `__attribute__((always_line))` 強制 gcc 在最佳化沒有開啟時 inline。 ### 核心 #### Function `raytracing` * 傳入場景必要的物件：長方面 (牆)、球體、光源、背景色、攝影機位置及視角、攝影機畫面 (輸出圖片) 的長寬 * 先計算以攝影機平面為基準的 vector space 的 basis vector，將結果存在 vector `u`, `v`, `w`。 * **function `calculateBasisVector`**： * `w`：攝影機面的 normal (vpn, view-plane normal) * `u`：w x 攝影機面的 up vector (vup) * `v`：w x u 找到最後一個 vector * `SAMPLES`：為了反鋸齒 (MSAA，多重採樣反鋸齒法)，設定成取樣四個 pixel 作為該 pixel 的顏色 * 取法如下 ``` 以 SAMPLES = 4 為例，對於 pixel (0,0) 會採 (0*2+0/2, 0*2+0%2) = (0,0) (0*2+1/2, 0*2+1%2) = (0,1) (0*2+2/2, 0*2+2%2) = (1,0) (0*2+3/2, 0*2+3%2) = (1,1) 四個點做平均而 pixel (1,0) 則為 (2,0), (2,1), (3,0), (3,1)四個點每次採樣的點都不一樣，可以在這裡做平行化 ``` * 從攝影機畫面的每個 pixel 射出光源 * **function `rayConstruction`**：建立光源出發點，會存在 `d`。 * 設定 view plane 的大小為 [-0.0175:0.0175][-0.0175:0.0175] (座標)；消失點的位置在 z = 0.05 的地方 (view plane space) * 計算該 pixel 在 view plane space 上的座標 * `w` (z)：座標為消失點 z 座標 * `u` (x)：為 xMin + xUnit * i * `v` (y)：為 yMin + yUnit * j * 將得到的座標轉換到 world space 上： ``` basis of space B(represented by space A) * coordainate on space B + origin on space B(represented by space A)= coordinate on space A ``` * 後與 view plane 的原點相減(?) * 計算光源在旅行的過程中會變成什麼顏色 * **function `rayColor`** * 判定如果發出去的光沒有碰到任何東西，則加上背景顏色，如果有則加上旅行之後的顏色。(這裡的顏色 scale 為 [0.0:1.0]) * 將 4 個取樣點的顏色值做平均，map 到 [0:255]，然後存到 `pixel` 中 ### 輸出 * [ppm](http://netpbm.sourceforge.net/doc/ppm.html)：`P6 Width Height MaxColorValue ColorValue` * `ColorValue`：以 (row, column) 填值 ## 優化 ### Baseline * 執行時間：11.650552 sec * 時間大多卡在 `dot_product` 跟 `substract_vector` ``` 前 10 名的程式熱區： % cumulative self self total time seconds seconds calls s/call s/call name 27.96 2.61 2.61 69646433 0.00 0.00 dot_product 18.32 4.32 1.71 56956357 0.00 0.00 subtract_vector 9.64 5.22 0.90 13861875 0.00 0.00 rayRectangularIntersection 6.75 5.85 0.63 31410180 0.00 0.00 multiply_vector 5.68 6.38 0.53 17836094 0.00 0.00 add_vector 5.68 6.91 0.53 10598450 0.00 0.00 normalize 5.46 7.42 0.51 4620625 0.00 0.00 ray_hit_object 4.18 7.81 0.39 13861875 0.00 0.00 raySphereIntersection 4.18 8.20 0.39 17821809 0.00 0.00 cross_product 3.05 8.49 0.29 1048576 0.00 0.00 ray_color ``` ### Loop Unrolling * 由於 for loop 有 branch 的判斷，會減慢執行速度。而在 `math_toolkit.h` 中的 vector 運算最多為三維，可以輕鬆展開 * 執行時間：6.881327 sec (效能提昇 41%) * 可以看到做 loop unrolling 之後的函式總執行時間下降很多 ``` dot_product 2.61 -> 0.73 substact_vector 1.71 -> 0.70 multiply_vector 0.63 -> 0.43 add_vector 0.53 -> 0.21 (seconds) ``` ### 使用 pthread 做平行化處理 * 閱讀材料 - [ ] [POSIX Threads Programming](https://computing.llnl.gov/tutorials/pthreads/) - [ ] [Introduction to Parallel progamming](https://computing.llnl.gov/tutorials/parallel_comp/) * 我認為 pthread 可以用在 * 重複執行的動作，像是 for loop * 可以分開的工作 * 如果沒有 critical section 會比較好處理，否則要注意共用資源的使用問題 * 觀察 raytracing 可以做平行化的地方 * 整張圖的運算切割：row 分, column 分或 block 分。缺點是老師在影片中講到的，每一個 pixel 的呼叫熱度不一樣，所以如果切出來熱度過於集中在某個 block，平行化的效果不高。 * 模糊化處理：每個 sample 做完運算後，再取平均。 * RGB：RGB 的計算結果是不互相干擾的，可以個別平行加。缺點 `ray_color` 計算完顏色一次輸出 RGB，如果每個 thread 只取其中一個值會太浪費。 #### 整張圖運算切割 * 基本想法：以 row 為單位切割運算區間，將 `raytracing` 的核心迴圈移到 `raytracing_thread` 函式，在 `raytracing` 只負責 create thread。 * 因為 `pthread_create` 只允許 pass 一個參數，所以將所需資訊打包成一個資料結構 ```c typedef struct __RAY_TRACING_THREAD_INFO { int height; int width; int fromHeight; // Inclusive, divide the blocks by rows int toHeight; // Exclusive uint8_t *pixels; // The block of data for each thread is not overlapping. color background_color; rectangular_node rectangulars; sphere_node spheres; light_node lights; const viewpoint *view; } thread_info; ``` * 問題：運行時間雖然減少，但是輸出的圖總是少一塊 ![](https://i.imgur.com/WdyZh9v.png =250x250) * 原因是因為 passing 的參數是 shared memory，而在我的資料結構中 `fromHeight`,`toHeight` 會因為迴圈而改變，所以在 thread 取用之前，就可能被更改了 * 驗證 ```c= #include <stdio.h> #include <pthread.h> #include <unistd.h> void *printMessage(void *msg) { int *value; sleep(1); // Wait for a while value = (int *)msg; printf("%d\n", *value); pthread_exit((void *)0); } #define NUM_OF_THREADS 5 int main(void) { pthread_t threads[NUM_OF_THREADS]; void *status; int x; for (int i = 0; i < NUM_OF_THREADS; ++i) { x = i; pthread_create(&threads[i], NULL, printMessage, (void *)&x); } for (int i = 0; i < NUM_OF_THREADS; ++i) pthread_join(threads[i], &status); return 0; } ``` ``` Output: 4 4 4 4 4 ``` * 解法：如果是 thread 間有差異的資料，用 array 存並 pass 對應的 element 到 thread 中。稍微修改一些地方： ```c= int x[NUM_OF_THREADS]; for (int i = 0; i < NUM_OF_THREADS; ++i) { x[i] = i; pthread_create(&threads[i], NULL, printMessage, (void *)&x[i]); } ``` ``` Output: 1 0 2 3 4 ``` * 所以我必須要為每一個 thread 準備一份他所需要的資料，不過對於同樣的資料可以用 pointer 指向同一個地方。修改過後的資料結構 ```c typedef struct __RAY_TRACING_THREAD_INFO { int *pHeight; int *pWidth; int fromHeight; // Inclusive, divide the blocks by rows int toHeight; // Exclusive uint8_t *pixels; // The block of data for each thread is not overlapping. color background_color; rectangular_node *pRectangulars; sphere_node *pSpheres; light_node *pLights; const viewpoint *view; } thread_info; ``` * 執行時間： * 3.664291 sec (4 threads) * 3.602248 sec (8 threads) * 3.608806 sec (16 threads) * 推測就是因為計算較多的地方都集中在某個區塊，所以執行時間無法再有更好的改善 * 透過 `$ cat /proc/sys/kernel/threads-max` 可以知道每一個 process 的最多 thread 數，我的是 30669 ###### tags: `sysprog2016`