2016q3 Homework1 (raytracing)

# 2016q3 Homework1 (raytracing) contributed by <`carolc0708`> ## 作業說明 * [A02: raytracing](/s/B1W5AWza) * 可善用 POSIX Thread, OpenMP, software pipelining, 以及 loop unrolling 一類的技巧來加速程式運作 ## gprof 分析 raytracing程式參考陳昱齊同學的[筆記](https://hackpad.com/2016q1-HW-2A-vEkRHYJYmvz)，將gprof的操作作淺顯易懂的解說: * 一般操作 * step1: `$ gcc -pg 檔名.c` 執行檔檔名（ex. file）此時會產生一個名叫做file的執行檔 * step2: `$ ./file` 執行file，接著會產生一個gmon.out文件，用來供gprof分析用的 * step3: `$ gprof -b file gmon.out | less` 此時就可以看到gprof分析程式，每個函數所花的時間，可以針對時間花費最多的進行改善 --- 先在source code底下執行`$ make PROFILE=1`產生raytracing執行檔，再執行 `$ ./raytracing`，接著會產生gmon.out文件。最後執行 `$ gprof -b raytracing gmon.out | head -n 20`，列出gprof分析程式中花費時間最高的20筆，結果如下: ``` Flat profile: Each sample counts as 0.01 seconds. % cumulative self self total time seconds seconds calls s/call s/call name 23.80 0.49 0.49 69646433 0.00 0.00 dot_product 21.37 0.93 0.44 56956357 0.00 0.00 subtract_vector 10.93 1.16 0.23 31410180 0.00 0.00 multiply_vector 8.26 1.33 0.17 17836094 0.00 0.00 add_vector 7.29 1.48 0.15 10598450 0.00 0.00 normalize 5.83 1.60 0.12 13861875 0.00 0.00 raySphereIntersection 5.34 1.71 0.11 17821809 0.00 0.00 cross_product 4.37 1.80 0.09 13861875 0.00 0.00 rayRectangularIntersection 3.89 1.88 0.08 4620625 0.00 0.00 ray_hit_object 2.43 1.93 0.05 1048576 0.00 0.00 ray_color 1.46 1.96 0.03 3838091 0.00 0.00 length 0.97 1.98 0.02 2520791 0.00 0.00 idx_stack_top 0.73 1.99 0.02 4221152 0.00 0.00 multiply_vectors 0.49 2.00 0.01 2558386 0.00 0.00 idx_stack_empty 0.49 2.01 0.01 2110576 0.00 0.00 compute_specular_diffuse ``` 由結果可知，我們必須對前幾個執行次數高、執行時間長的函式，如`dot_product()`、`subtract_vector()`、`multiply_vector()`等進行效能的改善。關於執行時間，在沒有使用gprof來追蹤程式(直接`$ make`來產生raytracting執行檔)，程式執行時間為 `2.433547 sec`，而使用gprof來追蹤程式後，程式執行時間為 `5.019975 sec`。 >[gprof效能分析工具](https://sourceware.org/binutils/docs/gprof/)[color=blue] ## Loop Unrolling * loop unrolling背景知識:將for迴圈中的表示式進行線性展開，減去迴圈檢查、跳躍指令所需用到的時間。 >[loop unrolling](https://www.cs.umd.edu/class/fall2001/cmsc411/proj01/proja/loop.html) >[loop unrolling wiki](https://en.wikipedia.org/wiki/Loop_unrolling)[color=blue] --- 先在花費總時間最高的`dot_product()`做loop unrolling: ```clike= static inline double dot_product(const double *v1, const double *v2) { double dp = 0.0; for (int i = 0; i < 3; i++) dp += v1[i] * v2[i]; return dp; } ``` 將for迴圈線性展開: ```clike= static inline double dot_product(const double *v1, const double *v2) { double dp = 0.0; dp += v1[0] * v2[0] + v1[1] * v2[1] + v1[2] * v2[2]; return dp; } ``` 改完後，執行時間由原本的 `5.019975 sec` 降為 `4.675378 sec` ，減少了 `0.34 sec`左右。 --- 接著對 `<math-toolkit.h> `中其他也有相同狀況的函式`add_vector()`、`subtract_vector()`、`multiply_vectors()`、`multiply_vector()`進行改良: ```clike= static inline void add_vector(const double *a, const double *b, double *out) { out[0] = a[0] + b[0]; out[1] = a[1] + b[1]; out[2] = a[2] + b[2]; } ``` ```clike= static inline void subtract_vector(const double *a, const double *b, double *out) { out[0] = a[0] - b[0]; out[1] = a[1] - b[1]; out[2] = a[2] - b[2]; } ``` ```clike= static inline void multiply_vectors(const double *a, const double *b, double *out) { out[0] = a[0] * b[0]; out[1] = a[1] * b[1]; out[2] = a[2] * b[2]; } ``` ```clike= static inline void multiply_vector(const double *a, double b, double *out) { out[0] = a[0] * b; out[1] = a[1] * b; out[2] = a[2] * b; } ``` 改完之後，程式執行時間降為 `4.191883 sec`，比完全未改善前降低了約 `0.83 sec` ## Force Inline * 參考[raytracing共筆](https://embedded2016.hackpad.com/ep/pad/static/f5CCUGMQ4Kp)中提到 Inline function的展開 * Inline function背景知識: * 在程式當中，直接將函式內容展開，省去執行時需額外跳至該函式記憶體位置再跳回來的時間。 ![](https://i.imgur.com/0QUmWwP.png) * 即便加入inline關鍵字，在程式執行時也不一定會實作。編譯器會去選擇這樣的實作方式是否符合效益、有節省到時間。 * 一般來說，函數中的==程式碼或處理較短，並且常呼叫使用的話==，可以給他加上個inline（因為編譯器可能會應你的要求使用） * `static inline`&`inline` &`extern inline` >等讀完多份文件再回來補資料><[name=Carol Chen] >[Inline function觀念](http://blog.yam.com/swwuyam/article/11745212)[color=blue] >[[C++]內嵌函數（inline function）筆記](https://dotblogs.com.tw/v6610688/2013/11/27/introduction_inline_function) >[Inline v.s Macro](http://ascii-iicsa.blogspot.tw/2010/08/inline-functionmarco.html) >[Inline function in C](http://www.greenend.org.uk/rjk/tech/inline.html) >[Static Inline v.s External Inline](http://blog.csdn.net/fengbingchun/article/details/51234209) --- * 因爲在Makefile中擁有 -O0 的指令(關閉最佳化)，所以inline不會被執行。要讓inline執行必須在==每一個==function後面加上 `__attribute__((always_inline))`，強行讓CPU去執行inline這個“建議” >這邊是否每一個都要強制展開不是很確定?![name=Carol Chen] 執行結果，出現了warning@@ ``` In file included from raytracing.c:5:0: primitives.h:4:1: warning: ?鳬lways_inline??attribute ignored [-Wattributes] typedef double point3[3]; ^ ``` 但執行時間明顯下降，只花了`2.128363 sec`，比只做loop unrolling下降了`2.07 sec`。 ## OpenMP * OpenMP 背景知識: * OpenMP（Open Multi-Processing）是一套支援跨平台共享記憶體方式的多執行緒並行的編程API，包括一套編譯器指令庫和一些能夠影響執行行為的環境變數。 * 可分為三類:`Directive`、`Clause`、`Function`: * function 的部份是獨立呼叫的，在一般的情況下不大會用到。 * directive 和 clause 的用法，大致上是： `#pragma omp <directive> [clause]` * 要拿來平行化，是使用 `parallel`、`sections`、`for` 這三項。`ordered`是用來設定平行化的執行順序。 * 使用 parallel pragma 來開始一個 parallel block，程式會 runtime 決定要開多少 thread，平行化的範圍是 parallel pragma 後的 statement 或 block，結束後 thread 就會收回。 * `for`: 把程式中的for迴圈分給scope上`#pragma omp parallel`所生出的threads去執行 * `section`: 把程式中==沒有相關性==的各個程式利用 `#pragma omp section` 來做區塊切割，然後由 `#pragma omp parallel sections` 分出的threads去做平行的處理。定義方式和`for`稍不同，詳細參考[這裡](https://kheresy.wordpress.com/2006/09/15/%E7%B0%A1%E6%98%93%E7%9A%84%E7%A8%8B%E5%BC%8F%E5%B9%B3%E8%A1%8C%E5%8C%96%EF%BC%8Dopenmp%EF%BC%88%E4%B8%89%EF%BC%89%E7%AF%84%E4%BE%8B-parallel%E3%80%81section/)。 >[OpenMP note](http://wdv4758h-notes.readthedocs.io/zh_TW/latest/openmp.html) >[OpenMP wiki](https://zh.wikipedia.org/wiki/OpenMP) >[OpenMP簡易的程式平行化](https://kheresy.wordpress.com/2006/06/09/%E7%B0%A1%E6%98%93%E7%9A%84%E7%A8%8B%E5%BC%8F%E5%B9%B3%E8%A1%8C%E5%8C%96%E6%96%B9%E6%B3%95%EF%BC%8Dopenmp%EF%BC%88%E4%B8%80%EF%BC%89%E7%B0%A1%E4%BB%8B/):（三）範例 parallel、section[color=blue] --- * 程式需 `#include <omp.h>` * Makefile中的 `CFLAGS` 加上 `-fopenmp`，`LDFLAGS` 加上`-lgomp` * 在有for迴圈的地方都做平行化處理，有兩種方式，效果相同: * `#pragma omp parallel for`加在for迴圈上方 * `#pragma omp parallel`加在要平行處理的scope`{}`上方，`#pragma omp for` 加在裏頭的for迴圈上方。 * 在 `math-toolkit.h`中修改有for迴圈的地方， ```clike= static inline double dot_product(const double *v1, const double *v2) { double dp = 0.0; #pragma omp parallel for for (int i = 0; i < 3; i++) dp += v1[i] * v2[i]; return dp; } ``` 執行結果為`4.974464 sec`，比優化前版本快 `0.1 sec`左右而已。判斷應該是for迴圈執行次數太少且只有一層，無法達到很大的優化效果。 * 參考Dada chan的[筆記](https://embedded2016.hackpad.com/2016q1-Homework-2-A-jugsB8br8Bt)，在`raytracing.c`中修改`raytracing()`函式: ```clike= #pragma omp parallel for num_threads(64) private(d),private(object_color),private(stk) for (int j = 0; j < height; j++) { for (int i = 0; i < width; i++) { double r = 0, g = 0, b = 0; /* MSAA */ for (int s = 0; s < SAMPLES; s++) { ``` >為什要這樣寫?[name=Carol Chen] >* private(): 讓每個執行緒中，都有一份變數的複本，以免互相干擾 >* [raytracing共筆](https://embedded2016.hackpad.com/ep/pad/static/f5CCUGMQ4Kp)中提到，使用private( )對stk、d和object_color進行處理的原因:在迴圈中的變數有非常多，但是可以從程式碼中看到以下兩行，他們的值事先就已經好，他們的值在進入迴圈之前就已進行過一次運算，並且在迴圈內並沒有任何一行指令會去改變他們的值，所以可想而知它們的值是不會改變的。所以只剩下stk、d和object_color這三個變數了。直接執行程式對`raytracing()`改善，結果為`5.004982 sec`，無太大差別。但加上前面實作的loop unrolling和force inline，時間可降為`2.098089 sec`。 --- * 一直覺得自己的OpenMP沒改善太多，很不甘心於是參考了很多位同學的筆記看看有沒有不同的寫法。在王紹華同學的[筆記](https://embedded2016.hackpad.com/2016q1-Homework-2forA-ZIYzFrLSRDA)看到切`section`的方式實作，修改`dot_product()` : ```clike= static inline double dot_product(const double *v1, const double *v2) { double dp0, dp1, dp2; #pragma omp parallel sections { #pragma omp section { dp0 = v1[0] * v2[0]; } #pragma omp section { dp1 = v1[1] * v2[1]; } #pragma omp section { dp2 = v1[2] * v2[2]; } } return dp0 + dp1 + dp2; } ``` 實驗結果為執行時間`4.675069 sec`，比優化前降低 `0.4 sec`，比只做`for`的平行好一點點。 ## SIMD * 背景知識:(參考謝永勁的[筆記](https://paper.dropbox.com/doc/Homework-2-part2-NeoEqPFZXy4j9ifNc5Smk)和wiki) * SIMD:單指令流多資料流（英語：Single Instruction Multiple Data，縮寫：SIMD）是一種採用一個控制器來控制多個處理器，同時對一組資料（又稱「==資料向量(Vector Type)==」）中的每一個分別執行相同的操作從而實作空間上的並列性的技術。 ![](https://i.imgur.com/8yT7HLi.png) * 同一個instruction可以同時處理多個data，可以同時處理多筆資料省去資料搬移的時間 * 在微處理器中，單指令流多資料流技術則是一個控制器控制多個平行的處理微元，例如Intel的MMX或SSE，以及AMD的3D Now!指令集。 * 運算方式: 一個float[4]，他的位置是連續的，那麼SIMD就可以將float按照位置分別做運算，就如同下面的圖片，他能一次運算4個float，以達到加速的效果，但是需要注意的是SIMD只支援同一種運算的方式，也就是說如果你想要用不同的方式對A0,A1,A2,A3進行運算是不行的，可能就要找找看MIMD ![](https://i.imgur.com/MBSKETt.png) * 資料儲存型態: C的資料型態被稱作`scalar types`（ex: int ,char ,float …），然而在SIMD裡面相對應的scalar是使用`vector type`來做儲存的 ![](https://i.imgur.com/k92hlP4.png) ![](https://i.imgur.com/zuM3KiX.png) --- * 在intel上使用avx指令集，要在`CFLAG`加`-mavx2`，並加入 `#include<immintrin.h>` >[Intrinsics Guide](https://software.intel.com/sites/landingpage/IntrinsicsGuide/#expand=2932,3559,3580,3632,3900,4574,4578,5474,5472,5471,5616,5661,5670,5692,5693,585,687,1680,2928,2930&techs=AVX)[color=blue] * 參考多份筆記發現效能不會變好，反而變差 >想參考以下看是否能夠改善[name=Carol Chen] >[SIMD Ray Tracing Tips](http://graphics.ucsd.edu/courses/cse168_s10/ucsd/SIMD_Ray_Tracing_Tips.pdf) ## POSIX Thread * 背景知識: * POSIX執行緒（英語：POSIX Threads，常被縮寫為Pthreads）是POSIX的執行緒標準，定義了創建和操縱執行緒的一套API。 * 實現POSIX 執行緒標準的庫常被稱作Pthreads，一般用於Unix-like POSIX 系統，如Linux、 Solaris。 >[PThread入門](http://dragonspring.pixnet.net/blog/post/32963482-posix%E7%B7%9A%E7%A8%8B%28pthread%29%E5%85%A5%E9%96%80%E6%96%87%E7%AB%A0%E5%88%86%E4%BA%AB)[color=blue] * PThread會使用到的函式: * pthread_create `int pthread_create(pthread_t *tid , const pthread_attr_t *attr , void *(*function)(void *) , void *argument)` 這個Function的作用是用來產生一個Thread並執行附帶的Function，附帶有4個參數參數1. pthread_t *tid為pthread的指標，在使用Thread之前必須要先宣告一個pthread_t的變數。參數2. const pthread_attr_t *attr為該Thread的屬性，預設是NULL，如果沒有其他特殊的需求直接填入NULL即可。參數3. void *(*function)(void *)為Function pointer，這邊要放入你要執行的Function。參數4. void *argument為Function pointer所要帶的參數。回傳值: 如果執行成功則回傳0，如果執行失敗則回傳錯誤代碼。 example: `pthread_create( &thread1, NULL , showmessage , message);` * pthread_join `int pthread_join (pthread_t thread, void **value_ptr)` 這個Function的作用會暫停目前執行pthread_join的Thread，等到目標Thread執行完畢之後目前執行pthread_join的Thread才會繼續執行，附帶有2個參數。參數1: pthread_t thread為要等待的目標Thread。參數2: void **value_ptr用來取得目標Thread的回傳值。回傳值: 如果執行成功則回傳0，如果執行失敗則回傳錯誤代碼。 >[PThread 程式撰寫](http://blog.xuite.net/nikoung/wretch/154071878-Pthread+%E7%A8%8B%E5%BC%8F%E6%92%B0%E5%AF%AB)[color=blue] --- * 參考陳品睿的[筆記](https://embedded2016.hackpad.com/2016q1-Homework-2A-bLGQtRraTES)，在`raytracing()`中實作PThread: 先看原始的`raytracing()`: ```C= void raytracing(uint8_t *pixels, color background_color, rectangular_node rectangulars, sphere_node spheres, light_node lights, const viewpoint *view, int width, int height) { point3 u, v, w, d; color object_color = { 0.0, 0.0, 0.0 }; /* calculate u, v, w */ calculateBasisVectors(u, v, w, view); idx_stack stk; int factor = sqrt(SAMPLES); for (int j = 0; j < height; j++) { for (int i = 0; i < width; i++) { double r = 0, g = 0, b = 0; /* MSAA */ for (int s = 0; s < SAMPLES; s++) { idx_stack_init(&stk); rayConstruction(d, u, v, w, i * factor + s / factor, j * factor + s % factor, view, width * factor, height * factor); if (ray_color(view->vrp, 0.0, d, &stk, rectangulars, spheres, lights, object_color, MAX_REFLECTION_BOUNCES)) { r += object_color[0]; g += object_color[1]; b += object_color[2]; } else { r += background_color[0]; g += background_color[1]; b += background_color[2]; } pixels[((i + (j * width)) * 3) + 0] = r * 255 / SAMPLES; pixels[((i + (j * width)) * 3) + 1] = g * 255 / SAMPLES; pixels[((i + (j * width)) * 3) + 2] = b * 255 / SAMPLES; } } } } ``` 可以將height或width拆成給不同Thread去執行，加快運算速度。 * 切height(column)的作法: >筆記中有提到均勻分column方式的效能比較，想多了解再來嘗試看看差異[name=Carol Chen] 在`raytracing.h`中，重新設計資料結構，將Thread當中要分別下去跑的變數一起包成一個struct `Thread_Content`: ```C= typedef struct THREAD_CONTENT{ uint8_t *pixels; double *background_color;// rectangular_node rectangulars; sphere_node spheres; light_node lights; const viewpoint *view; int width; int height; double *u;// double *v;// double *w;// double *d;// int factor; }Thread_Content; ``` `Thread_Arg`定義Thread的參數，其中包含`Thread_Content`指標: ```C= typedef struct THREAD_ARG{ int start_; int end_; Thread_Content* tc_; }Thread_Arg; ``` 接下來在`raytracing.c`由`raytracing()`另外分出兩個函式，分別處理Thread中的運算與Thread平行運算兩部分。處理Thread中的運算: ```C= void *thread_function(void *arg_) { const Thread_Arg *arg = (const Thread_Arg *)arg_; color object_color = { 0.0, 0.0, 0.0 }; idx_stack stk; for(int j = arg->start_ ; j < arg->end_ ; j += THREAD_NUM) { for(int i = 0 ; i < arg->tc_->width ; i ++) { double r = 0, g = 0, b = 0; for (int s = 0; s < SAMPLES; s++) { idx_stack_init(&stk); rayConstruction(arg->tc_->d, arg->tc_->u, arg->tc_->v, arg->tc_->w, i * arg->tc_->factor + s / arg->tc_->factor, j * arg->tc_->factor + s % arg->tc_->factor, arg->tc_->view, arg->tc_->width * arg->tc_->factor, arg->tc_->height * arg->tc_->factor); if (ray_color(arg->tc_->view->vrp, 0.0, arg->tc_->d, &stk, arg->tc_->rectangulars, arg->tc_->spheres, arg->tc_->lights, object_color, MAX_REFLECTION_BOUNCES)) { r += object_color[0]; g += object_color[1]; b += object_color[2]; } else { r += arg->tc_->background_color[0]; g += arg->tc_->background_color[1]; b += arg->tc_->background_color[2]; } arg->tc_->pixels[((i + (j * arg->tc_->width)) * 3) + 0] = r * 255 / SAMPLES; arg->tc_->pixels[((i + (j * arg->tc_->width)) * 3) + 1] = g * 255 / SAMPLES; arg->tc_->pixels[((i + (j * arg->tc_->width)) * 3) + 2] = b * 255 / SAMPLES; } } } } ``` 處理Thread平行運算: ```C= void parallel(uint8_t *pixels, double *background_color, rectangular_node rectangulars, sphere_node spheres, light_node lights, const viewpoint *view, int width, int height, double* u, double* v, double* w, double* d, int factor) { pthread_t pt_id[THREAD_NUM]; Thread_Content *tc= (Thread_Content*) malloc(sizeof(Thread_Content)); Thread_Arg *arg[THREAD_NUM]; void *ret; // tc->pixels = pixels; tc->background_color = background_color; tc->rectangulars = rectangulars; tc->spheres = spheres; tc->lights = lights; tc->view = view; tc->width = width; tc->height = height; tc->u = u; tc->v = v; tc->w = w; tc->d = d; tc->factor = factor; // for(int i = 0 ; i < THREAD_NUM ; i++) { arg[i] = (Thread_Arg *) malloc(sizeof(Thread_Arg)); arg[i]->tc_ = tc; arg[i]->start_ = i; arg[i]->end_ = height; } for(int i=0; i<THREAD_NUM; i++) { if(pthread_create(pt_id+i, NULL, thread_function, arg[i]) != 0) { printf("Thread created fail.\n"); exit(1); } } for(int i=0; i<THREAD_NUM; i++) pthread_join(pt_id[i], &ret); } ``` 不同的`THREAD_NUM`執行結果如下: ``` 32: 4.974186 sec 64: 4.902074 sec 128: 4.256612 sec 256: 3.925100 sec 512: 3.711159 sec//!!! 1024: 5.183581 sec ``` 可以發現我的電腦開512 Threads時，效能最好。 * 切width(row)的作法，執行結果如下: ``` 32: 5.243384 sec 64: 4.821204 sec 128: 5.005269 sec 256: 5.249753 sec ``` >筆記中還有很多切Thread的方式，有空時可以研究一下[name=Carol Chen]