# 2017q1 Homework1 (raytracing) contributed by < `illusion030` > ### Review by <`hugikun999`> * 比較不同的 Pthread 切割法,而非只有直的或橫的。 * 分析執行序數目和執行時間之間的關係。 * 使用openMP平行化。 * 分析每個 loop unrolling 是否都可以使時間消耗變短。 * 嘗試使用 SIMD 改寫程式。 --- ### 開發環境 ``` illusion030@illusion030-X550LD:~/Desktop/2017sysprog/raytracing$ lscpu Architecture: x86_64 CPU 作業模式: 32-bit, 64-bit Byte Order: Little Endian CPU(s): 4 On-line CPU(s) list: 0-3 每核心執行緒數:2 每通訊端核心數:2 Socket(s): 1 NUMA 節點: 1 供應商識別號: GenuineIntel CPU 家族: 6 型號: 69 Model name: Intel(R) Core(TM) i5-4200U CPU @ 1.60GHz 製程: 1 CPU MHz: 1600.061 CPU max MHz: 2600.0000 CPU min MHz: 800.0000 BogoMIPS: 4588.95 虛擬: VT-x L1d 快取: 32K L1i 快取: 32K L2 快取: 256K L3 快取: 3072K NUMA node0 CPU(s): 0-3 ``` --- ### 事前準備 * 安裝相關開發工具 ``` $ sudo apt-get update $ sudo apt-get install graphviz $ sudo apt-get install imagemagick ``` 觀看[作業影片](https://www.youtube.com/watch?v=m1RmfOfSwno) --- ### 原始版本 * fork [raytracing](https://github.com/sysprog21/raytracing) 至自己的 Github 並執行看看 ``` $ git clone https://github.com/illusion030/raytracing $ cd raytracing $ make $ ./raytracing ``` ``` # Rendering scene Done! Execution time of raytracing() : 3.362796 sec ``` 共花了約 3.36 秒 * 打開圖看看 ``` $ eog out.ppm ``` 為了上傳圖片將圖片轉為 png 檔 ``` $ convert out.ppm out.png ``` ![](https://i.imgur.com/zxNvqYl.png) make check 看看 ``` illusion030@illusion030-X550LD:~/Desktop/2017sysprog/raytracing$ make check # Rendering scene Done! Execution time of raytracing() : 3.244101 sec Verified OK ``` 得到 Verified OK * 嘗試看看 gprof ``` illusion030@illusion030-X550LD:~/Desktop/2017sysprog/raytracing$ make PROFILE=1 cc -std=gnu99 -Wall -O0 -g -pg -c -o objects.o objects.c cc -std=gnu99 -Wall -O0 -g -pg -c -o raytracing.o raytracing.c cc -std=gnu99 -Wall -O0 -g -pg -c -o main.o main.c cc -o raytracing objects.o raytracing.o main.o -lm -pg ``` 多了 -pg 這個指令 執行看看raytracing ``` illusion030@illusion030-X550LD:~/Desktop/2017sysprog/raytracing$ ./raytracing # Rendering scene Done! Execution time of raytracing() : 6.812788 sec ``` 花的時間增加為約 6.81 秒 使用看看 gprof ``` $ gprof ./raytracing | less ``` ``` Each sample counts as 0.01 seconds. % cumulative self self total time seconds seconds calls s/call s/call name 21.69 0.57 0.57 56956357 0.00 0.00 subtract_vector 21.31 1.13 0.56 69646433 0.00 0.00 dot_product 8.37 1.35 0.22 31410180 0.00 0.00 multiply_vector 7.99 1.56 0.21 13861875 0.00 0.00 rayRectangularIntersection 7.61 1.76 0.20 17836094 0.00 0.00 add_vector 6.85 1.94 0.18 10598450 0.00 0.00 normalize 5.71 2.09 0.15 4620625 0.00 0.00 ray_hit_object 4.95 2.22 0.13 1048576 0.00 0.00 ray_color 4.57 2.34 0.12 13861875 0.00 0.00 raySphereIntersection 3.42 2.43 0.09 17821809 0.00 0.00 cross_product 2.28 2.49 0.06 1048576 0.00 0.00 rayConstruction 1.52 2.53 0.04 4221152 0.00 0.00 multiply_vectors 1.14 2.56 0.03 2110576 0.00 0.00 compute_specular_diffuse 0.76 2.58 0.02 3838091 0.00 0.00 length 0.76 2.60 0.02 1241598 0.00 0.00 refraction 0.76 2.62 0.02 1 0.02 2.63 raytracing 0.38 2.63 0.01 2520791 0.00 0.00 idx_stack_top 0.00 2.63 0.00 2558386 0.00 0.00 idx_stack_empty 0.00 2.63 0.00 2110576 0.00 0.00 localColor 0.00 2.63 0.00 1241598 0.00 0.00 protect_color_overflow 0.00 2.63 0.00 1241598 0.00 0.00 reflection 0.00 2.63 0.00 1204003 0.00 0.00 idx_stack_push 0.00 2.63 0.00 1048576 0.00 0.00 idx_stack_init 0.00 2.63 0.00 113297 0.00 0.00 fresnel 0.00 2.63 0.00 37595 0.00 0.00 idx_stack_pop 0.00 2.63 0.00 3 0.00 0.00 append_rectangular 0.00 2.63 0.00 3 0.00 0.00 append_sphere 0.00 2.63 0.00 2 0.00 0.00 append_light 0.00 2.63 0.00 1 0.00 0.00 calculateBasisVectors 0.00 2.63 0.00 1 0.00 0.00 delete_light_list 0.00 2.63 0.00 1 0.00 0.00 delete_rectangular_list 0.00 2.63 0.00 1 0.00 0.00 delete_sphere_list 0.00 2.63 0.00 1 0.00 0.00 diff_in_second 0.00 2.63 0.00 1 0.00 0.00 write_to_ppm ``` 發現花在 subtract_vector 跟 dot_product 的次數跟時間最多,dot_product 甚至 call 了快 7000 萬次 * 使用 perf 分析 ``` illusion030@illusion030-X550LD:~/Desktop/2017sysprog/raytracing$ perf stat -e cache-misses,cache-references,instructions,cycles ./raytracing # Rendering scene Done! Execution time of raytracing() : 6.820573 sec Performance counter stats for './raytracing': 308,413 cache-misses # 31.311 % of all cache refs 985,001 cache-references 32,985,545,345 instructions # 1.90 insn per cycle 17,404,280,889 cycles 6.822371998 seconds time elapsed ``` cache-misses 約為 31.31% --- ## Graphviz * 運用 grif2dot 將 gprof 所產生的資料變成圖表顯示 * 參考 [Graphviz - 用指令來畫關係圖吧!](https://www.openfoundry.org/tw/foss-programs/8820-graphviz-) [hugikun999的共筆](https://hackmd.io/s/HyHhgcv6#) * 安裝 [grif2dot](https://github.com/jrfonseca/gprof2dot) 因為要用 pip 安裝,所以要先安裝 python ``` $sudo apt install python python-pip $sudo pip install --upgrade pip $sudo pip install gprof2dot ``` 畫出圖 ``` $ gprof ./raytracing | gprof2dot | dot -Tpng -o output.png ``` ![](https://i.imgur.com/DTFpYDl.png) --- ### 效能改善 ## loop unrolling * 參考[作業影片](https://www.youtube.com/watch?v=m1RmfOfSwno)中 loop unrolling 的方式 * 將 dot_product 裡的 for 迴圈展開,去除了比較 i 的時間 ```=vim static inline double dot_product(const double *v1, const double *v2) { double dp = 0.0; dp = v1[0] * v2[0] + v1[1] * v2[1] + v1[2] * v2[2]; return dp; } ``` * 跑一次 ``` illusion030@illusion030-X550LD:~/Desktop/2017sysprog/raytracing$ ./raytracing # Rendering scene Done! Execution time of raytracing() : 2.860502 sec ``` Excution time 降到約 2.86 秒 * check 看看對不對 ``` illusion030@illusion030-X550LD:~/Desktop/2017sysprog/raytracing$ make check # Rendering scene Done! Execution time of raytracing() : 2.812206 sec Verified OK ``` 得到 Verified OK!! * 用 gprof ``` $ make PROFILE=1 $ ./raytracing # Rendering scene Done! Execution time of raytracing() : 6.305774 sec $ gprof ./raytracing | less ``` 時間增加至約 6.31 秒 ``` Each sample counts as 0.01 seconds. % cumulative self self total time seconds seconds calls s/call s/call name 25.72 0.55 0.55 56956357 0.00 0.00 subtract_vector 12.16 0.81 0.26 69646433 0.00 0.00 dot_product 8.42 0.99 0.18 17821809 0.00 0.00 cross_product 8.42 1.17 0.18 17836094 0.00 0.00 add_vector 7.01 1.32 0.15 31410180 0.00 0.00 multiply_vector 7.01 1.47 0.15 13861875 0.00 0.00 rayRectangularIntersection 6.55 1.61 0.14 4620625 0.00 0.00 ray_hit_object 6.08 1.74 0.13 10598450 0.00 0.00 normalize 3.74 1.82 0.08 13861875 0.00 0.00 raySphereIntersection 3.74 1.90 0.08 1048576 0.00 0.00 ray_color 1.87 1.94 0.04 4221152 0.00 0.00 multiply_vectors 1.40 1.97 0.03 2110576 0.00 0.00 compute_specular_diffuse 0.94 1.99 0.02 1241598 0.00 0.00 protect_color_overflow 0.94 2.01 0.02 1241598 0.00 0.00 refraction 0.94 2.03 0.02 1048576 0.00 0.00 rayConstruction 0.94 2.05 0.02 1 0.02 0.02 delete_sphere_list 0.94 2.07 0.02 1 0.02 2.12 raytracing 0.70 2.09 0.02 2558386 0.00 0.00 idx_stack_empty 0.70 2.10 0.02 1204003 0.00 0.00 idx_stack_push 0.47 2.11 0.01 2110576 0.00 0.00 localColor 0.47 2.12 0.01 1241598 0.00 0.00 reflection 0.47 2.13 0.01 113297 0.00 0.00 fresnel 0.23 2.14 0.01 2520791 0.00 0.00 idx_stack_top 0.23 2.14 0.01 37595 0.00 0.00 idx_stack_pop 0.00 2.14 0.00 3838091 0.00 0.00 length 0.00 2.14 0.00 1048576 0.00 0.00 idx_stack_init 0.00 2.14 0.00 3 0.00 0.00 append_rectangular 0.00 2.14 0.00 3 0.00 0.00 append_sphere 0.00 2.14 0.00 2 0.00 0.00 append_light 0.00 2.14 0.00 1 0.00 0.00 calculateBasisVectors 0.00 2.14 0.00 1 0.00 0.00 delete_light_list 0.00 2.14 0.00 1 0.00 0.00 delete_rectangular_list 0.00 2.14 0.00 1 0.00 0.00 diff_in_second 0.00 2.14 0.00 1 0.00 0.00 write_to_ppm ``` dot_product 的 self seconds 降低為 0.26 秒,佔的比例也明顯的下降了 * 畫出圖看看 ``` $ gprof ./raytracing | gprof2dot | dot -Tpng -o output.png $ eog output.png ``` ![](https://i.imgur.com/jPKcxGQ.png) * 將所有的 for 迴圈也展開看看 ``` $ make PROFILE=1 illusion030@illusion030-X550LD:~/Desktop/2017sysprog/raytracing$ ./raytracing # Rendering scene Done! Execution time of raytracing() : 5.784699 sec ``` 感覺時間有一點點提昇 ``` illusion030@illusion030-X550LD:~/Desktop/2017sysprog/raytracing$ make check # Rendering scene Done! Execution time of raytracing() : 5.674221 sec Verified OK ``` check 成功!! * 用 gprof 看看 ``` Each sample counts as 0.01 seconds. % cumulative self self total time seconds seconds calls s/call s/call name 16.93 0.23 0.23 69646433 0.00 0.00 dot_product 12.79 0.40 0.17 56956357 0.00 0.00 subtract_vector 10.16 0.53 0.14 4620625 0.00 0.00 ray_hit_object 9.03 0.65 0.12 17821809 0.00 0.00 cross_product 9.03 0.77 0.12 10598450 0.00 0.00 normalize 8.65 0.89 0.12 13861875 0.00 0.00 rayRectangularIntersection 7.15 0.98 0.10 13861875 0.00 0.00 raySphereIntersection 6.39 1.07 0.09 31410180 0.00 0.00 multiply_vector 5.27 1.14 0.07 17836094 0.00 0.00 add_vector 3.76 1.19 0.05 1048576 0.00 0.00 ray_color 1.50 1.21 0.02 4221152 0.00 0.00 multiply_vectors 1.50 1.23 0.02 3838091 0.00 0.00 length 1.50 1.25 0.02 2110576 0.00 0.00 compute_specular_diffuse 1.50 1.27 0.02 1241598 0.00 0.00 protect_color_overflow 1.50 1.29 0.02 1 0.02 1.32 raytracing 1.13 1.30 0.02 1048576 0.00 0.00 rayConstruction 0.75 1.31 0.01 1241598 0.00 0.00 refraction 0.75 1.32 0.01 1 0.01 0.01 delete_sphere_list 0.38 1.33 0.01 2520791 0.00 0.00 idx_stack_top 0.38 1.33 0.01 37595 0.00 0.00 idx_stack_pop 0.00 1.33 0.00 2558386 0.00 0.00 idx_stack_empty 0.00 1.33 0.00 2110576 0.00 0.00 localColor 0.00 1.33 0.00 1241598 0.00 0.00 reflection 0.00 1.33 0.00 1204003 0.00 0.00 idx_stack_push 0.00 1.33 0.00 1048576 0.00 0.00 idx_stack_init 0.00 1.33 0.00 113297 0.00 0.00 fresnel 0.00 1.33 0.00 3 0.00 0.00 append_rectangular 0.00 1.33 0.00 3 0.00 0.00 append_sphere 0.00 1.33 0.00 2 0.00 0.00 append_light 0.00 1.33 0.00 1 0.00 0.00 calculateBasisVectors 0.00 1.33 0.00 1 0.00 0.00 delete_light_list 0.00 1.33 0.00 1 0.00 0.00 delete_rectangular_list 0.00 1.33 0.00 1 0.00 0.00 diff_in_second 0.00 1.33 0.00 1 0.00 0.00 write_to_ppm ``` subtract_vector、add_vector、multiply_vector 的時間都下降了 * 畫出圖來 ![](https://i.imgur.com/0oZOfLZ.png) 發現花費時間集中在 rayRectangularIntersection * 各跑 100 次平均後畫出比較圖 ![](https://i.imgur.com/SvbDrpF.png) 時間下降了 31.66% --- ## pThread * 嘗試用 thread 增進效能 * 先找尋適合使用 thread 的部份,可以發現 raytracing.c(raytracing) 裡有兩個 for 迴圈 ```=vim for (int j = 0; j < height; j++) { for (int i = 0; i < width; i++) { ... ``` * 試著將 width/2 執行看看 ```=vim for (int j = 0; j < height; j++) { for (int i = 0; i < width / 2; i++) { ... ``` * 看看輸出的圖 ![](https://i.imgur.com/nA5QZ9O.png) * 認真理解一下程式碼,發現他是一個個 pixel 填顏色,所以當 width 剩一半的時候,圖就只會輸出一半。 * 試著用 thread 將圖分成四等分跑看看 先寫了一個小程式試跑一下 pthread ,但 compile 時一直出現錯誤 ``` george.c:(.text+0x71): 未定義參考到「pthread_create george.c:(.text+0x8e): 未定義參考到「pthread_create」 ``` 上網查了一下資料發現最根本的問題,compile 之後還需要 linking,而我是錯在 linking 的步驟,所以 compile 時需要加上 -pthread(參考[[問題] Semaphore 和pthread的問題](https://www.ptt.cc/bbs/C_and_CPP/M.1414763854.A.A18.html)) ``` $ cc -pthread george.c ``` * 把 raytracing.c(raytracing) 改寫成 raytracing.c(raytracing_thread),將圖片以十字切成四塊,分別用 threads 執行,因為 pthread 傳值只能傳一個變數,所以宣告一個 struct 存需要的變數。 ```=vim typedef struct __RAYTRACING_DETAILS{ uint8_t *pixels; color background_color; rectangular_node rectangulars; sphere_node spheres; light_node lights; const viewpoint *view; int width_start; int height_start; int width_end; int height_end; int width; int height; }details; ``` ```=vim insert_detail(pixels, background, rectangulars, spheres, lights, &view, ROWS, COLS, 0, ROWS / 2, 0, COLS / 2, detail[0]); insert_detail(pixels, background, rectangulars, spheres, lights, &view, ROWS, COLS, ROWS / 2, ROWS, 0, COLS / 2, detail[1]); insert_detail(pixels, background, rectangulars, spheres, lights, &view, ROWS, COLS, 0, ROWS / 2, COLS / 2, COLS, detail[2]); insert_detail(pixels, background, rectangulars, spheres, lights, &view, ROWS, COLS, ROWS / 2, ROWS, COLS / 2, COLS, detail[3]); for(i = 0;i < 4;i++) pthread_create(&thread[i], NULL, &raytracing_thread, (void *)detail[i]); ``` * 實際跑跑看 ``` illusion030@illusion030-X550LD:~/Desktop/2017sysprog/raytracing$ make run ./raytracing # Rendering scene Done! Execution time of raytracing() : 1.106192 sec ``` 時間降到約 1.11 秒 ``` illusion030@illusion030-X550LD:~/Desktop/2017sysprog/raytracing$ make check # Rendering scene Done! Execution time of raytracing() : 1.081845 sec Verified OK ``` check 成功!! * 各跑 100 次平均後畫出比較圖 ![](https://i.imgur.com/Lfx4ptA.png) * 試著用 8 個 threads 跑跑看,把 COLS 切成 8 等分 ``` illusion030@illusion030-X550LD:~/Desktop/2017sysprog/raytracing$ make check # Rendering scene Done! Execution time of raytracing() : 1.092779 sec Verified OK ``` 花的時間看起來差不多 * 切 ROWS 看看 ``` illusion030@illusion030-X550LD:~/Desktop/2017sysprog/raytracing$ ./raytracing # Rendering scene Done! Execution time of raytracing() : 1.066999 sec ``` 發現花費時間差不多 * 將每個 thread 花的時間印出來看看 ``` illusion030@illusion030-X550LD:~/Desktop/2017sysprog/raytracing$ ./raytracing # Rendering scene Thread 1, exec_time = 0.202546 Thread 8, exec_time = 0.226773 Thread 7, exec_time = 0.696050 Thread 2, exec_time = 0.752991 Thread 5, exec_time = 0.969776 Thread 6, exec_time = 0.976841 Thread 3, exec_time = 1.059676 Thread 4, exec_time = 1.089355 Done! Execution time of raytracing() : 1.101046 sec ``` 發現前後兩個 threads 花的時間很少,試著把前後兩個計算的範圍擴大,計算範圍比從 1:1:1:1:1:1:1:1 變成 3:2:2:2:2:2:2:3 再跑一次 ``` illusion030@illusion030-X550LD:~/Desktop/2017sysprog/raytracing$ ./raytracing # Rendering scene Thread 5, exec_time = 0.902981 Thread 6, exec_time = 0.929170 Thread 2, exec_time = 0.940407 Thread 7, exec_time = 0.993130 Thread 1, exec_time = 1.036798 Thread 8, exec_time = 1.035376 Thread 3, exec_time = 1.047738 Thread 4, exec_time = 1.051778 Done! Execution time of raytracing() : 1.072351 sec ``` 時間變得比較平均一點了 * 跑 100 次平均畫出比較圖看看 ![](https://i.imgur.com/Bdsz6nl.png) 花費時間差不多 * 切 512 的 threads 看看,用 cols 來切 ``` illusion030@illusion030-X550LD:~/Desktop/2017sysprog/raytracing$ make check # Rendering scene Done! Execution time of raytracing() : 1.044123 sec Verified OK ``` 感覺有再更低一點 * 跑 100 次平均畫圖 ![](https://i.imgur.com/KgGWJKe.png) 跟最佳化還差 0.2 秒多,再加個 force inline 看看 * pthreads 參考資料 * [PTHREADS](http://man7.org/linux/man-pages/man7/pthreads.7.html) * [PTHREAD_CREATE](http://man7.org/linux/man-pages/man3/pthread_create.3.html) * [PTHREAD_JOIN](http://man7.org/linux/man-pages/man3/pthread_join.3.html) ## force inline * 將 math-toolkit.h 裡的 function 設定成編譯時 always_inline ```=vim __attribute__((always_inline)); ``` * 有另一個方式是 function-like-macros,把 function difine 成 macro,可是 macro 不能傳指標進去,所以先實踐 always_inline * 跑一次 ``` illusion030@illusion030-X550LD:~/Desktop/2017sysprog/raytracing$ make check # Rendering scene Done! Execution time of raytracing() : 0.878199 sec Verified OK ``` 剩 0.878199 秒了 * 跑 100 次平均畫圖 ![](https://i.imgur.com/CDorbWO.png) 越來越接近最佳化的時間了!! * inline 參考資料 * [[C++]內嵌函數(inline function)筆記](https://dotblogs.com.tw/v6610688/archive/2013/11/27/introduction_inline_function.aspx) * [Declaring Attributes of Functions](https://gcc.gnu.org/onlinedocs/gcc-4.1.2/gcc/Function-Attributes.html) * [Function-like Macros](https://gcc.gnu.org/onlinedocs/cpp/Function-like-Macros.html)