2017q1 Homework1(raytracing)

# 2017q1 Homework1(raytracing) contributed by < `baimao8437` > --- **Reviewed by `csielee`** * 在使用 openMP 的地方可以考慮使用 openMP 的 schedule 功能，因為預設的 schedule 都是 auto，可能會遇到熱區不平衡的問題，可參考[ryanwang522共筆](https://hackmd.io/s/BJsz9RZqx#) --- ### 花費時間 ~2/25~ ~-~ ~2/28~ * 讀書：6 hr * 文件：6 hr * coding: 4 hr * ~~看著發呆：3 hr~~ ## 目標設定 * 習慣 linux 系統基本操作 * 熟練 git 基本版本控管操作 * 效能分析 gprof & 繪圖工具 gprof2dot 、 graphviz 練習 ``` make PROFILE=1 && ./raytracing gprof ./raytracing | ~/gprof2dot/gprof2dot.py| dot -Tpng -o callgraph.png ``` * HackMD 學習報告寫作練習 * 學習 GDB debug * 減少發呆時間... ## 前置作業 1. 開發工具安裝 ``` $ sudo apt-get update $ sudo apt-get install graphviz $ sudo apt-get install imagemagick ``` ## 開發環境 ``` baimao@baimao-Aspire-V5-573G:~$ lscpu Architecture: x86_64 CPU 作業模式： 32-bit, 64-bit Byte Order: Little Endian CPU(s): 4 On-line CPU(s) list: 0-3 每核心執行緒數：2 每通訊端核心數：2 Socket(s): 1 NUMA 節點： 1 供應商識別號： GenuineIntel CPU 家族： 6 型號： 69 Model name: Intel(R) Core(TM) i5-4200U CPU @ 1.60GHz 製程： 1 CPU MHz： 1711.242 CPU max MHz: 2600.0000 CPU min MHz: 800.0000 BogoMIPS: 4589.38 虛擬： VT-x L1d 快取： 32K L1i 快取： 32K L2 快取： 256K L3 快取： 3072K NUMA node0 CPU(s): 0-3 ``` ## 效能分析測試未優化效能：直接執行程式結果為3.25 sec ``` # Rendering scene Done! Execution time of raytracing() : 3.252078 sec ``` ```make clean``` 後使用 ```make PROFILE=1``` 編譯執行結果為 6.88 sec ``` # Rendering scene Done! Execution time of raytracing() : 6.886804 sec ``` 現在輸入 ```$ gprof ./raytracing | less``` 嘗試了解執行次數 ``` Flat profile: Each sample counts as 0.01 seconds. % cumulative self self total time seconds seconds calls Ts/call Ts/call name % the percentage of the total running time of the time program used by this function. cumulative a running sum of the number of seconds accounted seconds for by this function and those listed above it. self the number of seconds accounted for by this seconds function alone. This is the major sort for this listing. calls the number of times this function was invoked, if this function is profiled, else blank. self the average number of milliseconds spent in this ms/call function per call, if this function is profiled, else blank. total the average number of milliseconds spent in this ms/call function and its descendents per call, if this function is profiled, else blank. name the name of the function. This is the minor sort for this listing. The index shows the location of the function in the gprof listing. If the index is in parenthesis it shows where it would appear in the gprof listing if it were to be printed. ^L Copyright (C) 2012-2016 Free Software Foundation, Inc. Copying and distribution of this file, with or without modification, are permitted in any medium without royalty provided the copyright notice and this notice are preserved. ``` 怎麼跟我想像的完全不一樣一開始就碰壁 :::success 已解決 gcc 版本 6.2.0 在 makefile 加入新的 flag ```-no-pie``` 可順利執行 ::: > 同學你好，我也遇到相同問題。請問你這問題有解決了嗎@@ > [name=yangyang95] > 靠你解決的感謝你~~ > [name=baimao8437] 順利執行回頭補先前版本的 gprof 使用 ```$ git checkout <commit_id>``` 回復舊版 * gprof ``` Each sample counts as 0.01 seconds. % cumulative self self total time seconds seconds calls s/call s/call name 23.03 0.61 0.61 69646433 0.00 0.00 dot_product 20.77 1.16 0.55 56956357 0.00 0.00 subtract_vector 11.89 1.48 0.32 31410180 0.00 0.00 multiply_vector 9.06 1.72 0.24 13861875 0.00 0.00 rayRectangularIntersection 7.55 1.92 0.20 10598450 0.00 0.00 normalize 6.04 2.08 0.16 13861875 0.00 0.00 raySphereIntersection 4.53 2.20 0.12 17836094 0.00 0.00 add_vector 3.78 2.30 0.10 4620625 0.00 0.00 ray_hit_object 3.40 2.39 0.09 17821809 0.00 0.00 cross_product 3.40 2.48 0.09 1048576 0.00 0.00 ray_color 0.94 2.50 0.03 4221152 0.00 0.00 multiply_vectors 0.76 2.52 0.02 3838091 0.00 0.00 length 0.76 2.54 0.02 2520791 0.00 0.00 idx_stack_top 0.76 2.56 0.02 2110576 0.00 0.00 compute_specular_diffuse 0.76 2.58 0.02 2110576 0.00 0.00 localColor 0.76 2.60 0.02 1048576 0.00 0.00 rayConstruction 0.38 2.61 0.01 1241598 0.00 0.00 protect_color_overflow 0.38 2.62 0.01 1241598 0.00 0.00 refraction 0.38 2.63 0.01 1204003 0.00 0.00 idx_stack_push 0.38 2.64 0.01 37595 0.00 0.00 idx_stack_pop 0.38 2.65 0.01 1 0.01 2.65 raytracing 0.00 2.65 0.00 2558386 0.00 0.00 idx_stack_empty 0.00 2.65 0.00 1241598 0.00 0.00 reflection 0.00 2.65 0.00 1048576 0.00 0.00 idx_stack_init 0.00 2.65 0.00 113297 0.00 0.00 fresnel 0.00 2.65 0.00 3 0.00 0.00 append_rectangular 0.00 2.65 0.00 3 0.00 0.00 append_sphere 0.00 2.65 0.00 2 0.00 0.00 append_light 0.00 2.65 0.00 1 0.00 0.00 calculateBasisVectors 0.00 2.65 0.00 1 0.00 0.00 delete_light_list 0.00 2.65 0.00 1 0.00 0.00 delete_rectangular_list 0.00 2.65 0.00 1 0.00 0.00 delete_sphere_list 0.00 2.65 0.00 1 0.00 0.00 diff_in_second 0.00 2.65 0.00 1 0.00 0.00 write_to_ppm ``` >Git ：由於我只是用 checkout 回到過去版本修改了檔案那我在 rebase 前就必須先用 ```$ git stash save``` 暫存我剛剛的修改再用 ```$ git rebase -i <commit_parent>``` 真正的回到過去修改成功後用 ```$ git stash pop``` 將剛剛暫存的修改還原最好是直接 rebase 再改不過也學到 stash 怎麼用了進入 rebase 後會讓你選擇 command 因為我是要修改檔案所以選 edit ``` $ git rebase -i 3165ec8 Stopped at 694f6fd... Force all inline to always inline You can amend the commit now, with git commit --amend Once you are satisfied with your changes, run git rebase --continue ``` >到這裡就可以開始修改程式碼之後就 add 跟 commit --amend 了好了之後再```$ git rebase --continue``` 哇原來改了一版後面就自動改好了以後只要 edit 第一個版本就好囉蒸蚌 ## 效能提升 ### 強制開啟 inline 看完影片後決定從最簡單的部份下手在 ```math-toolkit.h``` 中在每個 ```static inline``` 後面都加上 ```__attribute__((always_inline))```強制開啟 inline #### 結果： ``` # Rendering scene Done! Execution time of raytracing() : 4.206481 sec ``` 從 6.88 sec 降至 4.20 sec 又發現在 ```idx_stack.h``` 中也有 inline 全部強制 #### 結果： ``` # Rendering scene Done! Execution time of raytracing() : 3.712347 sec ``` 又順利下降了一些變成 3.71 sec ### Loop unrolling 將 ```math-toolkit.h``` 中用到的 for loop 都解開能夠避免多次的算術運算有效降低時間 #### 結果： ``` # Rendering scene Done! Execution time of raytracing() : 2.701332 sec ``` 由於避免了 for 所產生的 branch 效能提升從 3.71 sec 降至 2.70 sec ### Macro 做完後覺得其實這些運算也沒有複雜到一定要獨立成一個 function 嘛如果能以直接計算取代 call function 必定可以節省時間所以把 ```math-toolkit.h``` 的 ```dot_product、subtract_vector、multiply_vector...``` 全部換成以 macro 寫法直接進行運算順便把兩個完全沒有用到的 function(scalar_triple_product & scalar_triple) 刪掉 #### 結果： ``` # Rendering scene Done! Execution time of raytracing() : 2.086252 sec ``` 不出所料的降低時間從 2.70 sec 變成 2.08 sec 但是... ![](https://i.imgur.com/SDG79t6.png) 圖形出錯了使用 ```$ git checkout <commit_id>``` 回復版本檢查後來找出兇手是把 ```multiiply_vector``` 換成 macro 造成的真詭異我應該沒有換錯才對這邊參考了前輩們的筆記將 ``` #define multiply_vector(a, b, out) (out[0] = a[0] * b, out[1] = a[1] * b, out[2] = a[2] * b) ``` 換成 ``` #define multiply_vector(a, b, out) (out[0] = a[0] * (b), out[1] = a[1] * (b), out[2] = a[2] * (b)) ``` b有沒有括號原來有差@@ 終於回復原本圖形 > Macro 在compile時期會把字代換,所以沒加括號如果 * 優先緒高就會出錯. > ex 本來 a*(b+c) 少括號,a*b+c > [name=purpon] > 跪了跪了 > [name=baimao8437] ![](https://i.imgur.com/2zlZUdp.png) * gprof ``` Each sample counts as 0.01 seconds. % cumulative self self total time seconds seconds calls s/call s/call name 31.11 0.37 0.37 13861875 0.00 0.00 rayRectangularIntersection 22.70 0.64 0.27 2110576 0.00 0.00 compute_specular_diffuse 10.93 0.77 0.13 13861875 0.00 0.00 raySphereIntersection 10.51 0.90 0.13 1048576 0.00 0.00 ray_color 9.25 1.01 0.11 4620625 0.00 0.00 ray_hit_object 5.04 1.07 0.06 1048576 0.00 0.00 rayConstruction 3.36 1.11 0.04 2110576 0.00 0.00 localColor 3.36 1.15 0.04 1241598 0.00 0.00 reflection 1.68 1.17 0.02 1 0.02 1.19 raytracing 1.26 1.18 0.02 1241598 0.00 0.00 protect_color_overflow 0.84 1.19 0.01 1241598 0.00 0.00 refraction 0.00 1.19 0.00 113297 0.00 0.00 fresnel 0.00 1.19 0.00 3 0.00 0.00 append_rectangular 0.00 1.19 0.00 3 0.00 0.00 append_sphere 0.00 1.19 0.00 2 0.00 0.00 append_light 0.00 1.19 0.00 1 0.00 0.00 calculateBasisVectors 0.00 1.19 0.00 1 0.00 0.00 delete_light_list 0.00 1.19 0.00 1 0.00 0.00 delete_rectangular_list 0.00 1.19 0.00 1 0.00 0.00 delete_sphere_list 0.00 1.19 0.00 1 0.00 0.00 diff_in_second 0.00 1.19 0.00 1 0.00 0.00 write_to_ppm ``` ### OpenMP 屬於比較簡易的程式平行化只要 for 裡的內容獨立可以對 for 做平行化在 ```raytracing.c``` 中找到巢狀迴圈在上一行加入 ```#pragma omp parallel for``` +```num_threads(64)``` -> 產生64個 thread 來平行處理最後修改 ```Makefile``` 中的CFLAGS加上```-fopenmp``` LDFLAGS加上```-lgomp``` #### 結果 ``` # Rendering scene Done! Execution time of raytracing() : 1.428162 sec ``` 平行處理就是猛阿簡單幾行就從 2.08 sec 又降了許多變 1.42 sec but 圖形卻... ![](https://i.imgur.com/D4TVsco.png) 這邊參考前輩的筆記發現我少做了一件事因為在做平行化時兩個執行續可能會同時修改到相同的變數所以應該要把迴圈內會被修改到的變數全部加上 private 所以把會改變到的變數 : stk 、 d 、 object_color 這3個變數以 private 包起來加進平行化的參數 ```#pragma omp parallel for num_threads(64) private(stk), private(d),private(object_color)``` 然後圖又正常了 ![](https://i.imgur.com/4xhY3XU.png) * gprof ``` Each sample counts as 0.01 seconds. % cumulative self self total time seconds seconds calls ms/call ms/call name 41.19 0.07 0.07 309853 0.00 0.00 rayRectangularIntersection 17.65 0.10 0.03 21578 0.00 0.01 ray_color 11.77 0.12 0.02 318752 0.00 0.00 raySphereIntersection 11.77 0.14 0.02 51328 0.00 0.00 localColor 11.77 0.16 0.02 48388 0.00 0.00 compute_specular_diffuse 5.88 0.17 0.01 27916 0.00 0.00 reflection 0.00 0.17 0.00 109628 0.00 0.00 ray_hit_object 0.00 0.17 0.00 28481 0.00 0.00 protect_color_overflow 0.00 0.17 0.00 26259 0.00 0.00 refraction 0.00 0.17 0.00 20410 0.00 0.00 rayConstruction 0.00 0.17 0.00 2935 0.00 0.00 fresnel 0.00 0.17 0.00 3 0.00 0.00 append_rectangular 0.00 0.17 0.00 3 0.00 0.00 append_sphere 0.00 0.17 0.00 2 0.00 0.00 append_light 0.00 0.17 0.00 1 0.00 0.00 calculateBasisVectors 0.00 0.17 0.00 1 0.00 0.00 delete_light_list 0.00 0.17 0.00 1 0.00 0.00 delete_rectangular_list 0.00 0.17 0.00 1 0.00 0.00 delete_sphere_list 0.00 0.17 0.00 1 0.00 0.00 diff_in_second 0.00 0.17 0.00 1 0.00 170.07 raytracing 0.00 0.17 0.00 1 0.00 0.00 write_to_ppm ``` 可以看到平行化使前面幾名的執行時間明顯的降低了 ### AVX 在後面的作業比較熟悉 AVX 後回來寫就覺得很簡單了針對 add_vector 先測試原本的效能 ``` register __m256d ymm0, ymm1; ymm0 = _mm256_set_pd(a[0],a[1],a[2],0.0); ymm1 = _mm256_set_pd(b[0],b[1],b[2],0.0); ymm1 = _mm256_add_pd(ymm0, ymm1); double tmp[4] __attribute__((aligned(32))); _mm256_store_pd(tmp, ymm1); out[0] = tmp[3]; out[1] = tmp[2]; out[2] = tmp[1]; ``` 比較需要注意的是 out[0] = tmp[3] ; out[1] = tmp[2] ; out[2] = tmp[1]; #### 結果還沒優化前效能 * 執行時間 ``` # Rendering scene Done! Execution time of raytracing() : 2.965848 sec ``` * gprof ``` Each sample counts as 0.01 seconds. % cumulative self self total time seconds seconds calls s/call s/call name 3.39 1.67 0.06 17836094 0.00 0.00 add_vector ``` 再來看以 AVX 運算的結果 * 執行時間 ``` # Rendering scene Done! Execution time of raytracing() : 4.582634 sec ``` 哪尼!居然變久了 * gprof ``` Each sample counts as 0.01 seconds. % cumulative self self total time seconds seconds calls s/call s/call name 5.68 2.41 0.16 17836094 0.00 0.00 add_vector ``` 真的沒有比直接加快來改看看比較複雜一點的 dot product 還沒優化前效能 * 執行時間 ``` # Rendering scene Done! Execution time of raytracing() : 3.587229 sec ``` * gprof ``` % cumulative self self total time seconds seconds calls s/call s/call name 20.41 0.76 0.36 69646433 0.00 0.00 dot_product ``` AVX 優化後結果 * 執行時間 ``` # Rendering scene Done! Execution time of raytracing() : 6.080835 sec ``` 恩... * gprof ``` % cumulative self self total time seconds seconds calls s/call s/call name 26.09 0.91 0.91 69646433 0.00 0.00 dot_product ``` 看來 AVX 加速還是比較適用於一次處理 4 倍資料用在這裡沒有比較快 ## 參考資料： * [JiJi大的筆記](https://hackmd.io/s/r1vckLB6) * [gprof bug fix](https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=854704) * [Git rebase 教學](https://zlargon.gitbooks.io/git-tutorial/content/patch/rebase_interactive.html) * [OpenMP private](https://kheresy.wordpress.com/2006/09/22/%E7%B0%A1%E6%98%93%E7%9A%84%E7%A8%8B%E5%BC%8F%E5%B9%B3%E8%A1%8C%E5%8C%96%EF%BC%8Dopenmp%EF%BC%88%E4%BA%94%EF%BC%89-%E8%AE%8A%E6%95%B8%E7%9A%84%E5%B9%B3%E8%A1%8C%E5%8C%96/)