2017q1 Homework1 (raytracing)

# 2017q1 Homework1 (raytracing) contributed by <`vtim9907`> ### Reviewed by `petermouse` * 用 gdb 來驗證 `register` 關鍵字仍是個很好的嘗試！ * 可以考慮使用 Pthread 或 OpenMP 將程式平行處理 * 可以對 SIMD 比原版慢的結果進一步提出解釋並驗證 * 建議使用圖表來呈現改善的成果 # 開發環境 - OS : Ubuntu 16.04.2 LTS - Kernel Version : 4.8.0-36-generic - Memory : 32 GB - CPU : Intel® Core™ i5-6600 Processor - cache : - L1i cache : 4*32 KB - L1d cache : 4*32 KB - L2 cache : 4*256 KB - L3 cache : 6144 KB - L1 cache imformation ( sudo dmidecode -t cache ) : ``` Handle 0x001E, DMI type 7, 19 bytes Cache Information Socket Designation: L1 Cache Configuration: Enabled, Not Socketed, Level 1 Operational Mode: Write Back Location: Internal Installed Size: 256 kB Maximum Size: 256 kB Supported SRAM Types: Synchronous Installed SRAM Type: Synchronous Speed: Unknown Error Correction Type: Parity System Type: Unified Associativity: 8-way Set-associative ``` # 優化前分析 ## gprof 首先執行看看，得到結果： ``` Execution time of raytracing() : 2.106784 sec ``` 在開始優化之前，先做分析，找出該優化的地方，首先用 gprof 分析看看： ``` Flat profile: Each sample counts as 0.01 seconds. % cumulative self self total time seconds seconds calls s/call s/call name 22.18 0.43 0.43 56956357 0.00 0.00 subtract_vector 21.15 0.84 0.41 69646433 0.00 0.00 dot_product 10.32 1.04 0.20 13861875 0.00 0.00 rayRectangularIntersection 8.51 1.21 0.17 31410180 0.00 0.00 multiply_vector 7.74 1.36 0.15 13861875 0.00 0.00 raySphereIntersection 7.22 1.50 0.14 10598450 0.00 0.00 normalize 5.67 1.61 0.11 17836094 0.00 0.00 add_vector 3.61 1.68 0.07 17821809 0.00 0.00 cross_product 3.09 1.74 0.06 4620625 0.00 0.00 ray_hit_object 2.32 1.78 0.05 4221152 0.00 0.00 multiply_vectors 1.55 1.81 0.03 2110576 0.00 0.00 localColor 1.55 1.84 0.03 1048576 0.00 0.00 ray_color 1.03 1.86 0.02 3838091 0.00 0.00 length 1.03 1.88 0.02 2110576 0.00 0.00 compute_specular_diffuse 1.03 1.90 0.02 1 0.02 1.94 raytracing 0.52 1.91 0.01 2520791 0.00 0.00 idx_stack_top 0.52 1.92 0.01 1204003 0.00 0.00 idx_stack_push 0.52 1.93 0.01 1048576 0.00 0.00 rayConstruction 0.52 1.94 0.01 1 0.01 0.01 calculateBasisVectors 0.00 1.94 0.00 2558386 0.00 0.00 idx_stack_empty 0.00 1.94 0.00 1241598 0.00 0.00 protect_color_overflow 0.00 1.94 0.00 1241598 0.00 0.00 reflection 0.00 1.94 0.00 1241598 0.00 0.00 refraction 0.00 1.94 0.00 1048576 0.00 0.00 idx_stack_init 0.00 1.94 0.00 113297 0.00 0.00 fresnel 0.00 1.94 0.00 37595 0.00 0.00 idx_stack_pop ``` 省略了一些次數不多、耗時也不多的 function。可以看到前幾名被呼叫到的次數都是幾千萬次，該針對哪裡著手做優化，已經很明顯有了個大概的方向。 ## gprof2dot 雖然用 gprof 可以得到以文字所呈現的 call graph，但是我只看了一下就覺得很累＠＠大概是我對這個程式還不是非常熟悉，每個 function 的名稱有時看過就忘了，所以我還是覺得要把 call graph 視覺化出來。上網 google 了 "gprof call graph visualization"，出現的前好幾筆資料都是建議使用 gprof2dot 這個工具，所以我也決定用用看效果如何。安裝完相關套件後，執行： ``` gprof ./raytracing | gprof2dot | dot -Tpng -o output.png ``` 產生了一個 .png 的圖檔！ ![](https://i.imgur.com/YaNAHuM.png) 產生的圖檔其實滿大一張的，只是丟上來後變的有點小＠＠憑借這張圖，就能比較輕鬆知道 function 之間的關係以及程式流向，對分析也有幫助！ # 優化 ## Method 1 : loop unrolling 由於 loop 或是 if-else 相關類型的指令，都包含了 compare 以及 jump 的動作，雖然在寫高階語言時用這些方法比較方便，甚至因為省下了一些 code ，會有執行更快速的錯覺，但從彙編語言的角度來看，這些 branch 指令，每一步都是花費。以現在這個例子來看，`math-toolkit.h` 裡，一共有 5 個 function 使用了 for loop ，而且其中就有 4 個是跑了千萬次以上的 function，合計共 180070216 次，再乘上 4 ，就有高達七億兩千萬個 branch 指令被執行過，如果我們輕鬆的將這些少少的 loop 展開，就可以省下大量 cycle！單純針對 `math-toolkit.h` 裡的 function 做完 loop unrolling 的動作之後，實際測試結果： ``` Execution time of raytracing() : 1.468591 sec ``` 其中前幾大耗時的 function 包含了被修改的五個 function 的資訊： ``` Flat profile: Each sample counts as 0.01 seconds. % cumulative self self total time seconds seconds calls s/call s/call name 18.64 0.19 0.19 56956357 0.00 0.00 subtract_vector 13.73 0.33 0.14 69646433 0.00 0.00 dot_product 11.77 0.45 0.12 13861875 0.00 0.00 rayRectangularIntersection 7.85 0.53 0.08 17821809 0.00 0.00 cross_product 6.87 0.60 0.07 31410180 0.00 0.00 multiply_vector 6.87 0.67 0.07 13861875 0.00 0.00 raySphereIntersection 5.89 0.73 0.06 17836094 0.00 0.00 add_vector 5.89 0.79 0.06 4620625 0.00 0.00 ray_hit_object 3.92 0.83 0.04 10598450 0.00 0.00 normalize 3.92 0.87 0.04 1048576 0.00 0.00 ray_color 3.92 0.91 0.04 1 0.04 1.02 raytracing 3.43 0.95 0.04 4221152 0.00 0.00 multiply_vectors ``` 總執行的時間減少了約 0.64 秒，從上面的資訊也很容易看得出來，因為主要運算的 function 加快了，所以某些 function 也多少加快了一些；雖然只是做了簡單的優話，卻產生了極大的成效！這也多虧了 gprof 的分析，才得以輕易的看出哪些 function 是需要被修改的重點。 ### loop unrolling + always_inline 之後在看 [C语言inline详细讲解](http://www.cnblogs.com/cnmaizi/archive/2011/01/19/1939686.html) 時，赫然發現原來 `inline` 是個 "建議" 型關鍵字，compiler 並不一定會因為有加 inline 就保證會對該 function 做內聯擴展，也就是說我之前寫的 code 也許根本沒起到什麼作用＠＠不過還好文章後面馬上提到加上 `__attribute__ ((always_inline))` 就可以強制 complier 將該 function 視為 inline function ，因此重新修改過後，再次編譯執行： ``` Performance counter stats for './raytracing' (10 runs): 118,2873 cache-misses # 63.538 % of all cache refs ( +- 27.88% ) 186,1665 cache-references ( +- 23.03% ) 240,9906 L1-dcache-load-misses ( +- 3.27% ) <not supported> L1-dcache-store-misses <not supported> L1-dcache-prefetch-misses 99,5265 L1-icache-load-misses ( +- 18.47% ) 1.386430395 seconds time elapsed ( +- 2.20% ) ``` 結果又縮短了 0.11 秒左右！效能比優化前提昇了 36% ! ## Method 2 : SIMD 在開始之前，不確定我的 cpu 有支援哪些東西可以讓我用，於是去看了 [spec](http://www.cpu-world.com/CPUs/Core_i5/Intel-Core%20i5-6600.html) ，發現 feature 裡有提到支援 AVX/AVX2，於是想從指令集裡找適合的指令為 function 做優化，以前面 loop unrolling 為基底，希望能夠幫助效能更加提昇。一開始我用了 AVX/AVX2 intrinsics 把 `math-toolkit.h` 裡幾乎全部的 function 全部都改寫過，不過與預期的不同，當我全部改寫完重新執行後： - 未強制 inlinie 版 ``` Execution time of raytracing() : 3.650931 sec ``` - 強制 inline 版 ``` Execution time of raytracing() : 3.272730 sec ``` 就算經過強制 inline 後，也比未優化版本還慢了 0.75 秒，相較於 loop unrolling 的版本，差勁了更多。 ### AVX + always_inline + register 我在觀賞學長姐得共筆時，看到有人在 code 裡加上了 "register" 的關鍵字，這是個好像有聽過，可是我卻從來沒用過得東西，特別去看了一下 register 有什麼作用，馬上看到 register 跟 inline 一樣，只是個建議型關鍵字..., 而被標示 register 的變數 "有機會" 會被放到 cpu 的寄存器裡面，大幅提昇 cpu <>訪問</s> 存取的速度！聽起來似乎很適合加在跑了千萬次的 function 裡的變數！於是我就加上了之後編譯執行看看： :::danger access 的繁體中文翻譯是「存取」，不是「訪問」，請留意 --jserv ::: ``` Execution time of raytracing() : 2.842533 sec ``` 效能提昇了不少，比未加上 register 的版本又快了 0.4 秒，但可惜的是，還是慢了 loop unrolling 的版本許多。然而我還是想知道，善用了 cpu register 之後，cache 會發生什麼樣的變化，於是我用 perf 跑了加 register 之前與之後的版本，來觀察 cache-miss 發生的情況： - 沒使用 register 關鍵字 ``` Execution time of raytracing() : 3.376952 sec Performance counter stats for './raytracing' (10 runs): 428,9820 cache-misses # 71.062 % of all cache refs ( +- 26.18% ) 603,6756 cache-references ( +- 21.22% ) 814,9161 L1-dcache-load-misses ( +- 3.59% ) <not supported> L1-dcache-store-misses <not supported> L1-dcache-prefetch-misses 2,9194,2937 L1-icache-load-misses ( +- 0.58% ) 3.575088791 seconds time elapsed ( +- 2.74% ) ``` - 有用 register 關鍵字 ``` Execution time of raytracing() : 2.816457 sec Performance counter stats for './raytracing' (10 runs): 47,6384 cache-misses # 31.484 % of all cache refs ( +- 4.11% ) 151,3082 cache-references ( +- 10.23% ) 435,9741 L1-dcache-load-misses ( +- 1.71% ) <not supported> L1-dcache-store-misses <not supported> L1-dcache-prefetch-misses 2,0195,5565 L1-icache-load-misses ( +- 0.06% ) 2.813734510 seconds time elapsed ( +- 0.11% ) ``` 加了 register 關鍵字後，cache-miss 大幅降低！從四百多萬變成四十幾萬！並且 cache-miss 發生的機率也大幅下降了 40%； L1-icache-load-misses 也少了一億次左右！ cache reference 也減少了許多，我想應該有許多地方由 cpu register 代勞的關係。 ### loop unrolling + always_inline + register 同樣的，我也想為前面 loop unrolling 的版本加上 register 試試： ``` Performance counter stats for './raytracing' (10 runs): 24,6774 cache-misses # 42.308 % of all cache refs ( +- 3.21% ) 58,3278 cache-references ( +- 10.85% ) 209,1464 L1-dcache-load-misses ( +- 1.38% ) <not supported> L1-dcache-store-misses <not supported> L1-dcache-prefetch-misses 42,2837 L1-icache-load-misses ( +- 5.41% ) 1.264693515 seconds time elapsed ( +- 0.22% ) ``` 果然相較原來 loop unrolling ＋強制 inline 的版本，快了 0.12 秒， cache-miss 也變得更低。 ### register 驗證測試不過這時我突然想起，既然 register 也是一個建議型的 keyword 那我怎麼能確定我加進 code 裡的 register 是否真的有起到作用呢？於是我 google 大略看了一下 register 有沒有類似 always inline 的用法，不過似乎沒有，~~而且大多數人似乎都覺得 register 是一個沒有用的關鍵字了，大家都只用 complier 做優化 QQ~~ > 更精確的來說，在 C/C++ 規格中已經標注 "[Remove Deprecated Use of the register Keyword](http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2015/n4340.html)"，內容說明 register 關鍵字的功能已經從 Deprecated ，轉為移除該關鍵字的功能，但仍然保留了這個關鍵字，以做日後使用。 >> 感謝老師提醒，這邊不應該看別人的評論妄下結論。日後若對程式碼有疑惑，我會先看 spec 的！不過我突然想到既然我都已經強制 inline 了，那我指名要放進 register 的變數應該也真的都有照我想的做了吧！？為了確定這樣的想法是否正確，我想找個辦法觀察 cpu register 的運作，一方面確定強制 inline 是否也順便強制 register，一方面了解效能提昇的原因；我選擇用 gdb 來做接下來的測試。 ### gdb #### 測試一我想先做個簡單的測試來確認 register 是否真的有運作？強制 inline 真的也能強制 register 嘛？首先我把 dot_product() 改成： ```clike= static inline __attribute__ ((always_inline)) double dot_product(const double *v1, const double *v2) { __m256d p1 = _mm256_set_pd(v1[0], v1[1], v1[2], 0.0); __m256d p2 = _mm256_set_pd(v2[0], v2[1], v2[2], 0.0); register __m256d mul = _mm256_mul_pd(p1, p2); register __m256d hadd = _mm256_hadd_pd(mul, mul); register __m128d result1 = _mm_set1_pd(hadd[3]); register __m128d result2 = _mm_set1_pd(hadd[1]); return _mm_add_pd(result1, result2)[1]; } ``` p1, p2 是沒有加 register 的變數，而其他如 mul, hadd 都有，主要是希望透過 gdb 來看這些變數是否有被經過優化，有加 register 的地方照理來說應該是被優化過的，其他地方因為把 complier 的優化關掉了，所以都沒有優化過。 ``` vtim@vtim-ubuntu:~/raytracing$ make cc -std=gnu99 -Wall -O0 -g -mavx2 -g3 -c -o objects.o objects.c cc -std=gnu99 -Wall -O0 -g -mavx2 -g3 -c -o raytracing.o raytracing.c cc -std=gnu99 -Wall -O0 -g -mavx2 -g3 -c -o main.o main.c cc -o raytracing objects.o raytracing.o main.o -lm vtim@vtim-ubuntu:~/raytracing$ gdb -q raytracing // 用 gdb 跑 raytracing Reading symbols from raytracing...done. (gdb) l dot_product 99 } 100 101 static inline __attribute__ ((always_inline)) 102 double dot_product(const double *v1, const double *v2) 103 { 104 __m256d p1 = _mm256_set_pd(v1[0], v1[1], v1[2], 0.0); 105 __m256d p2 = _mm256_set_pd(v2[0], v2[1], v2[2], 0.0); 106 register __m256d mul = _mm256_mul_pd(p1, p2); 107 register __m256d hadd = _mm256_hadd_pd(mul, mul); 108 register __m128d result1 = _mm_set1_pd(hadd[3]); (gdb) b 107 // 設斷點在上面 107 行的位置 Breakpoint 1 at 0x40163d: /home/vtim/raytracing/math-toolkit.h:107. (21 locations) (gdb) r Starting program: /home/vtim/raytracing/raytracing # Rendering scene Breakpoint 1, dot_product (v2=0x7fffffffcba0, v1=0x7fffffffcb60) at math-toolkit.h:107 107 register __m256d hadd = _mm256_hadd_pd(mul, mul); (gdb) p p1 $1 = {0, 20, 0, 0} // 變數 p1 正常顯示 (gdb) p p2 $2 = {0, 17.345909496181164, -5.2263415432330067, 0} // 變數 p2 正常顯示 (gdb) p mul $3 = <optimized out> // 灌上 register keyword 的變數 mul 顯示 <optimized out> ``` 從上面最後幾行就看得到，有加 register 的部份的確有被 complier 採納，我做了好幾個測試結果都吻合前面的想法，所以我想強制 inline 確實也可強制 complier 採納 programmer 加入的 register keyword。 #### 測試二 >待續 ### 將部份 function 改回 loop unrolling 由於這個優化版本似乎沒有什麼進展，所以我決定改變策略，只選擇部份運算量較大、被呼叫次數較多的 function，用 AVX 去優化，看會不會突破 loop unrolling 版本的效率。可惜的是，當我逐個 function 去拔掉 AVX intrinsic instructions 時，效能就慢慢回升，直到全部拔掉，效能達到最高點，但是這就跟只用 loop unrolling 一模一樣阿！！！這真是令人太難過了＠＠我以為能夠搭配前面優話的版本進而更上一層樓，不過失敗了，我想應該有哪裡可以改進，我再思考看看。 ## Method 3 : OpenCL OpenCL 是一種針對異質性計算裝置進行平行化運算處理的 API 及 language，目前最常做的應用似乎是 GPGPU ，也就是可以運用顯示晶片做一些 3D 繪圖處理以外的運算，感覺是一種把資源的利用盡量提升到最大化的技術，決定試試看用 OpenCL 對 raytracing 做優化。由於我的筆電沒有獨顯，所以只有 Intel cpu 內建的顯示晶片，在第一關架設環境時就遇到了麻煩...，由於這方面可以查到的資訊比較少，我只能照著 Intel 官方提供的一些說明照著安裝，但是一直處在不確定有沒有安裝完全，而到了準備 build 一個 OpenCL kernel 時，它卻是用一個 OpenCL™ Code Builder 的工具，讓我目前還有點混亂...待研究中。 # Reference - Github [jrfonseca/gprof2dot](https://github.com/jrfonseca/gprof2dot) - [Crunching Numbers with AVX and AVX2](https://www.codeproject.com/Articles/874396/Crunching-Numbers-with-AVX-and-AVX) - [Overview: Intrinsics for Intel® Advanced Vector Extensions Instructions](http://technion.ac.il/doc/intel/compiler_c/main_cls/index.htm#intref_cls/common/intref_avx_extractf128_pd.htm) - [C语言inline详细讲解](http://www.cnblogs.com/cnmaizi/archive/2011/01/19/1939686.html)