2016q3 Homework 1 (raytracing)

# 2016q3 Homework 1 (raytracing) contributed by <`ChenYi`> ### Reviewed by `nekoneko` - 可以明確指出在dot_product使用AVX指令，時間沒有減少的原因 - 由compute-pi的程式碼和Intel Optimized文件，分析dot_prodect是否值得使用AVX加速，若不適合，能指出程式有沒有其他部份可用AVX加速，並實做。 ## 開發環境 Destribution: Ubuntu 14.04.5 LTS CPU : Intel(R) Xeon(R) CPU E3-1231 v3 @ 3.40GHz * L1d 快取：32K * L1i 快取：32K * L2 快取：256K * L3 快取：8192K >> 避免用圖片表示程式輸出的文字訊息，請愛用 command line [name=jserv] >> 好的謝謝提醒等等修正 [name=ChenYi] ## 準備工作弄好環境,clone下來 ``` sudo apt-get update sudo apt-get install graphviz sudo apt-get install imagemagick sudo apt-get install gprof git clone https://github.com/sysprog21/raytracing ``` ## gprof * gprof是個可以拿來看程式執行時，跑過哪些function和執行比率的一個實用工具。 * 從這些資訊中，我們可以觀察到哪些function特別花時間，和被呼叫特別多次，藉由這些東西，可以針對特定function去做改善，因而得到更好的表現。 * 以raytracing-origin為例，make時得先立個flag ``` make PROFILE=1 ``` * 其對應到的指令是-pg (在Makefile裡) ``` ifeq ($(strip $(PROFILE)),1) PROF_FLAGS = -pg <======== this CFLAGS += $(PROF_FLAGS) LDFLAGS += $(PROF_FLAGS) endif ``` * 目的是程式執行時，產生一個gmon.out，以供gprof使用 ## gprof2dot * 在[Ping-Ling's Hackpad](https://embedded2016.hackpad.com/2016q1-Homwork2a--QixiAsqbVaf)裡看到的，可以方便觀看程式的執行脈絡 * 因為是輸出成圖片，比單純使用gprof 好觀看很多 ## Origin 執行程式 & gprof * 因為直接gprof會很醜,所以排版一下(less) ``` ./raytracing gprof ./raytracing | less ``` Raytracing結果圖 ![](https://i.imgur.com/2xNrnRd.png) 用上個作業學到的perf來看看cache misses如何 ``` Execution time of raytracing() : 3.315326 sec Performance counter stats for './raytracing': 22,843 cache-misses # 18.615 % of all cache refs 122,711 cache-references 3,238,239 L1-dcache-load-misses <not supported> L1-dcache-store-misses <not supported> L1-dcache-prefetch-misses 151,876 L1-icache-load-misses 3.317456277 seconds time elapsed ``` 感覺問題不是出在cahce-misses , 用gprof看一下時間花在哪 ``` Each sample counts as 0.01 seconds. % cumulative self self total time seconds seconds calls s/call s/call name 25.49 0.67 0.67 69646433 0.00 0.00 dot_product 19.02 1.17 0.50 56956357 0.00 0.00 subtract_vector 10.27 1.44 0.27 13861875 0.00 0.00 rayRectangularIntersection 9.51 1.69 0.25 10598450 0.00 0.00 normalize 6.85 1.87 0.18 31410180 0.00 0.00 multiply_vector ``` 從這邊可以看到佔用次數前幾名分別為dot_product(),subtract_vector(),rayRectangularIntersection() * 因為raytracing必須經常計算向量入射與反射角及其投影，所以可以看到在這邊dot的使用率佔用了全部次數中的25.49% * subtract_vector()同上 * 有趣的是rayRectangularIntersection()為什麼會這麼高呢？ * 我推測是因為圖中3個矩型佔用的面積最大 ==> 吃到最多的射線 ### 這邊利用gprof2dot圖像化一下方便觀察 ![](https://i.imgur.com/IdrOtrI.png) * 愈篇藍色就愈不重要(這邊是指即使改進也不會增加太大的效能) * 所以我們要關注的便是顏色比較不一樣的那幾個 ## 刪除不必要的分支來做點最基礎的最佳化吧！已知向量相加減，以vec3為例是 ``` v1 + v2 ===> (v1.x+v2.x) , (v1.y+v2.y) , (v1.z+v2.z) ``` 那麼程式碼中的for其實是不需要用到的(loop unrolling) ``` for (int i = 0; i < 3; i++) out[i] = a[i] + b[i]; //change to out[0] = a[0] + b[0]; out[1] = a[1] + b[1]; out[2] = a[2] + b[2]; ``` 嘗試去節省sub/add跑for還要比對的時間結果 ``` Each sample counts as 0.01 seconds. % cumulative self self total time seconds seconds calls s/call s/call name 29.38 0.64 0.64 69646433 0.00 0.00 dot_product 12.85 0.92 0.28 13861875 0.00 0.00 rayRectangularIntersection 8.03 1.10 0.18 10598450 0.00 0.00 normalize 7.80 1.27 0.17 13861875 0.00 0.00 raySphereIntersection 7.34 1.43 0.16 31410180 0.00 0.00 multiply_vector 7.34 1.59 0.16 4620625 0.00 0.00 ray_hit_object 6.43 1.73 0.14 17821809 0.00 0.00 cross_product 4.59 1.83 0.10 56956357 0.00 0.00 subtract_vector 4.13 1.92 0.09 1048576 0.00 0.00 ray_color ``` sub跑到後面了，代表著佔用時間大幅下降　==> 可行阿！！改掉dot_product() , multi_vectors(...) , multi_vector(...) ``` Each sample counts as 0.01 seconds. % cumulative self self total time seconds seconds calls s/call s/call name 17.20 0.33 0.33 69646433 0.00 0.00 dot_product 13.55 0.59 0.26 10598450 0.00 0.00 normalize 11.73 0.82 0.23 13861875 0.00 0.00 raySphereIntersection 11.20 1.03 0.22 56956357 0.00 0.00 subtract_vector 10.94 1.24 0.21 13861875 0.00 0.00 rayRectangularIntersection ``` 分散變平均了，cumulative time也下降了不少 Total execution time 用Gnuplot畫一下 ![](https://i.imgur.com/utrpa2z.png) ``` Origin: Execution time of raytracing() : 4.368682 sec Remove branch: Execution time of raytracing() : 3.399409 sec ``` >> 用「表格」來比較，醒目 [name=jserv] >> 感謝老師提醒順便請問一下表格該怎麼做比較好[name=chenyi] ## OpenMP 嘗試一下使用OpenMP來做平行化修改substract_vector(...) ``` #pragma omp parallel sections { #pragma omp section { out[0] = a[0] - b[0]; } #pragma omp section { out[1] = a[1] - b[1]; } #pragma omp section { out[2] = a[2] - b[2]; } } ``` 結果 ``` Rendering scene Done! Execution time of raytracing() : 67.389720 sec ``` 恩... 我思考了很久也看過春季班同學的hackpad[4]了，實在想不出什麼原因不過就結果而言，我的結果確實是失敗了後來在其他人的hackpad上看到對raytacing(...)裡的三層for做平行化的話，其結果是可行參考[TempleJiJi's hackpad](https://hackmd.io/IwDhE4AZmB2BaATAYwKzHgFlQI1fHAZgDNF5JFhxZJCRNJkBTIA=#) 怎麼做的就不放上來了，因為我這邊只有試跑而已在我的實機上的測試結果如下 (16 threads) ``` # Rendering scene Done! Execution time of raytracing() : 0.391119 sec ``` loop unrolling版的為1.589252 sec :::warning 這邊的測試都有關掉-pg，我的實機測試如果沒關的話，使用OpenMP反而是會比較慢的 ::: 大致上想一下為什麼在三層for裡會成功，而在math_toolkit.h裡用section做會失敗 * CPU資源的配置不是免費的！ * 進section時切成多個threads，出sections後回歸main thread * 又dot_product()執行次數及section內容不值得這麼做 * 在配置資源的耗損上，應該是遠大於平行化所帶來的效益，所以才會出現60秒這種恐怖的數字以圖示大概是如此 ![](https://i.imgur.com/voGeiDU.png) 圖片來源(https://computing.llnl.gov/tutorials/openMP/) ## SIMD instructions 查一下自己CPU是否支援　（[我的規格](http://ark.intel.com/zh-tw/products/80910/Intel-Xeon-Processor-E3-1231-v3-8M-Cache-3_40-GHz)） SSE支援到4.2 AVX支援到2.0 看了一下SSE是128bits AVX則是256bits 這次作業code裏面是用double , 所以這邊選擇使用AVX 把dot_product(...) 裡的換掉 ``` static inline __attribute__((always_inline)) double dot_product(const double *v1, const double *v2) { double tmp[4]; register __m256d reg1 , reg2 , out_reg; reg1 = _mm256_set_pd(0.0 ,v1[2] , v1[1] , v1[0]); reg2 = _mm256_set_pd(0.0 ,v2[2] , v2[1] , v2[0]); out_reg = _mm256_shuffle_pd(reg1 , reg2 , _MM_SHUFFLE(0,1,2,3)); _mm256_store_pd(tmp , out_reg); return tmp[0]+tmp[1]+tmp[2]; } ``` :::info 在編譯上要記得加個 -mavx ::: 執行花了2.612381 秒居然比原版的還要慢！想了一下，慢的點應該出在__mmstore_ps()，這個用途是在於把資料搬回記憶體上但原版的方式為不斷地累加，正常安排暫存器的話大概會是全放在同一個暫存器上所以應該會比不斷store的store還來的快一點，詳細情形有時間去找老師問問看好了在Intel Intrinsics Guide裡的Operation ``` MEM[mem_addr+255:mem_addr] := a[255:0] ``` ## 總結各個最佳化方式的比較圖 ![](https://i.imgur.com/y7AauHj.png) 從圖中可以看到最快的是對for loop平行化的 OpenMP+loop unrolling，比起原本的程式碼快了有5倍之多。令人訝異的是沒想到AVX+loop unrolling會比原版還慢這點。事實上我還看了一下春季班同學們的筆記，同樣做平行化，pthread的方式又會比OpenMP快了一些，畢竟OpenMP使用上還有不少限制，有時間的話在回頭來做作看pthread的實驗。 ## 題外話 * 其實向量的運算的話，有不少library可用，像glm , bullet原生的等等。 ## References * [作業網站](https://hackmd.io/s/B1W5AWza) * [Ping-Ling's Hackpad](https://embedded2016.hackpad.com/2016q1-Homwork2a--QixiAsqbVaf) * [OpenMP簡介](https://kheresy.wordpress.com/2006/06/09/%e7%b0%a1%e6%98%93%e7%9a%84%e7%a8%8b%e5%bc%8f%e5%b9%b3%e8%a1%8c%e5%8c%96%e6%96%b9%e6%b3%95%ef%bc%8dopenmp%ef%bc%88%e4%b8%80%ef%bc%89%e7%b0%a1%e4%bb%8b/) * [王紹華's hackpad](https://embedded2016.hackpad.com/2016q1-Homework-2forA-ZIYzFrLSRDA) * [ストリーミング SIMD 拡張命令の算術演算](http://www2.kobe-u.ac.jp/~lerl2/l_cc_p_10.1.008/doc/main_cls/mergedProjects/intref_cls/common/intref_sse_arithmetic.htm) * [Crunching Numbers with AVX and AVX2](http://www.codeproject.com/Articles/874396/Crunching-Numbers-with-AVX-and-AVX) * [Intel Intrinsics Guide](https://software.intel.com/sites/landingpage/IntrinsicsGuide/#techs=AVX&expand=100) * [Real-Time Collision Detection](https://www.amazon.com/exec/obidos/tg/detail/-/1558607323?tag=realtimecolli-20) * [GLM gihub](https://github.com/g-truc/glm) * [bulletphysics](https://github.com/bulletphysics/bullet3) ###### tags: `fzzzz` 'sysprog2016'

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.