# 2016q3 Homework1 (raytracing) ###### tags: `heathcliffYang` contributed by <`heathcliffYang`> # Implementation note ==I had better to read the Makefile first.== `make check` will verify the picture. `convert` can convert the ppm to png or else. `make PROFILE=1` -> compile the program needed by **gprof** in Makefile. `make clean` will remove `use-models.h`,`out.ppm` and `gmon.out`. `gprof -b ./raytracing | less` -> 從上方顯示較少資訊,並提供翻頁,`-b`是不需要解釋名詞 `gprof ./raytracing | gprof2dot | dot -T png -o callgraph_omp1.png` # Try & Problems ## 9/26 8 pm ### 1. orig execution time of raytracing() 6.232054 sec ### 2. Rerun for gprof to collect data - gprof ``` heathcliff@heathcliff-MacBookPro:~/raytracing$ gprof -b ./raytracing Flat profile: Each sample counts as 0.01 seconds. % cumulative self self total time seconds seconds calls s/call s/call name 20.34 0.50 0.50 69646433 0.00 0.00 dot_product 16.68 0.91 0.41 56956357 0.00 0.00 subtract_vector 13.42 1.24 0.33 10598450 0.00 0.00 normalize 11.80 1.53 0.29 31410180 0.00 0.00 multiply_vector 9.36 1.76 0.23 13861875 0.00 0.00 rayRectangularIntersection 6.91 1.93 0.17 13861875 0.00 0.00 raySphereIntersection 4.47 2.04 0.11 4620625 0.00 0.00 ray_hit_object 3.46 2.13 0.09 17836094 0.00 0.00 add_vector 2.44 2.19 0.06 17821809 0.00 0.00 cross_product 2.03 2.24 0.05 1048576 0.00 0.00 ray_color 1.63 2.28 0.04 4221152 0.00 0.00 multiply_vectors 1.63 2.32 0.04 2110576 0.00 0.00 compute_specular_diffuse 1.22 2.35 0.03 2520791 0.00 0.00 idx_stack_top 1.22 2.38 0.03 2110576 0.00 0.00 localColor 0.81 2.40 0.02 1241598 0.00 0.00 refraction ``` - perf ``` Performance counter stats for './raytracing' (10 runs): 78,085 cache-misses # 19.826 % of all cache refs ( +- 11.39% ) 393,846 cache-references ( +- 3.41% ) 34,500,667,719 instructions # 1.75 insns per cycle ( +- 0.00% ) 19,734,977,254 cycles ( +- 0.83% ) 6.453942844 seconds time elapsed ( +- 0.82% ) ``` ## Loop unrolling - 9/27 7pm loop unrolling 所有的 ### 1. ``` Each sample counts as 0.01 seconds. % cumulative self self total time seconds seconds calls s/call s/call name 16.57 0.27 0.27 13861875 0.00 0.00 rayRectangularIntersection 14.70 0.50 0.24 56956357 0.00 0.00 subtract_vector *** 13.13 0.71 0.21 69646433 0.00 0.00 dot_product *** 11.88 0.90 0.19 10598450 0.00 0.00 normalize *** 8.44 1.04 0.14 4620625 0.00 0.00 ray_hit_object 7.50 1.16 0.12 13861875 0.00 0.00 raySphereIntersection 6.88 1.27 0.11 17836094 0.00 0.00 add_vector *** 5.32 1.35 0.09 17821809 0.00 0.00 cross_product *** 3.13 1.40 0.05 3838091 0.00 0.00 length *** 2.19 1.44 0.04 31410180 0.00 0.00 multiply_vector *** 1.88 1.47 0.03 2110576 0.00 0.00 compute_specular_diffuse 1.88 1.50 0.03 1048576 0.00 0.00 ray_color 1.56 1.52 0.03 4221152 0.00 0.00 multiply_vectors*** 0.94 1.54 0.02 113297 0.00 0.00 fresnel 0.94 1.55 0.02 2520791 0.00 0.00 idx_stack_top 0.63 1.56 0.01 2110576 0.00 0.00 localColor 0.63 1.57 0.01 1241598 0.00 0.00 protect_color_overflow 0.63 1.58 0.01 1241598 0.00 0.00 reflection 0.63 1.59 0.01 1 0.01 0.01 delete_sphere_list 0.63 1.60 0.01 1 0.01 1.59 raytracing 0.00 1.60 0.00 2558386 0.00 0.00 idx_stack_empty 0.00 1.60 0.00 1241598 0.00 0.00 refraction 0.00 1.60 0.00 1204003 0.00 0.00 idx_stack_push 0.00 1.60 0.00 1048576 0.00 0.00 idx_stack_init 0.00 1.60 0.00 1048576 0.00 0.00 rayConstruction 0.00 1.60 0.00 37595 0.00 0.00 idx_stack_pop 0.00 1.60 0.00 3 0.00 0.00 append_rectangular 0.00 1.60 0.00 3 0.00 0.00 append_sphere 0.00 1.60 0.00 2 0.00 0.00 append_light 0.00 1.60 0.00 1 0.00 0.00 calculateBasisVectors 0.00 1.60 0.00 1 0.00 0.00 delete_light_list 0.00 1.60 0.00 1 0.00 0.00 delete_rectangular_list 0.00 1.60 0.00 1 0.00 0.00 diff_in_second 0.00 1.60 0.00 1 0.00 0.00 write_to_ppm ``` ### 2. perf ``` Performance counter stats for './raytracing' (10 runs): 59,245 cache-misses # 19.567 % of all cache refs ( +- 1.89% ) 302,776 cache-references ( +- 7.82% ) 25,746,948,831 instructions # 1.61 insns per cycle ( +- 0.00% ) 15,999,672,351 cycles ( +- 0.43% ) 5.210404034 seconds time elapsed ( +- 0.56% ) ``` ## OpenMP - 針對迴圈平行化 9/27 9pm ### 1. Code ```c void normalize(double *v) { double d = sqrt(v[0] * v[0] + v[1] * v[1] + v[2] * v[2]); assert(d != 0.0 && "Error calculating normal"); #pragma omp parallel for for (a=0; a < 3; a++) { v[a]/=d; } } ``` ### 2. 遇到的問題 嘗試把所有math-toolkit.h裡的函式可以平行化的迴圈都平行化了,但是反而更慢更久; 之後換成只改其中一個函式,也跑得很久,這兩個都沒等到跑完。 ## 強迫改變函式屬性為always_inline - 9/27 11 pm all ### 1. Code ```c static __attribute__((always_inline)) void normalize(double *v) { double d = sqrt(v[0] * v[0] + v[1] * v[1] + v[2] * v[2]); assert(d != 0.0 && "Error calculating normal"); v[0] /= d; v[1] /= d; v[2] /= d; } ``` ### 2. gprof ``` Flat profile: Each sample counts as 0.01 seconds. % cumulative self self total time seconds seconds calls s/call s/call name 30.20 0.47 0.47 13861875 0.00 0.00 rayRectangularIntersection 15.91 0.71 0.25 13861875 0.00 0.00 raySphereIntersection 14.29 0.93 0.22 2110576 0.00 0.00 compute_specular_diffuse 9.74 1.08 0.15 1048576 0.00 0.00 ray_color 9.09 1.22 0.14 2110576 0.00 0.00 localColor 6.82 1.33 0.11 4620625 0.00 0.00 ray_hit_object 5.85 1.42 0.09 1048576 0.00 0.00 rayConstruction 2.60 1.46 0.04 1 0.04 1.54 raytracing 1.30 1.48 0.02 1241598 0.00 0.00 reflection 1.30 1.50 0.02 1241598 0.00 0.00 refraction 0.97 1.51 0.02 113297 0.00 0.00 fresnel 0.65 1.52 0.01 2558386 0.00 0.00 idx_stack_empty 0.65 1.53 0.01 2520791 0.00 0.00 idx_stack_top 0.65 1.54 0.01 1241598 0.00 0.00 protect_color_overflow 0.00 1.54 0.00 1204003 0.00 0.00 idx_stack_push 0.00 1.54 0.00 1048576 0.00 0.00 idx_stack_init 0.00 1.54 0.00 37595 0.00 0.00 idx_stack_pop 0.00 1.54 0.00 3 0.00 0.00 append_rectangular 0.00 1.54 0.00 3 0.00 0.00 append_sphere 0.00 1.54 0.00 2 0.00 0.00 append_light 0.00 1.54 0.00 1 0.00 0.00 calculateBasisVectors 0.00 1.54 0.00 1 0.00 0.00 delete_light_list 0.00 1.54 0.00 1 0.00 0.00 delete_rectangular_list 0.00 1.54 0.00 1 0.00 0.00 delete_sphere_list 0.00 1.54 0.00 1 0.00 0.00 diff_in_second 0.00 1.54 0.00 1 0.00 0.00 write_to_ppm ``` ### 3. perf stat ``` Performance counter stats for './raytracing' (10 runs): 53,970 cache-misses # 19.476 % of all cache refs ( +- 2.84% ) 277,108 cache-references ( +- 6.25% ) 13,144,659,166 instructions # 1.70 insns per cycle ( +- 0.00% ) 7,750,237,141 cycles ( +- 0.74% ) 2.540274253 seconds time elapsed ( +- 1.00% ) ``` >感謝hugikun999的提醒 ## 第二次嘗試 OpenMP + inline - 9/28 1am ### 1. 發現問題 加了inline之後,又試著在math-toolkit.h裡用OpenMP,但時間依然很長(兩百多秒);後來想到,每個函式被叫用那麼多次,但是每個都是迴圈短、運算比較簡單的時候,就不需要每個簡單的運算都用一個thread去執行。 ### 2. [call graph](http://www.openfoundry.org/tw/tech-column/8352-callgraphviz-cscopegraphviz-xdot-call-graph-visualizer) - Loop unrolling + inline (尚未加上omp) ![](https://i.imgur.com/sXaGSdl.png) - `ray_color()` & `rayConstruction()` ### 3. 嘗試修改 raytracing() ```c /* @param background_color this is not ambient light */ void raytracing(uint8_t *pixels, color background_color, rectangular_node rectangulars, sphere_node spheres, light_node lights, const viewpoint *view, int width, int height) { point3 u, v, w, d; color object_color = { 0.0, 0.0, 0.0 }; /* calculate u, v, w */ calculateBasisVectors(u, v, w, view); idx_stack stk; int factor = sqrt(SAMPLES); for (int j = 0; j < height; j++) { for (int i = 0; i < width; i++) { double r = 0, g = 0, b = 0; /* MSAA */ #pragma omp parallel for firstprivate(r,g,b) for (int s = 0; s < SAMPLES; s++) { idx_stack_init(&stk); rayConstruction(d, u, v, w, i * factor + s / factor, j * factor + s % factor, view, width * factor, height * factor); if (ray_color(view->vrp, 0.0, d, &stk, rectangulars, spheres, lights, object_color, MAX_REFLECTION_BOUNCES)) { r += object_color[0]; g += object_color[1]; b += object_color[2]; } else { r += background_color[0]; g += background_color[1]; b += background_color[2]; } pixels[((i + (j * width)) * 3) + 0] = r * 255 / SAMPLES; pixels[((i + (j * width)) * 3) + 1] = g * 255 / SAMPLES; pixels[((i + (j * width)) * 3) + 2] = b * 255 / SAMPLES; } } } } ``` **speed up but check fail** ``` Performance counter stats for './raytracing' (10 runs): 67,856 cache-misses # 0.188 % of all cache refs ( +- 2.44% ) 36,086,782 cache-references ( +- 2.19% ) 16,662,901,112 instructions # 0.91 insns per cycle ( +- 5.07% ) 18,386,293,743 cycles ( +- 4.35% ) 1.601089390 seconds time elapsed ( +- 4.40% ) ``` -> data dependent. ### 4. loop unrolling + inline + omp ![](https://i.imgur.com/bG6EV90.png) **** ## Loop unrolling raytracing() - 9/28 3pm ### 1. Loop unrolling 第三層迴圈 - 在嘗試解決加入OpenMP前後相依造成的錯誤當中,觀察到SAMPLE只有4,又是同時存取r,g,b三個值,故將他的迴圈解開。之後,分別兩個嘗試 - 1)把OpenMP留著 -> 速度快但check fail ``` Performance counter stats for './raytracing' (10 runs): 207,759 cache-misses # 0.276 % of all cache refs ( +- 2.99% ) 75,389,334 cache-references ( +- 3.65% ) 12,420,744,790 instructions # 0.72 insns per cycle ( +- 0.15% ) 17,284,430,642 cycles ( +- 0.58% ) 1.931539834 seconds time elapsed ( +- 1.13% ) ``` - 2)移除OpenMP -> 資料正確 ``` Performance counter stats for './raytracing' (10 runs): 248,194 cache-misses # 35.564 % of all cache refs ( +- 3.97% ) 697,873 cache-references ( +- 9.18% ) 13,320,768,291 instructions # 1.69 insns per cycle ( +- 0.00% ) 7,859,385,254 cycles ( +- 0.47% ) 2.606176699 seconds time elapsed ( +- 0.38% ) ``` ### 2. call graph 檢查變數是否可能被多個thread存取 ![](https://i.imgur.com/uBARYan.png) ## Omp success - 9/28 7pm ### 1. 成功把raytracing用omp平行,check成功 在Unroll 第三層迴圈之下,找出dependent的參數d, stk, object_color ### 2. perf stat ``` Performance counter stats for './raytracing' (10 runs): 74,722 cache-misses # 0.097 % of all cache refs ( +- 3.33% ) 77,274,246 cache-references ( +- 5.31% ) 12,810,788,231 instructions # 0.76 insns per cycle ( +- 0.11% ) 16,853,158,191 cycles ( +- 0.42% ) 2.026637683 seconds time elapsed ( +- 0.80% ) ``` ### 3. call gpraph ![](https://i.imgur.com/R7o5Sby.png) ### 4. 與沒有inline dot_product比較 perf stat ``` Performance counter stats for './raytracing' (10 runs): 102,166 cache-misses # 0.059 % of all cache refs ( +- 8.70% ) 173,186,250 cache-references ( +- 5.89% ) 63,646,408 branch-misses ( +- 1.08% ) 15,806,263,440 instructions # 0.50 insns per cycle ( +- 0.26% ) 31,412,416,665 cycles ( +- 1.00% ) 3.722181320 seconds time elapsed ( +- 1.97% ) ``` ## num_threads()的嘗試 9/28 - 11pm ### 1. 測試分配不同數量的threads對效能的影響 - 500 ``` Performance counter stats for './raytracing' (10 runs): 586,040 cache-misses # 1.565 % of all cache refs ( +- 1.72% ) 37,444,312 cache-references ( +- 0.24% ) 12,194,796,576 instructions # 0.85 insns per cycle ( +- 0.01% ) 14,430,824,154 cycles ( +- 0.12% ) 1.266002949 seconds time elapsed ( +- 0.29% ) ``` - 450 ``` Performance counter stats for './raytracing' (10 runs): 571,745 cache-misses # 1.523 % of all cache refs ( +- 1.76% ) 37,550,691 cache-references ( +- 0.42% ) 12,194,710,719 instructions # 0.84 insns per cycle ( +- 0.01% ) 14,506,164,926 cycles ( +- 0.38% ) 1.275681361 seconds time elapsed ( +- 0.54% ) ``` - 256 ``` Performance counter stats for './raytracing' (10 runs): 400,710 cache-misses # 1.071 % of all cache refs ( +- 5.59% ) 37,425,937 cache-references ( +- 0.36% ) 12,182,627,168 instructions # 0.84 insns per cycle ( +- 0.01% ) 14,428,624,839 cycles ( +- 0.21% ) 1.277496560 seconds time elapsed ( +- 0.50% ) ``` - 12 ``` Performance counter stats for './raytracing' (10 runs): 93,895 cache-misses # 0.167 % of all cache refs ( +- 11.93% ) 56,274,093 cache-references ( +- 1.03% ) 12,322,845,231 instructions # 0.74 insns per cycle ( +- 0.02% ) 16,560,790,058 cycles ( +- 0.33% ) 1.497703779 seconds time elapsed ( +- 0.28% ) ``` ## POSIX Thread ## software pipelining ## # 分析與效能優化策略 ## 觀察 ### 1. math-toolkit.h - 根據gprof的結果,發現主要耗時的是math-toolkit.h的4個函式,因此將會針對這4個函式進行優化 1. Loop unrolling `dot_product()` & `subtract_vector()` : 減少不必要的跳出迴圈判斷 - 發現`multiply_vector()`的執行時間大於`multiply_vectors()` - 運算式多為vector的運算,單個函式呼叫中資料彼此之前沒有關係 -> ~~用 OpenMP 把迴圈的運算平行化~~ - `inline`是建議compiler函式的屬性,內嵌,與編譯的方法相關 ### ~~2. main.c~~ ### 3. raytracing.c - 把所有raytracing()呼叫到的函式檢查一遍是否有共用存取 1. idx_stack.h -> **把idx_stack_init inline** **(not yet)** 2. ray_color(const point3 e, double t, const point3 d, ==idx_stack *stk==, const rectangular_node rectangulars, const sphere_node spheres, const light_node lights, color object_color, int bounces_left) add_vector(object_color, refraction_part, object_color); 3. rayConstruction(d, u, v, w, i * factor + 0 / factor, j * factor + 0 % factor, view, width * factor, height * factor); subtract_vector(s, view->vrp, d); normalize(d); 4. ray_hit_object(const point3 e, const point3 d, double t0, double t1, const rectangular_node rectangulars, rectangular_node *hit_rectangular, const sphere_node spheres, sphere_node *hit_sphere) *hit_sphere = sphere; *hit_rectangular = NULL; 5. reflection(r, d, op.normal); 6. refraction(rr, d, ip.normal, idx, idx_pass); # 函式的屬性 - Attributes of Functions `__attribute__()`可以在宣告一個函式時,標明函式的屬性,用法是: `void f () __attribute__ (attributeName)` ==-->這個可能參考資料寫錯了,因為compile時有出現錯誤訊息:== `attributes should be specified before the declarator in a function definition static` ## 種類 - `always_inline` : 標明函式是要用內嵌,也就是不使用呼叫、跳去記憶體位置執行函式,而是在編譯時將函式在呼叫點展開。 (`inline`:建議,compiler不一定會照做,擺法也不太一樣,`inline`位在宣告時的函式名稱最前面) [參考資料1](http://gcc.gnu.org/onlinedocs/gcc-4.1.2/gcc/Function-Attributes.html) [參考資料2](https://dotblogs.com.tw/v6610688/2013/11/27/introduction_inline_function) # [Gprof](https://sourceware.org/binutils/docs/gprof/) print the execution time of each function in the process. 1. `-pg` --> this option represents for profiling when compiling. Besides, it must be a compilation option as well as a link option. 2. `-Q` --> ==switch to suppress the printing of the call graph data.== 3. Run the program to generate the data gprof needs. `gprof options [exe] [profile-data-file] > outfile` **If you omit the executable file name, the file a.out is used. If you give no profile data file name, the file gmon.out is used.** 4. **(Not yet)**:To link the program for profiling, if you use a compiler such as cc to do the linking, simply specify `-pg' in addition to your usual options. # gprof2dot #### SIMD (Single instruction, multiple data) # AVX (Advanced Vector Extensions) 將大部分的整數運算指令擴展到 256 bits ## Register - 16 個 YMM registers. - 8 個 32-bit 的單浮點數或 4 個 64-bit 的雙浮點數 - XMM0 – XMM7 to YMM0 – YMM7 (in x86-64 mode, YMM0 – YMM15) [參考資料](https://en.wikipedia.org/wiki/Advanced_Vector_Extensions) # SSE (Streaming SIMD Extensions) ## C programming `sqrt()` calculates the square root. > hugikun999 : `apt install python-pip` -> `pip install gprof2dot`