# 2016q3 Homework (raytracing) contributed by <`kaizsv`> ## prof ``` # Rendering scene Done! Execution time of raytracing() : 6.473611 sec ``` ``` Flat profile: Each sample counts as 0.01 seconds. % cumulative self self total time seconds seconds calls s/call s/call name 24.22 0.53 0.53 56956357 0.00 0.00 subtract_vector 23.30 1.04 0.51 69646433 0.00 0.00 dot_product 13.94 1.35 0.31 31410180 0.00 0.00 multiply_vector 6.85 1.50 0.15 17836094 0.00 0.00 add_vector 5.48 1.62 0.12 10598450 0.00 0.00 normalize 4.57 1.72 0.10 17821809 0.00 0.00 cross_product 4.57 1.82 0.10 13861875 0.00 0.00 rayRectangularIntersection 4.57 1.92 0.10 13861875 0.00 0.00 raySphereIntersection ``` 上面是我的 prof 結果,先對常被呼叫的`subtract_vector`, `dot_product`, `multiply_vector`, `add_vector`,做 loop unrolling #### loop unrolling ``` # Rendering scene Done! Execution time of raytracing() : 5.899568 sec ``` 快了1秒左右。 #### OpenMP ``` # Rendering scene Done! Execution time of raytracing() : 0.983607 sec Verified OK ``` OpenMP 可以在`shared-memory machine`上執行平行程式,在編繹時加上`-fopenmp`。 ```C= #include <omp.h> #pragma omp parallel for num_threads(16) \ schedule(guided, 4) \ private(d) \ private(stk) \ firstprivate(object_color) for (int j = 0; j < height; j++) { for (int i = 0; i < width; i++) { double r = 0, g = 0, b = 0; /* MSAA */ for (int s = 0; s < SAMPLES; s++) { idx_stack_init(&stk); rayConstruction(d, u, v, w, i * factor + s / factor, j * factor + s % factor, view, width * factor, height * factor); if (ray_color(view->vrp, 0.0, d, &stk, rectangulars, spheres, lights, object_color, MAX_REFLECTION_BOUNCES)) { r += object_color[0]; g += object_color[1]; b += object_color[2]; } else { r += background_color[0]; g += background_color[1]; b += background_color[2]; } pixels[((i + (j * width)) * 3) + 0] = r * 255 / SAMPLES; pixels[((i + (j * width)) * 3) + 1] = g * 255 / SAMPLES; pixels[((i + (j * width)) * 3) + 2] = b * 255 / SAMPLES; } } } ``` `#pragma omp parallel for` : 這是 OpenMP 的編繹器指令,表示要將這個 for 迴圈平行化。 `num_threads(16)` : 要用幾個 threads 來執行,也可以用 `omp_get_max_threads()`讓最多的 threads 執行。 `schedule(dynamic, 4)` : 編繹器要如何分配工作,就用`schedule`指令。 `static` : 每個 threads 會依序執行被切割的工作,而 schedule(static, 4) 的意思如下例子。 #pragma omp parallel for num_threads[4] schedule(static) for (int i = 0; i < 1000; i++) {} thread 1: i = 0 ~ 249 thread 2: i = 250 ~ 499 thread 3: i = 500 ~ 749 thread 4: i = 750 ~ 999 #pragma omp parallel for num_threads[4] schedule(static, 4) for (int i = 0; i < 1000; i++) {} thread 1: i = 0, 1, 2, 3, 16, 17... thread 2: i = 4, 5, 6, 7, 20, 21... thread 3: i = 8, 9, 10, 11, 24, 25... thread 4: i = 12, 13, 14, 15, 28, 29... `dynamic` : 當 threads 完成某個區塊後,才會動態分配另一個區塊去執行。 `guided` : 與 dynamic 類似,但區塊大小會指數遞減。 `auto` : 編繹器自行決定。 `runtime` : 使用者用環境變數 OMP_SCHEDULE 決定。 `private and shared` : private 就是每個 thread 有自己的一份變數,同理 shared 就是所有 threads 共用。 `firstprivate` : 它就是 private 變數,但如果該變數在進入迴圈前就有初始值,則 firstprivate 會保留,如果是 private 的話會是未知,同理還有 lastprivate 是執行緒結束後 private 變數是否會更新。 在`raytracing`迴圈內,`stk`每次迴圈都會清空,`d`會先在`rayConstruction`被正規化,再計算顏色,而`object_color`在進入迴圈前有初始值了,就設成`firstprivate`。 [多核心高效能程式開發](http://weblis.lib.ncku.edu.tw/search~S1*cht?/X{u591A}{u6838}{u5FC3}&searchscope=1&SORT=D/X{u591A}{u6838}{u5FC3}&searchscope=1&SORT=D&SUBKEY=%E5%A4%9A%E6%A0%B8%E5%BF%83/1%2C159%2C159%2CB/frameset&FF=X{u591A}{u6838}{u5FC3}&searchscope=1&SORT=D&11%2C11%2C) ###### tags: `assigment_2` `raytracing`