contributed by <Sean1127
>
Daichou
$ lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 4
On-line CPU(s) list: 0-3
Thread(s) per core: 2
Core(s) per socket: 2
Socket(s): 1
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 69
Model name: Intel(R) Core(TM) i5-4210U CPU @ 1.70GHz
Stepping: 1
CPU MHz: 1166.748
CPU max MHz: 2700.0000
CPU min MHz: 800.0000
BogoMIPS: 4789.06
Virtualization: VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 3072K
NUMA node0 CPU(s): 0-3
$ graphviz
,而是使用它提供的工具函式,$ man graphviz
可以瀏覽所有可用的同行化工具linux-tools-generic
之中 (請注意版本)dot
的語言,才能繪製出方塊圖$ sudo apt-get install graphviz
$ sudo apt-get install python-pip
$ sudo pip install gprof2dot
$ sudo apt-get install imagemagick
Graphviz - 用指令來畫關係圖吧
HackMD 也可以用 dot 語法畫圖,例如
有空再研究,這不是今天的重點
我們只需要知道:
$ gprof ./raytracing | gprof2dot | dot -Tpng -o output.png
- 以
make PROFILE=1
重新編譯程式碼,並且學習gprof
- 以 gprof 指出效能瓶頸,並且著手改寫檔案
math-toolkit.h
在內的函式實做,充分紀錄效能差異在共筆- 可善用 POSIX Thread, OpenMP, software pipelining, 以及 loop unrolling 一類的技巧來加速程式運作
$ gprof ./raytracing | less
Flat profile:
Each sample counts as 0.01 seconds.
% cumulative self self total
time seconds seconds calls s/call s/call name
22.69 0.66 0.66 69646433 0.00 0.00 dot_product
20.63 1.26 0.60 56956357 0.00 0.00 subtract_vector
9.80 1.55 0.29 13861875 0.00 0.00 rayRectangularIntersection
8.60 1.80 0.25 10598450 0.00 0.00 normalize
7.56 2.02 0.22 31410180 0.00 0.00 multiply_vector
6.53 2.21 0.19 17821809 0.00 0.00 cross_product
% the percentage of the total running time of the
time program used by this function.
cumulative a running sum of the number of seconds accounted
seconds for by this function and those listed above it.
self the number of seconds accounted for by this
seconds function alone. This is the major sort for this
listing.
self
是自己執行時間的加總% time
是比例cumulative
是自己(% time
第 n
名)跟 <n
名時間的加總dot_product
, subtract_vector
各佔了 20 % 左右
static inline
void subtract_vector(const double *a, const double *b, double *out)
{
for (int i = 0; i < 3; i++)
out[i] = a[i] - b[i];
}
static inline
double dot_product(const double *v1, const double *v2)
{
double dp = 0.0;
for (int i = 0; i < 3; i++)
dp += v1[i] * v2[i];
return dp;
}
static inline
void subtract_vector(const double *a, const double *b, double *out)
{
out[0] = a[0] - b[0];
out[1] = a[1] - b[1];
out[2] = a[2] - b[2];
out[3] = a[3] - b[3];
}
*** stack smashing detected ***: ./raytracing terminated
Aborted (core dumped)
好,一定是有什麼搞錯了
static inline
void subtract_vector(const double *a, const double *b, double *out)
{
out[0] = a[0] - b[0];
out[1] = a[1] - b[1];
out[2] = a[2] - b[2];
}
手殘嚴重…
Execution time of raytracing() : 2.203631 sec
% cumulative self self total
time seconds seconds calls s/call s/call name
23.73 0.42 0.42 69646433 0.00 0.00 dot_product
14.87 0.68 0.26 56956357 0.00 0.00 subtract_vector
11.15 0.87 0.20 31410180 0.00 0.00 multiply_vector
dot_product
: 0.66 降到 0.42subtract_vector
: 0.60 降到 0.26inline
就是把 function 裡的東西都複製貼上到外層(有點像一開始學程式因為不會分檔、寫 function,所以所有功能都在main
裡面做)
-O0
會讓編譯器忽略inline
,所以要再函式前加上__attribute__((always_inline))
強制 inline
static inline __attribute__((always_inline))
double dot_product(const double *v1, const double *v2)
{
double dp = 0.0;
dp += v1[0] * v2[0];
dp += v1[1] * v2[1];
dp += v1[2] * v2[2];
return dp;
}
Execution time of raytracing() : 1.997267 sec
% cumulative self self total
time seconds seconds calls s/call s/call name
30.99 0.57 0.57 13861875 0.00 0.00 rayRectangularIntersection
29.36 1.11 0.54 13861875 0.00 0.00 raySphereIntersection
10.33 1.30 0.19 2110576 0.00 0.00 compute_specular_diffuse
raytracing.c
裡除了ray_hit_object, ray_color, raytracing
的所有函式Execution time of raytracing() : 1.950107 sec
raytracing.c, idx_stack.h
所有函式都 inlineraytracing.c: In function ‘ray_color’:
raytracing.c:356:14: error: inlining failed in call to always_inline ‘ray_color’: recursive inlining
unsigned int ray_color(const point3 e, double t,
^
raytracing.c:439:13: error: called from here
if (ray_color(ip.point, MIN_DISTANCE, r, stk, rectangulars, sp
結果
Execution time of raytracing() : 1.868269 sec
分析
就如預期的把所有函式都塞進去了,但其實我還挺訝異時間居然有進步 0.11 秒!
根據 When to use inline function and when not to use it? 的解釋,inline 的使用時機
除了raytracing
被呼叫 1 次之外,其他函數的呼叫次數都是百萬次,所以都是 inline 候選人
依照結果來看,這次作業的函式還"不夠大",所以並沒有產生 inline 的副作用
比較執行檔的大小
2.0 59552
2.1 59664
2.2 失敗
2.3 71248
更深入的解釋 To Inline or Not To Inline(研究中)
因為在傳入參數時共用變數,所以參數不正確進而導致資料分割錯誤,以後要多注意!
修正後結果
Execution time of raytracing() : 1.030958 sec
這是以 inline math-toolkit.h
的版本平行化的結果,所以進步是 1.997267 -> 1.030985,共 0.96 秒
目前表現最佳版本
...
int factor = sqrt(SAMPLES);
# pragma omp parallel num_threads(thread_count) \
firstprivate(u, v, w, d, object_color, stk, factor)
{
# pragma omp for schedule(static,1)
for (int j = 0; j < height; j++) {
for (int i = 0; i < width; i++) {
double r = 0, g = 0, b = 0;
/* MSAA */
...
}
}
注意變數 scope {}
結果
Execution time of raytracing() : 0.855305 sec
分析
Each sample counts as 0.01 seconds.
% cumulative self self total
time seconds seconds calls s/call s/call name
93.06 1.87 1.87 891880 0.00 0.00 ray_color
6.97 2.01 0.14 1 0.14 2.01 raytracing
0.00 2.01 0.00 3 0.00 0.00 append_rectangular
ray_color, raytracing
兩個函式有時間,而raytracing
則被平行化,接下來的目標當然是平行化ray_color
ray_color
覺得實在麻煩
for (light_node light = lights; light; light = light->next) {
ray_hit_object
: 68 %
rayRectangularIntersection
: 40 %raySphereIntersection
: 20 %compute_specular_defuse
: 12 %pthread 使用最快的 16 threads
openmp 放棄平行 ray_color
現在大學部有開一堂課叫做平行程式設計,好險有修,不然這個作業要從頭開始查,一定會死得很難看
看圖表也知道程式平行化的功力有多強,也就是說用平行化才能發揮多核心電腦真正的效能,往後這塊應該會發展越來越快,需要趁早熟悉啊
sysprog