owned this note
owned this note
Published
Linked with GitHub
# 2016q3 Homework1 (raytracing)
###### tags: `heathcliffYang`
contributed by <`heathcliffYang`>
# Implementation note
==I had better to read the Makefile first.==
`make check` will verify the picture.
`convert` can convert the ppm to png or else.
`make PROFILE=1` -> compile the program needed by **gprof** in Makefile.
`make clean` will remove `use-models.h`,`out.ppm` and `gmon.out`.
`gprof -b ./raytracing | less` -> 從上方顯示較少資訊,並提供翻頁,`-b`是不需要解釋名詞
`gprof ./raytracing | gprof2dot | dot -T png -o callgraph_omp1.png`
# Try & Problems
## 9/26 8 pm
### 1. orig execution time of raytracing()
6.232054 sec
### 2. Rerun for gprof to collect data
- gprof
```
heathcliff@heathcliff-MacBookPro:~/raytracing$ gprof -b ./raytracing
Flat profile:
Each sample counts as 0.01 seconds.
% cumulative self self total
time seconds seconds calls s/call s/call name
20.34 0.50 0.50 69646433 0.00 0.00 dot_product
16.68 0.91 0.41 56956357 0.00 0.00 subtract_vector
13.42 1.24 0.33 10598450 0.00 0.00 normalize
11.80 1.53 0.29 31410180 0.00 0.00 multiply_vector
9.36 1.76 0.23 13861875 0.00 0.00 rayRectangularIntersection
6.91 1.93 0.17 13861875 0.00 0.00 raySphereIntersection
4.47 2.04 0.11 4620625 0.00 0.00 ray_hit_object
3.46 2.13 0.09 17836094 0.00 0.00 add_vector
2.44 2.19 0.06 17821809 0.00 0.00 cross_product
2.03 2.24 0.05 1048576 0.00 0.00 ray_color
1.63 2.28 0.04 4221152 0.00 0.00 multiply_vectors
1.63 2.32 0.04 2110576 0.00 0.00 compute_specular_diffuse
1.22 2.35 0.03 2520791 0.00 0.00 idx_stack_top
1.22 2.38 0.03 2110576 0.00 0.00 localColor
0.81 2.40 0.02 1241598 0.00 0.00 refraction
```
- perf
```
Performance counter stats for './raytracing' (10 runs):
78,085 cache-misses # 19.826 % of all cache refs ( +- 11.39% )
393,846 cache-references ( +- 3.41% )
34,500,667,719 instructions # 1.75 insns per cycle ( +- 0.00% )
19,734,977,254 cycles ( +- 0.83% )
6.453942844 seconds time elapsed ( +- 0.82% )
```
## Loop unrolling - 9/27 7pm
loop unrolling 所有的
### 1.
```
Each sample counts as 0.01 seconds.
% cumulative self self total
time seconds seconds calls s/call s/call name
16.57 0.27 0.27 13861875 0.00 0.00 rayRectangularIntersection
14.70 0.50 0.24 56956357 0.00 0.00 subtract_vector ***
13.13 0.71 0.21 69646433 0.00 0.00 dot_product ***
11.88 0.90 0.19 10598450 0.00 0.00 normalize ***
8.44 1.04 0.14 4620625 0.00 0.00 ray_hit_object
7.50 1.16 0.12 13861875 0.00 0.00 raySphereIntersection
6.88 1.27 0.11 17836094 0.00 0.00 add_vector ***
5.32 1.35 0.09 17821809 0.00 0.00 cross_product ***
3.13 1.40 0.05 3838091 0.00 0.00 length ***
2.19 1.44 0.04 31410180 0.00 0.00 multiply_vector ***
1.88 1.47 0.03 2110576 0.00 0.00 compute_specular_diffuse
1.88 1.50 0.03 1048576 0.00 0.00 ray_color
1.56 1.52 0.03 4221152 0.00 0.00 multiply_vectors***
0.94 1.54 0.02 113297 0.00 0.00 fresnel
0.94 1.55 0.02 2520791 0.00 0.00 idx_stack_top
0.63 1.56 0.01 2110576 0.00 0.00 localColor
0.63 1.57 0.01 1241598 0.00 0.00 protect_color_overflow
0.63 1.58 0.01 1241598 0.00 0.00 reflection
0.63 1.59 0.01 1 0.01 0.01 delete_sphere_list
0.63 1.60 0.01 1 0.01 1.59 raytracing
0.00 1.60 0.00 2558386 0.00 0.00 idx_stack_empty
0.00 1.60 0.00 1241598 0.00 0.00 refraction
0.00 1.60 0.00 1204003 0.00 0.00 idx_stack_push
0.00 1.60 0.00 1048576 0.00 0.00 idx_stack_init
0.00 1.60 0.00 1048576 0.00 0.00 rayConstruction
0.00 1.60 0.00 37595 0.00 0.00 idx_stack_pop
0.00 1.60 0.00 3 0.00 0.00 append_rectangular
0.00 1.60 0.00 3 0.00 0.00 append_sphere
0.00 1.60 0.00 2 0.00 0.00 append_light
0.00 1.60 0.00 1 0.00 0.00 calculateBasisVectors
0.00 1.60 0.00 1 0.00 0.00 delete_light_list
0.00 1.60 0.00 1 0.00 0.00 delete_rectangular_list
0.00 1.60 0.00 1 0.00 0.00 diff_in_second
0.00 1.60 0.00 1 0.00 0.00 write_to_ppm
```
### 2. perf
```
Performance counter stats for './raytracing' (10 runs):
59,245 cache-misses # 19.567 % of all cache refs ( +- 1.89% )
302,776 cache-references ( +- 7.82% )
25,746,948,831 instructions # 1.61 insns per cycle ( +- 0.00% )
15,999,672,351 cycles ( +- 0.43% )
5.210404034 seconds time elapsed ( +- 0.56% )
```
## OpenMP - 針對迴圈平行化 9/27 9pm
### 1. Code
```c
void normalize(double *v)
{
double d = sqrt(v[0] * v[0] + v[1] * v[1] + v[2] * v[2]);
assert(d != 0.0 && "Error calculating normal");
#pragma omp parallel for
for (a=0; a < 3; a++) {
v[a]/=d;
}
}
```
### 2. 遇到的問題
嘗試把所有math-toolkit.h裡的函式可以平行化的迴圈都平行化了,但是反而更慢更久;
之後換成只改其中一個函式,也跑得很久,這兩個都沒等到跑完。
## 強迫改變函式屬性為always_inline - 9/27 11 pm
all
### 1. Code
```c
static __attribute__((always_inline))
void normalize(double *v)
{
double d = sqrt(v[0] * v[0] + v[1] * v[1] + v[2] * v[2]);
assert(d != 0.0 && "Error calculating normal");
v[0] /= d;
v[1] /= d;
v[2] /= d;
}
```
### 2. gprof
```
Flat profile:
Each sample counts as 0.01 seconds.
% cumulative self self total
time seconds seconds calls s/call s/call name
30.20 0.47 0.47 13861875 0.00 0.00 rayRectangularIntersection
15.91 0.71 0.25 13861875 0.00 0.00 raySphereIntersection
14.29 0.93 0.22 2110576 0.00 0.00 compute_specular_diffuse
9.74 1.08 0.15 1048576 0.00 0.00 ray_color
9.09 1.22 0.14 2110576 0.00 0.00 localColor
6.82 1.33 0.11 4620625 0.00 0.00 ray_hit_object
5.85 1.42 0.09 1048576 0.00 0.00 rayConstruction
2.60 1.46 0.04 1 0.04 1.54 raytracing
1.30 1.48 0.02 1241598 0.00 0.00 reflection
1.30 1.50 0.02 1241598 0.00 0.00 refraction
0.97 1.51 0.02 113297 0.00 0.00 fresnel
0.65 1.52 0.01 2558386 0.00 0.00 idx_stack_empty
0.65 1.53 0.01 2520791 0.00 0.00 idx_stack_top
0.65 1.54 0.01 1241598 0.00 0.00 protect_color_overflow
0.00 1.54 0.00 1204003 0.00 0.00 idx_stack_push
0.00 1.54 0.00 1048576 0.00 0.00 idx_stack_init
0.00 1.54 0.00 37595 0.00 0.00 idx_stack_pop
0.00 1.54 0.00 3 0.00 0.00 append_rectangular
0.00 1.54 0.00 3 0.00 0.00 append_sphere
0.00 1.54 0.00 2 0.00 0.00 append_light
0.00 1.54 0.00 1 0.00 0.00 calculateBasisVectors
0.00 1.54 0.00 1 0.00 0.00 delete_light_list
0.00 1.54 0.00 1 0.00 0.00 delete_rectangular_list
0.00 1.54 0.00 1 0.00 0.00 delete_sphere_list
0.00 1.54 0.00 1 0.00 0.00 diff_in_second
0.00 1.54 0.00 1 0.00 0.00 write_to_ppm
```
### 3. perf stat
```
Performance counter stats for './raytracing' (10 runs):
53,970 cache-misses # 19.476 % of all cache refs ( +- 2.84% )
277,108 cache-references ( +- 6.25% )
13,144,659,166 instructions # 1.70 insns per cycle ( +- 0.00% )
7,750,237,141 cycles ( +- 0.74% )
2.540274253 seconds time elapsed ( +- 1.00% )
```
>感謝hugikun999的提醒
## 第二次嘗試 OpenMP + inline - 9/28 1am
### 1. 發現問題
加了inline之後,又試著在math-toolkit.h裡用OpenMP,但時間依然很長(兩百多秒);後來想到,每個函式被叫用那麼多次,但是每個都是迴圈短、運算比較簡單的時候,就不需要每個簡單的運算都用一個thread去執行。
### 2. [call graph](http://www.openfoundry.org/tw/tech-column/8352-callgraphviz-cscopegraphviz-xdot-call-graph-visualizer)
- Loop unrolling + inline (尚未加上omp)
![](https://i.imgur.com/sXaGSdl.png)
- `ray_color()` & `rayConstruction()`
### 3. 嘗試修改 raytracing()
```c
/* @param background_color this is not ambient light */
void raytracing(uint8_t *pixels, color background_color,
rectangular_node rectangulars, sphere_node spheres,
light_node lights, const viewpoint *view,
int width, int height)
{
point3 u, v, w, d;
color object_color = { 0.0, 0.0, 0.0 };
/* calculate u, v, w */
calculateBasisVectors(u, v, w, view);
idx_stack stk;
int factor = sqrt(SAMPLES);
for (int j = 0; j < height; j++) {
for (int i = 0; i < width; i++) {
double r = 0, g = 0, b = 0;
/* MSAA */
#pragma omp parallel for firstprivate(r,g,b)
for (int s = 0; s < SAMPLES; s++) {
idx_stack_init(&stk);
rayConstruction(d, u, v, w,
i * factor + s / factor,
j * factor + s % factor,
view,
width * factor, height * factor);
if (ray_color(view->vrp, 0.0, d, &stk, rectangulars, spheres,
lights, object_color,
MAX_REFLECTION_BOUNCES)) {
r += object_color[0];
g += object_color[1];
b += object_color[2];
} else {
r += background_color[0];
g += background_color[1];
b += background_color[2];
}
pixels[((i + (j * width)) * 3) + 0] = r * 255 / SAMPLES;
pixels[((i + (j * width)) * 3) + 1] = g * 255 / SAMPLES;
pixels[((i + (j * width)) * 3) + 2] = b * 255 / SAMPLES;
}
}
}
}
```
**speed up but check fail**
```
Performance counter stats for './raytracing' (10 runs):
67,856 cache-misses # 0.188 % of all cache refs ( +- 2.44% )
36,086,782 cache-references ( +- 2.19% )
16,662,901,112 instructions # 0.91 insns per cycle ( +- 5.07% )
18,386,293,743 cycles ( +- 4.35% )
1.601089390 seconds time elapsed ( +- 4.40% )
```
-> data dependent.
### 4. loop unrolling + inline + omp
![](https://i.imgur.com/bG6EV90.png)
****
## Loop unrolling raytracing() - 9/28 3pm
### 1. Loop unrolling 第三層迴圈
- 在嘗試解決加入OpenMP前後相依造成的錯誤當中,觀察到SAMPLE只有4,又是同時存取r,g,b三個值,故將他的迴圈解開。之後,分別兩個嘗試
- 1)把OpenMP留著 -> 速度快但check fail
```
Performance counter stats for './raytracing' (10 runs):
207,759 cache-misses # 0.276 % of all cache refs ( +- 2.99% )
75,389,334 cache-references ( +- 3.65% )
12,420,744,790 instructions # 0.72 insns per cycle ( +- 0.15% )
17,284,430,642 cycles ( +- 0.58% )
1.931539834 seconds time elapsed ( +- 1.13% )
```
- 2)移除OpenMP -> 資料正確
```
Performance counter stats for './raytracing' (10 runs):
248,194 cache-misses # 35.564 % of all cache refs ( +- 3.97% )
697,873 cache-references ( +- 9.18% )
13,320,768,291 instructions # 1.69 insns per cycle ( +- 0.00% )
7,859,385,254 cycles ( +- 0.47% )
2.606176699 seconds time elapsed ( +- 0.38% )
```
### 2. call graph 檢查變數是否可能被多個thread存取
![](https://i.imgur.com/uBARYan.png)
## Omp success - 9/28 7pm
### 1. 成功把raytracing用omp平行,check成功
在Unroll 第三層迴圈之下,找出dependent的參數d, stk, object_color
### 2. perf stat
```
Performance counter stats for './raytracing' (10 runs):
74,722 cache-misses # 0.097 % of all cache refs ( +- 3.33% )
77,274,246 cache-references ( +- 5.31% )
12,810,788,231 instructions # 0.76 insns per cycle ( +- 0.11% )
16,853,158,191 cycles ( +- 0.42% )
2.026637683 seconds time elapsed ( +- 0.80% )
```
### 3. call gpraph
![](https://i.imgur.com/R7o5Sby.png)
### 4. 與沒有inline dot_product比較 perf stat
```
Performance counter stats for './raytracing' (10 runs):
102,166 cache-misses # 0.059 % of all cache refs ( +- 8.70% )
173,186,250 cache-references ( +- 5.89% )
63,646,408 branch-misses ( +- 1.08% )
15,806,263,440 instructions # 0.50 insns per cycle ( +- 0.26% )
31,412,416,665 cycles ( +- 1.00% )
3.722181320 seconds time elapsed ( +- 1.97% )
```
## num_threads()的嘗試 9/28 - 11pm
### 1. 測試分配不同數量的threads對效能的影響
- 500
```
Performance counter stats for './raytracing' (10 runs):
586,040 cache-misses # 1.565 % of all cache refs ( +- 1.72% )
37,444,312 cache-references ( +- 0.24% )
12,194,796,576 instructions # 0.85 insns per cycle ( +- 0.01% )
14,430,824,154 cycles ( +- 0.12% )
1.266002949 seconds time elapsed ( +- 0.29% )
```
- 450
```
Performance counter stats for './raytracing' (10 runs):
571,745 cache-misses # 1.523 % of all cache refs ( +- 1.76% )
37,550,691 cache-references ( +- 0.42% )
12,194,710,719 instructions # 0.84 insns per cycle ( +- 0.01% )
14,506,164,926 cycles ( +- 0.38% )
1.275681361 seconds time elapsed ( +- 0.54% )
```
- 256
```
Performance counter stats for './raytracing' (10 runs):
400,710 cache-misses # 1.071 % of all cache refs ( +- 5.59% )
37,425,937 cache-references ( +- 0.36% )
12,182,627,168 instructions # 0.84 insns per cycle ( +- 0.01% )
14,428,624,839 cycles ( +- 0.21% )
1.277496560 seconds time elapsed ( +- 0.50% )
```
- 12
```
Performance counter stats for './raytracing' (10 runs):
93,895 cache-misses # 0.167 % of all cache refs ( +- 11.93% )
56,274,093 cache-references ( +- 1.03% )
12,322,845,231 instructions # 0.74 insns per cycle ( +- 0.02% )
16,560,790,058 cycles ( +- 0.33% )
1.497703779 seconds time elapsed ( +- 0.28% )
```
## POSIX Thread
## software pipelining
##
# 分析與效能優化策略
## 觀察
### 1. math-toolkit.h
- 根據gprof的結果,發現主要耗時的是math-toolkit.h的4個函式,因此將會針對這4個函式進行優化
1. Loop unrolling `dot_product()` & `subtract_vector()` : 減少不必要的跳出迴圈判斷
- 發現`multiply_vector()`的執行時間大於`multiply_vectors()`
- 運算式多為vector的運算,單個函式呼叫中資料彼此之前沒有關係 -> ~~用 OpenMP 把迴圈的運算平行化~~
- `inline`是建議compiler函式的屬性,內嵌,與編譯的方法相關
### ~~2. main.c~~
### 3. raytracing.c
- 把所有raytracing()呼叫到的函式檢查一遍是否有共用存取
1. idx_stack.h ->
**把idx_stack_init inline** **(not yet)**
2. ray_color(const point3 e, double t,
const point3 d,
==idx_stack *stk==,
const rectangular_node rectangulars,
const sphere_node spheres,
const light_node lights,
color object_color, int bounces_left)
add_vector(object_color, refraction_part, object_color);
3. rayConstruction(d, u, v, w, i * factor + 0 / factor, j * factor + 0 % factor, view, width * factor, height * factor);
subtract_vector(s, view->vrp, d);
normalize(d);
4. ray_hit_object(const point3 e, const point3 d, double t0, double t1, const rectangular_node rectangulars, rectangular_node *hit_rectangular, const sphere_node spheres, sphere_node *hit_sphere)
*hit_sphere = sphere;
*hit_rectangular = NULL;
5. reflection(r, d, op.normal);
6. refraction(rr, d, ip.normal, idx, idx_pass);
# 函式的屬性 - Attributes of Functions
`__attribute__()`可以在宣告一個函式時,標明函式的屬性,用法是:
`void f () __attribute__ (attributeName)` ==-->這個可能參考資料寫錯了,因為compile時有出現錯誤訊息:==
`attributes should be specified before the declarator in a function definition
static`
## 種類
- `always_inline` : 標明函式是要用內嵌,也就是不使用呼叫、跳去記憶體位置執行函式,而是在編譯時將函式在呼叫點展開。 (`inline`:建議,compiler不一定會照做,擺法也不太一樣,`inline`位在宣告時的函式名稱最前面)
[參考資料1](http://gcc.gnu.org/onlinedocs/gcc-4.1.2/gcc/Function-Attributes.html)
[參考資料2](https://dotblogs.com.tw/v6610688/2013/11/27/introduction_inline_function)
# [Gprof](https://sourceware.org/binutils/docs/gprof/)
print the execution time of each function in the process.
1. `-pg` --> this option represents for profiling when compiling. Besides, it must be a compilation option as well as a link option.
2. `-Q` --> ==switch to suppress the printing of the call graph data.==
3. Run the program to generate the data gprof needs.
`gprof options [exe] [profile-data-file] > outfile`
**If you omit the executable file name, the file a.out is used. If you give no profile data file name, the file gmon.out is used.**
4.
**(Not yet)**:To link the program for profiling, if you use a compiler such as cc to do the linking, simply specify `-pg' in addition to your usual options.
# gprof2dot
#### SIMD (Single instruction, multiple data)
# AVX (Advanced Vector Extensions)
將大部分的整數運算指令擴展到 256 bits
## Register
- 16 個 YMM registers.
- 8 個 32-bit 的單浮點數或 4 個 64-bit 的雙浮點數
- XMM0 – XMM7 to YMM0 – YMM7 (in x86-64 mode, YMM0 – YMM15)
[參考資料](https://en.wikipedia.org/wiki/Advanced_Vector_Extensions)
# SSE (Streaming SIMD Extensions)
## C programming
`sqrt()` calculates the square root.
> hugikun999 : `apt install python-pip` -> `pip install gprof2dot`