# 2016q3 Homework1 (raytracing)
contributed by <`beieve7028`>
---
# 預期目標
* 在 GitHub 上 fork [raytracing](https://github.com/sysprog21/raytracing),並思考如何改善程式效能
* 以 `make PROFILE=1` 重新編譯程式碼,並且學習 [gprof](https://sourceware.org/binutils/docs/gprof/)
* 以 [gprof](https://sourceware.org/binutils/docs/gprof/) 指出效能瓶頸,並且著手改寫檔案 `math-toolkit.h` 在內的函式實做
* 可善用 POSIX Thread, OpenMP, software pipelining, 以及 loop unrolling 一類的技巧來加速程式運作
---
# 實驗環境
```
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 8
On-line CPU(s) list: 0-7
Thread(s) per core: 2
Core(s) per socket: 4
Socket(s): 1
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 94
Stepping: 3
CPU MHz: 3491.640
BogoMIPS: 6815.99
Virtualization: VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 8192K
NUMA node0 CPU(s): 0-7
```
---
# gprof分析(baseline)
執行時間:
```
$ git checkout master && make clean && make && ./raytracing
Execution time of raytracing() : 2.168869 sec
```
gprof分析:
```
$ git checkout master && make clean && make PROFILE=1 && ./raytracing && gprof ./raytracing | less
Flat profile:
Each sample counts as 0.01 seconds.
% cumulative self self total
time seconds seconds calls s/call s/call name
30.02 0.72 0.72 69646433 0.00 0.00 dot_product
15.01 1.08 0.36 56956357 0.00 0.00 subtract_vector
10.84 1.34 0.26 13861875 0.00 0.00 rayRectangularIntersection
8.13 1.54 0.20 31410180 0.00 0.00 multiply_vector
7.09 1.71 0.17 13861875 0.00 0.00 raySphereIntersection
4.59 1.82 0.11 17836094 0.00 0.00 add_vector
4.17 1.92 0.10 4620625 0.00 0.00 ray_hit_object
3.96 2.01 0.10 17821809 0.00 0.00 cross_product
3.75 2.10 0.09 10598450 0.00 0.00 normalize
2.08 2.15 0.05 2110576 0.00 0.00 compute_specular_diffuse
1.67 2.19 0.04 4221152 0.00 0.00 multiply_vectors
1.67 2.23 0.04 2110576 0.00 0.00 localColor
1.25 2.26 0.03 1048576 0.00 0.00 ray_color
1.25 2.29 0.03 1 0.03 2.39 raytracing
0.83 2.31 0.02 2558386 0.00 0.00 idx_stack_empty
0.83 2.33 0.02 1241598 0.00 0.00 refraction
0.83 2.35 0.02 1204003 0.00 0.00 idx_stack_push
0.42 2.36 0.01 3838091 0.00 0.00 length
```
可以見到前幾名佔了很大的比例,因此去查看函式內容
math-toolkit.h:
```clike=
static inline
void add_vector(const double *a, const double *b, double *out)
{
for (int i = 0; i < 3; i++)
out[i] = a[i] + b[i];
}
static inline
void subtract_vector(const double *a, const double *b, double *out)
{
for (int i = 0; i < 3; i++)
out[i] = a[i] - b[i];
}
static inline
void multiply_vectors(const double *a, const double *b, double *out)
{
for (int i = 0; i < 3; i++)
out[i] = a[i] * b[i];
}
static inline
void multiply_vector(const double *a, double b, double *out)
{
for (int i = 0; i < 3; i++)
out[i] = a[i] * b;
}
static inline
double dot_product(const double *v1, const double *v2)
{
double dp = 0.0;
for (int i = 0; i < 3; i++)
dp += v1[i] * v2[i];
return dp;
}
```
可以見到都是迴圈的計算,於是採用[解說](https://www.youtube.com/watch?v=V_rLyzecuaE)所提到的[loop unrolling](https://zh.wikipedia.org/wiki/%E5%BE%AA%E7%8E%AF%E5%B1%95%E5%BC%80)技巧進行優化:
math-toolkit.h
```clike=
static inline
void add_vector(const double *a, const double *b, double *out)
{
out[0] = a[0] + b[0];
out[1] = a[1] + b[1];
out[2] = a[2] + b[2];
}
static inline
void subtract_vector(const double *a, const double *b, double *out)
{
out[0] = a[0] - b[0];
out[1] = a[1] - b[1];
out[2] = a[2] - b[2];
}
static inline
void multiply_vectors(const double *a, const double *b, double *out)
{
out[0] = a[0] * b[0];
out[1] = a[1] * b[1];
out[2] = a[2] * b[2];
}
static inline
void multiply_vector(const double *a, double b, double *out)
{
out[0] = a[0] * b;
out[1] = a[1] * b;
out[2] = a[2] * b;
}
static inline
double dot_product(const double *v1, const double *v2)
{
return v1[0] * v2[0] + v1[1] * v2[1] + v1[2] * v2[2];
}
```
經過優化後的執行時間有明顯改善,節省了30%的計算時間。
```
$ git checkout loop-unrooling && make clean && make && ./raytracing
Execution time of raytracing() : 1.516588 sec
```
這時再以gprof分析:
```
$ git checkout loop-unrooling && make clean && make PROFILE=1 && ./raytracing && gprof ./raytracing | less
Flat profile:
Each sample counts as 0.01 seconds.
% cumulative self self total
time seconds seconds calls s/call s/call name
20.01 0.33 0.33 69646433 0.00 0.00 dot_product
15.16 0.58 0.25 56956357 0.00 0.00 subtract_vector
10.31 0.75 0.17 13861875 0.00 0.00 raySphereIntersection
10.01 0.92 0.17 13861875 0.00 0.00 rayRectangularIntersection
9.10 1.07 0.15 10598450 0.00 0.00 normalize
6.67 1.18 0.11 17821809 0.00 0.00 cross_product
4.85 1.26 0.08 31410180 0.00 0.00 multiply_vector
4.25 1.33 0.07 4620625 0.00 0.00 ray_hit_object
3.34 1.38 0.06 2110576 0.00 0.00 localColor
3.03 1.43 0.05 17836094 0.00 0.00 add_vector
3.03 1.48 0.05 3838091 0.00 0.00 length
3.03 1.53 0.05 1241598 0.00 0.00 refraction
2.43 1.57 0.04 4221152 0.00 0.00 multiply_vectors
1.82 1.60 0.03 2110576 0.00 0.00 compute_specular_diffuse
1.21 1.62 0.02 2520791 0.00 0.00 idx_stack_top
0.61 1.63 0.01 2558386 0.00 0.00 idx_stack_empty
0.61 1.64 0.01 1048576 0.00 0.00 ray_color
0.61 1.65 0.01 1 0.01 1.65 raytracing
```
也會發現計算所耗費開銷的排名也有變動。
---
# Inline
在[影片](https://www.youtube.com/watch?v=V_rLyzecuaE)中有提到編譯器的inline技巧,而在[gcc](https://gcc.gnu.org/onlinedocs/gcc/Inline.html)當中有提到
`GCC does not inline any functions when not optimizing unless you specify the ‘always_inline’ attribute for the function.`
由於在Makefile裡面我們在編譯指定最佳化的選項為:
```
CFLAGS = \
-std=gnu99 -Wall -O0 -g
```
表示不啟用編譯器的最佳化操作,因此在程式碼當中inline建議會被忽略:
```
static inline
```
這時就要使用
```
__attribute__((always_inline))
```
來表示要啟用inline,可以見到使用了inline技巧後執行的時間也有改善了10%:
```
$ git checkout inline && make clean && make && ./raytracing
Execution time of raytracing() : 1.967007 sec
```
---
# OpenMP
上網有找到[這篇](http://www.syscom.com.tw/ePaper_Content_EPArticledetail.aspx?id=29&EPID=151&j=6&HeaderName=%E7%A0%94%E7%99%BC%E6%96%B0%E8%A6%96%E7%95%8C)把openMP的用法整理蠻清楚的,當中也有提到OpenMP用法某些情況會導致的錯誤,例如多个執行序寫同一個變數時,[這篇](https://hackmd.io/AwFgjArAHAbDYFoBMwzASAhgZhAqAnGAQtpmDAKYQzZgDsMwQA==?view)有提到使用`private(var)`的指令讓變數互有獨立副本以免發生不預期的錯誤(race conditions),而在[這篇](https://kheresy.wordpress.com/2006/09/22/%E7%B0%A1%E6%98%93%E7%9A%84%E7%A8%8B%E5%BC%8F%E5%B9%B3%E8%A1%8C%E5%8C%96%EF%BC%8Dopenmp%EF%BC%88%E4%BA%94%EF%BC%89-%E8%AE%8A%E6%95%B8%E7%9A%84%E5%B9%B3%E8%A1%8C%E5%8C%96/)當中有提到也可使用`#pragma omp atomic`的指示來避免變個被多個執行同時寫入,或是指示`#pragma omp for reduction( +:sum)`這種各自運算最後才整合的方式。
----
基於上面的說明,我決定在`math-toolkit.h`這個標頭檔裡面使用`#pragma omp for reduction( +:sum)`為基礎的平行指示,不過在該標頭檔當中只有一個方法會有race conditions的問題:
```clike=
static inline
double dot_product(const double *v1, const double *v2)
{
double dp = 0.0;
for (int i = 0; i < 3; i++)
dp += v1[i] * v2[i];
return dp;
}
```
```clike=
static inline
double dot_product(const double *v1, const double *v2)
{
double dp = 0.0;
#pragma omp parallel for num_threads(3) reduction( +:dp)
for (int i = 0; i < 3; i++)
dp += v1[i] * v2[i];
return dp;
}
```
出了一點問題:
```
math-toolkit.h: In function ‘dot_product’:
math-toolkit.h:70:0: warning: ignoring #pragma omp parallel [-Wunknown-pragmas]
#pragma omp parallel for num_threads(3) reduction( +:dp)
```
於是在Makefile裡修改為:
```shell=
CFLAGS = \
-std=gnu99 -Wall -O0 -g -fopenmp
LDFLAGS = \
-lm -fopenmp
```
就可以順利執行了,不過在執行後會發現執行的時間反而增加,推測是以下的cod所導致,為了使用平行計算而增加的成本不足以抵銷平行計算所帶來的效益,如以下的code而言,我指示openmp分為3個threads去計算,在執行時間則從baseline的`2.168869 sec`增加到`104.624172 sec`。
num_threads=3:`Execution time of raytracing() : 104.624172 sec`
```clike=
static inline
void add_vector(const double *a, const double *b, double *out)
{
#pragma omp parallel for num_threads(3)
for (int i = 0; i < 3; i++)
out[i] = a[i] + b[i];
}
static inline
void subtract_vector(const double *a, const double *b, double *out)
{
#pragma omp parallel for num_threads(3)
for (int i = 0; i < 3; i++)
out[i] = a[i] - b[i];
}
static inline
void multiply_vectors(const double *a, const double *b, double *out)
{
#pragma omp parallel for num_threads(3)
for (int i = 0; i < 3; i++)
out[i] = a[i] * b[i];
}
static inline
void multiply_vector(const double *a, double b, double *out)
{
#pragma omp parallel for num_threads(3)
for (int i = 0; i < 3; i++)
out[i] = a[i] * b;
}
static inline
double dot_product(const double *v1, const double *v2)
{
double dp = 0.0;
#pragma omp parallel for num_threads(3) reduction(+:dp)
for (int i = 0; i < 3; i++)
dp += v1[i] * v2[i];
return dp;
}
```
num_threads=1:`Execution time of raytracing() : 30.393997 sec`
```clike=
static inline
void add_vector(const double *a, const double *b, double *out)
{
#pragma omp parallel for num_threads(1)
for (int i = 0; i < 3; i++)
out[i] = a[i] + b[i];
}
static inline
void subtract_vector(const double *a, const double *b, double *out)
{
#pragma omp parallel for num_threads(1)
for (int i = 0; i < 3; i++)
out[i] = a[i] - b[i];
}
static inline
void multiply_vectors(const double *a, const double *b, double *out)
{
#pragma omp parallel for num_threads(1)
for (int i = 0; i < 3; i++)
out[i] = a[i] * b[i];
}
static inline
void multiply_vector(const double *a, double b, double *out)
{
#pragma omp parallel for num_threads(1)
for (int i = 0; i < 3; i++)
out[i] = a[i] * b;
}
static inline
double dot_product(const double *v1, const double *v2)
{
double dp = 0.0;
#pragma omp parallel for num_threads(1) reduction(+:dp)
for (int i = 0; i < 3; i++)
dp += v1[i] * v2[i];
return dp;
}
```
因此現在要平行計算的目標必須放置計算較大的單元區塊,於是從`dot_product`搜尋呼叫函式,在raytracing.c檔案內的raytracing函式發現一個可以嘗試優化的地方,於是將該處程式增加openMP的平行計算指令,並且以private對平行區塊外的變數進行副本拷貝,以免發生race conditions的錯誤,經過計算後時間縮至`0.625264 sec`,不過相較於採用`-Ofast`進行編譯所得的`0.407022 sec`還有一段差距。
```clike=
#pragma omp parallel for private(stk), private(d) private(object_color)
for (int j = 0; j < height; j++) {
for (int i = 0; i < width; i++) {
double r = 0, g = 0, b = 0;
/* MSAA */
for (int s = 0; s < SAMPLES; s++) {
idx_stack_init(&stk);
rayConstruction(d, u, v, w,
i * factor + s / factor,
j * factor + s % factor,
view,
width * factor, height * factor);
if (ray_color(view->vrp, 0.0, d, &stk, rectangulars, spheres,
lights, object_color,
MAX_REFLECTION_BOUNCES)) {
r += object_color[0];
g += object_color[1];
b += object_color[2];
} else {
r += background_color[0];
g += background_color[1];
b += background_color[2];
}
pixels[((i + (j * width)) * 3) + 0] = r * 255 / SAMPLES;
pixels[((i + (j * width)) * 3) + 1] = g * 255 / SAMPLES;
pixels[((i + (j * width)) * 3) + 2] = b * 255 / SAMPLES;
}
}
}
```
若是以openMP加上`__attribute__((always_inline))`
指示,則能達到`0.413672 sec`,與編譯最佳化的`-Ofast`計算所得的`0.407022 sec`相當接近。
---
# 參考資料
* [A02: raytracing - HackMD](https://hackmd.io/s/B1W5AWza)