# 2016q3 Homework1 (raytracing)
contributed by <`kevinbird61`>
## 規劃
- [ ] 初步跑過程式,檢視情況
- [ ] 先逐步測試已有的加速方案
- [ ] 檢討該方案的優劣
- [ ] 使用自己的加速方案
## 初步嘗試
- 沒有用gprof的版本
```XML
kevin@kevin-X450JF:[~/workspace/raytracing]$ make PROFILE=0
cc -std=gnu99 -Wall -O0 -g -c -o objects.o objects.c
cc -std=gnu99 -Wall -O0 -g -c -o raytracing.o raytracing.c
cc -std=gnu99 -Wall -O0 -g -c -o main.o main.c
cc -o raytracing objects.o raytracing.o main.o -lm
kevin@kevin-X450JF:[~/workspace/raytracing]$ ./raytracing
# Rendering scene
Done!
Execution time of raytracing() : 2.625426 sec
```
- 使用gprof的版本
```XML
kevin@kevin-X450JF:[~/workspace/raytracing]$ make PROFILE=1
cc -std=gnu99 -Wall -O0 -g -pg -c -o objects.o objects.c
cc -std=gnu99 -Wall -O0 -g -pg -c -o raytracing.o raytracing.c
cc -std=gnu99 -Wall -O0 -g -pg -c -o main.o main.c
cc -o raytracing objects.o raytracing.o main.o -lm -pg
kevin@kevin-X450JF:[~/workspace/raytracing]$ ./raytracing
# Rendering scene
Done!
Execution time of raytracing() : 5.403366 sec
```
- 執行`gprof ./raytracing | less `
```XML
Flat profile:
Each sample counts as 0.01 seconds.
% cumulative self self total
time seconds seconds calls s/call s/call name
22.61 0.47 0.47 69646433 0.00 0.00 dot_product
17.80 0.84 0.37 56956357 0.00 0.00 subtract_vector
9.62 1.04 0.20 31410180 0.00 0.00 multiply_vector
8.66 1.22 0.18 13861875 0.00 0.00 raySphereIntersection
8.18 1.39 0.17 13861875 0.00 0.00 rayRectangularIntersection
7.70 1.55 0.16 10598450 0.00 0.00 normalize
7.46 1.71 0.16 17836094 0.00 0.00 add_vector
6.73 1.85 0.14 4620625 0.00 0.00 ray_hit_object
...
```
> 可以看到呼叫dot_product次數很多,從這邊改起
> [name= kevinbird61 ] [time=Sun, Jun 28, 2015 9:59 PM] [color=#907bf7]
### Loop unrolling 解決
```XML
static inline
double dot_product(const double *v1, const double *v2)
{
double dp = 0.0;
dp = dp + (v1[0]*v2[0] + v1[1]*v2[1] + v1[2]*v2[2]);
return dp;
}
```
- 把原本的for loop打開,再次執行(with gprof)
```XML
# Rendering scene
Done!
Execution time of raytracing() : 4.941005 sec
```
- 執行時間從原本`5.403366 sec`降到`4.941005 sec`
- 再來看看`gprof ./raytracing | less`:
```XML
Each sample counts as 0.01 seconds.
% cumulative self self total
time seconds seconds calls s/call s/call name
20.27 0.31 0.31 56956357 0.00 0.00 subtract_vector
13.08 0.51 0.20 31410180 0.00 0.00 multiply_vector
11.77 0.69 0.18 69646433 0.00 0.00 dot_product
10.46 0.85 0.16 10598450 0.00 0.00 normalize
8.50 0.98 0.13 13861875 0.00 0.00 raySphereIntersection
7.19 1.09 0.11 17836094 0.00 0.00 add_vector
...
```
- 可以看到,所佔用的時間比例從原本22%降到12%左右
- Loop unrolling,可以減少branch數量
- 減少branch的使用
- 利用disassemble來看組合語言
- 沒有Loop unrolling版本:
```XML=997
0000000000401375 <dot_product>:
static inline
double dot_product(const double *v1, const double *v2)
{
401375: 55 push %rbp
401376: 48 89 e5 mov %rsp,%rbp
401379: 48 89 7d e8 mov %rdi,-0x18(%rbp)
40137d: 48 89 75 e0 mov %rsi,-0x20(%rbp)
double dp = 0.0;
401381: 66 0f ef c0 pxor %xmm0,%xmm0
401385: f2 0f 11 45 f8 movsd %xmm0,-0x8(%rbp)
/*dp = dp + (v1[0]*v2[0] + v1[1]*v2[1] + v1[2]*v2[2]);*/
for (int i = 0; i < 3; i++)
40138a: c7 45 f4 00 00 00 00 movl $0x0,-0xc(%rbp)
401391: eb 46 jmp 4013d9 <dot_product+0x64>
dp += v1[i] * v2[i];
401393: 8b 45 f4 mov -0xc(%rbp),%eax
...
```
- 使用Loop unrolling後:
```XML=997
0000000000401375 <dot_product>:
static inline
double dot_product(const double *v1, const double *v2)
{
401375: 55 push %rbp
401376: 48 89 e5 mov %rsp,%rbp
401379: 48 89 7d e8 mov %rdi,-0x18(%rbp)
40137d: 48 89 75 e0 mov %rsi,-0x20(%rbp)
double dp = 0.0;
401381: 66 0f ef c0 pxor %xmm0,%xmm0
401385: f2 0f 11 45 f8 movsd %xmm0,-0x8(%rbp)
dp = dp + (v1[0]*v2[0] + v1[1]*v2[1] + v1[2]*v2[2]);
40138a: 48 8b 45 e8 mov -0x18(%rbp),%rax
40138e: f2 0f 10 08 movsd (%rax),%xmm1
401392: 48 8b 45 e0 mov -0x20(%rbp),%rax
401396: f2 0f 10 00 movsd (%rax),%xmm0
40139a: f2 0f 59 c8 mulsd %xmm0,%xmm1
40139e: 48 8b 45 e8 mov -0x18(%rbp),%rax
4013a2: 48 83 c0 08 add $0x8,%rax
4013a6: f2 0f 10 10 movsd (%rax),%xmm2
4013aa: 48 8b 45 e0 mov -0x20(%rbp),%rax
4013ae: 48 83 c0 08 add $0x8,%rax
4013b2: f2 0f 10 00 movsd (%rax),%xmm0
4013b6: f2 0f 59 c2 mulsd %xmm2,%xmm0
4013ba: f2 0f 58 c8 addsd %xmm0,%xmm1
4013be: 48 8b 45 e8 mov -0x18(%rbp),%rax
4013c2: 48 83 c0 10 add $0x10,%rax
4013c6: f2 0f 10 10 movsd (%rax),%xmm2
...
```
=> 就運算上,加減次數並不會減少,省下的每次呼叫dot_product時產生的jmp
=> 觀察產生的assembly大小:
```XML
kevin@kevin-QX-350-Series:[~/workspace/raytracing]$ ls -l
總計 1276
-rw-rw-r-- 1 kevin kevin 156 9月 27 11:38 AUTHORS
-rw-rw-r-- 1 kevin kevin 786447 9月 27 11:38 baseline.ppm
-rw-rw-r-- 1 kevin kevin 167972 9月 27 12:37 disassembly.dump
-rw-rw-r-- 1 kevin kevin 167992 9月 27 12:34 disassembly_opt.dump
...
```
- 更正,仍為原本loop的版本比較小
### 使用force inline版本
(新電腦環境下的執行情形,已加上先前loop unrolling):
```XML
kevin@kevin-QX-350-Series:[~/workspace/raytracing]$ ./raytracing
# Rendering scene
Done!
Execution time of raytracing() : 2.467720 sec
```
(加上`__attribute__((always_inline))`後):
```XML
kevin@kevin-QX-350-Series:[~/workspace/raytracing]$ ./raytracing
# Rendering scene
Done!
Execution time of raytracing() : 2.397503 sec
```
減少0.070217秒的執行時間
- 察看強制Inline後的結果
```XML
Flat profile:
Each sample counts as 0.01 seconds.
% cumulative self self total
time seconds seconds calls s/call s/call name
40.25 0.87 0.87 13861875 0.00 0.00 rayRectangularIntersection
16.05 1.21 0.35 13861875 0.00 0.00 raySphereIntersection
15.35 1.54 0.33 2110576 0.00 0.00 compute_specular_diffuse
6.98 1.69 0.15 2110576 0.00 0.00 localColor
6.51 1.83 0.14 1048576 0.00 0.00 ray_color
6.51 1.97 0.14 4620625 0.00 0.00 ray_hit_object
2.79 2.03 0.06 1048576 0.00 0.00 rayConstruction
1.86 2.07 0.04 1 0.04 2.15 raytracing
1.40 2.10 0.03 1241598 0.00 0.00 reflection
...
```
- 果真都被展開,沒辦法被追蹤了
###
## Reference
- [Enhance raytracing program](https://embedded2016.hackpad.com/Enhance-raytracing-program-f5CCUGMQ4Kp)