# 2016q3 Homework 2 ( raytracing )
contributed by <`ierosodin`>
reviewed by <`janetwei`>
- 可以使用gnuplot之類的工具繪製出效能比較圖表,顯示優化前後差別
- 可以再嘗試其他優化方法,例如 SIMD
## 開發環境
作業系統 : CentOS 7
`$ lscpu`
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 12
On-line CPU(s) list: 0-11
Thread(s) per core: 2
Core(s) per socket: 6
Socket(s): 1
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 45
Model name: Genuine Intel(R) CPU @ 3.30GHz
Stepping: 5
CPU MHz: 1277.976
BogoMIPS: 6600.19
Virtualization: VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 15360K
NUMA node0 CPU(s): 0-11
## 軟體安裝
```
$ yum install graphviz
$ yum install ImageMagick
$ git clone https://github.com/sysprog21/raytracing
$ cd raytracing
$ make
$ ./raytracing
```
初次執行raytracing
`Execution time of raytracing() : 3.097945 sec`
## 使用gprof分析
```
$ make PROFILE=1
$ ./raytracing
```
使用gprof時, 執行時間會變長
(gprof使gcc 在每個函数中都加入了一個mcount, 也就是說每個函數都會調用mcount, 增加執行時間)
`Execution time of raytracing() : 5.379273 sec`
## Using gprof2dot in Centos
gprof2dot能將gprof或perf的分析結果轉成 dot 格式, 裡面會描述各個節點間的關係
```
$ git clone https://github.com/jrfonseca/gprof2dot
$ gprof raytracing| ../gprof2dot/gprof2dot.py | dot -Tpng -o dot.png
```
![](https://i.imgur.com/8il9B1s.png)
### 開始分析
`$ gprof -b raytracing gmon.out | less`
```
Flat profile:
Each sample counts as 0.01 seconds.
% cumulative self self total
time seconds seconds calls s/call s/call name
22.87 0.72 0.72 69646433 0.00 0.00 dot_product
19.85 1.35 0.63 56956357 0.00 0.00 subtract_vector
9.21 1.64 0.29 13861875 0.00 0.00 raySphereIntersection
8.26 1.90 0.26 10598450 0.00 0.00 normalize
7.94 2.15 0.25 13861875 0.00 0.00 rayRectangularIntersection
7.15 2.37 0.23 31410180 0.00 0.00 multiply_vector
6.04 2.56 0.19 17821809 0.00 0.00 cross_product
5.56 2.74 0.18 17836094 0.00 0.00 add_vector
3.18 2.84 0.10 4620625 0.00 0.00 ray_hit_object
```
從gprof調用表可以發現, dot_product與subtract_vector被呼叫次數高, 佔用了許多時間, 嘗試針對這兩項進行優化
## 嘗試一( OpenMP )
針對dot_product()的for迴圈進行openmp
```
#include <omp.h>
#pragma omp for
for (i = 0; i < 3; i++)
dp += v1[i] * v2[i];
```
結果發現執行時間變長了!
```
Flat profile:
Each sample counts as 0.01 seconds.
% cumulative self self total
time seconds seconds calls s/call s/call name
36.36 1.33 1.33 69646433 0.00 0.00 dot_product
16.13 1.92 0.59 56956357 0.00 0.00 subtract_vector
11.21 2.33 0.41 13861875 0.00 0.00 rayRectangularIntersection
6.29 2.56 0.23 31410180 0.00 0.00 multiply_vector
5.19 2.75 0.19 10598450 0.00 0.00 normalize
4.92 2.93 0.18 17836094 0.00 0.00 add_vector
```
問題 : 應該是要減少呼叫dot_product的次數, 且dot_product中的for為小迴圈, 可嘗試將for迴圈展開
## 嘗試二( loop unrolling )
將dot_product, multiply_vector, add_vector, subtract_vector中的for迴圈展開
`Execution time of raytracing() : 2.253459 sec`
```
Flat profile:
Each sample counts as 0.01 seconds.
% cumulative self self total
time seconds seconds calls s/call s/call name
20.45 0.47 0.47 69646433 0.00 0.00 dot_product
11.96 0.75 0.28 13861875 0.00 0.00 rayRectangularIntersection
11.75 1.02 0.27 13861875 0.00 0.00 raySphereIntersection
10.44 1.26 0.24 10598450 0.00 0.00 normalize
7.40 1.43 0.17 31410180 0.00 0.00 multiply_vector
7.18 1.59 0.17 56956357 0.00 0.00 subtract_vector
6.96 1.75 0.16 17821809 0.00 0.00 cross_product
5.44 1.88 0.13 17836094 0.00 0.00 add_vector
```
從表中可以分析出, loop unrolling對效能產生了影響
## 嘗試三( OpenMP )
發現rayRectangularIntersection與raySphereIntersection對程式效能有影響, 嘗試對raytracing.c進行優化
( 解決raytracing()中大量的for迴圈 -> 平行化 )
結果對效能有很明顯的提升
`Execution time of raytracing() : 0.369776 sec`
```
Flat profile:
Each sample counts as 0.01 seconds.
% cumulative self self total
time seconds seconds calls s/call s/call name
12.80 0.39 0.39 4570655 0.00 0.00 dot_product
11.14 0.72 0.34 3595377 0.00 0.00 subtract_vector
8.98 0.99 0.27 1661550 0.00 0.00 multiply_vector
8.64 1.25 0.26 911025 0.00 0.00 raySphereIntersection
8.31 1.50 0.25 587853 0.00 0.00 normalize
7.98 1.74 0.24 1159622 0.00 0.00 cross_product
7.98 1.98 0.24 231807 0.00 0.00 ray_hit_object
6.65 2.18 0.20 825606 0.00 0.00 rayRectangularIntersection
5.49 2.35 0.17 59468 0.00 0.00 ray_color
4.32 2.48 0.13 1 0.13 2.99 raytracing
3.49 2.58 0.11 1028861 0.00 0.00 add_vector
```
但是發現加入openmp後的結果不大理想, output出來的圖不正確
![](https://i.imgur.com/1VmrQ6y.png)
## 其他嘗試( pthreads )
有看到學長使用pthreads, 不過因為自己對這方面不熟, 還沒有進行嘗試
>> 不用提「這方面不熟」,沒有具體成果前,當然都是「不熟」 [name=jserv]
嘗試用pthreads改寫raytracing.c
>>參考[吳彥寬的共筆](https://embedded2016.hackpad.com/ep/pad/static/wOu40KzMaIP)
### 使用到的pthreads函式
#### main.c
```clike
pthread_t *id = ( pthread_t* ) malloc( THREADNUM* sizeof( pthread_t));
```
宣告thread id, 並分配記憶體
```clike
rays** ptr = (rays**) malloc( THREADNUM* sizeof( rays* ));
```
宣告一個指標型態的陣列, 使每一個id都有一個指標型態的參數(ptr[i]), 用來傳入function
```clike
pthread_create( &id[i], NULL, (void*) &raytracing, (void*) ptr[i]);
```
建立thread, 參數分別是:
thread id; 屬性(一般填NULL); 要multi-threads的function; 傳入的參數
```clike
pthread_join( id[i], NULL);
```
用來等待該id的thread結束, 第二個參數用來儲存回傳值(如果有)
#### raytracing.c
```clike
pthread_exit(0);
```
放在raytracing()最後, 表示該function結束後, thread自己關閉
### 問題解決
Q:pthreads一個function似乎只能傳入一個參數
A:改變raytracing.c中raytracing()的參數結構 -> 變成一個structure
```
rays *new_rays(uint8_t *pixels, double *background_color,
rectangular_node rectangulars, sphere_node spheres,
light_node lights,const viewpoint *view,
int width, int height, int id, int threadnum)
{
rays * r = (rays *) malloc ( sizeof( rays));
r->pixels = pixels;
r->background_color = background_color;
r->rectangulars = rectangulars;
r->spheres = spheres;
r->lights = lights;
r->view = view;
r->width = width;
r->height = height;
r->id = id;
r->threadnum = threadnum;
return r;
}
```
並將原本的
`void raytracing(uint8_t *pixels, color background_color, rectangular_node rectangulars, sphere_node spheres, light_node lights, const viewpoint *view, int width, int height)`
改成
`void raytracing( void * ray)`
函式中的參數也都改用potiner
(raytracing.h中也要做相對應的修改)
Q:編譯時出現
```
cc -o raytracing objects.o raytracing.o main.o -lm -lgomp
/usr/bin/ld: main.o: undefined reference to symbol 'pthread_create@@GLIBC_2.2.5'
/usr/bin/ld: note: 'pthread_create@@GLIBC_2.2.5' is defined in DSO /lib64/libpthread.so.0 so try adding it to the linker command line
/lib64/libpthread.so.0: could not read symbols: Invalid operation
collect2: error: ld returned 1 exit status
make: *** [raytracing] Error 1
```
A:需要在Makefile中的`LDFLAGS`增加`-lpthread`
### 結果
`Execution time of raytracing() : 0.320049 sec`(THREADNUM = 128)
pthread能大幅提昇效能, 且得到的out.ppm也是正確的圖形
## 嘗試四( force inline )
呼叫函數時, 電腦會紀錄目前的記憶體位址, 然後跳至函數的記憶體位置, 等到處理完後, 再回到原先的位址, 但這樣會降低效能
使用inline可以直接展開function, 但原本的inline只能'建議'compiler, 嘗試強迫inline
在`CFLAGS`中增加`-D__forceinline="__attribute__((always_inline))"`
### 結果
`Execution time of raytracing() : 0.313396 sec`
沒有很明顯的影響
## 參考資料
[gprof2dot github](https://github.com/jrfonseca/gprof2dot)
[pThreads for Raytracing](https://embedded2016.hackpad.com/ep/pad/static/wOu40KzMaIP)
[Enhance raytracing program](https://embedded2016.hackpad.com/ep/pad/static/f5CCUGMQ4Kp)
[GNU gprof](https://sourceware.org/binutils/docs/gprof/)