2016q3 Homework 2 ( raytracing )

contributed by <ierosodin>
reviewed by <janetwei>

可以使用gnuplot之類的工具繪製出效能比較圖表,顯示優化前後差別
可以再嘗試其他優化方法，例如 SIMD

開發環境

作業系統 : CentOS 7

$ lscpu

Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 12
On-line CPU(s) list: 0-11
Thread(s) per core: 2
Core(s) per socket: 6
Socket(s): 1
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 45
Model name: Genuine Intel® CPU @ 3.30GHz
Stepping: 5
CPU MHz: 1277.976
BogoMIPS: 6600.19
Virtualization: VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 15360K
NUMA node0 CPU(s): 0-11

軟體安裝

$ yum install graphviz
$ yum install ImageMagick
$ git clone https://github.com/sysprog21/raytracing 
$ cd raytracing
$ make
$ ./raytracing

初次執行raytracing
Execution time of raytracing() : 3.097945 sec

使用gprof分析

$ make PROFILE=1
$ ./raytracing

使用gprof時, 執行時間會變長
(gprof使gcc 在每個函数中都加入了一個mcount, 也就是說每個函數都會調用mcount, 增加執行時間)
Execution time of raytracing() : 5.379273 sec

Using gprof2dot in Centos

gprof2dot能將gprof或perf的分析結果轉成 dot 格式, 裡面會描述各個節點間的關係

$ git clone https://github.com/jrfonseca/gprof2dot
$ gprof raytracing| ../gprof2dot/gprof2dot.py | dot -Tpng -o dot.png

開始分析

$ gprof -b raytracing gmon.out | less

Flat profile:

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total           
 time   seconds   seconds    calls   s/call   s/call  name    
 22.87      0.72     0.72 69646433     0.00     0.00  dot_product
 19.85      1.35     0.63 56956357     0.00     0.00  subtract_vector
  9.21      1.64     0.29 13861875     0.00     0.00  raySphereIntersection
  8.26      1.90     0.26 10598450     0.00     0.00  normalize
  7.94      2.15     0.25 13861875     0.00     0.00  rayRectangularIntersection
  7.15      2.37     0.23 31410180     0.00     0.00  multiply_vector
  6.04      2.56     0.19 17821809     0.00     0.00  cross_product
  5.56      2.74     0.18 17836094     0.00     0.00  add_vector
  3.18      2.84     0.10  4620625     0.00     0.00  ray_hit_object

從gprof調用表可以發現, dot_product與subtract_vector被呼叫次數高, 佔用了許多時間, 嘗試針對這兩項進行優化

嘗試一( OpenMP )

針對dot_product()的for迴圈進行openmp

#include <omp.h>
#pragma omp for
for (i = 0; i < 3; i++)
    dp += v1[i] * v2[i];

結果發現執行時間變長了!

Flat profile:

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total           
 time   seconds   seconds    calls   s/call   s/call  name    
 36.36      1.33     1.33 69646433     0.00     0.00  dot_product
 16.13      1.92     0.59 56956357     0.00     0.00  subtract_vector
 11.21      2.33     0.41 13861875     0.00     0.00  rayRectangularIntersection
  6.29      2.56     0.23 31410180     0.00     0.00  multiply_vector
  5.19      2.75     0.19 10598450     0.00     0.00  normalize
  4.92      2.93     0.18 17836094     0.00     0.00  add_vector

問題 : 應該是要減少呼叫dot_product的次數, 且dot_product中的for為小迴圈, 可嘗試將for迴圈展開

嘗試二( loop unrolling )

將dot_product, multiply_vector, add_vector, subtract_vector中的for迴圈展開
Execution time of raytracing() : 2.253459 sec

Flat profile:

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total           
 time   seconds   seconds    calls   s/call   s/call  name    
 20.45      0.47     0.47 69646433     0.00     0.00  dot_product
 11.96      0.75     0.28 13861875     0.00     0.00  rayRectangularIntersection
 11.75      1.02     0.27 13861875     0.00     0.00  raySphereIntersection
 10.44      1.26     0.24 10598450     0.00     0.00  normalize
  7.40      1.43     0.17 31410180     0.00     0.00  multiply_vector
  7.18      1.59     0.17 56956357     0.00     0.00  subtract_vector
  6.96      1.75     0.16 17821809     0.00     0.00  cross_product
  5.44      1.88     0.13 17836094     0.00     0.00  add_vector

從表中可以分析出, loop unrolling對效能產生了影響

嘗試三( OpenMP )

發現rayRectangularIntersection與raySphereIntersection對程式效能有影響, 嘗試對raytracing.c進行優化
( 解決raytracing()中大量的for迴圈 -> 平行化 )

結果對效能有很明顯的提升
Execution time of raytracing() : 0.369776 sec

Flat profile:

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total           
 time   seconds   seconds    calls   s/call   s/call  name    
 12.80      0.39     0.39  4570655     0.00     0.00  dot_product
 11.14      0.72     0.34  3595377     0.00     0.00  subtract_vector
  8.98      0.99     0.27  1661550     0.00     0.00  multiply_vector
  8.64      1.25     0.26   911025     0.00     0.00  raySphereIntersection
  8.31      1.50     0.25   587853     0.00     0.00  normalize
  7.98      1.74     0.24  1159622     0.00     0.00  cross_product
  7.98      1.98     0.24   231807     0.00     0.00  ray_hit_object
  6.65      2.18     0.20   825606     0.00     0.00  rayRectangularIntersection
  5.49      2.35     0.17    59468     0.00     0.00  ray_color
  4.32      2.48     0.13        1     0.13     2.99  raytracing
  3.49      2.58     0.11  1028861     0.00     0.00  add_vector

但是發現加入openmp後的結果不大理想, output出來的圖不正確

其他嘗試( pthreads )

有看到學長使用pthreads, 不過因為自己對這方面不熟, 還沒有進行嘗試

不用提「這方面不熟」，沒有具體成果前，當然都是「不熟」 jserv

嘗試用pthreads改寫raytracing.c

參考吳彥寬的共筆

使用到的pthreads函式

main.c

pthread_t *id = ( pthread_t* ) malloc( THREADNUM* sizeof( pthread_t));

宣告thread id, 並分配記憶體

rays** ptr = (rays**) malloc( THREADNUM* sizeof( rays* ));

宣告一個指標型態的陣列, 使每一個id都有一個指標型態的參數(ptr[i]), 用來傳入function

pthread_create( &id[i], NULL, (void*) &raytracing, (void*) ptr[i]);

建立thread, 參數分別是:
thread id; 屬性(一般填NULL); 要multi-threads的function; 傳入的參數

pthread_join( id[i], NULL);

用來等待該id的thread結束, 第二個參數用來儲存回傳值(如果有)

raytracing.c

pthread_exit(0);

放在raytracing()最後, 表示該function結束後, thread自己關閉

問題解決

Q:pthreads一個function似乎只能傳入一個參數

A:改變raytracing.c中raytracing()的參數結構 -> 變成一個structure

rays *new_rays(uint8_t *pixels, double *background_color,
                rectangular_node rectangulars, sphere_node spheres,
                light_node lights,const viewpoint *view,
                int width, int height, int id, int threadnum)
{
    rays * r =  (rays *) malloc ( sizeof( rays));
    r->pixels = pixels;
    r->background_color =  background_color;
    r->rectangulars = rectangulars;
    r->spheres = spheres;
    r->lights = lights;
    r->view = view;
    r->width = width;
    r->height = height;
    r->id = id;
    r->threadnum = threadnum;
    return r;
}

並將原本的
void raytracing(uint8_t *pixels, color background_color, rectangular_node rectangulars, sphere_node spheres, light_node lights, const viewpoint *view, int width, int height)
改成
void raytracing( void * ray)
函式中的參數也都改用potiner
(raytracing.h中也要做相對應的修改)

Q:編譯時出現

cc -o raytracing objects.o raytracing.o main.o -lm -lgomp
/usr/bin/ld: main.o: undefined reference to symbol 'pthread_create@@GLIBC_2.2.5'
/usr/bin/ld: note: 'pthread_create@@GLIBC_2.2.5' is defined in DSO /lib64/libpthread.so.0 so try adding it to the linker command line
/lib64/libpthread.so.0: could not read symbols: Invalid operation
collect2: error: ld returned 1 exit status
make: *** [raytracing] Error 1

A:需要在Makefile中的LDFLAGS增加-lpthread

結果

Execution time of raytracing() : 0.320049 sec(THREADNUM = 128)
pthread能大幅提昇效能, 且得到的out.ppm也是正確的圖形

嘗試四( force inline )

呼叫函數時, 電腦會紀錄目前的記憶體位址, 然後跳至函數的記憶體位置, 等到處理完後, 再回到原先的位址, 但這樣會降低效能
使用inline可以直接展開function, 但原本的inline只能'建議'compiler, 嘗試強迫inline

在CFLAGS中增加-D__forceinline="__attribute__((always_inline))"

結果

Execution time of raytracing() : 0.313396 sec
沒有很明顯的影響

參考資料

gprof2dot github
pThreads for Raytracing
Enhance raytracing program
GNU gprof