2017q1Homewok1(raytracing)

contributed by <janetwei>

Reviewed by `ierosodin`

OpenMP 與 pthread 的效能差異不大
比較不同 thread number 對效能的差異
利用 threadpool 來解決不同 thread 之間工作量不平均的問題
OpenMP 中不同 schedule 對平行化的影響
用 SIMD 來加速整個資料的處理

開發環境

$ lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 4
On-line CPU(s) list: 0-3
Thread(s) per core: 2
Core(s) per socket: 2
Socket(s): 1
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 61
Model name: Intel® Core™ i5-5250U CPU @ 1.60GHz
Stepping: 4
CPU MHz: 2220.750
CPU max MHz: 2700.0000
CPU min MHz: 500.0000
BogoMIPS: 3199.77
Virtualization: VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 3072K
NUMA node0 CPU(s): 0-3

優化前原始程式的執行時間:

Execution time of raytracing() : 3.001049 sec

base.png

gprof

gprof: 印出程式執行中每個 function 所花的時間，可以幫助從眾多 function 中找出耗時最多的。其中包括呼叫次數，因此可以幫忙分析程式運行流程

$ make PROFILE=1
$ ./raytracing

Execution time of raytracing() : 6.391525 sec
在編譯和鏈接程序時（加入-pg），gcc 在你的每个函数中都加入了mcount, 依赖於 compiler 或操作系统的函數，因此 gprof 所執行的時間較慢

gprof2dot
能將 gprof 的結果轉成一個結點格式，分析各個點（也就是 function )之間的關係
$ git clone https://github.com/jrfonseca/gprof2dot
$ gprof raytracing| gprof2dot/gprof2dot.py | dot -Tpng -o result.png

開始分析

$ gprof -b raytracing gmon.out | less
這裡用less是因為 gprof 輸出訊息較多，less可以讓我們透過上下方向鍵查看 gprof 產生的輸出

Flat profile:

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total           
 time   seconds   seconds    calls   s/call   s/call  name    
 24.47      0.56     0.56 69646433     0.00     0.00  dot_product
 23.60      1.10     0.54 56956357     0.00     0.00  subtract_vector
  9.61      1.32     0.22 31410180     0.00     0.00  multiply_vector
  8.74      1.52     0.20  4620625     0.00     0.00  ray_hit_object
  7.65      1.70     0.18 13861875     0.00     0.00  rayRectangularIntersection
  5.46      1.82     0.13 13861875     0.00     0.00  raySphereIntersection
  5.02      1.94     0.12 17836094     0.00     0.00  add_vector
  4.81      2.05     0.11 10598450     0.00     0.00  normalize
  3.93      2.14     0.09  1048576     0.00     0.00  ray_color
  1.75      2.18     0.04  4221152     0.00     0.00  multiply_vectors
  1.31      2.21     0.03 17821809     0.00     0.00  cross_product
  0.66      2.22     0.02  3838091     0.00     0.00  length
  0.44      2.23     0.01  2110576     0.00     0.00  compute_specular_diffuse
  0.44      2.24     0.01  2110576     0.00     0.00  localColor
  0.44      2.25     0.01  1241598     0.00     0.00  protect_color_overflow
  0.44      2.26     0.01  1241598     0.00     0.00  reflection
  0.44      2.27     0.01  1241598     0.00     0.00  refraction
  0.44      2.28     0.01  1048576     0.00     0.00  rayConstruction
  0.44      2.29     0.01        1     0.01     2.29  raytracing
  0.00      2.29     0.00  2558386     0.00     0.00  idx_stack_empty
  0.00      2.29     0.00  2520791     0.00     0.00  idx_stack_top
  0.00      2.29     0.00  1204003     0.00     0.00  idx_stack_push
  0.00      2.29     0.00  1048576     0.00     0.00  idx_stack_init
  0.00      2.29     0.00   113297     0.00     0.00  fresnel
  0.00      2.29     0.00    37595     0.00     0.00  idx_stack_pop
  0.00      2.29     0.00        3     0.00     0.00  append_rectangular
  0.00      2.29     0.00        3     0.00     0.00  append_sphere
  0.00      2.29     0.00        2     0.00     0.00  append_light
  0.00      2.29     0.00        1     0.00     0.00  calculateBasisVectors
  0.00      2.29     0.00        1     0.00     0.00  delete_light_list
  0.00      2.29     0.00        1     0.00     0.00  delete_rectangular_list
  0.00      2.29     0.00        1     0.00     0.00  delete_sphere_list
  0.00      2.29     0.00        1     0.00     0.00  diff_in_second
  0.00      2.29     0.00        1     0.00     0.00  write_to_ppm

由 gprof 結果顯示 dot_product & subtract_vector 這兩個 function 的呼叫次數最多，執行時間分別佔用總時間的24.47% 和 23.6% ，因此嘗試優化這兩個 function

compiler優化(-Ofast)

Execution time of raytracing() : 0.594889 sec

優化

loop unrolling

math-toolkit.h的 dot_product（） & subtract_vector（） 兩個 functions 的是由簡單的 for 迴圈組成，因此嘗試將迴圈拆開






void subtract_vector(const double *a, const double *b, double *out)
{
    out[0] = a[0] - b[0];
    out[1] = a[1] - b[1];
    out[2] = a[2] - b[2];
}






double dot_product(const double *v1, const double *v2)
{
    double dp = 0.0;
    dp = v1[0] * v2[0] + v1[1] * v2[1] + v1[2] * v2[2];
    return dp;
}

執行時間：
Execution time of raytracing() : 1.996303 sec

再由 gprof 分析

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total           
 time   seconds   seconds    calls   s/call   s/call  name    
 17.75      0.28     0.28 69646433     0.00     0.00  dot_product
 11.62      0.46     0.18 56956357     0.00     0.00  subtract_vector
 11.30      0.63     0.18 17836094     0.00     0.00  add_vector
  9.68      0.78     0.15 13861875     0.00     0.00  rayRectangularIntersection
  9.36      0.93     0.15 17821809     0.00     0.00  cross_product
  8.07      1.05     0.13 13861875     0.00     0.00  raySphereIntersection
  7.75      1.17     0.12 31410180     0.00     0.00  multiply_vector
  6.46      1.27     0.10 10598450     0.00     0.00  normalize
  5.81      1.36     0.09  4620625     0.00     0.00  ray_hit_object
  3.23      1.41     0.05  1048576     0.00     0.00  ray_color
  1.94      1.44     0.03  1241598     0.00     0.00  reflection
  1.29      1.46     0.02  2110576     0.00     0.00  localColor
  1.29      1.48     0.02  1241598     0.00     0.00  protect_color_overflow
  1.29      1.50     0.02  1241598     0.00      0.00  refraction

發現 dot_product（） & subtract_vector（） 仍為最高的呼叫次數，但是已經不到 20% ，subtract_vector（）甚至已經減少到 11.62%

因此將其他也是由迴圈所構成的函數 loop unrolling
執行時間：
Execution time of raytracing() : 2.079975 sec

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total           
 time   seconds   seconds    calls   s/call   s/call  name    
 15.80      0.21     0.21 69646433     0.00     0.00  dot_product
 12.41      0.38     0.17 13861875     0.00     0.00  raySphereIntersection
 12.41      0.54     0.17 17821809     0.00     0.00  cross_product
 10.91      0.69     0.15 56956357     0.00     0.00  subtract_vector
  9.40      0.81     0.13 13861875     0.00     0.00  rayRectangularIntersection
  7.52      0.91     0.10  4620625     0.00     0.00  ray_hit_object
  6.39      1.00     0.09 31410180     0.00     0.00  multiply_vector
  6.02      1.08     0.08 17836094     0.00     0.00  add_vector
  3.39      1.12     0.05  4221152     0.00     0.00  multiply_vectors
  3.01      1.16     0.04 10598450     0.00     0.00  normalize
  2.26      1.19     0.03  1048576     0.00     0.00  ray_color
  2.26      1.22     0.03        1     0.03     1.33  raytracing
  1.50      1.24     0.02  3838091     0.00     0.00  length
  1.50      1.26     0.02  2520791     0.00     0.00  idx_stack_top

gprof2dot

force inline

在呼叫函式時會需要分配記憶空間因而需要額外的資源負擔，可以「建議」編譯器將之設定為「行內函式」（Inline function），如果建議被採納，則該函式會自動在呼叫點展現為程式碼，行內函式建議可以直接定義於表頭檔案中

將math-toolkit.h中的 inline 替換成__attribute__((always_inline))










static __attribute__((always_inline))
void normalize(double *v)
{
    double d = sqrt(v[0] * v[0] + v[1] * v[1] + v[2] * v[2]);
    assert(d != 0.0 && "Error calculating normal");

    v[0] /= d;
    v[1] /= d;
    v[2] /= d;
}

執行時間：
Execution time of raytracing() : 1.737352 sec

$ make PROFILE=1後會出現always_inline function might not be inlinable的 warning =>指的是compiler可能不會做inline這個指令

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total           
 time   seconds   seconds    calls   s/call   s/call  name    
 34.01      0.60     0.60 13861875     0.00     0.00  rayRectangularIntersection
 24.29      1.02     0.43 13861875     0.00     0.00  raySphereIntersection
 12.29      1.24     0.22  2110576     0.00     0.00  compute_specular_diffuse
  7.72      1.37     0.14  4620625     0.00     0.00  ray_hit_object
  7.43      1.50     0.13  1048576     0.00     0.00  ray_color
  6.86      1.62     0.12  2110576     0.00     0.00  localColor
  2.57      1.67     0.05  1048576     0.00     0.00  rayConstruction
  1.71      1.70     0.03        1     0.03     1.75  raytracing
  1.14      1.72     0.02  1241598     0.00     0.00  reflection
  0.57      1.73     0.01  2558386     0.00     0.00  idx_stack_empty

發現rayRectangularIntersection 和 raySphereIntersection 的呼叫次數最多

OpenMP

OpenMP 是一套支援跨平台共享記憶體方式的多執行緒並行的編程API，使用C,C++和Fortran語言，包括一套編譯器指令、庫和一些能夠影響執行行為的環境變數。
參考資料

在 Makefile 中加入參數 -fopenmp





CC ?= gcc
CFLAGS = \
        -std=gnu99 -Wall -O0 -g -fopenmp
LDFLAGS = \
        -lm -fopenmp

由 gprof2dot 顯示出rayRectangularIntersection 和 raySphereIntersection都是由 ray_hit_object 中呼叫的，因此嘗試在此 function 平行化 for





















#pragma omp parallel for
    for (rectangular_node rec = rectangulars; rec; rec = rec->next) {
        if (rayRectangularIntersection(biased_e, d, &(rec->element),
                                       &tmpresult, &t1) && (t1 < nearest)) {
            /* hit is closest so far */
            *hit_rectangular = rec;
            nearest = t1;
            result = tmpresult;
        }
    }

    /* check the spheres */
    for (sphere_node sphere = spheres; sphere; sphere = sphere->next) {
        if (raySphereIntersection(biased_e, d, &(sphere->element),
                                  &tmpresult, &t1) && (t1 < nearest)) {
            *hit_sphere = sphere;
            *hit_rectangular = NULL;
            nearest = t1;
            result = tmpresult;
        }
    }

失敗！！！！
出現 error，因為 OpenMP 的 for 有一定的格式
OpenMP Application Program Interface
for (init-expr; test-expr; incr-expr)

init-expr:
var = lb
integer-type var = lb
random-access-iterator-type var = lb
pointer-type var = lb
test-expr:
var relational-op b
b relational-op var
incr-expr:
++ var
var ++
–var
var–
var += incr
var -= incr
var = var + incr
var = incr + var
var = var - incr

之後嘗試在 raytracing 用 OpenMP

#pragma omp parallel private(object_color,stk,d)
#pragma omp for schedule(static,omp_get_max_threads())

OpenMP 語法說明
omp_get_max_threads:
Returns an integer that is equal to or greater than the number of threads that would be available if a parallel region without num_threads were defined at that point in the code.
private：
讓每個執行緒中，都有一份變數的複本，以免互相干擾。
schedule：
設定 for 迴圈的平行化方法；有 dynamic、guided、runtime、static 四種方法。

執行時間：
Execution time of raytracing() : 0.823155 sec

用 OpenMP 使效能有明顯的提升

pthread

pthread: 可將一個 function 分成多個 thread 執行，增加執行效率

main.c

加入標頭檔 #include <pthread.h>
宣告 thread id ，並配置記憶體空間給他
並且宣告一個指標陣列，使得每個 id 都有值傳入（ tmp[i] )




















pthread_t *id = (pthread_t *) malloc(THREAD_NUM * sizeof(pthread_t));
rays_arg **tmp = (rays_arg **) malloc(THREAD_NUM * sizeof(rays_arg));

for(int i=0; i<THREAD_NUM; i++) {
        tmp[i]=(rays_arg *)malloc(sizeof(rays_arg));
        tmp[i]->pixels = pixels;
        tmp[i]->background_color =  background;
        tmp[i]->rectangulars = rectangulars;
        tmp[i]->spheres = spheres;
        tmp[i]->lights = lights;
        tmp[i]->view = &view;
        tmp[i]->width = ROWS;
        tmp[i]->height = COLS;
        tmp[i]->t_id = i;
        tmp[i]->t_num = THREAD_NUM;
        pthread_create(&id[i], NULL, (void *) &raytracing, (void *) tmp[i]);
    }
    for(int i=0; i<THREAD_NUM; i++) {
        pthread_join(id[i],NULL);
    }

pthread_create（）：
產生一個 thread 執行 function
int pthread_create(pthread_t *tid , const pthread_attr_t *attr , void *(*function)(void *) , void *argument)
const pthread_attr_t *attr為該Thread的屬性，預設是NULL，
void *argument為Function pointer所要帶的參數。
回傳值: 如果執行成功則回傳0，如果執行失敗則回傳錯誤。
pthread_join():
等待指定的 thread 結束
int pthread_join (pthread_t thread, void **value_ptr)

raytracing.c

pthread_exit():
用來關閉一個 thread
void pthread_exit (void *value_ptr)

raytracing.h

由於 main.c中的 pthread_create() 只能輸入一個變數，因此創一個 struct 使得那個變數指向 raytracing()所需的變數












typedef struct ray {
    uint8_t *pixels;
    double * background_color;
    rectangular_node rectangulars;
    sphere_node spheres;
    light_node lights;
    const viewpoint *view;
    int width;
    int height;
    int t_id;
    int t_num;
} rays_arg;

Execution time of raytracing() : 0.798060 sec

可是優化的結果只有和OpenMP相差一點點的時間
因此還需要再修改一下

參考資料
pthread for raytrcing