2016q3 Homework1 (raytracing)

contributed by <carolc0708>

作業說明

A02: raytracing
可善用 POSIX Thread, OpenMP, software pipelining, 以及 loop unrolling 一類的技巧來加速程式運作

gprof 分析 raytracing程式

參考陳昱齊同學的筆記，將gprof的操作作淺顯易懂的解說:

一般操作
- step1: $ gcc -pg 檔名.c
  執行檔檔名（ex. file）
  此時會產生一個名叫做file的執行檔
- step2: $ ./file
  執行file，接著會產生一個gmon.out文件，用來供gprof分析用的
- step3: $ gprof -b file gmon.out | less
  此時就可以看到gprof分析程式，每個函數所花的時間，可以針對時間花費最多的進行改善

先在source code底下執行$ make PROFILE=1產生raytracing執行檔，

再執行 $ ./raytracing，接著會產生gmon.out文件。

最後執行 $ gprof -b raytracing gmon.out | head -n 20，列出gprof分析程式中花費時間最高的20筆，結果如下:

Flat profile:

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total
 time   seconds   seconds    calls   s/call   s/call  name
 23.80      0.49     0.49 69646433     0.00     0.00  dot_product
 21.37      0.93     0.44 56956357     0.00     0.00  subtract_vector
 10.93      1.16     0.23 31410180     0.00     0.00  multiply_vector
  8.26      1.33     0.17 17836094     0.00     0.00  add_vector
  7.29      1.48     0.15 10598450     0.00     0.00  normalize
  5.83      1.60     0.12 13861875     0.00     0.00  raySphereIntersection
  5.34      1.71     0.11 17821809     0.00     0.00  cross_product
  4.37      1.80     0.09 13861875     0.00     0.00  rayRectangularIntersection
  3.89      1.88     0.08  4620625     0.00     0.00  ray_hit_object
  2.43      1.93     0.05  1048576     0.00     0.00  ray_color
  1.46      1.96     0.03  3838091     0.00     0.00  length
  0.97      1.98     0.02  2520791     0.00     0.00  idx_stack_top
  0.73      1.99     0.02  4221152     0.00     0.00  multiply_vectors
  0.49      2.00     0.01  2558386     0.00     0.00  idx_stack_empty
  0.49      2.01     0.01  2110576     0.00     0.00  compute_specular_diffuse

由結果可知，我們必須對前幾個執行次數高、執行時間長的函式，如dot_product()、subtract_vector()、multiply_vector()等進行效能的改善。

關於執行時間，在沒有使用gprof來追蹤程式(直接$ make來產生raytracting執行檔)，程式執行時間為 2.433547 sec，而使用gprof來追蹤程式後，程式執行時間為 5.019975 sec。

gprof效能分析工具

Loop Unrolling

loop unrolling背景知識:將for迴圈中的表示式進行線性展開，減去迴圈檢查、跳躍指令所需用到的時間。

loop unrolling
loop unrolling wiki

先在花費總時間最高的dot_product()做loop unrolling:








static inline
double dot_product(const double *v1, const double *v2)
{
    double dp = 0.0;
    for (int i = 0; i < 3; i++)
        dp += v1[i] * v2[i];
    return dp;
}

將for迴圈線性展開:







static inline
double dot_product(const double *v1, const double *v2)
{
    double dp = 0.0;
    dp += v1[0] * v2[0] + v1[1] * v2[1] + v1[2] * v2[2];
    return dp;
}

改完後，執行時間由原本的 5.019975 sec 降為 4.675378 sec ，減少了 0.34 sec左右。

接著對 <math-toolkit.h> 中其他也有相同狀況的函式add_vector()、subtract_vector()、multiply_vectors()、multiply_vector()進行改良:








static inline
void add_vector(const double *a, const double *b, double *out)
{
    out[0] = a[0] + b[0];
    out[1] = a[1] + b[1];
    out[2] = a[2] + b[2];
	
}







static inline
void subtract_vector(const double *a, const double *b, double *out)
{
    out[0] = a[0] - b[0];
    out[1] = a[1] - b[1];
    out[2] = a[2] - b[2];
}







static inline
void multiply_vectors(const double *a, const double *b, double *out)
{
    out[0] = a[0] * b[0];
    out[1] = a[1] * b[1];
    out[2] = a[2] * b[2];
}







static inline
void multiply_vector(const double *a, double b, double *out)
{
    out[0] = a[0] * b;
    out[1] = a[1] * b;
    out[2] = a[2] * b;
}

改完之後，程式執行時間降為 4.191883 sec，比完全未改善前降低了約 0.83 sec

Force Inline

參考raytracing共筆中提到 Inline function的展開
Inline function背景知識:
- 在程式當中，直接將函式內容展開，省去執行時需額外跳至該函式記憶體位置再跳回來的時間。
- 即便加入inline關鍵字，在程式執行時也不一定會實作。編譯器會去選擇這樣的實作方式是否符合效益、有節省到時間。
- 一般來說，函數中的程式碼或處理較短，並且常呼叫使用的話，可以給他加上個inline（因為編譯器可能會應你的要求使用）
static inline&inline &extern inline

等讀完多份文件再回來補資料><Carol Chen

Inline function觀念
[C++]內嵌函數（inline function）筆記
 Inline v.s Macro
Inline function in C
Static Inline v.s External Inline

因爲在Makefile中擁有 -O0 的指令(關閉最佳化)，所以inline不會被執行。要讓inline執行必須在每一個function後面加上 __attribute__((always_inline))，強行讓CPU去執行inline這個“建議”

這邊是否每一個都要強制展開不是很確定?!Carol Chen

執行結果，出現了warning@@

In file included from raytracing.c:5:0:
primitives.h:4:1: warning: ?鳬lways_inline??attribute ignored [-Wattributes]
 typedef double point3[3];
 ^

但執行時間明顯下降，只花了2.128363 sec，比只做loop unrolling下降了2.07 sec。

OpenMP

OpenMP 背景知識:
- OpenMP（Open Multi-Processing）是一套支援跨平台共享記憶體方式的多執行緒並行的編程API，包括一套編譯器指令庫和一些能夠影響執行行為的環境變數。
- 可分為三類:Directive、Clause、Function:
  - function 的部份是獨立呼叫的，在一般的情況下不大會用到。
  - directive 和 clause 的用法，大致上是：
    #pragma omp <directive> [clause]
- 要拿來平行化，是使用 parallel、sections、for 這三項。ordered是用來設定平行化的執行順序。
  - 使用 parallel pragma 來開始一個 parallel block，程式會 runtime 決定要開多少 thread，平行化的範圍是 parallel pragma 後的 statement 或 block，結束後 thread 就會收回。
  - for: 把程式中的for迴圈分給scope上#pragma omp parallel所生出的threads去執行
  - section: 把程式中沒有相關性的各個程式利用 #pragma omp section 來做區塊切割，然後由 #pragma omp parallel sections 分出的threads去做平行的處理。定義方式和for稍不同，詳細參考這裡。

OpenMP note
OpenMP wiki
OpenMP簡易的程式平行化:（三）範例 parallel、section

程式需 #include <omp.h>
Makefile中的 CFLAGS 加上 -fopenmp，LDFLAGS 加上-lgomp
在有for迴圈的地方都做平行化處理，有兩種方式，效果相同:
- #pragma omp parallel for加在for迴圈上方
- #pragma omp parallel加在要平行處理的scope{}上方，#pragma omp for 加在裏頭的for迴圈上方。
在 math-toolkit.h中修改有for迴圈的地方，









static inline
double dot_product(const double *v1, const double *v2)
{
    double dp = 0.0;
    #pragma omp parallel for
    for (int i = 0; i < 3; i++)
        dp += v1[i] * v2[i];
    return dp;
}

執行結果為4.974464 sec，比優化前版本快 0.1 sec左右而已。判斷應該是for迴圈執行次數太少且只有一層，無法達到很大的優化效果。

參考Dada chan的筆記，在raytracing.c中修改raytracing()函式:







 #pragma omp parallel for num_threads(64) private(d),private(object_color),private(stk)
    for (int j = 0; j < height; j++) {
        for (int i = 0; i < width; i++) {
            double r = 0, g = 0, b = 0;
            /* MSAA */
            for (int s = 0; s < SAMPLES; s++) {

為什要這樣寫?Carol Chen

private(): 讓每個執行緒中，都有一份變數的複本，以免互相干擾

raytracing共筆中提到，使用private( )對stk、d和object_color進行處理的原因:在迴圈中的變數有非常多，但是可以從程式碼中看到以下兩行，他們的值事先就已經好，他們的值在進入迴圈之前就已進行過一次運算，並且在迴圈內並沒有任何一行指令會去改變他們的值，所以可想而知它們的值是不會改變的。所以只剩下stk、d和object_color這三個變數了。

直接執行程式對raytracing()改善，結果為5.004982 sec，無太大差別。但加上前面實作的loop unrolling和force inline，時間可降為2.098089 sec。

一直覺得自己的OpenMP沒改善太多，很不甘心於是參考了很多位同學的筆記看看有沒有不同的寫法。在王紹華同學的筆記看到切section的方式實作，修改dot_product() :






















static inline
double dot_product(const double *v1, const double *v2)
{
    double dp0, dp1, dp2;

    #pragma omp parallel sections
    {
        #pragma omp section
        {
            dp0 = v1[0] * v2[0];
        }
        #pragma omp section
        {
            dp1 = v1[1] * v2[1];
        }
        #pragma omp section
        {
            dp2 = v1[2] * v2[2];
        }
    }
    return dp0 + dp1 + dp2;
}

實驗結果為執行時間4.675069 sec，比優化前降低 0.4 sec，比只做for的平行好一點點。

SIMD

背景知識:(參考謝永勁的筆記和wiki)
- SIMD:單指令流多資料流（英語：Single Instruction Multiple Data，縮寫：SIMD）是一種採用一個控制器來控制多個處理器，同時對一組資料（又稱「資料向量(Vector Type)」）中的每一個分別執行相同的操作從而實作空間上的並列性的技術。
- 同一個instruction可以同時處理多個data，可以同時處理多筆資料省去資料搬移的時間
- 在微處理器中，單指令流多資料流技術則是一個控制器控制多個平行的處理微元，例如Intel的MMX或SSE，以及AMD的3D Now!指令集。
運算方式:
一個float[4]，他的位置是連續的，那麼SIMD就可以將float按照位置分別做運算，就如同下面的圖片，他能一次運算4個float，以達到加速的效果，但是需要注意的是SIMD只支援同一種運算的方式，也就是說如果你想要用不同的方式對A0,A1,A2,A3進行運算是不行的，可能就要找找看MIMD
資料儲存型態:
C的資料型態被稱作scalar types（ex: int ,char ,float …），然而在SIMD裡面相對應的scalar是使用vector type來做儲存的

在intel上使用avx指令集，要在CFLAG加-mavx2，並加入 #include<immintrin.h>

Intrinsics Guide

參考多份筆記發現效能不會變好，反而變差

想參考以下看是否能夠改善Carol Chen
SIMD Ray Tracing Tips

POSIX Thread

背景知識:
- POSIX執行緒（英語：POSIX Threads，常被縮寫為Pthreads）是POSIX的執行緒標準，定義了創建和操縱執行緒的一套API。
- 實現POSIX 執行緒標準的庫常被稱作Pthreads，一般用於Unix-like POSIX 系統，如Linux、 Solaris。

PThread入門

PThread會使用到的函式:
- pthread_create
  int pthread_create(pthread_t *tid , const pthread_attr_t *attr , void *(*function)(void *) , void *argument)
  這個Function的作用是用來產生一個Thread並執行附帶的Function，附帶有4個參數
  參數1. pthread_t *tid為pthread的指標，在使用Thread之前必須要先宣告一個pthread_t的變數。
  參數2. const pthread_attr_t *attr為該Thread的屬性，預設是NULL，如果沒有其他特殊的需求直接填入NULL即可。
  參數3. void *(*function)(void *)為Function pointer，這邊要放入你要執行的Function。
  參數4. void *argument為Function pointer所要帶的參數。
  回傳值: 如果執行成功則回傳0，如果執行失敗則回傳錯誤代碼。
  example: pthread_create( &thread1, NULL , showmessage , message);
- pthread_join
  int pthread_join (pthread_t thread, void **value_ptr)
  這個Function的作用會暫停目前執行pthread_join的Thread，等到目標Thread執行完畢之後目前執行pthread_join的Thread才會繼續執行，附帶有2個參數。
  參數1: pthread_t thread為要等待的目標Thread。
  參數2: void **value_ptr用來取得目標Thread的回傳值。
  回傳值: 如果執行成功則回傳0，如果執行失敗則回傳錯誤代碼。

PThread 程式撰寫

參考陳品睿的筆記，在raytracing()中實作PThread:

先看原始的raytracing():











































void raytracing(uint8_t *pixels, color background_color,
                rectangular_node rectangulars, sphere_node spheres,
                light_node lights, const viewpoint *view,
                int width, int height)
{
    point3 u, v, w, d;
    color object_color = { 0.0, 0.0, 0.0 };

    /* calculate u, v, w */
    calculateBasisVectors(u, v, w, view);

    idx_stack stk;

    int factor = sqrt(SAMPLES);
    for (int j = 0; j < height; j++) {
        for (int i = 0; i < width; i++) {
            double r = 0, g = 0, b = 0;
            /* MSAA */
            for (int s = 0; s < SAMPLES; s++) {
                idx_stack_init(&stk);
                rayConstruction(d, u, v, w,
                                i * factor + s / factor,
                                j * factor + s % factor,
                                view,
                                width * factor, height * factor);
                if (ray_color(view->vrp, 0.0, d, &stk, rectangulars, spheres,
                              lights, object_color,
                              MAX_REFLECTION_BOUNCES)) {
                    r += object_color[0];
                    g += object_color[1];
                    b += object_color[2];
                } else {
                    r += background_color[0];
                    g += background_color[1];
                    b += background_color[2];
                }
                pixels[((i + (j * width)) * 3) + 0] = r * 255 / SAMPLES;
                pixels[((i + (j * width)) * 3) + 1] = g * 255 / SAMPLES;
                pixels[((i + (j * width)) * 3) + 2] = b * 255 / SAMPLES;
            }
        }
    }
}

可以將height或width拆成給不同Thread去執行，加快運算速度。

切height(column)的作法:

筆記中有提到均勻分column方式的效能比較，想多了解再來嘗試看看差異Carol Chen

在raytracing.h中，重新設計資料結構，將Thread當中要分別下去跑的變數一起包成一個struct Thread_Content:



















typedef struct THREAD_CONTENT{

        uint8_t *pixels;
        double *background_color;//
        rectangular_node rectangulars;
        sphere_node spheres;
        light_node lights;
        const viewpoint *view;

        int width;
        int height;

        double *u;//
        double *v;//
        double *w;//
        double *d;//

        int factor;
}Thread_Content;

Thread_Arg定義Thread的參數，其中包含Thread_Content指標:






typedef struct THREAD_ARG{

        int start_;
        int end_;
        Thread_Content* tc_;
}Thread_Arg;

接下來在raytracing.c由raytracing()另外分出兩個函式，分別處理Thread中的運算與Thread平行運算兩部分。

處理Thread中的運算:






































void *thread_function(void *arg_)
{
    const Thread_Arg *arg = (const Thread_Arg *)arg_;

    color object_color = { 0.0, 0.0, 0.0 };
    idx_stack stk;

    for(int j = arg->start_ ; j < arg->end_ ; j += THREAD_NUM)
    {
        for(int i = 0 ; i < arg->tc_->width ; i ++)
        {
                double r = 0, g = 0, b = 0;
                for (int s = 0; s < SAMPLES; s++) {
                idx_stack_init(&stk);
                rayConstruction(arg->tc_->d, arg->tc_->u, arg->tc_->v, arg->tc_->w,
                                i * arg->tc_->factor + s / arg->tc_->factor,
                                j * arg->tc_->factor + s % arg->tc_->factor,
                                arg->tc_->view,
                                arg->tc_->width * arg->tc_->factor, arg->tc_->height * arg->tc_->factor);
                if (ray_color(arg->tc_->view->vrp, 0.0, arg->tc_->d, &stk, arg->tc_->rectangulars, arg->tc_->spheres, arg->tc_->lights, object_color, MAX_REFLECTION_BOUNCES))                  {
                    r += object_color[0];
                    g += object_color[1];
                    b += object_color[2];
                }

                else
                {
                    r += arg->tc_->background_color[0];
                    g += arg->tc_->background_color[1];
                    b += arg->tc_->background_color[2];
                }
                arg->tc_->pixels[((i + (j * arg->tc_->width)) * 3) + 0] = r * 255 / SAMPLES;
                arg->tc_->pixels[((i + (j * arg->tc_->width)) * 3) + 1] = g * 255 / SAMPLES;
                arg->tc_->pixels[((i + (j * arg->tc_->width)) * 3) + 2] = b * 255 / SAMPLES;
        }
      }
   }
}

處理Thread平行運算:














































void parallel(uint8_t *pixels, double *background_color, rectangular_node rectangulars, sphere_node spheres, light_node lights, const viewpoint *view, int width, int height, double* u, double* v, double* w, double* d, int factor)
{
        pthread_t pt_id[THREAD_NUM];
        Thread_Content *tc= (Thread_Content*) malloc(sizeof(Thread_Content));
        Thread_Arg *arg[THREAD_NUM];
        void *ret;

        //
        tc->pixels = pixels;
        tc->background_color = background_color;
        tc->rectangulars = rectangulars;
        tc->spheres = spheres;
        tc->lights = lights;
        tc->view = view;

        tc->width = width;
        tc->height = height;

        tc->u = u;
        tc->v = v;
        tc->w = w;
        tc->d = d;

        tc->factor = factor;
        //
        for(int i = 0 ; i < THREAD_NUM ; i++)
        {
                arg[i] = (Thread_Arg *) malloc(sizeof(Thread_Arg));
                arg[i]->tc_ = tc;
                arg[i]->start_ = i;
                arg[i]->end_ = height;
        }

        for(int i=0; i<THREAD_NUM; i++)
        {
            if(pthread_create(pt_id+i, NULL, thread_function, arg[i]) != 0)
            {
                printf("Thread created fail.\n");
                exit(1);
             }
        }

    for(int i=0; i<THREAD_NUM; i++)
        pthread_join(pt_id[i], &ret);

}

不同的THREAD_NUM執行結果如下:

32:   4.974186 sec
64:   4.902074 sec
128:  4.256612 sec
256:  3.925100 sec
512:  3.711159 sec//!!!
1024: 5.183581 sec

可以發現我的電腦開512 Threads時，效能最好。

切width(row)的作法，執行結果如下:

32:	5.243384 sec
64: 	4.821204 sec
128: 	5.005269 sec
256: 	5.249753 sec

筆記中還有很多切Thread的方式，有空時可以研究一下Carol Chen