2016q3 Homework2 (raytracing)

contributed by <aweimeow>

tags: `sysprog21` `aweimeow`

作業環境

OS: Ubuntu 14.04.4 LTS
CPU: Intel® Core™ i5-4210M CPU @ 2.60GHz
Memory: 8G
Cache:
- L1d cache: 32KB
- L1i cache: 32KB
- L2 cache: 256KB
- L3 cache: 3072KB

前置準備

$ sudo apt-get update
$ sudo apt-get install graphviz
$ sudo apt-get install imagemagick

未修改的版本

# Rendering scene
Done!
Execution time of raytracing() : 2.909354 sec

附上輸出的圖：
output.ppm

找到可以著手修改增進速度的地方

先加上 PROFILE=1

# Rendering scene
Done!
Execution time of raytracing() : 5.193626 sec

只取前面幾個來看：

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total           
 time   seconds   seconds    calls   s/call   s/call  name    
 28.59      0.90     0.90 69646433     0.00     0.00  dot_product
 14.29      1.35     0.45 56956357     0.00     0.00  subtract_vector
  9.05      1.64     0.29 31410180     0.00     0.00  multiply_vector
  8.89      1.92     0.28 13861875     0.00     0.00  rayRectangularIntersection
  6.67      2.13     0.21 13861875     0.00     0.00  raySphereIntersection
  6.67      2.34     0.21 10598450     0.00     0.00  normalize
  6.35      2.54     0.20 17836094     0.00     0.00  add_vector
  4.76      2.69     0.15  4620625     0.00     0.00  ray_hit_object
  3.34      2.79     0.11 17821809     0.00     0.00  cross_product
  2.86      2.88     0.09  1048576     0.00     0.00  ray_color
  2.54      2.96     0.08  2110576     0.00     0.00  compute_specular_diffuse
  1.27      3.00     0.04  4221152     0.00     0.00  multiply_vectors
  1.27      3.04     0.04  1048576     0.00     0.00  rayConstruction
  1.27      3.08     0.04        1     0.04     3.15  raytracing
  0.64      3.10     0.02  2110576     0.00     0.00  localColor
  0.32      3.11     0.01  3838091     0.00     0.00  length

也使用第一個作業學到的 perf 來觀察：

 Performance counter stats for './raytracing' (5 runs):

           954,195      cache-misses              #   49.638 % of all cache refs    
         2,062,377      cache-references                                            
    33,500,782,366      instructions              #    2.04  insns per cycle        
    16,585,243,176      cycles                                                      

       5.220092593 seconds time elapsed                                          ( +-  0.23% )

著手修改程式碼

dot_product

Loop unrolling，把程式碼當中的迴圈展開，加速程式的執行速度

static inline
double dot_product(const double *v1, const double *v2)
{
    double dp = 0.0;
    dp += v1[0] * v2[0] + v1[1] * v2[1] + v1[2] * v2[2] 
    return dp;
}

結果：

# Rendering scene
Done!
Execution time of raytracing() : 4.807088 sec

與上一次相比，5.193626 - 4.807088 ＝ 0.386538 秒，這個是論結果來看，我們很明確的發現速度提昇了，
那麼 gprof 呢？把兩次放在一起比較，秒速下降了約 .39 秒，代表這樣子是真的能加速的

 time   seconds   seconds    calls   s/call   s/call  name   
 28.59      0.90     0.90 69646433     0.00     0.00  dot_product
 17.72      1.09     0.51 69646433     0.00     0.00  dot_product

subtract_vector

一樣是以 loop unrolling 來修改：

# Rendering scene
Done!
Execution time of raytracing() : 4.579059 sec

gprof 之後也能夠發現執行的秒速從 0.45 下降到 0.33 秒

 time   seconds   seconds    calls   s/call   s/call  name   
 14.29      1.35     0.45 56956357     0.00     0.00  subtract_vector
 12.41      1.10     0.33 56956357     0.00     0.00  subtract_vector

省略一堆的 Loop Unrolling，總結全部展開的結果

%   cumulative   self              self     total           
 time   seconds   seconds    calls   s/call   s/call  name    
 21.70      0.49     0.49 69646433     0.00     0.00  dot_product
 13.73      0.80     0.31 13861875     0.00     0.00  rayRectangularIntersection
 11.51      1.06     0.26 13861875     0.00     0.00  raySphereIntersection
  9.74      1.28     0.22 10598450     0.00     0.00  normalize
  8.41      1.47     0.19 56956357     0.00     0.00  subtract_vector
  7.08      1.63     0.16 31410180     0.00     0.00  multiply_vector
  6.64      1.78     0.15 17821809     0.00     0.00  cross_product
  4.43      1.88     0.10 17836094     0.00     0.00  add_vector

OpenMP

在這邊參考其他同學的作法，以及此篇的說明，對於怎麼寫有一些大概的概念。

首先要先思考哪些變數在各個 Thread 是必須獨立的：

idx_stack stk: 看起來是要存放東西的 Stack，每個 Thread 應該都要有自己的
d: 這個參數有傳入 rayConstruction, ray_color，所以應該不是固定的值
object_color:

先 import

#import <omp.h>

然後在 for 迴圈前面宣告

#pragma omp parallel for num_threads(THREAD_NUM) private(stk, d, object_color)

並且修改 MakeFile:

CFLAGS = \
    -std=gnu99 -Wall -O0 -g -fopenmp
LDFLAGS = \
    -lm -fopenmp

要記得加上 fopenmp 這個 Tag，我參考的那位同學寫的：

最後要記得 #include<omp.h> ，以及在編譯選項中加上 -fopenp

這邊有打錯字 :P

然後再來是結果

在沒有 OpenMP 的加持時：

# Rendering scene
Done!
Execution time of raytracing() : 4.038518 sec

再來是有 OpenMP 的加持（Thread = 4）：

# Rendering scene
Done!
Execution time of raytracing() : 8.254940 sec

咦，怎麼時間還增加了呢？不過我們有 gprof 可以用。

發現執行時間原本從 1.95 下降到 1.92。

好像不是很理想，我不確定是不是因為 Thread 給太少了？
所以接下來試試看 16 個 Thread

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total           
 time   seconds   seconds    calls  ms/call  ms/call  name    
 16.31      0.15     0.15   988837     0.00     0.00  raySphereIntersection
 11.96      0.26     0.11  4444684     0.00     0.00  dot_product
  9.79      0.35     0.09   372860     0.00     0.00  ray_hit_object
  9.24      0.44     0.09  3957736     0.00     0.00  subtract_vector
  8.70      0.52     0.08  1062330     0.00     0.00  cross_product
  8.70      0.60     0.08   912758     0.00     0.00  rayRectangularIntersection
  7.61      0.67     0.07   673090     0.00     0.00  normalize
  5.44      0.72     0.05    61496     0.00     0.01  ray_color
  4.35      0.76     0.04  2129702     0.00     0.00  multiply_vector
  4.35      0.80     0.04  1099567     0.00     0.00  add_vector
  3.26      0.83     0.03   282499     0.00     0.00  length
  2.18      0.85     0.02    69869     0.00     0.00  rayConstruction
  2.18      0.87     0.02        1    20.01   920.55  raytracing
  1.63      0.88     0.02   226005     0.00     0.00  multiply_vectors
  1.09      0.89     0.01   159710     0.00     0.00  compute_specular_diffuse
  1.09      0.90     0.01   155301     0.00     0.00  localColor
  1.09      0.91     0.01    91039     0.00     0.00  refraction
  1.09      0.92     0.01    89899     0.00     0.00  protect_color_overflow
  0.00      0.92     0.00   188053     0.00     0.00  idx_stack_empty
  0.00      0.92     0.00   153272     0.00     0.00  idx_stack_top
  0.00      0.92     0.00    82018     0.00     0.00  reflection
  0.00      0.92     0.00    80465     0.00     0.00  idx_stack_push
  0.00      0.92     0.00    69209     0.00     0.00  idx_stack_init
  0.00      0.92     0.00     9361     0.00     0.00  fresnel
  0.00      0.92     0.00     3378     0.00     0.00  idx_stack_pop
  0.00      0.92     0.00        3     0.00     0.00  append_rectangular
  0.00      0.92     0.00        3     0.00     0.00  append_sphere
  0.00      0.92     0.00        2     0.00     0.00  append_light
  0.00      0.92     0.00        1     0.00     0.00  calculateBasisVectors
  0.00      0.92     0.00        1     0.00     0.00  delete_light_list
  0.00      0.92     0.00        1     0.00     0.00  delete_rectangular_list
  0.00      0.92     0.00        1     0.00     0.00  delete_sphere_list
  0.00      0.92     0.00        1     0.00     0.00  diff_in_second
  0.00      0.92     0.00        1     0.00     0.00  write_to_ppm

可以看到在最後是 0.92 秒，時間確實下降了，接下來比較一下Thread的數量與執行時間的差異（以 4 為基底取次方作為數量）：

Thread Number	Execute Time (sec)
No OpenMP	2.234710
4	1.203826
16	0.907454
64	0.960009
256	0.938343
1024	1.029779
4096	1.053434