2016q3 Homework1(raytracing)

tags: `tundergod` `hw1` `2016q3`

contributed by <tundergod>

作業要求

在 GitHub 上 fork raytracing，並思考如何改善程式效能
以 make PROFILE=1 重新編譯程式碼，並且學習 gprof
- 參考使用 GNU gprof 進行 Linux 平台的程式分析
以 gprof 指出效能瓶頸，並且著手改寫檔案 math-toolkit.h 在內的函式實做，充分紀錄效能差異在共筆
- 注意: 請勿更動編譯器最佳化選項 -O0 (關閉最佳化)
- 檔案 math-toolkit.h 定義的若干數學操作函式很重要，可參考 Investigating SSE cross product performance、MathFu 原始程式碼，以及 2015q3 Homework #1 Ext
可善用 POSIX Thread, OpenMP, software pipelining, 以及 loop unrolling 一類的技巧來加速程式運作
將你的觀察、分析，以及各式效能改善過程，並善用 gnuplot 製圖，紀錄於「作業區」

開發環境

Linux 版本: Ubuntu 16.04 LTS
硬體資訊：

Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                4
On-line CPU(s) list:   0-3
Thread(s) per core:    2
Core(s) per socket:    2
Socket(s):             1
NUMA node(s):          1
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 69
Model name:            Intel(R) Core(TM) i5-4200U CPU @ 1.60GHz
Stepping:              1
CPU MHz:               1661.660
CPU max MHz:           2600.0000
CPU min MHz:           800.0000
BogoMIPS:              4589.54
Virtualization:        VT-x
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              3072K
NUMA node0 CPU(s):     0-3

初始資料

執行時間:(make PROFILE=1)

# Rendering scene
Done!
Execution time of raytracing() : 9.463558 sec

gprof 觀察:
- 發現math-toolkit.h中的dot_product , subtract_vector , multiply_vector 這三個function是最耗費時間的，佔了總執行時間將近50%,而單單是math-toolkit.h中函數的被呼叫次數高達約2億次

Flat profile:

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total           
 time   seconds   seconds    calls   s/call   s/call  name    
 22.98      0.99     0.99 69646433     0.00     0.00  dot_product
 17.64      1.75     0.76 56956357     0.00     0.00  subtract_vector
  9.87      2.18     0.43 13861875     0.00     0.00  rayRectangularIntersection
  9.29      2.58     0.40 31410180     0.00     0.00  multiply_vector
  7.43      2.90     0.32 10598450     0.00     0.00  normalize
  5.92      3.15     0.26 13861875     0.00     0.00  raySphereIntersection
  5.69      3.40     0.25 17836094     0.00     0.00  add_vector
  5.57      3.64     0.24  4620625     0.00     0.00  ray_hit_object
  4.41      3.83     0.19 17821809     0.00     0.00  cross_product
  3.71      3.99     0.16  1048576     0.00     0.00  ray_color
  2.79      4.11     0.12  4221152     0.00     0.00  multiply_vectors
  1.86      4.19     0.08        1     0.08     4.31  raytracing
  0.70      4.22     0.03  2520791     0.00     0.00  idx_stack_top
  0.70      4.25     0.03  1048576     0.00     0.00  rayConstruction
  0.58      4.27     0.03  3838091     0.00     0.00  length
  0.46      4.29     0.02  2110576     0.00     0.00  compute_specular_diffuse
  0.46      4.31     0.02  1241598     0.00     0.00  refraction
  0.00      4.31     0.00  2558386     0.00     0.00  idx_stack_empty
  0.00      4.31     0.00  2110576     0.00     0.00  localColor
  0.00      4.31     0.00  1241598     0.00     0.00  protect_color_overflow
  0.00      4.31     0.00  1241598     0.00     0.00  reflection
  0.00      4.31     0.00  1204003     0.00     0.00  idx_stack_push
  0.00      4.31     0.00  1048576     0.00     0.00  idx_stack_init
  0.00      4.31     0.00   113297     0.00     0.00  fresnel
  0.00      4.31     0.00    37595     0.00     0.00  idx_stack_pop
  0.00      4.31     0.00        3     0.00     0.00  append_rectangular
  0.00      4.31     0.00        3     0.00     0.00  append_sphere
  0.00      4.31     0.00        2     0.00     0.00  append_light
  0.00      4.31     0.00        1     0.00     0.00  calculateBasisVectors
  0.00      4.31     0.00        1     0.00     0.00  delete_light_list
  0.00      4.31     0.00        1     0.00     0.00  delete_rectangular_list
  0.00      4.31     0.00        1     0.00     0.00  delete_sphere_list
  0.00      4.31     0.00        1     0.00     0.00  diff_in_second
  0.00      4.31     0.00        1     0.00     0.00  write_to_ppm

優化

Loop Unrooling

在math-toolkit.h中的各個function多是以for迴圈實作，可以利用循環展開的方式加快程式的執行速度(並行執行)。
循環展開的缺點則是犧牲了程式的排版與閱讀性，代碼行數倍增。
以dot_product爲例:

static inline
double dot_product(const double *v1, const double *v2)
{
	return v1[0] * v2[0] + v1[1] * v2[1] + v1[2] * v2[2];
}

Loop Unrooling的展開原理(利用輸出組合語言觀察)可以查看之前的公筆

Force Inline

inline是向編譯器提出“建議”，把inline的函數在函數位置直接展開（減輕系統負擔(overhead)），編譯器會對比兩者執行時間後選擇執行inline與否。
在本次實作中使用inline能夠減少2億次(math-toolkit中的函式)的function call
因爲在Makefile中擁有 -O0 的指令（關閉最佳化），所以inline不會被執行，要讓inline執行必須在每一個function後面加上__attribute__((always_inline))強行讓CPU去執行inline這個“建議”

static inline __attribute__((always_inline))
double length(const double *v)
{
    return sqrt(v[0] * v[0] + v[1] * v[1] + v[2] * v[2]);
}

Macro

Macro巨集跟inline有點像,都是對該段代碼進行直接替換
用法:在math-toolkit.h中直接使用

#define DOT(a,b) ((a[0]*b[0])+(a[1]*b[1])+(a[2]*b[2]))

Macro 和 Inline的差別?

Macro是在Compiler進行編譯之前就被替換,而Inline是會被Compiler看到的,且在展開時系統會自動做安全檢查或Type的轉換

去掉Assert()

這是gprof中normalize的數據:
- 它佔用了7.43%的總時間,且被呼叫了1千萬次,也就是說normalize中的一個防錯函數assert()也被執行了1千萬次,而他在正確輸出的環境下是不會用到的,可以在確定程式正確或完成作業之後把它的作用去掉
方法:
- 直接刪除
- 修改 Makefile，新增 CFLAGS=-DNDEBUG 以消除 assert 對效能的影響

7.43      2.90     0.32 10598450     0.00     0.00  normalize

OpenMP

OpenMP（Open Multi-Processing）是一套支持跨平台、共享記憶體方式的多執行緒平行運算的API

	#pragma omp parallel for num_threads(N)   \
		private(stk), private(d), private(object_color)
	for (int j = 0 ; j < 512; j++) {
		for (int i = 0; i < 512; i++) {
		.....
		}
	}

POSIX Thread

實作如下:
- 做法爲把512個column分別交給N個thread去做

typedef struct __ARG {
    uint8_t *pixels;
    color background;
    rectangular_node rectangulars;
    sphere_node spheres;
    light_node lights;
    const viewpoint *View;
    int start_j;
    point3 u, v, w, d;
} arg;

for (int j = (*data).start_j ; j < 512; j+=N) {
        for (int i = 0; i < 512; i++) {
	...
	}
}

用數量不同的thread去測試哪個效能最高:
Image Not Showing Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →

最終結果

用最快的64thread去與ofast比較

# Rendering scene
Done!
Execution time of raytracing() : 0.671045 sec

圖表:

比ofast的0.674040 sec 快了一點點!

2016q3 Homework1(raytracing)

tags: tundergod hw1 2016q3

作業要求

開發環境

初始資料

優化

Loop Unrooling

Force Inline

Macro

Macro 和 Inline的差別?

去掉Assert()

OpenMP

POSIX Thread

最終結果

tags: `tundergod` `hw1` `2016q3`