2016q3 Homework1 (raytracing)

contributed by <KobeYu>
github:https://github.com/kobe0308/raytracing
作業說明: https://hackmd.io/s/B1W5AWza
reviewed by <janetwei>

可以再嘗試其他優化方法，例如 SIMD、pthread、force inline、AVX、Software Pipline
gprof是個不錯的工具,裡面的callgraph可以讓你了解各個Funcation 之間的關係
使用 gnuplot繪製出效能比較圖表

tags: `kobeyu`

目標

熟悉gprog效能分析工具
分析效能瓶頸
提出並驗證最佳化策略

執行環境 Ubuntu 16.04 @MacBook Air

Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                4
On-line CPU(s) list:   0-3
Thread(s) per core:    2
Core(s) per socket:    2
Socket(s):             1
NUMA node(s):          1
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 58
Model name:            Intel(R) Core(TM) i5-3427U CPU @ 1.80GHz
Stepping:              9
CPU MHz:               2796.207
CPU max MHz:           2800.0000
CPU min MHz:           800.0000
BogoMIPS:              4589.39
Virtualization:        VT-x
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              3072K
NUMA node0 CPU(s):     0-3
Flags:                 fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm tpr_shadow vnmi flexpriority ept vpid fsgsbase smep erms xsaveopt dtherm ida arat pln pts

效能分析 gprof

$ make PROFILE=1
$ ./raytracing
$ gprof raytracting

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total           
 time   seconds   seconds    calls   s/call   s/call  name    
 25.19      0.71     0.71 69646433     0.00     0.00  dot_product
 17.74      1.21     0.50 56956357     0.00     0.00  subtract_vector
 12.60      1.57     0.36 13861875     0.00     0.00  rayRectangularIntersection
  9.05      1.82     0.26 31410180     0.00     0.00  multiply_vector
  6.03      1.99     0.17 17836094     0.00     0.00  add_vector
  6.03      2.16     0.17 10598450     0.00     0.00  normalize
  ...

使用perf的結果也是dot_product消耗了許多的工作週期,所以將先從dot_product開始

執行時間分析

在使用gprof的時候會嚴重影響程式執行的時間,在原始程式碼中若設定PROFILE=1執行時間是6秒鐘左右,而不設定PROFILE是三秒鐘左右,所以日後再分析時,須將此因素考量進去.

解析dot_product

這段程式是用inline宣告,在編譯時會將程式碼直接取代dot_product,好處是在呼叫時不用另外再開stack減少記憶體的使用增加效率,另外陣列中的資料似乎沒有相依性也沒有順序性,只要最後加總之後返回總和即可,所以在最佳化策略上將採用並行處理的方式,看能增進多少效能.

static inline
double dot_product(const double *v1, const double *v2)
{
	double dp = 0.0;
	for (int i = 0; i < 3; i++)
		dp += v1[i] * v2[i];
	return dp;
}

根據perf可以看到組合語言的程式片段:

最佳化策略

loop unrolling wiki

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total           
 time   seconds   seconds    calls   s/call   s/call  name    
 17.87      0.45     0.45 56956357     0.00     0.00  subtract_vector
 12.31      0.76     0.31 69646433     0.00     0.00  dot_product
 11.51      1.05     0.29 13861875     0.00     0.00  rayRectangularIntersection
  9.93      1.30     0.25 10598450     0.00     0.00  normalize
  9.93      1.55     0.25 17836094     0.00     0.00  add_vector
  8.14      1.76     0.21 31410180     0.00     0.00  multiply_vector
  7.54      1.95     0.19  4620625     0.00     0.00  ray_hit_object
  5.16      2.08     0.13 17821809     0.00     0.00  cross_product
  4.37      2.19     0.11 13861875     0.00     0.00  raySphereIntersection
  3.57      2.28     0.09  1048576     0.00     0.00  ray_color

可以看到loop unrolling可以將效能從25.19降到12.31,

並行處理 SIMD

SIMD是Single Instruction Multiple Data的縮寫,表示一行指令可以執行多個資料,在Intel CPU中MMX,SSE與AVX都是屬於SIMD的指令集,SSE兼容了MMX的指令集並持續的在演進,目前來到了SSE4.2版,而AVX則是更新的指令集架構全名是Advanced Vector Extensions.

Intel SSE指令集

Intel AVX指令集

Rererence

Intel SSE
Intel SIMD演進史
 中文介紹
 一份介紹Intel SIMD的投影片
 2016Q1同學的作業筆記
 一篇介紹使用SIMD完成點積的文章
 header file of simd intel