2016q3 Homework 2 (raytracing)

contributed by <ChienChing>

Settings

CPU : Intel® Core™ i5-5200U
L1d cache： 32K
L1i cache： 32K
L2 cache： 256K
L3 cache： 3072K
Operating System : Ubuntu 16.04 LTS

$ sudo apt-get update
$ sudo apt-get install graphviz
$ sudo apt-get install imagemagick

$ git clone https://github.com/Leonardcclin/raytracing 
$ cd raytracing
$ make
$ ./raytracing

可正常執行

$ ./raytracing 

Execution time of raytracing() : 2.928651 sec

$ make check

# Rendering scene
Done!
Execution time of raytracing() : 2.908627 sec
Verified OK

不知道從哪開始，那就從Makefile下手好了

Makefile

ifeq ($(strip $(PROFILE)),1)
PROF_FLAGS = -pg
CFLAGS += $(PROF_FLAGS)
LDFLAGS += $(PROF_FLAGS) 
endif

以這段來看應該是有沒有PROFILE這變數的話會有不同結果，再來看看main.c

取timestamp的start與end中間夾著 raytracing(…)這函數，那這應該就是我們要優化的目標了!

使用PROFILE=1在編譯一次

make clear
make PROFILE=1
./raytracing

結果

# Rendering scene
Done!
Execution time of raytracing() : 6.474614 sec

時間久了不少

make check 

# Rendering scene
Done!
Execution time of raytracing() : 6.390476 sec
Verified OK

作業2 說明影片

看了影片後才知道原來make時加入PROFILE=1 是要使用gprof去記錄追蹤functions的執行時間

gprof屬於GNU toolchain之一，可以了解程式中functions之間的相關性以及可追蹤functions的消耗時間

make clear
make PROFILE=1
./raytracing
gprof ./raytracing | less

可閱讀raytracing中的function執行時間、次數…等等
Flat profile:

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total           
 time   seconds   seconds    calls   s/call   s/call  name    
 26.92      0.60     0.60 69646433     0.00     0.00  dot_product
 21.09      1.07     0.47 56956357     0.00     0.00  subtract_vector
 11.89      1.34     0.27 31410180     0.00     0.00  multiply_vector
  7.18      1.50     0.16 13861875     0.00     0.00  rayRectangularIntersection
  6.28      1.64     0.14 17836094     0.00     0.00  add_vector
  5.38      1.76     0.12 17821809     0.00     0.00  cross_product
  5.38      1.88     0.12 13861875     0.00     0.00  raySphereIntersection
  5.38      2.00     0.12  4620625     0.00     0.00  ray_hit_object
  3.14      2.07     0.07 10598450     0.00     0.00  normalize
  1.35      2.10     0.03  1048576     0.00     0.00  rayConstruction
  1.35      2.13     0.03  1048576     0.00     0.00  ray_color
  1.12      2.15     0.03  4221152     0.00     0.00  multiply_vectors
  1.12      2.18     0.03  2520791     0.00     0.00  idx_stack_top
  0.90      2.20     0.02  2110576     0.00     0.00  compute_specular_diffuse
  0.45      2.21     0.01  3838091     0.00     0.00  length
  0.45      2.22     0.01   113297     0.00     0.00  fresnel
  0.45      2.23     0.01        1     0.01     2.23  raytracing
  0.22      2.23     0.01    37595     0.00     0.00  idx_stack_pop
  0.00      2.23     0.00  2558386     0.00     0.00  idx_stack_empty
  0.00      2.23     0.00  2110576     0.00     0.00  localColor
  0.00      2.23     0.00  1241598     0.00     0.00  protect_color_overflow
  0.00      2.23     0.00  1241598     0.00     0.00  reflection
  0.00      2.23     0.00  1241598     0.00     0.00  refraction
  0.00      2.23     0.00  1204003     0.00     0.00  idx_stack_push
  0.00      2.23     0.00  1048576     0.00     0.00  idx_stack_init
  0.00      2.23     0.00        3     0.00     0.00  append_rectangular
  0.00      2.23     0.00        3     0.00     0.00  append_sphere
  0.00      2.23     0.00        2     0.00     0.00  append_light
  0.00      2.23     0.00        1     0.00     0.00  calculateBasisVectors
  0.00      2.23     0.00        1     0.00     0.00  delete_light_list
  0.00      2.23     0.00        1     0.00     0.00  delete_rectangular_list
  0.00      2.23     0.00        1     0.00     0.00  delete_sphere_list

可以看得出來 "dot_product" 做了非常的多次，可以先試試看

loop unrolling

math-toolkit.h

double dot_product(const double *v1, const double *v2)
{
	 double dp = 0.0;
	 for (int i = 0; i < 3; i++)
		 dp += v1[i] * v2[i];
	 return dp;
}

unrolling的dot_product(..)

double dot_product(const double *v1, const double *v2)
{
	 double dp = 0.0;
	 dp = v1[0] * v2[0] + v1[1] * v2[1] + v1[2] * v2[2];
	 return dp;
}

$ ./raytracing 結果如預期的減少了，大約3秒左右

# Rendering scene
Done!
Execution time of raytracing() : 6.138969 sec

$ make check 也是減少了execution time

# Rendering scene
Done!
Execution time of raytracing() : 6.144696 sec
Verified OK

使用gprof來觀看結果

Flat profile:

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total           
 time   seconds   seconds    calls   s/call   s/call  name    
 22.51      0.45     0.45 56956357     0.00     0.00  subtract_vector
 12.51      0.70     0.25 69646433     0.00     0.00  dot_product
 11.01      0.92     0.22 17836094     0.00     0.00  add_vector
  9.26      1.11     0.19 13861875     0.00     0.00  rayRectangularIntersection
  8.00      1.27     0.16 17821809     0.00     0.00  cross_product

dot_product降至第二名了，所以現在可以來看看第一名的subtract_vector

 void subtract_vector(const double *a, const double *b, double *out)
 {
	  for (int i = 0; i < 3; i++)                                             
		  out[i] = a[i] - b[i];
 }

unrolling again

void subtract_vector(const double *a, const double *b, double *out)
{
	  //unrolling version
	  out[0] = a[0] - b[0];
	  out[1] = a[1] - b[1];
	  out[2] = a[2] - b[2];                                                 
}

重新跑一跑

./raytracing 
# Rendering scene
Done!
Execution time of raytracing() : 5.797442 sec


make check 
# Rendering scene
Done!
Execution time of raytracing() : 5.766265 sec
Verified OK

又降低了一些時間，於是把可以簡單解決的都弄一弄

./raytracing 
# Rendering scene
Done!
Execution time of raytracing() : 5.539355 sec

make check 
# Rendering scene
Done!
Execution time of raytracing() : 5.596634 sec
Verified OK

到這邊不免感覺到雖然只是在做一些很廢的事情，避免了很多的branch發生，但時間的確降低了。

Pthread

AVX SIMD

SIMD (Single Instruction Multiply data)

OpenMP

Wiki

OpenMP（Open Multi-Processing）是一套支援跨平台共享記憶體方式的多執行緒並行的編程API，使用C,C++和Fortran語言，可以在大多數的處理器體系和作業系統中執行，包括Solaris, AIX, HP-UX, GNU/Linux, Mac OS X, 和Microsoft Windows。包括一套編譯器指令、庫和一些能夠影響執行行為的環境變數。

OpenMP採用可移植的、可延伸的模型，為程式設計師提供了一個簡單而靈活的開發平台，從標準桌面電腦到超級電腦的並列應用程式介面。

簡易的程式平行化方法－OpenMP 簡介
實際上要去控制 thread 是滿麻煩的。在程式的編寫上，也會複雜不少；而如果我們只是想要把一些簡單的迴圈平行化處理，用 thread library 來控制，實在有點殺雞用牛刀的感覺。這時候，用 Open MP 就簡單多了！OpenMP 是一種能透過高階指令，很簡單地將程式平行化、多執行緒化的 API；在最簡單的情形，甚至可以只加一行指令，就可以將迴圈內的程式平行化處理了

Example 1

先寫個Hello world小程式來跑跑看

#include<stdio.h>
#include<omp.h>

int main(){
	#pragma omp parallel
	{
		int tid = omp_get_thread_num();
		printf("Hello world from thread :%d\n",tid);
	}
	return 0;
}

$ gcc -fopenmp test_openmp.c test_openmp
$./test_openmp

輸出

Hello world from thread :1
Hello world from thread :3
Hello world from thread :2
Hello world from thread :0

使用4個threads去跑這個小程式!

嘗試著去使用openmp，將以下這段貼到raytracing的某個迴圈上面

#pragma omp parallel num_threads(8)

結果如下，爆炸的久…

./raytracing 
# Rendering scene
Done!
Execution time of raytracing() : 56.723064 sec

將math-toolkit.h中的迴圈都改成openmp的寫法結果超爛…

[請益]有沒有用了threads後反而效果更差的八卦?

使用openm後

Flat profile:

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total           
 time   seconds   seconds    calls   s/call   s/call  name    
 26.14      0.70     0.70 69646433     0.00     0.00  dot_product
 19.04      1.21     0.51 56956357     0.00     0.00  subtract_vector
 13.44      1.57     0.36 31410180     0.00     0.00  multiply_vector
  9.33      1.82     0.25 13861875     0.00     0.00  rayRectangularIntersection

–

Execution time of raytracing() : 6.574827 sec

 Performance counter stats for './raytracing':

		   233,699      cache-misses              #   28.042 % of all cache refs    
		   833,403      cache-references                                            
	31,686,595,460      instructions              #    1.79  insns per cycle        
	17,672,617,401      cycles                                                      

	   6.576436610 seconds time elapsed

使用openm前

Flat profile:

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total           
 time   seconds   seconds    calls   s/call   s/call  name    
 25.91      0.65     0.65 69646433     0.00     0.00  dot_product
 17.14      1.08     0.43 56956357     0.00     0.00  subtract_vector
 14.95      1.46     0.38 31410180     0.00     0.00  multiply_vector
  9.17      1.69     0.23 13861875     0.00     0.00  rayRectangularIntersection

–

Execution time of raytracing() : 6.696912 sec

 Performance counter stats for './raytracing':

		   676,755      cache-misses              #   41.376 % of all cache refs    
		 1,635,633      cache-references                                            
	34,241,796,991      instructions              #    1.91  insns per cycle        
	17,919,766,513      cycles                                                      

	   6.699495044 seconds time elapsed

Force inline

inline只是提示compiler可以將他最佳化唷，但他可能不會理你

所以就可以改成以下形式叫compiler一定要做

將math-toolkit.c內經常使用到的functions前面改成這樣

inline __attribute__((always_inline))
double func(){}

會告訴compiler將此函數直接嵌入於程式碼，也就是可以減少call function的成本，結果如下

# Rendering scene
Done!
Execution time of raytracing() : 3.896639 sec

大大的減少了，這才是有功用的啊!