2016q3 Homework 1 (raytracing)

contributed by <ChenYi>

Reviewed by `nekoneko`

可以明確指出在dot_product使用AVX指令，時間沒有減少的原因
由compute-pi的程式碼和Intel Optimized文件，分析dot_prodect是否值得使用AVX加速，若不適合，能指出程式有沒有其他部份可用AVX加速，並實做。

開發環境

Destribution: Ubuntu 14.04.5 LTS

CPU : Intel® Xeon® CPU E3-1231 v3 @ 3.40GHz

L1d 快取：32K
L1i 快取：32K
L2 快取：256K
L3 快取：8192K

避免用圖片表示程式輸出的文字訊息，請愛用 command line jserv
好的謝謝提醒等等修正 ChenYi

準備工作

弄好環境,clone下來

sudo apt-get update
sudo apt-get install graphviz
sudo apt-get install imagemagick
sudo apt-get install gprof
git clone https://github.com/sysprog21/raytracing

gprof

gprof是個可以拿來看程式執行時，跑過哪些function和執行比率的一個實用工具。
從這些資訊中，我們可以觀察到哪些function特別花時間，和被呼叫特別多次，藉由這些東西，可以針對特定function去做改善，因而得到更好的表現。
以raytracing-origin為例，make時得先立個flag

make PROFILE=1

其對應到的指令是-pg (在Makefile裡)

ifeq ($(strip $(PROFILE)),1)
PROF_FLAGS = -pg   <======== this
CFLAGS += $(PROF_FLAGS)
LDFLAGS += $(PROF_FLAGS)
endif

目的是程式執行時，產生一個gmon.out，以供gprof使用

gprof2dot

在Ping-Ling's Hackpad裡看到的，可以方便觀看程式的執行脈絡
因為是輸出成圖片，比單純使用gprof 好觀看很多

Origin

執行程式 & gprof

因為直接gprof會很醜,所以排版一下(less)

./raytracing
gprof  ./raytracing | less

Raytracing結果圖

用上個作業學到的perf來看看cache misses如何

Execution time of raytracing() : 3.315326 sec

 Performance counter stats for './raytracing':

            22,843      cache-misses              #   18.615 % of all cache refs    
           122,711      cache-references                                            
         3,238,239      L1-dcache-load-misses                                       
   <not supported>      L1-dcache-store-misses   
   <not supported>      L1-dcache-prefetch-misses
           151,876      L1-icache-load-misses                                       

       3.317456277 seconds time elapsed

感覺問題不是出在cahce-misses , 用gprof看一下時間花在哪

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total           
 time   seconds   seconds    calls   s/call   s/call  name    
 25.49      0.67     0.67 69646433     0.00     0.00  dot_product
 19.02      1.17     0.50 56956357     0.00     0.00  subtract_vector
 10.27      1.44     0.27 13861875     0.00     0.00  rayRectangularIntersection
  9.51      1.69     0.25 10598450     0.00     0.00  normalize
  6.85      1.87     0.18 31410180     0.00     0.00  multiply_vector

從這邊可以看到佔用次數前幾名分別為dot_product(),subtract_vector(),rayRectangularIntersection()

因為raytracing必須經常計算向量入射與反射角及其投影，所以可以看到在這邊dot的使用率佔用了全部次數中的25.49%
subtract_vector()同上
有趣的是rayRectangularIntersection()為什麼會這麼高呢？
- 我推測是因為圖中3個矩型佔用的面積最大 ==> 吃到最多的射線

這邊利用gprof2dot圖像化一下方便觀察

愈篇藍色就愈不重要(這邊是指即使改進也不會增加太大的效能)
所以我們要關注的便是顏色比較不一樣的那幾個

刪除不必要的分支

來做點最基礎的最佳化吧！
已知向量相加減，以vec3為例是

v1 + v2 ===> (v1.x+v2.x) , (v1.y+v2.y) , (v1.z+v2.z)

那麼程式碼中的for其實是不需要用到的(loop unrolling)

for (int i = 0; i < 3; i++)
    out[i] = a[i] + b[i];	
//change to
out[0] = a[0] + b[0];
out[1] = a[1] + b[1];
out[2] = a[2] + b[2];

嘗試去節省sub/add跑for還要比對的時間結果

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total           
 time   seconds   seconds    calls   s/call   s/call  name    
 29.38      0.64     0.64 69646433     0.00     0.00  dot_product
 12.85      0.92     0.28 13861875     0.00     0.00  rayRectangularIntersection
  8.03      1.10     0.18 10598450     0.00     0.00  normalize
  7.80      1.27     0.17 13861875     0.00     0.00  raySphereIntersection
  7.34      1.43     0.16 31410180     0.00     0.00  multiply_vector
  7.34      1.59     0.16  4620625     0.00     0.00  ray_hit_object
  6.43      1.73     0.14 17821809     0.00     0.00  cross_product
  4.59      1.83     0.10 56956357     0.00     0.00  subtract_vector
  4.13      1.92     0.09  1048576     0.00     0.00  ray_color

sub跑到後面了，代表著佔用時間大幅下降　==> 可行阿！！
改掉dot_product() , multi_vectors(…) , multi_vector(…)

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total           
 time   seconds   seconds    calls   s/call   s/call  name    
 17.20      0.33     0.33 69646433     0.00     0.00  dot_product
 13.55      0.59     0.26 10598450     0.00     0.00  normalize
 11.73      0.82     0.23 13861875     0.00     0.00  raySphereIntersection
 11.20      1.03     0.22 56956357     0.00     0.00  subtract_vector
 10.94      1.24     0.21 13861875     0.00     0.00  rayRectangularIntersection

分散變平均了，cumulative time也下降了不少

Total execution time

用Gnuplot畫一下

Origin:        	Execution time of raytracing() : 4.368682 sec
Remove branch:	Execution time of raytracing() : 3.399409 sec

用「表格」來比較，醒目 jserv
感謝老師提醒順便請問一下表格該怎麼做比較好chenyi

OpenMP

嘗試一下使用OpenMP來做平行化
修改substract_vector(…)

#pragma omp parallel sections
{
    #pragma omp section
    {
        out[0] = a[0] - b[0];
    }

    #pragma omp section
    {
        out[1] = a[1] - b[1];
    }
    #pragma omp section
    {
        out[2] = a[2] - b[2];
    }
}

結果

Rendering scene
Done!
Execution time of raytracing() : 67.389720 sec

恩… 我思考了很久也看過春季班同學的hackpad[4]了，實在想不出什麼原因
不過就結果而言，我的結果確實是失敗了
後來在其他人的hackpad上看到
對raytacing(…)裡的三層for做平行化的話，其結果是可行
參考TempleJiJi's hackpad
怎麼做的就不放上來了，因為我這邊只有試跑而已
在我的實機上的測試結果如下 (16 threads)

# Rendering scene
Done!
Execution time of raytracing() : 0.391119 sec

loop unrolling版的為1.589252 sec

這邊的測試都有關掉-pg，我的實機測試如果沒關的話，使用OpenMP反而是會比較慢的

大致上想一下為什麼在三層for裡會成功，而在math_toolkit.h裡用section做會失敗

CPU資源的配置不是免費的！
進section時切成多個threads，出sections後回歸main thread
又dot_product()執行次數及section內容不值得這麼做
在配置資源的耗損上，應該是遠大於平行化所帶來的效益，所以才會出現60秒這種恐怖的數字

以圖示大概是如此

圖片來源(https://computing.llnl.gov/tutorials/openMP/)

SIMD instructions

查一下自己CPU是否支援　（我的規格）
SSE支援到4.2 AVX支援到2.0
看了一下SSE是128bits AVX則是256bits
這次作業code裏面是用double , 所以這邊選擇使用AVX
把dot_product(…) 裡的換掉

static inline __attribute__((always_inline)) 
double dot_product(const double *v1, const double *v2)
{
    double tmp[4];
    register __m256d reg1 , reg2 , out_reg;
    reg1 = _mm256_set_pd(0.0 ,v1[2] , v1[1] , v1[0]);
    reg2 = _mm256_set_pd(0.0 ,v2[2] , v2[1] , v2[0]);
	
    out_reg = _mm256_shuffle_pd(reg1 , reg2 , _MM_SHUFFLE(0,1,2,3));

    _mm256_store_pd(tmp , out_reg);

    return tmp[0]+tmp[1]+tmp[2];
}

在編譯上要記得加個 -mavx

執行花了2.612381 秒
居然比原版的還要慢！
想了一下，慢的點應該出在__mmstore_ps()，這個用途是在於把資料搬回記憶體上
但原版的方式為不斷地累加，正常安排暫存器的話大概會是全放在同一個暫存器上
所以應該會比不斷store的store還來的快一點，詳細情形有時間去找老師問問看好了
在Intel Intrinsics Guide裡的Operation

MEM[mem_addr+255:mem_addr] := a[255:0]

總結

各個最佳化方式的比較圖

從圖中可以看到最快的是對for loop平行化的 OpenMP+loop unrolling，比起原本的程式碼快了有5倍之多。令人訝異的是沒想到AVX+loop unrolling會比原版還慢這點。事實上我還看了一下春季班同學們的筆記，同樣做平行化，pthread的方式又會比OpenMP快了一些，畢竟OpenMP使用上還有不少限制，有時間的話在回頭來做作看pthread的實驗。

題外話

其實向量的運算的話，有不少library可用，像glm , bullet原生的等等。

2016q3 Homework 1 (raytracing)

Reviewed by `nekoneko`

開發環境

準備工作

gprof

gprof2dot

Origin

這邊利用gprof2dot圖像化一下方便觀察

刪除不必要的分支

OpenMP

SIMD instructions

總結

題外話

References

tags: `fzzzz` 'sysprog2016'

2016q3 Homework 1 (raytracing)

Reviewed by nekoneko

開發環境

準備工作

gprof

gprof2dot

Origin

這邊利用gprof2dot圖像化一下方便觀察

刪除不必要的分支

OpenMP

SIMD instructions

總結

題外話

References

tags: fzzzz 'sysprog2016'

Reviewed by `nekoneko`

tags: `fzzzz` 'sysprog2016'