2016q3 Homework1 (compute-pi)

# 2016q3 Homework1 (compute-pi) contributed by <`FZzzz`> ## Reviewed by ```hugikun999``` * 關於AVX_unroll的部份，其實可以將不足16的地方額外加上去，如此便N可以不用受限於一定要用16的倍數。 * N跑的範圍或許可以小一點，從未加入信賴區間到加入後雖然有看出變平滑，但是由於區間過大(~2500000)所以改善的幅度無法很清處看到，另外可以加入比較不同loop次數對結果的差異。 * 在Leibninz的openmp4可能是因為使用的地方錯誤，將無法平行化的地方做平行，如此只會增加花費的時間。 * 可以更深入研究為何openmp在128和256時的抖動比起64之前的更加明顯 * 可以更深入研究AVX256為何不能達到4x的原因，找出瓶頸。 ## 開發環境 ``` Architecture: x86_64 CPU 作業模式： 32-bit, 64-bit Byte Order: Little Endian CPU(s): 8 On-line CPU(s) list: 0-7 每核心執行緒數：2 每通訊端核心數：4 Socket(s): 1 NUMA 節點： 1 供應商識別號： GenuineIntel CPU 家族： 6 型號： 60 製程： 3 CPU MHz： 3599.882 BogoMIPS: 6784.39 虛擬： VT-x L1d 快取： 32K L1i 快取： 32K L2 快取： 256K L3 快取： 8192K NUMA node0 CPU(s): 0-7 ``` ### 觀察程式執行並分析 make後會產生5個不同版本的程式打開makefile可以看到，在default裡根據給定不同的參數會輸出不同的 time_test_XXX 可以用這些分別的程式可以分別去測他的執行時間在看一下gencsv ```shell for i in `seq 100 200 3000 ... ... ``` 第一個值100為起始，以兩百為間距，一直跑到3000 (這邊的數字是我改過的) 這次用大一點100000到400000 用Gnuplot畫一下 ![](https://i.imgur.com/OvJv5oW.png) 跑小跑多一點的好了，想看一下抖動的情形 * 這邊的結果是25倍的時間 * 原版的並沒有使用信賴區間，所以正常應該要有一定程度的晃動，不然就是我太幸運了 * 因為 clock_gettime() , 會因為context switch而有晃動 * 用信賴區間做的話，應該是每次跑都記一次，然後算標準差，再捨去不在區間內的有了有了，超級抖動XD ![](https://i.imgur.com/J9Px8m4.png) 從信賴區間裡撈資料前，在原本的程式中我發現了一件事，clock_gettime是有算進for loop的 ```c clock_gettime(CLOCK_ID, &start); for (i = 0; i < loop; i++) { compute_pi_openmp(N, 2); } clock_gettime(CLOCK_ID, &end); ``` 這會是個問題，因為我們不想要看到for loop的時間，這會造成誤差所以改成 ```c for (i = 0; i < loop; i++) { clock_gettime(CLOCK_ID, &start); compute_pi_baseline(N); clock_gettime(CLOCK_ID, &end); datas[i] = (double) (end.tv_sec - start.tv_sec) + (end.tv_nsec - start.tv_nsec)/ONE_SEC; } ``` 針對每一次執行去做時間的計算，把時間存起來，以便等等運算由於我們只有取２５次，其實滿少量的，所以直接運算母體就行信賴區間公式如下(取自[王紹華的筆記](https://hackmd.io/KYQwjARgZgTArAZgLQgJwAYpICyoMYAmSAHJiEggGzzqQEIDsqYQA===#%E7%9B%B4%E6%8E%A5%E9%81%8B%E7%AE%97%E6%AF%8D%E9%AB%94)) ### 直接對母體做運算 * Lower Endpoint = avg(X) − 1.96 * σ * Upper Endpoint = avg(X) + 1.96 * σ * -σ 為母體標準差 * n 為樣本資料數套用95%信賴區間的結果圖 ![](https://i.imgur.com/McxiCj4.png) 抖動情形好多了！ Error rate　（精度到小數點下８位） ![](https://i.imgur.com/IGeTAQs.png) AVX+Unrolling的結果有點怪，找一下原因發現在for迴圈的部份,是16的倍數 ```C for (int i = 0; i <= N - 16; i += 16) ``` 而我用的級距為 250000 % 16 ！= 0，但500000是16的倍數所以圖上每兩個點會是正常的值改個級距->16000 ![](https://i.imgur.com/hQzfAIM.png) ## Leibninz formula proof ![](https://i.imgur.com/YNufMlS.png) >> 試著用 LaTeX 改寫圖片中的數學式，詳見 [MathJax basic tutorial and quick reference](http://meta.math.stackexchange.com/questions/5020/mathjax-basic-tutorial-and-quick-reference) [name=jserv] ## SIMD 介紹其實在上個作業就已經開始用了，但由於上個作業需要用到simd instruction的地方太小了（要用的話要大改struct）倒置在store的操作上會拖慢許多，不過這次就不同了，這次可以將Ｎ切成4個一組 ex: 1/1 - 1/3 + 1/5 - 1/7 (N=0 1 2 3) 這樣就可以將利用ymm來儲存資料另外有個FMA中有個指令_mm256_fmadd_pd(...)可以用參照Intel intrinsics ``` FOR j := 0 to 3 i := j*64 dst[i+63:i] := (a[i+63:i] * b[i+63:i]) + c[i+63:i] ENDFOR dst[MAX:256] := 0 ``` 可以實作出2n+1的加法講了一堆，還沒開始介紹SIMD instructionsXD 簡單來講就是這張圖 ![](https://i.imgur.com/mF03dd0.png) 平常我們再用的int double float.... 不是32bits就是64bits 而SIMD的指令允許我們把好幾個(一般是4個4*32 or 4*64)的資料放在一起做運算 ![](https://i.imgur.com/p9bbMnt.png) 這樣的方式在運算向量或一組資料上，效能上會有更顯著的提升但為什麼不會到４倍呢？理由是在放進專用與一般的空間中做切換消耗掉的時間所以正如上面所提，假設你的運算沒有很長的話，而只是單純的一個加減那麼應該是不適合使用SIMD instruction作為加速方式 :::info 下面參考連結的Intel Intrinsics Guide真的很好用！ ::: ### Leibniz方法 Leibniz的公式為 ![](https://i.imgur.com/eJ92NeP.png) 實作的話很簡單 ```c double compute_pi_leibniz(size_t n) { double sum = 0.0; for (size_t i = 0; i < n; i++) { int sign = (i % 2 == 0) ? 1 : -1; //正負號 sum += (sign / (2.0 * (double)i + 1.0)); } return sum * 4.0; } ``` Spent Time ![](https://i.imgur.com/GxbDOOm.png) Error ![](https://i.imgur.com/jLXrydm.png) 就結果來看Leibniz的表現不錯，error rate和原本的可以說沒有差太多(應該說幾乎一樣才對XD) 不過OpenMP 4 threads的又不知道為什麼高於baseline了ˊ~ˋ >> 這個觀察很重要，你可以試著變更測試標的，像是其他數學式，然後再對照 OpenMP 在多個 thread 的行為 [name=jserv] >> 好的我會試試看 [name=ChenYi] ### OpenMP行為分析 (這部份我試著做看看，以經過死限了ＱＱ) 先附上我Leibniz方法的OpenMP的程式碼 ```C double compute_pi_leibniz_openmp(size_t n ,int threads) { double sum = 0.0; int sign = 0; #pragma omp parallel num_threads(threads) { #pragma omp parallel for private(sign) , reduction(+:sum) for (size_t i = 0; i < n; i++) { sign = (i % 2 == 0) ? 1 : -1; sum += (sign / (2.0 * (double)i + 1.0)); } } return sum * 4.0; } ``` 在附上不同thread數量的結果圖 ![](https://i.imgur.com/gyylQdM.png) 全都高於Leibniz方法的baseline 恩~好問題有種跟raytracing功課遇到的問題一樣的感覺 [32 OpenMP Traps For C++ Developers](http://www.viva64.com/en/a/0054/) 先留著做紀錄，我猜應該是第五個，結果真的是ＸＤＤ誤植parallel會導致多執行threads*(N-1)次運算 from上面連結修正後 ![](https://i.imgur.com/JBYIw8a.png) 倒是64上去後就差不多了，不過抖動情形令人很在意，晚點在查查修正後程式碼 ```C double compute_pi_leibniz_openmp(size_t n ,int threads) { double sum = 0.0; int sign = 0; #pragma omp parallel num_threads(threads) { #pragma omp parallel for private(sign) , reduction(+:sum) for (size_t i = 0; i < n; i++) { sign = (i % 2 == 0) ? 1 : -1; sum += (sign / (2.0 * (double)i + 1.0)); } } return sum * 4.0; } ``` 測試上述網站的正確性 ```C= #include <stdio.h> #include <stdlib.h> #include <omp.h> void myFunc(void) { printf("HI\n"); return; } int main(void) { #pragma omp parallel num_threads(2) { #pragma omp parallel for for(int i=0;i<5;i++) { myFunc(); } } printf("none parallel\n"); #pragma omp for for(int i=0;i<5;i++) { myFunc(); } return 0; } ``` ``` HI HI HI HI HI HI HI HI HI HI //10 none parallel HI HI HI HI HI ``` 上面的執行確認該網站的敘述為正確 #### Note of OpenMP Tutorial Notes: * When a thread reaches a PARALLEL directive, it creates a team of threads and becomes the master of the team. The master is a member of that team and has thread number 0 within that team. * Starting from the beginning of this parallel region, the code is duplicated and all threads will execute that code. * There is an implied barrier at the end of a parallel region. Only the master thread continues execution past this point. * If any thread terminates within a parallel region, all threads in the team will terminate, and the work done up until that point is undefined. 這次犯的錯誤可看第2點，在進入parallel region的時候，code會被複製，然後才交由各執行序去執行 ### Reference * [作業連結](https://hackmd.io/s/rJARexQT) * [廖健富學長的筆記](https://hackpad.com/Hw1-Extcompute_pi-geXUeYjdv1I) * [Gnuplot Demo Script](http://gnuplot.sourceforge.net/demo/simple.html) * [Why clcok_gettime() so erratic?](http://stackoverflow.com/questions/6814792/why-is-clock-gettime-so-erratic) * [Leibniz formula for π](https://en.wikipedia.org/wiki/Leibniz_formula_for_%CF%80) * [Intel Intrinsics Guide](https://software.intel.com/sites/landingpage/IntrinsicsGuide/#techs=AVX,AVX2,FMA&expand=100,695,2369) * [OpenMP Tutorial - Parallel Region](https://computing.llnl.gov/tutorials/openMP/#ParallelRegion) ###### tags: `Fzzzz` `sysprog`