# 2016q3 Homework1 (compute-pi)
## 開發環境
* OS:Lubuntu 16.04 LTS
* Memory: 8 GB
* CPU:
* Name:
* Intel(R) Core(TM) i5-4200H CPU @ 2.80GHz
* Cache:
* L1d Cache: 32K
* L1i Cache: 32K
* L2 Cache: 256K
* L3 Cache: 3072K
## Leibniz formula for π
[Leibniz formula for π的詳細證明](https://proofwiki.org/wiki/Leibniz%27s_Formula_for_Pi)
其他取得π值的方法:
(1)[黎曼ζ函數](https://zh.wikipedia.org/wiki/%E9%BB%8E%E6%9B%BC%CE%B6%E5%87%BD%E6%95%B8) ζ(2)=π^2^/6;
(2)[沃利斯乘積](https://zh.wikipedia.org/wiki/%E6%B2%83%E5%88%A9%E6%96%AF%E4%B9%98%E7%A7%AF)
(3)[高斯積分](https://zh.wikipedia.org/wiki/%E9%AB%98%E6%96%AF%E7%A7%AF%E5%88%86)
(4)...
## Makefile
一開始完全不知道程式要怎麼產生想要的資料,所以觀察了一下Makefile,發現比起phonebook,又多一些看不懂的東西。
```=
gencsv: default
for i in `seq 100 5000 25000`; do \
printf "%d," $$i;\
./benchmark_clock_gettime $$i; \
done > result_clock_gettime.csv
```
原來Makefile也可以用迴圈,而`seq 100 5000 20000`代表起始值100,等差5000,20000代表結束,其將作benchmark_clock_gettime的argv[1],printf與benchmark_clock_gettime將匯入result_clock_gettime.csv。
## 觀察
在`benchmark_clock_gettime.c`中,每種方式都執行了25次,並且將其紀錄於表格中,如果將表格以`.txt`開啟,或是觀察`printf`輸出格式,可以發現到格式其實是以逗號分隔:
```
100,0.000026,0.000072,0.000061,0.000021,0.000011
5100,0.001253,0.000599,0.000637,0.000517,0.000460
10100,0.002174,0.001128,0.001122,0.001019,0.000891
15100,0.003443,0.001682,0.003035,0.001511,0.001016
20100,0.004341,0.002275,0.002194,0.002971,0.001194
```
因此在做圖時,script需要`set datafile separator ","`,gnuplot才能區分每一行。
花了很久時間才寫好script
```
reset
set datafile separator ","
set title "compute-pi(25 rounds)"
set xlabel 'N'
set ylabel 'time(sec)'
set terminal png size 1000,800 enhanced font 'Verdana,14'
set output 'plot.png'
plot "result_clock_gettime.csv" u 1:2 w lp title "Baseline",\
"result_clock_gettime.csv" u 1:3 w lp title "OpenMP with 2 threads",\
"result_clock_gettime.csv" u 1:4 w lp title "OpenMP with 4 threads",\
"result_clock_gettime.csv" u 1:5 w lp title "AVX SIMD" ,\
"result_clock_gettime.csv" u 1:5 w lp title "AVX SIMD + Loop unrolling" ,\
```
圖片:
![](https://i.imgur.com/dhrqC0o.png)
我發現每次執行都有落差(尤其是OpenMP(4)),可能時間短得不明顯又很少次,所以我把25次改為10000次
![](https://i.imgur.com/O8Awfth.png)
這次有了比較一致的答案:
效率上AVX+LR > AVX >>> OMP(2) > OMP(4) >>> Baseline
接著,Loop數不變,將N放大
![](https://i.imgur.com/D967MJ5.png)
![](https://i.imgur.com/njGoCBN.png)
在N比較小的情況下,OpenMP(4)有機會比Baseline效率還低,OpenMP(4)整體來說似乎不太穩定。
OpenMP(2)在9e5~1e6間也出現了明顯的升高,但在另一張圖沒有出現,對此沒有結論。
....
## 了解改進方法
### OpenMP
平行化需要注意到的問題,就是變數的存取,不同執行緒對共享變數做修改/讀取,可能會產生不可預期的結果,OpenMP提供了很多方法來讓我們決定變數要怎麼運作。
參考[OpenMP變數的平行化](https://kheresy.wordpress.com/2006/09/22/%E7%B0%A1%E6%98%93%E7%9A%84%E7%A8%8B%E5%BC%8F%E5%B9%B3%E8%A1%8C%E5%8C%96%EF%BC%8Dopenmp%EF%BC%88%E4%BA%94%EF%BC%89-%E8%AE%8A%E6%95%B8%E7%9A%84%E5%B9%B3%E8%A1%8C%E5%8C%96/)歸納以下幾個常用的:
private:每個執行緒有各自的變數(但不會初始化值!)。
firstprivate:同private,但變數值為執行緒前的變數值。
lastprivate:同private,但原變數值會改為執行緒結束時的值。
atomic:防止多個執行緒修改。
reduction:每個執行緒有各自的變數(值為單位元素),最後在合併。
### AVX SIMD
在raytracing時有使用過,但不是很理想。
在這裡充分的運用,ymm0~2為係數,ymm3作為運算,ymm4儲存運算結果
### AVX SIMD + loop unrolling
利用到更多的avx暫存器,可以減少迴圈的數量,一個迴圈處理16個double,每個都多了原本的ymm2、ymm3、ymm4,等於一個迴圈中用了5+3+3+3=14個暫存器。
## 關於time
### xtime vs jiffies
### cpu time
[作業說明](https://hackmd.io/s/rJARexQT)中有提到CPU time是每條執行緒的時間加總,為了觀察到這個結果,我把每一種方法個別執行
```
petermouse@pm-X550JK:~/ws/compute-pi$ time ./time_test_baseline
N = 400000000 , pi = 3.141593
real 0m3.360s
user 0m3.356s
sys 0m0.000s
petermouse@pm-X550JK:~/ws/compute-pi$ time ./time_test_openmp_2
N = 400000000 , pi = 3.141593
real 0m2.287s
user 0m4.556s
sys 0m0.000s
petermouse@pm-X550JK:~/ws/compute-pi$ time ./time_test_openmp_4SIMD register load/store
N = 400000000 , pi = 3.141593
real 0m1.772s
user 0m7.032s
sys 0m0.000s
```
可以發現到user time與real time的關係,與執行緒的數量密切相關
###
## 參考資料
[Makefile 語法簡介](http://tetralet.luna.com.tw/?op=ViewArticle&articleId=185)
[使用gnuplot科學作圖](http://www.phy.fju.edu.tw/~ph116j/gnuplot_tutorial.pdf)
[Visual C++ OpenMP doc](https://msdn.microsoft.com/en-us/library/0ca2w8dk.aspx)
[OpenMP變數的平行化](https://kheresy.wordpress.com/2006/09/22/%E7%B0%A1%E6%98%93%E7%9A%84%E7%A8%8B%E5%BC%8F%E5%B9%B3%E8%A1%8C%E5%8C%96%EF%BC%8Dopenmp%EF%BC%88%E4%BA%94%EF%BC%89-%E8%AE%8A%E6%95%B8%E7%9A%84%E5%B9%B3%E8%A1%8C%E5%8C%96/)
[clock_gettime](http://man7.org/linux/man-pages/man2/clock_gettime.2.html)
[knowing the current time](http://www.makelinux.net/ldd3/chp-7-sect-2)