2016q3 Homework1(compute-pi)
contributed by <heathcliffYang
>
目標
- 重現計算pi的演算法實驗
- 更熟悉用gnuplot作圖
- 增加計算函式執行時間的準確度
- 尋找不同的計算pi的演算法並且實踐測試
Code 理解
Makefile
computepi.h & computepi.c
各個計算pi的方法的函式,其中優化版本有3種,分別是omp(thread數目可以改變), avx, avx with unrolling
- AVX unrolling -> 增加register數量,將迴圈展開,變成一次做16組
time_test.c
benchmark_clock_gettime.c
計算函式執行時間,並印出函式執行25次的時間,而N是迭代次數
Q: 不懂Makefile的$$i可以改變什麼??
A:因為main(int argc, char const *argv[]) 表示pass初始值argv[1]進去,也就是$$i;而argv[0]已被系統占用
result_clock_gettime.csv
benchmark 跑的時候被記錄的數據,第一個是N(Makefile裡),之後依序是各個函式printf的時間,換行寫在AVX + Loop unrolling的部分;且printf時注意有","來將data分開
100,0.000023,0.000110,0.000073,0.000017,0.000029
5100,0.001286,0.000740,0.000735,0.000833,0.000686
10100,0.002502,0.001358,0.001357,0.001655,0.001138
15100,0.003698,0.002009,0.001968,0.004866,0.001491
20100,0.004897,0.002636,0.004268,0.002939,0.001954
25100,0.006063,0.003244,0.004876,0.003433,0.002463
30100,0.007322,0.003864,0.005547,0.003963,0.002943
35100,0.008506,0.004489,0.006013,0.004506,0.003462
40100,0.009765,0.005120,0.005047,0.005009,0.003901
45100,0.010985,0.005715,0.009892,0.005559,0.004443
50100,0.012150,0.006356,0.007536,0.006108,0.004874
55100,0.013136,0.006981,0.006930,0.006708,0.005382
60100,0.014151,0.007750,0.011707,0.007129,0.005883
65100,0.016313,0.008239,0.009882,0.007724,0.006323
70100,0.016663,0.009122,0.025125,0.008197,0.006775
75100,0.018109,0.009475,0.011066,0.008912,0.007337
80100,0.019331,0.010445,0.016794,0.009227,0.007756
85100,0.020502,0.010715,0.012317,0.009814,0.008381
90100,0.021964,0.011379,0.012910,0.010297,0.008558
95100,0.022849,0.012086,0.013538,0.010722,0.008921
100100,0.024405,0.012588,0.019913,0.012311,0.009734
105100,0.041532,0.017654,0.015964,0.011699,0.009896
感謝hugikun999的共筆
Run & Problems
看資料&思考問題 - 9/28 10pm
- 為什麼有一些指令可以支援四個暫存器的運算元,使得 code 更小,並且減少一些不必要的指令,提升速度
gnuplot作圖遇到問題 9/29 8pm
- make plot 失敗: No rule to make target 'plot'. Stop.
已把gnuplot會用到的runtime.gp寫完並放在/compute-pi底下,也檢查了一下Makefile的格式與執行make plot時是否已有result_clock_gettime.csv的存在,但最後出現錯誤,目前還沒查明原因
- 解決:在搬動compute-pi的資料夾的過程中,其中一個terminal沒更新到訊息,所以改過Makefile,與用來run的terminal所在的檔案夾不同
原檔作圖分析
分析baseline與其他優化方法的函式,隨著N增加,他們所花費時間的變化
seq 100 5000 1000000

- 分析: baseline & openmp_4 的時間數據震盪都偏大,其中openmp_2 跟 openmp_4的時間落點幾乎重疊(細看發現omp2略低!!why?!),只是openmp_4的震盪較小,而使用AVX優化所花的時間是最少的
- 對於震盪的原因看法: 因為baseline, openmp_2,openmp_4的執行時間比較久,而系統其實也在執行著很多背景程式,所以有許多的process會搶資源,而導致取的時間有延遲而失真。
- Q:但是無法理解為什麼 openmp_4比openmp_2還要激烈震盪?
去除極端的時間
用95%的信賴區間來把極端的時間(也就是有可能受其他因素影響的時間)去除
- 改寫benchmark_clock_gettime.c
因為原本的benchmark是計算執行輸入同一個N,函式25次的總共花費時間,而現在改成:同一個N,每執行一次計算一次時間,共執行__次來算信賴區間,最後印出經過計算後的時間
但是因為math.h的函式出問題尚未解決,決定先用mean(極端值仍然存在)
感謝hugikun999幫忙找math.h出問題的原因><: -lm要放在最後面!!!
猜想一個問題:會不會因為紀錄的時間太短,nsec的數字會丟失?
From shelly4132 & 王紹華 的共筆 ; 信賴區間計算
測試各個函式所計算pi的時間 - mean版本
-
seq 100 5000 1000000 & SAMPLE_SIZE=25

-
seq 100 5000 1000000 & SAMPLE_SIZE=100

-
seq 1000 1110 112000 & SAMPLE_SIZE=1000

不過還在思考為甚麼會在最後的時候時間就爆炸了
4. seq 100 111 11200 & SAMPLE_SIZE=10000

測試各個函式所計算pi的時間 - 95%信賴區間
seq 100 111 11200 & SMAPLE = 25
測試各個函式所計算的pi的正確性
- runtime_error_rate.gp & Makefile 調整一下
- seq 100 111 11200

N值非16的倍數,所以AVX跟AVX+unroll的誤差才會那麼大
- seq 80 800 11200

後來考慮到因為AVX的兩個版本是4組N值一起算、16組N值一起算,故設計讓這兩個能整除的數目,才不會少算,而圖也顯示大家的error rate都一樣
探討openmp thread 數目對效能影響
perf stat 分析
- 測試的event: cache-misses, cache-references, cycles, instructions 來分析各個函式執行時間長短不同的原因。
- 分析圖
AVX雖然cache-miss偏高,但cycle數是其他三種1/3~1/5。
Q:為何使用AVX的話cache-miss會偏高?
AVX
硬體
- 16 個 256-bit 的 YMM(YMM0-YMM15) 暫存器
- 32-bit 的控制暫存器 MXCSR
(not yet)MXCSR 上的 0-5 位元是浮點數的 exception,這幾個 bit 在 set 之後只能透過 LDMXCSR
或是 FXRSTOR
clear,而 7-12 位元是獨立的 exception mask,在 power-up 或是 reset 後是初始化成 set 狀態,0-5 位元的 exception 分別是 invalid operation、denormal、divide by zero、overflow、underflow、precision。
指令可以分為 vector 跟 scalar 版本,在 vector 版本中資料會被視為 parallel SIMD 處理,而 scalar 則是只處理一個 entry
背景知識
可能延遲程式的因子:
- 將文字或資料輸程式所需的 I/O ( 如 Network I/O, Disk I/O…)
- 取得實際記憶體供程式使用所需的 I/O
- 其他程式所使用的 CPU 的時間
- 作業系統所使用的 CPU 的時間
參考資料1-Hw1-Ext
SMP
Attributes of Variables
aligned (alignment)
Ex: int x __attribute__ ((aligned (16))) = 0
compiler to allocate the global variable x on a 16-byte boundary
參考資料
信賴區間補充
不懂的部分
- why line 14 是 computepi.o?
C programming
- 複習
main(int argc, char const *argv[])
atoi(argv[1])
size_t
pow();
#define M_PI acos(-1.0)