owned this note
owned this note
Published
Linked with GitHub
# 2016q3 Homework1(compute-pi)
###### tags: `heathcliffYang`
contributed by <`heathcliffYang`>
# 目標
- 重現計算pi的演算法實驗
- 更熟悉用gnuplot作圖
- 增加計算函式執行時間的準確度
- 尋找不同的計算pi的演算法並且實踐測試
# Code 理解
## Makefile
```
CFLAGS = -O0 -std=gnu99 -Wall -fopenmp -mavx #給定的程式碼已指定 -fopenmp -mavx
default: #compile computepi.o & time_test.c
#並用 -DBASELINE等等 指定特定的函式來做為輸出執行檔的source
#compile computepi.o & benchmark_clock_gettime.c
check: #time 顯示關於./time_test_baseline等等程式執行時使用的系統資源
gencsv: default
for i in `seq 100 5000 1000000`; do \ #重複做 [起始點][step][終點]
printf "%d," $$i;\ #顯示i的值
./benchmark_clock_gettime $$i; \ #將i的值輸入至./benchmark_clock_gettime
done > result_clock_gettime.csv #將原本會print出來的資訊寫入result_clock_gettime.csv
plot: result_clock_gettime.csv #用result_clock_gettime.csv作圖
gnuplot runtime.gp #gnuplot執行runtime.gp
clean:
rm -f $(EXECUTABLE) *.o *.s result_clock_gettime.csv
```
## computepi.h & computepi.c
各個計算pi的方法的函式,其中優化版本有3種,分別是omp(thread數目可以改變), avx, avx with unrolling
- AVX -> 一次做4組
```c
for (int i = 0; i <= N - 4; i += 4) {
ymm3 = _mm256_set1_pd(i * dt); // i*dt, i*dt, i*dt, i*dt
ymm3 = _mm256_add_pd(ymm3, ymm2); // x = i*dt+3*dt, i*dt+2*dt, i*dt+dt, i*dt+0.0
ymm3 = _mm256_mul_pd(ymm3, ymm3); // x^2 = (i*dt+3*dt)^2, (i*dt+2*dt)^2, ...
ymm3 = _mm256_add_pd(ymm0, ymm3); // 1+x^2 = 1+(i*dt+3*dt)^2, 1+(i*dt+2*dt)^2, ...
ymm3 = _mm256_div_pd(ymm1, ymm3); // dt/(1+x^2)
ymm4 = _mm256_add_pd(ymm4, ymm3); // pi += dt/(1+x^2)
}
double tmp[4] __attribute__((aligned(32)));
_mm256_store_pd(tmp, ymm4); // move packed float64 values to 256-bit aligned memory location
pi += tmp[0] + tmp[1] + tmp[2] + tmp[3];
return pi * 4.0;
```
- AVX unrolling -> 增加register數量,將迴圈展開,變成一次做16組
## time_test.c
```c
#include <stdio.h>
#include "computepi.h"
int main(int argc, char const *argv[])
{
__attribute__((unused)) int N = 400000000; //不讓未用到的函數出現警告
double pi = 0.0;
#if defined(BASELINE) // 若指令參數下 -DBASELINE則會執行此部分
pi = compute_pi_baseline(N);
#endif
#if defined(OPENMP_2)
pi = compute_pi_openmp(N, 2); //omp配2 threads
#endif
#if defined(OPENMP_4)
pi = compute_pi_openmp(N, 4); //omp配4 threads
#endif
#if defined(AVX)
pi = compute_pi_avx(N);
#endif
#if defined(AVXUNROLL)
pi = compute_pi_avx_unroll(N);
#endif
printf("N = %d , pi = %lf\n", N, pi);
return 0;
}
```
## benchmark_clock_gettime.c
計算函式執行時間,並印出函式執行25次的時間,而N是迭代次數
```c
// Baseline
clock_gettime(CLOCK_ID, &start);
for(i = 0; i < loop; i++) {
compute_pi_baseline(N);
}
clock_gettime(CLOCK_ID, &end);
printf("%lf,", (double) (end.tv_sec - start.tv_sec) +
(end.tv_nsec - start.tv_nsec)/ONE_SEC);
```
**Q:** 不懂Makefile的$$i可以改變什麼??
A:因為main(int argc, char const *argv[]) 表示pass初始值argv[1]進去,也就是$$i;而argv[0]已被系統占用
## result_clock_gettime.csv
benchmark 跑的時候被記錄的數據,第一個是N(Makefile裡),之後依序是各個函式printf的時間,換行寫在AVX + Loop unrolling的部分;且printf時注意有","來將data分開
```
100,0.000023,0.000110,0.000073,0.000017,0.000029
5100,0.001286,0.000740,0.000735,0.000833,0.000686
10100,0.002502,0.001358,0.001357,0.001655,0.001138
15100,0.003698,0.002009,0.001968,0.004866,0.001491
20100,0.004897,0.002636,0.004268,0.002939,0.001954
25100,0.006063,0.003244,0.004876,0.003433,0.002463
30100,0.007322,0.003864,0.005547,0.003963,0.002943
35100,0.008506,0.004489,0.006013,0.004506,0.003462
40100,0.009765,0.005120,0.005047,0.005009,0.003901
45100,0.010985,0.005715,0.009892,0.005559,0.004443
50100,0.012150,0.006356,0.007536,0.006108,0.004874
55100,0.013136,0.006981,0.006930,0.006708,0.005382
60100,0.014151,0.007750,0.011707,0.007129,0.005883
65100,0.016313,0.008239,0.009882,0.007724,0.006323
70100,0.016663,0.009122,0.025125,0.008197,0.006775
75100,0.018109,0.009475,0.011066,0.008912,0.007337
80100,0.019331,0.010445,0.016794,0.009227,0.007756
85100,0.020502,0.010715,0.012317,0.009814,0.008381
90100,0.021964,0.011379,0.012910,0.010297,0.008558
95100,0.022849,0.012086,0.013538,0.010722,0.008921
100100,0.024405,0.012588,0.019913,0.012311,0.009734
105100,0.041532,0.017654,0.015964,0.011699,0.009896
```
## runtime.gp***
```
set xlabel 'N'
set ylabel 'Time(sec)'
set autoscale fix
set style fill solid
set datafile separator "," #特別注意,不然會讀不到值
set title 'Wall-clock time -> using clock_gettime()'
set term png enhanced font 'Verdana,10'
set output 'time.png'
plot 'result_clock_gettime.csv' using 1:2 title 'baseline' with linespoints, \
'result_clock_gettime.csv' using 1:3 title 'openmp_2' with linespoints, \
'result_clock_gettime.csv' using 1:4 title 'openmp_4' with linespoints, \
'result_clock_gettime.csv' using 1:5 title 'AVX' with linespoints, \
'result_clock_gettime.csv' using 1:6 title 'AVX+unroll' with linespoints, \
```
> 感謝hugikun999的共筆
# Run & Problems
## 看資料&思考問題 - 9/28 10pm
- 為什麼**有一些指令可以支援四個暫存器的運算元,使得 code 更小,並且減少一些不必要的指令,提升速度**
## gnuplot作圖遇到問題 9/29 8pm
- make plot 失敗: No rule to make target 'plot'. Stop.
```
gencsv: default
for i in `seq 100 5000 1000000`; do \
printf "%d," $$i;\
./benchmark_clock_gettime $$i; \
done > result_clock_gettime.csv
plot: result_clock_gettime.csv
gnuplot runtime.gp
```
已把gnuplot會用到的runtime.gp寫完並放在/compute-pi底下,也檢查了一下Makefile的格式與執行make plot時是否已有result_clock_gettime.csv的存在,但最後出現錯誤,目前還沒查明原因
- 解決:在搬動compute-pi的資料夾的過程中,其中一個terminal沒更新到訊息,所以改過Makefile,與用來run的terminal所在的檔案夾不同
### 原檔作圖分析
分析baseline與其他優化方法的函式,隨著N增加,他們所花費時間的變化
`seq 100 5000 1000000`
![](https://i.imgur.com/1WlI3N6.png)
- **分析:** baseline & openmp_4 的時間數據震盪都偏大,其中**openmp_2 跟 openmp_4的時間落點幾乎重疊**(細看發現omp2略低!!why?!),只是openmp_4的震盪較小,而使用AVX優化所花的時間是最少的
- **對於震盪的原因看法:** 因為baseline, openmp_2,openmp_4的執行時間比較久,而系統其實也在執行著很多背景程式,所以有許多的process會搶資源,而導致取的時間有延遲而失真。
- ==Q:但是無法理解為什麼 openmp_4比openmp_2還要激烈震盪?==
### 去除極端的時間
用95%的信賴區間來把極端的時間(也就是有可能受其他因素影響的時間)去除
- **改寫benchmark_clock_gettime.c**
因為原本的benchmark是計算執行輸入同一個N,函式25次的總共花費時間,而現在改成:同一個N,每執行一次計算一次時間,共執行__次來算信賴區間,最後印出經過計算後的時間
~~==但是因為math.h的函式出問題尚未解決,決定先用mean(極端值仍然存在)==~~
> 感謝hugikun999幫忙找math.h出問題的原因><: **-lm要放在最後面!!!**
```c
```
- **Makefile修改**
```
CC = gcc
CFLAGS = -O0 -std=gnu99 -Wall -fopenmp -mavx
EXECUTABLE = \
time_test_baseline time_test_openmp_2 time_test_openmp_4 \
time_test_avx time_test_avxunroll \
benchmark_clock_gettime
default: computepi.o
$(CC) $(CFLAGS) computepi.o time_test.c -DBASELINE -o time_test_baseline
$(CC) $(CFLAGS) computepi.o time_test.c -DOPENMP_2 -o time_test_openmp_2
$(CC) $(CFLAGS) computepi.o time_test.c -DOPENMP_4 -o time_test_openmp_4
$(CC) $(CFLAGS) computepi.o time_test.c -DAVX -o time_test_avx
$(CC) $(CFLAGS) computepi.o time_test.c -DAVXUNROLL -o time_test_avxunroll
$(CC) $(CFLAGS) computepi.o benchmark_clock_gettime.c -o benchmark_clock_gettime -lm
```
==猜想一個問題:會不會因為紀錄的時間太短,nsec的數字會丟失?==
> From [shelly4132](https://hackmd.io/s/HJrLU0dp) & [王紹華](https://hackmd.io/s/BJdYsOPa) 的共筆 ; [信賴區間計算](https://github.com/rampant1018/compute_pi/blob/master/compute_pi.c)
### 測試各個函式所計算pi的時間 - mean版本
1. seq 100 5000 1000000 & SAMPLE_SIZE=25
![](https://i.imgur.com/360apLK.png)
2. seq 100 5000 1000000 & SAMPLE_SIZE=100
![](https://i.imgur.com/k6q4RQd.png)
3. seq 1000 1110 112000 & SAMPLE_SIZE=1000
![](https://i.imgur.com/AAo80K8.png)
==不過還在思考為甚麼會在最後的時候時間就爆炸了==
4. seq 100 111 11200 & SAMPLE_SIZE=10000
![](https://i.imgur.com/jDYNs4X.png)
### 測試各個函式所計算pi的時間 - 95%信賴區間
seq 100 111 11200 & SMAPLE = 25
### 測試各個函式所計算的pi的正確性
- benchmark_error_rate.c
```c
#include <math.h>
#include <stdio.h>
#include <stdlib.h>
#include <immintrin.h>
#include <stdint.h>
#include "computepi.h"
#define ONE_SEC 1000000000.0
//#define M_PI acos(-1.0)
int main(int argc, char const *argv[])
{
if (argc < 2) return -1;
int N = atoi(argv[1]);
double pi, diff, error;
// Baseline
pi = compute_pi_baseline(N);
diff = pi - M_PI > 0 ? pi - M_PI : M_PI - pi;
error = diff / M_PI;
printf("%lf,", error );
// OpenMP with 2 threads
pi = compute_pi_openmp(N, 2);
diff = pi - M_PI > 0 ? pi - M_PI : M_PI - pi;
error = diff / M_PI;
printf("%lf,", error );
// OpenMP with 4 threads
pi = compute_pi_openmp(N, 4);
diff = pi - M_PI > 0 ? pi - M_PI : M_PI - pi;
error = diff / M_PI;
printf("%lf,", error );
// AVX SIMD
pi = compute_pi_avx(N);
diff = pi - M_PI > 0 ? pi - M_PI : M_PI - pi;
error = diff / M_PI;
printf("%lf,", error );
// AVX SIMD + Loop unrolling
pi = compute_pi_avx_unroll(N);
diff = pi - M_PI > 0 ? pi - M_PI : M_PI - pi;
error = diff / M_PI;
printf("%lf\n", error );
return 0;
}
```
- runtime_error_rate.gp & Makefile 調整一下
1. seq 100 111 11200
![](https://i.imgur.com/n5hpzTp.png)
N值非16的倍數,所以AVX跟AVX+unroll的誤差才會那麼大
2. seq 80 800 11200
![](https://i.imgur.com/GSpUw6X.png)
後來考慮到因為AVX的兩個版本是4組N值一起算、16組N值一起算,故設計讓這兩個能整除的數目,才不會少算,而圖也顯示大家的error rate都一樣
### 探討openmp thread 數目對效能影響
### perf stat 分析
1. 測試的event: cache-misses, cache-references, cycles, instructions 來分析各個函式執行時間長短不同的原因。
```
Performance counter stats for './time_test_baseline' (25 runs):
18,980 cache-misses # 23.003 % of all cache refs ( +- 10.91% )
82,513 cache-references ( +- 8.18% )
11,416,803,377 cycles ( +- 0.01% )
10,813,679,993 instructions # 0.95 insns per cycle ( +- 0.00% )
3.773416151 seconds time elapsed ( +- 0.22% )
```
```
Performance counter stats for './time_test_openmp_2' (25 runs):
144,102 cache-misses # 19.233 % of all cache refs ( +- 14.61% )
749,245 cache-references ( +- 14.60% )
13,919,361,574 cycles ( +- 4.67% )
11,226,535,864 instructions # 0.81 insns per cycle ( +- 0.01% )
3.316340440 seconds time elapsed ( +- 7.29% )
```
```
Performance counter stats for './time_test_openmp_4' (25 runs):
142,875 cache-misses # 16.832 % of all cache refs ( +- 14.40% )
848,851 cache-references ( +- 15.04% )
20,728,355,391 cycles ( +- 1.51% )
11,245,729,658 instructions # 0.54 insns per cycle ( +- 0.01% )
2.681960048 seconds time elapsed ( +- 4.23% )
```
```
Performance counter stats for './time_test_avx' (25 runs):
31,124 cache-misses # 30.657 % of all cache refs ( +- 12.08% )
101,522 cache-references ( +- 10.52% )
4,997,650,102 cycles ( +- 0.18% )
4,106,976,991 instructions # 0.82 insns per cycle ( +- 0.00% )
1.738483690 seconds time elapsed ( +- 0.96% )
```
```
Performance counter stats for './time_test_avxunroll' (25 runs):
53,932 cache-misses # 31.102 % of all cache refs ( +- 13.40% )
173,402 cache-references ( +- 11.94% )
4,622,056,802 cycles ( +- 0.46% )
3,306,564,222 instructions # 0.72 insns per cycle ( +- 0.00% )
1.600523838 seconds time elapsed ( +- 0.74% )
```
- 分析圖
AVX雖然cache-miss偏高,但cycle數是其他三種1/3~1/5。
==Q:為何使用AVX的話cache-miss會偏高?==
## 證明 Leibniz formula for pi & 其他計算圓周率的方法
# AVX
### 硬體
- 16 個 256-bit 的 YMM(YMM0-YMM15) 暫存器
- 32-bit 的控制暫存器 MXCSR
(not yet)**MXCSR 上的 0-5 位元是浮點數的 exception,這幾個 bit 在 set 之後只能透過 `LDMXCSR` 或是 `FXRSTOR` clear,而 7-12 位元是獨立的 exception mask,在 power-up 或是 reset 後是初始化成 set 狀態,0-5 位元的 exception 分別是 invalid operation、denormal、divide by zero、overflow、underflow、precision。**
**指令可以分為 vector 跟 scalar 版本,在 vector 版本中資料會被視為 parallel SIMD 處理,而 scalar 則是只處理一個 entry**
# 背景知識
## 可能延遲程式的因子:
1. 將文字或資料輸程式所需的 I/O ( 如 Network I/O, Disk I/O…)
2. 取得實際記憶體供程式使用所需的 I/O
3. 其他程式所使用的 CPU 的時間
4. 作業系統所使用的 CPU 的時間
[參考資料1-Hw1-Ext](https://hackpad.com/Hw1-Extcompute_pi-geXUeYjdv1I)
## SMP
## Attributes of Variables
1. `aligned (alignment)`
Ex: `int x __attribute__ ((aligned (16))) = 0` **compiler to allocate the global variable x on a 16-byte boundary**
[參考資料](https://gcc.gnu.org/onlinedocs/gcc-4.1.2/gcc/Variable-Attributes.html#Variable-Attributes)
## 信賴區間補充
## 不懂的部分
1. why line 14 是 computepi.o?
2.
```
%.o: %.c
$(CC) -c $(CFLAGS) $< -o $@
```
## C programming
1. 複習`main(int argc, char const *argv[])`
2. `atoi(argv[1])`
3. `size_t`
4. `pow();`
5. `#define M_PI acos(-1.0)`