# 2016q3 Homework1(compute-pi) ###### tags: `heathcliffYang` contributed by <`heathcliffYang`> # 目標 - 重現計算pi的演算法實驗 - 更熟悉用gnuplot作圖 - 增加計算函式執行時間的準確度 - 尋找不同的計算pi的演算法並且實踐測試 # Code 理解 ## Makefile ``` CFLAGS = -O0 -std=gnu99 -Wall -fopenmp -mavx #給定的程式碼已指定 -fopenmp -mavx default: #compile computepi.o & time_test.c #並用 -DBASELINE等等 指定特定的函式來做為輸出執行檔的source #compile computepi.o & benchmark_clock_gettime.c check: #time 顯示關於./time_test_baseline等等程式執行時使用的系統資源 gencsv: default for i in `seq 100 5000 1000000`; do \ #重複做 [起始點][step][終點] printf "%d," $$i;\ #顯示i的值 ./benchmark_clock_gettime $$i; \ #將i的值輸入至./benchmark_clock_gettime done > result_clock_gettime.csv #將原本會print出來的資訊寫入result_clock_gettime.csv plot: result_clock_gettime.csv #用result_clock_gettime.csv作圖 gnuplot runtime.gp #gnuplot執行runtime.gp clean: rm -f $(EXECUTABLE) *.o *.s result_clock_gettime.csv ``` ## computepi.h & computepi.c 各個計算pi的方法的函式,其中優化版本有3種,分別是omp(thread數目可以改變), avx, avx with unrolling - AVX -> 一次做4組 ```c for (int i = 0; i <= N - 4; i += 4) { ymm3 = _mm256_set1_pd(i * dt); // i*dt, i*dt, i*dt, i*dt ymm3 = _mm256_add_pd(ymm3, ymm2); // x = i*dt+3*dt, i*dt+2*dt, i*dt+dt, i*dt+0.0 ymm3 = _mm256_mul_pd(ymm3, ymm3); // x^2 = (i*dt+3*dt)^2, (i*dt+2*dt)^2, ... ymm3 = _mm256_add_pd(ymm0, ymm3); // 1+x^2 = 1+(i*dt+3*dt)^2, 1+(i*dt+2*dt)^2, ... ymm3 = _mm256_div_pd(ymm1, ymm3); // dt/(1+x^2) ymm4 = _mm256_add_pd(ymm4, ymm3); // pi += dt/(1+x^2) } double tmp[4] __attribute__((aligned(32))); _mm256_store_pd(tmp, ymm4); // move packed float64 values to 256-bit aligned memory location pi += tmp[0] + tmp[1] + tmp[2] + tmp[3]; return pi * 4.0; ``` - AVX unrolling -> 增加register數量,將迴圈展開,變成一次做16組 ## time_test.c ```c #include <stdio.h> #include "computepi.h" int main(int argc, char const *argv[]) { __attribute__((unused)) int N = 400000000; //不讓未用到的函數出現警告 double pi = 0.0; #if defined(BASELINE) // 若指令參數下 -DBASELINE則會執行此部分 pi = compute_pi_baseline(N); #endif #if defined(OPENMP_2) pi = compute_pi_openmp(N, 2); //omp配2 threads #endif #if defined(OPENMP_4) pi = compute_pi_openmp(N, 4); //omp配4 threads #endif #if defined(AVX) pi = compute_pi_avx(N); #endif #if defined(AVXUNROLL) pi = compute_pi_avx_unroll(N); #endif printf("N = %d , pi = %lf\n", N, pi); return 0; } ``` ## benchmark_clock_gettime.c 計算函式執行時間,並印出函式執行25次的時間,而N是迭代次數 ```c // Baseline clock_gettime(CLOCK_ID, &start); for(i = 0; i < loop; i++) { compute_pi_baseline(N); } clock_gettime(CLOCK_ID, &end); printf("%lf,", (double) (end.tv_sec - start.tv_sec) + (end.tv_nsec - start.tv_nsec)/ONE_SEC); ``` **Q:** 不懂Makefile的$$i可以改變什麼?? A:因為main(int argc, char const *argv[]) 表示pass初始值argv[1]進去,也就是$$i;而argv[0]已被系統占用 ## result_clock_gettime.csv benchmark 跑的時候被記錄的數據,第一個是N(Makefile裡),之後依序是各個函式printf的時間,換行寫在AVX + Loop unrolling的部分;且printf時注意有","來將data分開 ``` 100,0.000023,0.000110,0.000073,0.000017,0.000029 5100,0.001286,0.000740,0.000735,0.000833,0.000686 10100,0.002502,0.001358,0.001357,0.001655,0.001138 15100,0.003698,0.002009,0.001968,0.004866,0.001491 20100,0.004897,0.002636,0.004268,0.002939,0.001954 25100,0.006063,0.003244,0.004876,0.003433,0.002463 30100,0.007322,0.003864,0.005547,0.003963,0.002943 35100,0.008506,0.004489,0.006013,0.004506,0.003462 40100,0.009765,0.005120,0.005047,0.005009,0.003901 45100,0.010985,0.005715,0.009892,0.005559,0.004443 50100,0.012150,0.006356,0.007536,0.006108,0.004874 55100,0.013136,0.006981,0.006930,0.006708,0.005382 60100,0.014151,0.007750,0.011707,0.007129,0.005883 65100,0.016313,0.008239,0.009882,0.007724,0.006323 70100,0.016663,0.009122,0.025125,0.008197,0.006775 75100,0.018109,0.009475,0.011066,0.008912,0.007337 80100,0.019331,0.010445,0.016794,0.009227,0.007756 85100,0.020502,0.010715,0.012317,0.009814,0.008381 90100,0.021964,0.011379,0.012910,0.010297,0.008558 95100,0.022849,0.012086,0.013538,0.010722,0.008921 100100,0.024405,0.012588,0.019913,0.012311,0.009734 105100,0.041532,0.017654,0.015964,0.011699,0.009896 ``` ## runtime.gp*** ``` set xlabel 'N' set ylabel 'Time(sec)' set autoscale fix set style fill solid set datafile separator "," #特別注意,不然會讀不到值 set title 'Wall-clock time -> using clock_gettime()' set term png enhanced font 'Verdana,10' set output 'time.png' plot 'result_clock_gettime.csv' using 1:2 title 'baseline' with linespoints, \ 'result_clock_gettime.csv' using 1:3 title 'openmp_2' with linespoints, \ 'result_clock_gettime.csv' using 1:4 title 'openmp_4' with linespoints, \ 'result_clock_gettime.csv' using 1:5 title 'AVX' with linespoints, \ 'result_clock_gettime.csv' using 1:6 title 'AVX+unroll' with linespoints, \ ``` > 感謝hugikun999的共筆 # Run & Problems ## 看資料&思考問題 - 9/28 10pm - 為什麼**有一些指令可以支援四個暫存器的運算元,使得 code 更小,並且減少一些不必要的指令,提升速度** ## gnuplot作圖遇到問題 9/29 8pm - make plot 失敗: No rule to make target 'plot'. Stop. ``` gencsv: default for i in `seq 100 5000 1000000`; do \ printf "%d," $$i;\ ./benchmark_clock_gettime $$i; \ done > result_clock_gettime.csv plot: result_clock_gettime.csv gnuplot runtime.gp ``` 已把gnuplot會用到的runtime.gp寫完並放在/compute-pi底下,也檢查了一下Makefile的格式與執行make plot時是否已有result_clock_gettime.csv的存在,但最後出現錯誤,目前還沒查明原因 - 解決:在搬動compute-pi的資料夾的過程中,其中一個terminal沒更新到訊息,所以改過Makefile,與用來run的terminal所在的檔案夾不同 ### 原檔作圖分析 分析baseline與其他優化方法的函式,隨著N增加,他們所花費時間的變化 `seq 100 5000 1000000` ![](https://i.imgur.com/1WlI3N6.png) - **分析:** baseline & openmp_4 的時間數據震盪都偏大,其中**openmp_2 跟 openmp_4的時間落點幾乎重疊**(細看發現omp2略低!!why?!),只是openmp_4的震盪較小,而使用AVX優化所花的時間是最少的 - **對於震盪的原因看法:** 因為baseline, openmp_2,openmp_4的執行時間比較久,而系統其實也在執行著很多背景程式,所以有許多的process會搶資源,而導致取的時間有延遲而失真。 - ==Q:但是無法理解為什麼 openmp_4比openmp_2還要激烈震盪?== ### 去除極端的時間 用95%的信賴區間來把極端的時間(也就是有可能受其他因素影響的時間)去除 - **改寫benchmark_clock_gettime.c** 因為原本的benchmark是計算執行輸入同一個N,函式25次的總共花費時間,而現在改成:同一個N,每執行一次計算一次時間,共執行__次來算信賴區間,最後印出經過計算後的時間 ~~==但是因為math.h的函式出問題尚未解決,決定先用mean(極端值仍然存在)==~~ > 感謝hugikun999幫忙找math.h出問題的原因><: **-lm要放在最後面!!!** ```c ``` - **Makefile修改** ``` CC = gcc CFLAGS = -O0 -std=gnu99 -Wall -fopenmp -mavx EXECUTABLE = \ time_test_baseline time_test_openmp_2 time_test_openmp_4 \ time_test_avx time_test_avxunroll \ benchmark_clock_gettime default: computepi.o $(CC) $(CFLAGS) computepi.o time_test.c -DBASELINE -o time_test_baseline $(CC) $(CFLAGS) computepi.o time_test.c -DOPENMP_2 -o time_test_openmp_2 $(CC) $(CFLAGS) computepi.o time_test.c -DOPENMP_4 -o time_test_openmp_4 $(CC) $(CFLAGS) computepi.o time_test.c -DAVX -o time_test_avx $(CC) $(CFLAGS) computepi.o time_test.c -DAVXUNROLL -o time_test_avxunroll $(CC) $(CFLAGS) computepi.o benchmark_clock_gettime.c -o benchmark_clock_gettime -lm ``` ==猜想一個問題:會不會因為紀錄的時間太短,nsec的數字會丟失?== > From [shelly4132](https://hackmd.io/s/HJrLU0dp) & [王紹華](https://hackmd.io/s/BJdYsOPa) 的共筆 ; [信賴區間計算](https://github.com/rampant1018/compute_pi/blob/master/compute_pi.c) ### 測試各個函式所計算pi的時間 - mean版本 1. seq 100 5000 1000000 & SAMPLE_SIZE=25 ![](https://i.imgur.com/360apLK.png) 2. seq 100 5000 1000000 & SAMPLE_SIZE=100 ![](https://i.imgur.com/k6q4RQd.png) 3. seq 1000 1110 112000 & SAMPLE_SIZE=1000 ![](https://i.imgur.com/AAo80K8.png) ==不過還在思考為甚麼會在最後的時候時間就爆炸了== 4. seq 100 111 11200 & SAMPLE_SIZE=10000 ![](https://i.imgur.com/jDYNs4X.png) ### 測試各個函式所計算pi的時間 - 95%信賴區間 seq 100 111 11200 & SMAPLE = 25 ### 測試各個函式所計算的pi的正確性 - benchmark_error_rate.c ```c #include <math.h> #include <stdio.h> #include <stdlib.h> #include <immintrin.h> #include <stdint.h> #include "computepi.h" #define ONE_SEC 1000000000.0 //#define M_PI acos(-1.0) int main(int argc, char const *argv[]) { if (argc < 2) return -1; int N = atoi(argv[1]); double pi, diff, error; // Baseline pi = compute_pi_baseline(N); diff = pi - M_PI > 0 ? pi - M_PI : M_PI - pi; error = diff / M_PI; printf("%lf,", error ); // OpenMP with 2 threads pi = compute_pi_openmp(N, 2); diff = pi - M_PI > 0 ? pi - M_PI : M_PI - pi; error = diff / M_PI; printf("%lf,", error ); // OpenMP with 4 threads pi = compute_pi_openmp(N, 4); diff = pi - M_PI > 0 ? pi - M_PI : M_PI - pi; error = diff / M_PI; printf("%lf,", error ); // AVX SIMD pi = compute_pi_avx(N); diff = pi - M_PI > 0 ? pi - M_PI : M_PI - pi; error = diff / M_PI; printf("%lf,", error ); // AVX SIMD + Loop unrolling pi = compute_pi_avx_unroll(N); diff = pi - M_PI > 0 ? pi - M_PI : M_PI - pi; error = diff / M_PI; printf("%lf\n", error ); return 0; } ``` - runtime_error_rate.gp & Makefile 調整一下 1. seq 100 111 11200 ![](https://i.imgur.com/n5hpzTp.png) N值非16的倍數,所以AVX跟AVX+unroll的誤差才會那麼大 2. seq 80 800 11200 ![](https://i.imgur.com/GSpUw6X.png) 後來考慮到因為AVX的兩個版本是4組N值一起算、16組N值一起算,故設計讓這兩個能整除的數目,才不會少算,而圖也顯示大家的error rate都一樣 ### 探討openmp thread 數目對效能影響 ### perf stat 分析 1. 測試的event: cache-misses, cache-references, cycles, instructions 來分析各個函式執行時間長短不同的原因。 ``` Performance counter stats for './time_test_baseline' (25 runs): 18,980 cache-misses # 23.003 % of all cache refs ( +- 10.91% ) 82,513 cache-references ( +- 8.18% ) 11,416,803,377 cycles ( +- 0.01% ) 10,813,679,993 instructions # 0.95 insns per cycle ( +- 0.00% ) 3.773416151 seconds time elapsed ( +- 0.22% ) ``` ``` Performance counter stats for './time_test_openmp_2' (25 runs): 144,102 cache-misses # 19.233 % of all cache refs ( +- 14.61% ) 749,245 cache-references ( +- 14.60% ) 13,919,361,574 cycles ( +- 4.67% ) 11,226,535,864 instructions # 0.81 insns per cycle ( +- 0.01% ) 3.316340440 seconds time elapsed ( +- 7.29% ) ``` ``` Performance counter stats for './time_test_openmp_4' (25 runs): 142,875 cache-misses # 16.832 % of all cache refs ( +- 14.40% ) 848,851 cache-references ( +- 15.04% ) 20,728,355,391 cycles ( +- 1.51% ) 11,245,729,658 instructions # 0.54 insns per cycle ( +- 0.01% ) 2.681960048 seconds time elapsed ( +- 4.23% ) ``` ``` Performance counter stats for './time_test_avx' (25 runs): 31,124 cache-misses # 30.657 % of all cache refs ( +- 12.08% ) 101,522 cache-references ( +- 10.52% ) 4,997,650,102 cycles ( +- 0.18% ) 4,106,976,991 instructions # 0.82 insns per cycle ( +- 0.00% ) 1.738483690 seconds time elapsed ( +- 0.96% ) ``` ``` Performance counter stats for './time_test_avxunroll' (25 runs): 53,932 cache-misses # 31.102 % of all cache refs ( +- 13.40% ) 173,402 cache-references ( +- 11.94% ) 4,622,056,802 cycles ( +- 0.46% ) 3,306,564,222 instructions # 0.72 insns per cycle ( +- 0.00% ) 1.600523838 seconds time elapsed ( +- 0.74% ) ``` - 分析圖 AVX雖然cache-miss偏高,但cycle數是其他三種1/3~1/5。 ==Q:為何使用AVX的話cache-miss會偏高?== ## 證明 Leibniz formula for pi & 其他計算圓周率的方法 # AVX ### 硬體 - 16 個 256-bit 的 YMM(YMM0-YMM15) 暫存器 - 32-bit 的控制暫存器 MXCSR (not yet)**MXCSR 上的 0-5 位元是浮點數的 exception,這幾個 bit 在 set 之後只能透過 `LDMXCSR` 或是 `FXRSTOR` clear,而 7-12 位元是獨立的 exception mask,在 power-up 或是 reset 後是初始化成 set 狀態,0-5 位元的 exception 分別是 invalid operation、denormal、divide by zero、overflow、underflow、precision。** **指令可以分為 vector 跟 scalar 版本,在 vector 版本中資料會被視為 parallel SIMD 處理,而 scalar 則是只處理一個 entry** # 背景知識 ## 可能延遲程式的因子: 1. 將文字或資料輸程式所需的 I/O ( 如 Network I/O, Disk I/O…) 2. 取得實際記憶體供程式使用所需的 I/O 3. 其他程式所使用的 CPU 的時間 4. 作業系統所使用的 CPU 的時間 [參考資料1-Hw1-Ext](https://hackpad.com/Hw1-Extcompute_pi-geXUeYjdv1I) ## SMP ## Attributes of Variables 1. `aligned (alignment)` Ex: `int x __attribute__ ((aligned (16))) = 0` **compiler to allocate the global variable x on a 16-byte boundary** [參考資料](https://gcc.gnu.org/onlinedocs/gcc-4.1.2/gcc/Variable-Attributes.html#Variable-Attributes) ## 信賴區間補充 ## 不懂的部分 1. why line 14 是 computepi.o? 2. ``` %.o: %.c $(CC) -c $(CFLAGS) $< -o $@ ``` ## C programming 1. 複習`main(int argc, char const *argv[])` 2. `atoi(argv[1])` 3. `size_t` 4. `pow();` 5. `#define M_PI acos(-1.0)`