2016q3 Homework1(compute-pi)

tags: `heathcliffYang`

contributed by <heathcliffYang>

目標

重現計算pi的演算法實驗
更熟悉用gnuplot作圖
增加計算函式執行時間的準確度
尋找不同的計算pi的演算法並且實踐測試

Code 理解

Makefile

CFLAGS = -O0 -std=gnu99 -Wall -fopenmp -mavx   #給定的程式碼已指定 -fopenmp -mavx
default:	#compile computepi.o & time_test.c
		#並用 -DBASELINE等等 指定特定的函式來做為輸出執行檔的source
		#compile computepi.o & benchmark_clock_gettime.c
check:		#time 顯示關於./time_test_baseline等等程式執行時使用的系統資源
gencsv: default
        for i in `seq 100 5000 1000000`; do \		#重複做 [起始點][step][終點]
                printf "%d," $$i;\			#顯示i的值
                ./benchmark_clock_gettime $$i; \	#將i的值輸入至./benchmark_clock_gettime
        done > result_clock_gettime.csv		#將原本會print出來的資訊寫入result_clock_gettime.csv

plot: result_clock_gettime.csv				#用result_clock_gettime.csv作圖
        gnuplot runtime.gp				#gnuplot執行runtime.gp

clean:
        rm -f $(EXECUTABLE) *.o *.s result_clock_gettime.csv

computepi.h & computepi.c

各個計算pi的方法的函式，其中優化版本有3種，分別是omp(thread數目可以改變), avx, avx with unrolling

AVX -> 一次做4組

    for (int i = 0; i <= N - 4; i += 4) {   
        ymm3 = _mm256_set1_pd(i * dt);      // i*dt, i*dt, i*dt, i*dt
        ymm3 = _mm256_add_pd(ymm3, ymm2);   // x = i*dt+3*dt, i*dt+2*dt, i*dt+dt, i*dt+0.0
        ymm3 = _mm256_mul_pd(ymm3, ymm3);   // x^2 = (i*dt+3*dt)^2, (i*dt+2*dt)^2, ...
        ymm3 = _mm256_add_pd(ymm0, ymm3);   // 1+x^2 = 1+(i*dt+3*dt)^2, 1+(i*dt+2*dt)^2, ...
        ymm3 = _mm256_div_pd(ymm1, ymm3);   // dt/(1+x^2)
        ymm4 = _mm256_add_pd(ymm4, ymm3);   // pi += dt/(1+x^2)
	}
	double tmp[4] __attribute__((aligned(32)));
    _mm256_store_pd(tmp, ymm4);             // move packed float64 values to  256-bit aligned memory location
    pi += tmp[0] + tmp[1] + tmp[2] + tmp[3];
    return pi * 4.0;

AVX unrolling -> 增加register數量，將迴圈展開，變成一次做16組

time_test.c

#include <stdio.h>
#include "computepi.h"

int main(int argc, char const *argv[])
{
    __attribute__((unused)) int N = 400000000;	//不讓未用到的函數出現警告
    double pi = 0.0;

#if defined(BASELINE)					// 若指令參數下 -DBASELINE則會執行此部分
    pi = compute_pi_baseline(N);
#endif

#if defined(OPENMP_2)
    pi = compute_pi_openmp(N, 2);			//omp配2 threads
#endif

#if defined(OPENMP_4)
    pi = compute_pi_openmp(N, 4);			//omp配4 threads
#endif

#if defined(AVX)
    pi = compute_pi_avx(N);
#endif

#if defined(AVXUNROLL)
    pi = compute_pi_avx_unroll(N);
#endif
    printf("N = %d , pi = %lf\n", N, pi);

    return 0;
}

benchmark_clock_gettime.c

計算函式執行時間，並印出函式執行25次的時間，而N是迭代次數

    // Baseline
    clock_gettime(CLOCK_ID, &start);
    for(i = 0; i < loop; i++) {
        compute_pi_baseline(N);
    }
    clock_gettime(CLOCK_ID, &end);
    printf("%lf,", (double) (end.tv_sec - start.tv_sec) +
           (end.tv_nsec - start.tv_nsec)/ONE_SEC);

Q: 不懂Makefile的$$i可以改變什麼??

A:因為main(int argc, char const *argv[]) 表示pass初始值argv[1]進去，也就是$$i；而argv[0]已被系統占用

result_clock_gettime.csv

benchmark 跑的時候被記錄的數據，第一個是N(Makefile裡)，之後依序是各個函式printf的時間，換行寫在AVX + Loop unrolling的部分；且printf時注意有","來將data分開

100,0.000023,0.000110,0.000073,0.000017,0.000029
5100,0.001286,0.000740,0.000735,0.000833,0.000686
10100,0.002502,0.001358,0.001357,0.001655,0.001138
15100,0.003698,0.002009,0.001968,0.004866,0.001491
20100,0.004897,0.002636,0.004268,0.002939,0.001954
25100,0.006063,0.003244,0.004876,0.003433,0.002463
30100,0.007322,0.003864,0.005547,0.003963,0.002943
35100,0.008506,0.004489,0.006013,0.004506,0.003462
40100,0.009765,0.005120,0.005047,0.005009,0.003901
45100,0.010985,0.005715,0.009892,0.005559,0.004443
50100,0.012150,0.006356,0.007536,0.006108,0.004874
55100,0.013136,0.006981,0.006930,0.006708,0.005382
60100,0.014151,0.007750,0.011707,0.007129,0.005883
65100,0.016313,0.008239,0.009882,0.007724,0.006323
70100,0.016663,0.009122,0.025125,0.008197,0.006775
75100,0.018109,0.009475,0.011066,0.008912,0.007337
80100,0.019331,0.010445,0.016794,0.009227,0.007756
85100,0.020502,0.010715,0.012317,0.009814,0.008381
90100,0.021964,0.011379,0.012910,0.010297,0.008558
95100,0.022849,0.012086,0.013538,0.010722,0.008921
100100,0.024405,0.012588,0.019913,0.012311,0.009734
105100,0.041532,0.017654,0.015964,0.011699,0.009896

runtime.gp***

set xlabel 'N'
set ylabel 'Time(sec)'
set autoscale fix
set style fill solid
set datafile separator ","		#特別注意，不然會讀不到值
set title 'Wall-clock time -> using clock_gettime()'
set term png enhanced font 'Verdana,10'
set output 'time.png'

plot 'result_clock_gettime.csv' using 1:2 title 'baseline' with linespoints, \
     'result_clock_gettime.csv' using 1:3 title 'openmp_2' with linespoints, \
     'result_clock_gettime.csv' using 1:4 title 'openmp_4' with linespoints, \
     'result_clock_gettime.csv' using 1:5 title 'AVX' with linespoints, \
     'result_clock_gettime.csv' using 1:6 title 'AVX+unroll' with linespoints, \

感謝hugikun999的共筆

Run & Problems

看資料&思考問題 - 9/28 10pm

為什麼有一些指令可以支援四個暫存器的運算元，使得 code 更小，並且減少一些不必要的指令，提升速度

gnuplot作圖遇到問題 9/29 8pm

make plot 失敗: No rule to make target 'plot'. Stop.

gencsv: default
        for i in `seq 100 5000 1000000`; do \
                printf "%d," $$i;\
                ./benchmark_clock_gettime $$i; \
        done > result_clock_gettime.csv 

plot: result_clock_gettime.csv
        gnuplot runtime.gp

已把gnuplot會用到的runtime.gp寫完並放在/compute-pi底下，也檢查了一下Makefile的格式與執行make plot時是否已有result_clock_gettime.csv的存在，但最後出現錯誤，目前還沒查明原因

解決:在搬動compute-pi的資料夾的過程中，其中一個terminal沒更新到訊息，所以改過Makefile，與用來run的terminal所在的檔案夾不同

原檔作圖分析

分析baseline與其他優化方法的函式，隨著N增加，他們所花費時間的變化
seq 100 5000 1000000

分析: baseline & openmp_4 的時間數據震盪都偏大，其中openmp_2 跟 openmp_4的時間落點幾乎重疊(細看發現omp2略低!!why?!)，只是openmp_4的震盪較小，而使用AVX優化所花的時間是最少的
對於震盪的原因看法: 因為baseline, openmp_2,openmp_4的執行時間比較久，而系統其實也在執行著很多背景程式，所以有許多的process會搶資源，而導致取的時間有延遲而失真。
Q:但是無法理解為什麼 openmp_4比openmp_2還要激烈震盪?

去除極端的時間

用95%的信賴區間來把極端的時間(也就是有可能受其他因素影響的時間)去除

改寫benchmark_clock_gettime.c
因為原本的benchmark是計算執行輸入同一個N，函式25次的總共花費時間，而現在改成:同一個N，每執行一次計算一次時間，共執行__次來算信賴區間，最後印出經過計算後的時間
~~但是因為math.h的函式出問題尚未解決，決定先用mean(極端值仍然存在)~~

感謝hugikun999幫忙找math.h出問題的原因><: -lm要放在最後面!!!

Makefile修改

CC = gcc
CFLAGS = -O0 -std=gnu99 -Wall -fopenmp -mavx
EXECUTABLE = \
        time_test_baseline time_test_openmp_2 time_test_openmp_4 \
        time_test_avx time_test_avxunroll \
        benchmark_clock_gettime

default: computepi.o
        $(CC) $(CFLAGS) computepi.o time_test.c -DBASELINE -o time_test_baseline
        $(CC) $(CFLAGS) computepi.o time_test.c -DOPENMP_2 -o time_test_openmp_2
        $(CC) $(CFLAGS) computepi.o time_test.c -DOPENMP_4 -o time_test_openmp_4
        $(CC) $(CFLAGS) computepi.o time_test.c -DAVX -o time_test_avx
        $(CC) $(CFLAGS) computepi.o time_test.c -DAVXUNROLL -o time_test_avxunroll
        $(CC) $(CFLAGS) computepi.o benchmark_clock_gettime.c -o benchmark_clock_gettime -lm

猜想一個問題:會不會因為紀錄的時間太短，nsec的數字會丟失?

From shelly4132 & 王紹華的共筆；信賴區間計算

測試各個函式所計算pi的時間 - mean版本

seq 100 5000 1000000 & SAMPLE_SIZE=25
seq 100 5000 1000000 & SAMPLE_SIZE=100
seq 1000 1110 112000 & SAMPLE_SIZE=1000

不過還在思考為甚麼會在最後的時候時間就爆炸了
4. seq 100 111 11200 & SAMPLE_SIZE=10000

測試各個函式所計算pi的時間 - 95%信賴區間

seq 100 111 11200 & SMAPLE = 25

測試各個函式所計算的pi的正確性

benchmark_error_rate.c

#include <math.h>
#include <stdio.h>
#include <stdlib.h>
#include <immintrin.h>
#include <stdint.h>
#include "computepi.h"

#define ONE_SEC 1000000000.0
//#define M_PI acos(-1.0)


int main(int argc, char const *argv[])
{

    if (argc < 2) return -1;

    int N = atoi(argv[1]);
    double pi, diff, error;


    // Baseline
    pi = compute_pi_baseline(N);
    diff = pi - M_PI > 0 ? pi - M_PI : M_PI - pi;
    error = diff / M_PI;
    printf("%lf,", error );


    // OpenMP with 2 threads
    pi = compute_pi_openmp(N, 2);
    diff = pi - M_PI > 0 ? pi - M_PI : M_PI - pi;
    error = diff / M_PI;
    printf("%lf,", error );

    // OpenMP with 4 threads
    pi = compute_pi_openmp(N, 4);
    diff = pi - M_PI > 0 ? pi - M_PI : M_PI - pi;
    error = diff / M_PI;
    printf("%lf,", error );


    // AVX SIMD
    pi = compute_pi_avx(N);
    diff = pi - M_PI > 0 ? pi - M_PI : M_PI - pi;
    error = diff / M_PI;
    printf("%lf,", error );
	
    // AVX SIMD + Loop unrolling
    pi = compute_pi_avx_unroll(N);
    diff = pi - M_PI > 0 ? pi - M_PI : M_PI - pi;
    error = diff / M_PI;
    printf("%lf\n", error );

    return 0;
}

runtime_error_rate.gp & Makefile 調整一下

seq 100 111 11200

N值非16的倍數，所以AVX跟AVX+unroll的誤差才會那麼大
seq 80 800 11200

後來考慮到因為AVX的兩個版本是4組N值一起算、16組N值一起算，故設計讓這兩個能整除的數目，才不會少算，而圖也顯示大家的error rate都一樣

探討openmp thread 數目對效能影響

perf stat 分析

測試的event: cache-misses, cache-references, cycles, instructions 來分析各個函式執行時間長短不同的原因。

Performance counter stats for './time_test_baseline' (25 runs):

            18,980      cache-misses              #   23.003 % of all cache refs      ( +- 10.91% )
            82,513      cache-references                                              ( +-  8.18% )
    11,416,803,377      cycles                                                        ( +-  0.01% )
    10,813,679,993      instructions              #    0.95  insns per cycle          ( +-  0.00% )

       3.773416151 seconds time elapsed                                          ( +-  0.22% )

 Performance counter stats for './time_test_openmp_2' (25 runs):

           144,102      cache-misses              #   19.233 % of all cache refs      ( +- 14.61% )
           749,245      cache-references                                              ( +- 14.60% )
    13,919,361,574      cycles                                                        ( +-  4.67% )
    11,226,535,864      instructions              #    0.81  insns per cycle          ( +-  0.01% )

       3.316340440 seconds time elapsed                                          ( +-  7.29% )

 Performance counter stats for './time_test_openmp_4' (25 runs):

           142,875      cache-misses              #   16.832 % of all cache refs      ( +- 14.40% )
           848,851      cache-references                                              ( +- 15.04% )
    20,728,355,391      cycles                                                        ( +-  1.51% )
    11,245,729,658      instructions              #    0.54  insns per cycle          ( +-  0.01% )

       2.681960048 seconds time elapsed                                          ( +-  4.23% )

 Performance counter stats for './time_test_avx' (25 runs):

            31,124      cache-misses              #   30.657 % of all cache refs      ( +- 12.08% )
           101,522      cache-references                                              ( +- 10.52% )
     4,997,650,102      cycles                                                        ( +-  0.18% )
     4,106,976,991      instructions              #    0.82  insns per cycle          ( +-  0.00% )

       1.738483690 seconds time elapsed                                          ( +-  0.96% )

 Performance counter stats for './time_test_avxunroll' (25 runs):

            53,932      cache-misses              #   31.102 % of all cache refs      ( +- 13.40% )
           173,402      cache-references                                              ( +- 11.94% )
     4,622,056,802      cycles                                                        ( +-  0.46% )
     3,306,564,222      instructions              #    0.72  insns per cycle          ( +-  0.00% )

       1.600523838 seconds time elapsed                                          ( +-  0.74% )

分析圖
AVX雖然cache-miss偏高，但cycle數是其他三種1/3~1/5。
Q:為何使用AVX的話cache-miss會偏高?

證明 Leibniz formula for pi & 其他計算圓周率的方法

AVX

硬體

16 個 256-bit 的 YMM(YMM0-YMM15) 暫存器
32-bit 的控制暫存器 MXCSR
(not yet)MXCSR 上的 0-5 位元是浮點數的 exception，這幾個 bit 在 set 之後只能透過 LDMXCSR 或是 FXRSTOR clear，而 7-12 位元是獨立的 exception mask，在 power-up 或是 reset 後是初始化成 set 狀態，0-5 位元的 exception 分別是 invalid operation、denormal、divide by zero、overflow、underflow、precision。
指令可以分為 vector 跟 scalar 版本，在 vector 版本中資料會被視為 parallel SIMD 處理，而 scalar 則是只處理一個 entry

背景知識

可能延遲程式的因子:

將文字或資料輸程式所需的 I/O ( 如 Network I/O, Disk I/O…)
取得實際記憶體供程式使用所需的 I/O
其他程式所使用的 CPU 的時間
作業系統所使用的 CPU 的時間

參考資料1-Hw1-Ext

SMP

Attributes of Variables

aligned (alignment)
Ex: int x __attribute__ ((aligned (16))) = 0 compiler to allocate the global variable x on a 16-byte boundary
參考資料

信賴區間補充

不懂的部分

why line 14 是 computepi.o?

%.o: %.c
        $(CC) -c $(CFLAGS) $< -o $@

C programming

複習main(int argc, char const *argv[])
atoi(argv[1])
size_t
pow();
#define M_PI acos(-1.0)