2017q1 Homework1 (compute-pi)

contributed by <ierosodin>

Reviewed by `0xff07`

Pros

用 95 % 信賴區間消除偏差的資料
openCL 的實作
getopt()

Cons

openCL 的編譯可以寫清楚一點，或是加入 Makefile 中
thread_benchmark.c 中，測試不同 thread 數目的程式略嫌冗長，可以考慮用迴圈或寫個 shell script 自動測試不同執行緒數目。
error.c 同上

開發環境

作業系統 : Red Hat 7

$ lscpu

Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 12
On-line CPU(s) list: 0-11
Thread(s) per core: 2
Core(s) per socket: 6
Socket(s): 1
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 45
Model name: Genuine Intel® CPU @ 3.30GHz
Stepping: 5
CPU MHz: 1277.976
BogoMIPS: 6600.19
Virtualization: VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 15360K
NUMA node0 CPU(s): 0-11

軟體安裝

$ git clone https://github.com/sysprog21/compute-pi
$ cd compute-pi
$ make check
$ make gencsv

重現實驗

$ time ./time_test_baseline

N = 400000000 , pi = 3.141593

real	0m6.979s
user	0m6.978s
sys	0m0.001s

$ time ./time_test_openmp_2

N = 400000000 , pi = 3.141593

real	0m3.568s
user	0m7.131s
sys	0m0.002s

$ time ./time_test_openmp_4

N = 400000000 , pi = 3.141593

real	0m1.886s
user	0m7.536s
sys	0m0.001s

$ time ./time_test_avx

N = 400000000 , pi = 3.141593

real	0m3.340s
user	0m3.331s
sys	0m0.003s

$ time ./time_test_avxunroll

N = 400000000 , pi = 3.141593

real	0m2.030s
user	0m2.028s
sys	0m0.002s

折線圖

利用 result_clock_gettime.csv 所得到的數據畫折線圖, 可以發現時間與N呈正向關關係，下圖為 loop = 1 的折線圖：

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

改寫 Makefile 中的 gencsv ，將數據點提高：

for i in `seq 100 100 25000`; do \
                printf "%d " $$i;\
                ./benchmark_clock_gettime $$i; \
        done > result_clock_gettime.csv

詳細用法 : $ man seq

可以發現，雖然大致上呈正比關係，但有許多處的 data 值飄走了！

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

平均值

利用取平均的方式，降低 bad data 對整體的影響，下圖是取 25 次所畫出的結果：

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

100次：

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

可以發現，在取平均值之後， bad data 的影響明顯變小了

95%信賴區間

信賴區間是對這個樣本的某個總體參數的區間估計，代表著真實值有一定機率落在與該測量結果有關的某對應區間。95%信賴區間指的就是，真實值有百分之九十五的機率會落在這個區間。

這裡改寫 benchmark_clock_gettime.c , 在取樣100次後找出95%的信賴區間, 並刪除不適當的資料後取平均：

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

將標準提高到 90％，也就是平均值正負 1.645 個標準差，可以發現，關係圖已經趨近於一直線了！

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

Threads 數量對效能影響

當 thread number = Thread(s) per Core * Core(s) per Socket 時，效能最好 (N = 25000)

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

Threads 12 vs 16 (N = 100~25000)

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

ERROR

為了檢視我們所算來的結果是否正確，比對實驗值與標準值的差異：

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

發現其中四條曲線重疊，唯獨 avxunroll 會不斷的飄動(擺動週期為 N = 16 ），原因為 avxunroll 中每一次 for 迴圈為 16 筆資料，然而實驗的 N 並非 16 的倍數，因此會產生數值錯誤的問題(餘數不會算到）

分別在avx與avxunroll之中，增加了檢查N並非4的倍數與16的倍數，並加以修正，得到以下的結果圖：

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

OpenCL

第一次使用 GPU 來加速程式，真的是太厲害啦！！！ierosodin

OpenCL 是一種異質性計算裝置進行平行化運算所設計的標準 API ，在同一種語法下，可以命令 CPU 或 GPU 來進行運算。 GPU 含有數千個更小型且更高效率的核心，專為同時處理多重任務進行最佳化， OpenCL 提供兩種平行化的模式，包括 task parallel 以及 data parallel ，而 data parallel 是最簡單的平行化方法，也就是如果有一大筆資料要進行同樣的運算就很適合。

OpenCL 大致的流程：

把 OpenCL 裝置初始化。
- clGetPlatformIDs(1, &platform_id, &ret_num_platforms);
- clGetDeviceIDs( platform_id, CL_DEVICE_TYPE_GPU, 1, &device_id, &ret_num_devices);
- clGetDeviceInfo(device_id, CL_DEVICE_MAX_WORK_GROUP_SIZE, sizeof(size_t), &workGroupSize, NULL);
在 OpenCL 裝置上配置記憶體
- clCreateContext( 0, 1, &device_id, NULL, NULL, &ret);
- clCreateCommandQueue(context, device_id, 0, &ret);
編譯要執行的 OpenCL 程式（稱為 kernel）
- clCreateProgramWithSource(context, 1, (const char **)&source_str, NULL, NULL);
- clBuildProgram(program, 0, NULL, NULL, NULL, NULL);
將資料內容複製到 OpenCL 裝置上
- clCreateKernel(program, "Pi", &ret);
- clCreateBuffer(context, CL_MEM_READ_WRITE, workGroups, sizeof(float), NULL, &ret);
- clSetKernelArg(kernel, 0, sizeof(cl_mem), &mem_obj);
執行編譯好的 kernel
- clEnqueueNDRangeKernel(command_queue, kernel, 1, NULL, &globalWorkSize, &localWorkSize, 0, NULL, NULL);
把計算結果從 OpenCL 裝置上，複製到記憶體中
- clEnqueueReadBuffer(command_queue, mem_obj, CL_TRUE, 0, sizeof(float)*workGroups, pi_partical, 0, NULL, NULL);

Kernel：

獲得 localWorkSize
- get_local_id(0);
獲得 globalWorkSize
- get_global_id(0);
同步所有的 thread
- barrier(CLK_LOCAL_MEM_FENCE);


































__kernel void Pi(__global float *workGroupBuffer, __local float *scratch, const uint niter, const uint chunk) // NumWorkGroups, workGroupSize, Total iterations, Chunk size
{
	const uint lid = get_local_id(0);
	const uint gid = get_global_id(0);

	const float h = (1.0/(float)niter);
	float partial_sum = 0.0;

	// Each thread compute chunk iterations
	for(uint i=gid*chunk; i<(gid*chunk)+chunk; i++) {
		float x = h * ((float) i - 0.5);
		partial_sum += 4.0 / (1.0 + x * x);
	}

	// Each thread store its partial sum in the workgroup array
	scratch[lid] = partial_sum;

	// Synchronize all threads within the workgroup
	barrier(CLK_LOCAL_MEM_FENCE);

	float local_pi = 0;

	// Only thread 0 of each workgroup perform the reduction
	// of that workgroup
	if(lid == 0) {
		const uint length = lid + get_local_size(0);
		for (uint i = lid; i<length; i++) {
			local_pi += scratch[i];
		}
		// It store the workgroup sum
		// Final reduction, between block, is done out by CPU
		workGroupBuffer[get_group_id(0)] = local_pi;
	}
}

可以發現，當 N 小的時候，由於使用 GPU 需要花費時間成本在資料的搬移，因此速度較慢，但當 N 超過 40000 後，使用 OpenCL 則有大幅度的加速，且從圖中可以發現， OpenCL 的執行時間幾乎是常數，也就是無論切得再細，所花費的時間幾乎不變。（這樣就可以拉高 N 值來提高精確度！）

不過由於操作 GPU 時是使用單浮點數運算，所以當 N 小的時候，誤差相對也大許多。

不同的 chunk 數對效能的影響

嘗試將 chunks 設為

2^{x}

，其中，

x

為小於

l o g_{2} N

的最大值，以得到較小的 workGroups ，這樣可以減少資料在 GPU 與 CPU 之間的交換次數，結果發現，無論

N

為多少，時間都是 0.000005秒。

推測是資料運算時間還小於 CPU 一次 interrupt 最短的時間，因此運算時間皆相同。ierosodin

getopt()

以往寫程式如果想要在執行時傳入參數，都會使用 argc argv ，缺點就是必須要很精確的排好參數的順序，還要寫一大串的偵錯判斷式。 getopt() 提供了另一種給參數的方式，由選項與選項參數所構成，執行程式在給參數時，只需要將選項與參數配好，就不需要按照順序傳入，程式判斷式的可讀性也會提升許多。

使用範例：$ ./execute_file -n 1024 -c 32

getopt() 的實作在 unistd.h 中，因此使用時要引入，其中包含了三個主要的參數：

optarg 用來抓取選項的參數
optind 用來記錄 getopt() 已找過參數的數量

範例程式：

optind = 0;

while ((opt = getopt(argc, argv, "nt:")) != -1) {
    switch (opt) {
    case 'n':
        FUNC(optarg);
        break;
    case 't':
        int a = atoi(optarg);
        break;
    default: /* '?' */
        fprintf(stderr, "Usage: \n");
        exit(EXIT_FAILURE);
    }
}

getopt() 可以抓取執行命令中的參數， nt: 為要找的選項，當找到時會傳回選項，將該選項的參數記錄在 optarg 中，並將 optind 加一，直到找完為止，傳回 -1 。

好的習慣：由於 optind 是用來記錄 getopt() 找過的參數數量，因此在找過一次後， optind 會變成最大值，若在程式中呼叫兩次 getopt() ，便會出現找不到參數的問題，因此在每次調用 getopt() 前，先將 optind 設為零可以避免這個問題的出現。ierosodin

參考資料

王紹華的共筆
 信賴區間與信心水準的解讀
 OpenCL benchmark. PI calculation
man getopt
Linux getopt twice

2017q1 Homework1 (compute-pi)

Reviewed by 0xff07

開發環境

軟體安裝

重現實驗

折線圖

平均值

95%信賴區間

Threads 數量對效能影響

ERROR

OpenCL

getopt()

參考資料

Reviewed by `0xff07`