課程｜平行程式設計

# 平行程式設計 > 心有重核若計算若邏輯繽紛如早春繁花 > 行實多序若同步若分離迅急似盛夏快雨 > [name=劉邦鋒] > [Tutorial](https://www.youtube.com/playlist?list=PLOvZ8aEg7xDn1H78MT2oApPWrLvcxb3fc) > [課程網頁](https://sites.google.com/view/pp2019-ntu/assignment) #### PP Introduction - Motivation - 我們希望變得 faster 和 better - 現代運算的資料量很大 - 我們希望快速地解決問題 - 有些計算是 Time Critical - Weather forecast - Stock market - Radar signal processing - Moore’s Law：電晶體密度約每兩年便會增加一倍。 - 不但數量要變多，成本也要變低，才有效益。 - 這幾年的走向越來越不穩定，業界認為接下來不會持續太多年。 - 可能可以提高效能，但成本一定會增加。 - Scientific Process - 科學方法 - Observation - Theory - Experiment → Slow - 有些科學的實驗仰賴電腦的 simulation，需要很大量的計算。 - expensive、dangerous、unfeasible、time consuming - Definition - 我們認為 Multicore 和 Multiprocessor 差異不大，都是在 shared memory 上面做操作。 - Multicore CPU - shared memory, has its own cache → memory arbiter - EX：Intel Ivy Bridge Intel Xeon - 15 cores, 37.5 MB L3 cache. - EX：Intel Xeon Phi - 三個計畫所構成的電腦 - 60 cores, pentium based. - Multiprocessor - shared memory - EX：TianHe - 16000 computers(nodes) - 2 Intel Ivy Bridge Intel Xeon - 3 Intel Xeon Phi - Multicomputer - Communication Cetwork(phone call) - Cluster Computing： - loosely or tightly connected computers - Computer system that are connected by a network, but not a shared memory, can be considered as a cluster. - Massively parallel computing - Grid computing - Cloud computing - Parallism - Instruction-level parallelism - Data parallelism - Task parallelism - Dependency Graph - node → task - edge → dependency - may add start node, end node - Wavefront：不存在彼此有 path 的一個點集合。 - Parallelism：最大的 wavefront 大小。 - Inequality about parallelism - $N\ge T\ge max(L,\frac{N}{min(P,W)})$ - P = The number of processors - N = The number of tasks - L = The longest path in the dependency graph - W = The maximum number of tasks in a wavefront - T = The unit of execution time - Distributed Computing - 目前還沒有 Automatic Parallel Programming 的機制。 - Parallel Programming v.s. Distributed Computing - Parallel Programming： - 目標在減少時間(Performance)。 - Race-car - Distributed Computing： - 目標在增加可靠度(reliability)。 - Tank #### PP Architecture - Multiprocessor - Shared global memory, UMA - ![](https://i.imgur.com/FupR6En.png) - Distributed shared memory, NUMA - ![](https://i.imgur.com/cHhs43O.png) - **Logically, we do not care about the memory management unit** - ![](https://i.imgur.com/j8ZnzhU.png) - However, in some programming models we need to take care of race condition - 每個 processor 有自己的計算資源，並可以透過 shared memory 來同步 - Barrier Syncrhonization：等到所有處理器都完成某一工作再一起往後做。 - Race Condition： - Software Solution：Critical section, lock, synchronization - 被用來製造 hardware random number generators。 - Cache： - ![](https://i.imgur.com/JJztEtu.png) - Hardware Solution：guarantee the consistenct between memory and cache. - Write-through：每次寫都寫回memory。 - Write-back：別人access memory時才會寫回。 - EX：Intel Gulftown CPU - 6 cores, Core i7-9xx, 12MB L3 cache - EX：Nvidia Tesla - General Purpose GPU - 240 processors, 4096 MB of GDDR3 memory - ![](https://i.imgur.com/n2FGWbA.png) - 對應平行程式：OpenCL - Multicomputer - ![](https://i.imgur.com/dxXIFYx.png) - Slow or very fast Network。 - 只有 local memory 沒有 shared memory。 - 需要使用一個 master 蒐集大家資訊。 - Topology of Network - Ring - Fat tree - ![](https://i.imgur.com/iSyPOX5.png) - Mesh and Torus - ![](https://i.imgur.com/NHGHSFz.png) - Hypercube - FFT (butterfly) - ![](https://i.imgur.com/pAafC9r.png) - 目前的趨勢是 Fat tree 或是高維的 Torus。 - LINPACK 作為平行處理的 benchmark - Flynn’s Taxonomy - SISD - Standard von Neumann architecture - sequential programming - SIMD - parallel programming on multiple processors - data parallelism - GPU supports - MIMD - functional parallelism - different processors executes different instructions on different data - MISD - Fault Tolerance - The same computation is repeated multiple times - Google MapRedcue computation #### PP Evaluation - Algorithm - 描述解決問題的過程 - 1.detailed - 2.well-defined - 3.temporal dependency - Upper bound：$\exists A\forall I~~T(A,I)\le f(n)$ - Lower bound：$\forall A\exists I~~T(A,I)\ge g(n)$ - Evaluation - Speed Up： $k=\frac{T_s}{T_p}$ - 注意要注重比較的基準。 - 不可以讓 sequential program 跑很慢來增加 speed up。 - $1\le k\le p$ - 用一個 processor 來 simulate sequential program。 - 用一個 processor 來 simulate parallel program 的所有 step 數。 - Super-linear Speedup - 仍然可能發生 $k>p$ - 可能是因為 sub-problems 的 working set 比較少，使 cache 的效果更好 - Efficiency： $e=\frac{k}{p}$ - 單位 processor 的 speed up - Work： $W_p=e\times T_p$ - Thoery - Amdahl’s Law：speed up 隨著 p 增加有上限。 - 不可平行化的部分，相較於 input size 呈現比例 x - $T_p\ge xT_s$ - $k\le \frac{1}{x}$ - Gustafson’s Law：speed up 隨著 p 增加沒有上限。 - 不可平行化的部分，相較於 input size 為 constant x - 任意的 arbitrarily large data 都可以平行化處理 - $T_s=x+p(1-x)$ - $T_p=1$ - $k=x+p(1-x)\propto p$ - If we have a good speedup even if problem size is small, then we really have a good parallel implementation. - Otherwise it could be a fabricated phenomenon with the use (or abuse) of Gustafson’s Law. #### PP Principle - Even partition - Even workload distribution - 希望 makespan 越小越好。 - 常常我們不知道正確的 work load 估計，這時候有一些 scheduling 的 issue。 - Proper granularity(basic unit) - 當 granularity 小時，越容易 balance workload ，但會有 overhead - mapping table - scheduling - synchronization - 同時也要看 system 可以 spawn 出的 thread 數量決定。 - Communication reduction - Synchronization - 種類 - Barrier Synchronization → shared memory - Before/After Synchronization → IPC - Access Synchronization → semaphore - Trade off with granularity - Data Locality - Temporal Locality - Disk cache - Translation Lookaside Buffer - Space Locality - Matrix Multiplication 如果要求 required data 都要在 local memory，則所有 data 都必須要在同一 processor - Communication-to-Computation Ratio - 越低越好，類似 Surface to Volume Ratio - 方形的切割比片狀的切割要來的好 - Efficient Implementation - Global synchronization - Batch mode message passing - Overlap communication with computation - Explore memory hierarchy #### PP Algorithm - Embarrassingly Parallel - 不需要溝通 - EX：矩陣乘法、Monte Carlo、Parameter Search。 - Divide and Conquer - 需要大量溝通、同步。 - Summaion - 方法一： - 時間：$O(n/p+p)$ - 極值：$p=\sqrt n\rightarrow O(\sqrt n)$ - 方法二： - 時間：$O(n/p+logp)$ - 極值：$p=\frac{n}{logn}\rightarrow O(logn)$ - Speed Up：$\frac{n}{n/p+logp}$ - Efficiency：$\frac{n}{n+plogp}$ - Prefix - Sequential 的 Prefix 問題：時間只要 $O(n)$ - 常常被應用在 compact array 之中。 - [4 0 9 0 0 8 5 7] → [4 9 8 5 7] - [1 1 2 2 2 3 4 5] - 只要 operator 有結合律，都可以平行計算加速。 - 方法一：處理器夠多時 - **for $i = 0$ to $logn$** - 每個 processor 加上自己往左數第 $i$ 的資料 - 時間：$O(logn)$ - 方法二： - 1.切割資料成 $p$ 份，$p$ 個處理器分別去算 prefix sum - 2.把每段的最後一個元素取出，額外算出一個 prefix sum - 3.$p$ 個處理器分別去組合兩個 prefix sum 的結果 - 時間：$O(n/p+logp)$ - Sorting - $log(n!)=\theta(nlogn)$ 因為 $\frac{n}{2}^\frac{n}{2}\le n!\le n^n$ - Sequential Bucket Sort - ![](https://i.imgur.com/h5xadZ8.png) - 總共時間複雜度為：$O(n+b)$ - Parallel Bucket Sort - Read to Bucket - 1.$O(\frac{n}{p})$ 把 key 拆成 $p$ 份，$p$ 個 processor 分別把 $n/p$ 個 key，記錄到自己獨立的 bucket 裡。 - Read from Bucket - 2.$O(b)$ 每個 process 整理到一個共用的 bucket。 - 3.$O(b/p+logp)$ 做 $b$ 個元素的 prefix sum。 - 4.$O(b)$ 每個 process 把答案放到 output array。 - 總共時間複雜度為：$O(\frac{n}{p}+b+logp)$ - **需要大量的記憶體。** - Sequential Quicker Sort(遞迴呼叫) - 假設找到 pivot 使 input array 平均分到 $g$ 個 group 上。 - 總共時間複雜度為：$T(n)=gT(\frac{n}{g})+O(n)=O(nlogn)$ - Parallel Quicker Sort(不可遞迴呼叫→直接平行解決) - 假設找到 pivot 使 input array 平均分到 $g$ 個 group 上。 - 1.$O(\frac{n}{p})$ 把 key 拆成 $p$ 份，每個 processor 分別把 $n/p$ 個 key，拆分到共同的 $g$ 個 group 裡。(假設可以同步) - 2.$O(\frac{n}{p}log\frac{n}{p})$ 把每一個 bucket 裡面的元素用傳統方法 sort。(假設平均分配) - 3.$O(logp)$ 對每段元素個數去算一個 prefix sum。 - 4.$O(\frac{n}{p})$ 每個 process 把答案放到 output array。 - 總共時間複雜度為：$O(\frac{n}{p}log\frac{n}{p}+logp)$ - Parallel Exchange Sort - 應用在 Hypercube 上面常見的方法。 - 演算法： - 將資料平均分到 $p=2^k$ 個處理器上，每個處理器用 bit string 表示。 - **for $i=0$ to $k$** - 每個處理器找到第 $i$ 個 bit 不同的另一處理器，兩兩配成一對。 - 兩個處理器自行挑選一個 pivot - 比 pivot 小的數都交給第 $i$ 個 bit $0$ 的那方 - 比 pivot 大的數都交給第 $i$ 個 bit $1$ 的那方 - 最後每個處理器在 sort 自己內部的資料。 - 這個演算法假設 - 每個回合資料都蠻平均的，每個處理器手上交換前後都是 $n/p$ 個。 - 每個回合大概都把手上的資料一半交給對方，再收到另一半的資料。 - 總共時間複雜度為： - $O(\frac{n}{p})O(logp)+O(\frac{n}{p}log\frac{n}{p})=O(\frac{n}{p}logn)$ - Matrix Multiplication - 時間：$O(\frac{n^3}{p})$ - Speed Up ： $p$ #### PP Thread - **Process and Thread** - Process 是執行中的 Program，Thread 是 Program 的一個執行緒。 - UNIX Process - 1.Process ID, Process Group ID, User ID, Group ID - 2.Environment Variable - 3.Working directory - 4.Program instructions - 5.Registers - 6.File descriptors - 7.Signal actions - UNIX Process 的資料擺放 - Stack：函數內宣告 - Heap：動態宣告 - Data：全域宣告 - Thread 和 Process 在平行運算上的差別 - Process 的溝通成本比較大 - Process 的創立成本比較大 - Threads 之間可以共享一些 Process 資源，也有自己的 private 資源 - **P Thread** - [hello10](https://ideone.com/dFPDkx) - [eight_single](https://ideone.com/DYPwQC) - [eight_join](https://ideone.com/89xWUv) - [eight_mutex](https://ideone.com/f2u3qW) - Include 以及 Compile - #include <pthread.h> - gcc xxx.c -o xxx -lpthread - 相關的資料型態和 API： - pthread_t - pthread_create(&thread,&attr,func,(void*)&t) - pthread_join(thread,(void**)&num) - pthread_exit(void* value) - pthread_attr_t - pthread_attr_init(&attr) - pthread_attr_setdetachstate(&attr,PTHREAD_CREATE_JOINABLE) - pthread_attr_destroy(&attr) - pthread_mutex_t - pthread_mutex_init(&lock,mutexattr) - pthread_mutex_lock(&lock) - pthread_mutex_unlock(&lock) - pthread_mutex_destroy(&lock) #### PP OpenMP - **Basic** - Include 以及 Compile - #include <omp.h> - gcc xxx.c -o xxx -fopenmp - 先寫 sequential 的程式，再用 openmp 改成平行版本。 - 相關的使用方法： - 基本的平行： - #pragma omp parallel - omp_get_num_procs() - omp_get_num_threads() - omp_set_num_threads() - omp_get_thread_num() - 迴圈的平行： - #pragma omp parallel for - 初始值：var = exp1 - 條件：var cond exp2 - 增加： - var++, var--, ++var, --var - var = var+exp3, var = var-exp3 - var+=exp3, var-=exp3 - 程式會自動等到所有 loop 跑完，做一次 barrier synchronization - #pragma omp parallel for private(v) - ![](https://i.imgur.com/UMr80XI.png) - 每一個 thread 會各有一個相同名稱的 private 變數，並且沒有初始化 - #pragma omp parallel for reduction(+:numSolution) - Global 先宣告一個變數 - 每一個 thread 會各有一個 private 變數，並且初始化成 global 值 - 平行結束之後，每一個 thread 會把結果加到 global 變數中 - #pragma omp parallel for firstprivate(v) - ![](https://i.imgur.com/ljlx0sG.png) - 每一個 thread 的 private 變數，被初始化成 global 值 - #pragma omp parallel for lastprivate(v) - ![](https://i.imgur.com/HQKTBIB.png) - loop 的最後一個 iteration 執行完後，把 private 變數的值存回 global 變數 - #pragma omp critical - 創造一個 critical section - **Intermediate** - Schedule Clause - #pragma omp parallel for schedule(policy , chunk) - static： - round robin，每 chunk 個工作一組，分配到不同 thread 上。 - dynamic： - 每 chunk 個工作一組，每個 thread 先要一份工作。 - 剩的工作等每個 thread 有空閒時才會再接著分配。 - guided： - 一開始給比較大的 chunk，後來給比較少的 chunk。 - chunk size 為工作量的下限。 - runtime： - omp_set_schedule(policy, chunk) - omp_sched_static - omp_sched_dynamic - omp_sched_guided - omp_sched_auto - auto： - Measure Time - double omp_get_wtime(); - Parallel Sections - Parallel Directives - 程式在 #pragma omp parallel 中，會生出 multi thread 了。 ```clike= #pragma omp parallel { #pragma omp sections { #pragma omp section #pragma omp section } #pragma omp for for(...) #pragma omp for for(...) } ``` - 避免把每個 thread 用 branch 分開： ```clike= int main(void){ omp_set_num_threads(8); #pragma omp parallel { int id = omp_get_thread_num(); for(int i=0;i<3;i++){ //printf("%d %d\n",i,id); if(id<4){ #pragma omp for for(int j=0;j<4;j++){ printf("1 --- %d %d %d\n",i,j,id); } }else{//some iteration doesn't done #pragma omp for for(int j=4;j<20;j++){ printf("2 --- %d %d %d\n",i,j,id); } } } } return 0; } ``` ```clike= int main(void){ omp_set_num_threads(8); #pragma omp parallel { int id = omp_get_thread_num(); for(int i=0;i<3;i++){ if(id<4){ //block forever #pragma omp for for(int j=0;j<16;j++){ printf("%d %d %d\n",i,j,id); } } } sleep(id); printf("%d %d\n",id,omp_get_num_threads()); } return 0; } ``` - **Advanced** - **Private Variable** - 不同 thread 去存取 global 變數的時間，遠大於去存取 private 變數的時間。 - **Thread Creation/Destroying** - Thread 的生成和銷毀很耗資源。 - **nowait(上課講義為甚麼會對??)** - 讓 parallel for 的結束，不特別做 barrier synchronization ```clike= int main(void){ #pragma omp parallel { int id = omp_get_thread_num(); for(int i=0;i<3;i++){ #pragma omp for // nowait for(int j=0;j<8;j++){ sleep(id); printf("%d %d %d\n",i,j,id); } } } return 0; } ``` ```clike= int main(void){ #pragma omp parallel { int id = omp_get_thread_num(); #pragma omp for // nowait for(int i=0;i<7;i++){ sleep(id); printf("%d %d\n",i,id); } #pragma omp sections { #pragma omp section printf("section 1 %d\n",id); #pragma omp section printf("section 2 %d\n",id); #pragma omp section printf("section 3 %d\n",id); } } return 0; } ``` - Critical Region - 包含越少變數越好，否則會互相等來等去 - 如果需要大量 access 同一個變數： - 1.設成 private 最後再整理在一起。 - 2.開額外的空間，最後再整理在一起。 - 3.用 reduction 是最好的方法。 - Double Buffer(Game Of Life) #### PP OpenCL - **OpenCL：** - 在 **heterogeneous platforms** 上面執行的 **framework** - 底層可以是任何處理器：CPU、GPU、DSP、FPGA 等 - ![](https://i.imgur.com/o4YDxtx.png) - Hardware - Host：C/C++，單執行緒在CPU上執行，控制流程 - Device：OpenCL平行程式，多執行緒，計算 - compute devices - compute units - processing elements(PE) - Memory - 每個 PE 有 private memory - 同一個 compute unit 的每個 PE 共享 local memory - 不同個 compute unit 的每個 PE 共享 global memory - **資訊查詢：** - [Platform ID](https://www.khronos.org/registry/OpenCL/sdk/1.0/docs/man/xhtml/clGetPlatformIDs.html)： - [Platform Information](https://www.khronos.org/registry/OpenCL/sdk/1.0/docs/man/xhtml/clGetPlatformInfo.html)： - [Device ID](https://www.khronos.org/registry/OpenCL/sdk/1.0/docs/man/xhtml/clGetDeviceIDs.html)： - [Device Information](https://www.khronos.org/registry/OpenCL/sdk/1.0/docs/man/xhtml/clGetDeviceInfo.html)： ```clike= cl_int clGetPlatformIDs(); cl_int clGetPlatformInfo(); cl_int clGetDeviceIDs (); cl_int clGetDeviceInfo(); ``` - **編譯與執行：** - [Create Context](https://www.khronos.org/registry/OpenCL/sdk/1.0/docs/man/xhtml/clCreateContext.html)： - [Create Command Queue](https://www.khronos.org/registry/OpenCL/sdk/1.0/docs/man/xhtml/clCreateCommandQueue.html)： - [Create Program](https://www.khronos.org/registry/OpenCL/sdk/1.0/docs/man/xhtml/clCreateProgramWithSource.html)： - [Build Program](https://www.khronos.org/registry/OpenCL/sdk/1.0/docs/man/xhtml/clBuildProgram.html)： - [Create Kernel](https://www.khronos.org/registry/OpenCL/sdk/1.0/docs/man/xhtml/clCreateKernel.html)： - [Create Buffer](https://www.khronos.org/registry/OpenCL/sdk/1.0/docs/man/xhtml/clCreateBuffer.html)： - [Set Kernel Argument](https://www.khronos.org/registry/OpenCL/sdk/1.0/docs/man/xhtml/clSetKernelArg.html)： - [Set Shape And Run](https://www.khronos.org/registry/OpenCL/sdk/1.0/docs/man/xhtml/clEnqueueNDRangeKernel.html)： - [Read Buffer](https://www.khronos.org/registry/OpenCL/sdk/1.0/docs/man/xhtml/clEnqueueReadBuffer.html)： ```clike= cl_context clCreateContext(); cl_command_queue clCreateCommandQueue(); cl_program clCreateProgramWithSource(); cl_int clBuildProgram(); cl_kernel clCreateKernel(); cl_mem clCreateBuffer(); cl_int clSetKernelArg(); cl_int clEnqueueNDRangeKernel(); cl_int clEnqueueReadBuffer(); ``` - **清空記憶體：** - clReleaseContext - clReleaseCommandQueue - clReleaseProgram - clReleaseKernel - clReleaseMemObject - ![](https://i.imgur.com/BQcgKkz.png) - **Kernel Function 操作性質**： - **可以印出記憶體位置，來區分該變數到底是 (__global, __const)、__local、__private，這三種的記憶體位置會存放在很不一樣的 virtual memory，而每個 thread 會根據自己的 page table 來對應到正確的 physical memory** - 參數： - 指標：可以用 __global 或是 __local - 變數：可以不特別指定或是 __private - 變數： - __local - __private - 不可以用 __global - 同步： - 只能在同一個 work group 內同步 - barrier(CLK_LOCAL_MEM_FENCE); - barrier(CLK_GLOBAL_MEM_FENCE); - **Buffer 的性質：clCreateBuffer 時可以設定的參數** - ![](https://i.imgur.com/C4NCc2K.png) - 若是 **ALLOC** 或是 **COPY** ：Host 需要使用 clEnqueueWriteBuffer 或 clEnqueueReadBuffer 來操作 Device 的記憶體 - 若是 **USE** ：Device 會自動存回 Host 所指定的記憶體，因此速度比較慢。 - **注意：如果裝置沒有支援 Unified memory for Host and Device，則此設定無效，仍然需要 clEnqueueReadBuffer 把資料讀回來。** - [**clFinish**](https://www.khronos.org/registry/OpenCL/sdk/1.0/docs/man/xhtml/clFinish.html)： ```clike= cl_int clFinish(cl_command_queue command_queue); ``` - clEnqueueNDRangeKernel 是一個 nonblocking 的操作 - 使用 clFinish 會等待所有的 command queue 都執行完。 - **Index Domain**： ```clike= size_t get_global_id(uint dimindx); size_t get_local_id(uint dimindx); size_t get_group_id(uint dimindx); ``` ```clike= size_t get_global_size(uint dimindx); size_t get_local_size(uint dimindx); size_t get_num_groups(uint dimindx); ``` - 每一個 work item 就是平行的執行 kernel function - 可以透過以上的 API ，來 specify 不同 work item 要做不同事情 - **矩陣乘法：** ```clike= __kernel void mul(__global int matrixA[N][N], __global int matrixB[N][N], __global int matrixC[N][N]){ int row = get_global_id(0); int col = get_global_id(1); int sum = 0; for(int i = 0; i < N; i++) sum += matrixA[row][i] * matrixB[i][col]; matrixC[row][col] = sum; } ``` ```clike= #define N 1024 #define Blk 64 #define BSIDE (N/Blk) __kernel void mul(__global int A[N][N], __global int B[N][N], __global int C[N][N]){ int globalRow = get_global_id(0); int globalCol = get_global_id(1); int localRow = get_local_id(0); int localCol = get_local_id(1); __local int ALocal[BSIDE][BSIDE]; __local int BLocal[BSIDE][BSIDE]; int sum = 0; for(int i = 0; i < Blk; i++){ ALocal[localRow][localCol] = A[globalRow][i*BSIDE + localCol]; BLocal[localRow][localCol] = B[i*BSIDE + localRow][globalCol]; barrier(CLK_LOCAL_MEM_FENCE); for(int k = 0; k < BSIDE; k++) sum += ALocal[localRow][k] * BLocal[k][localCol]; barrier(CLK_LOCAL_MEM_FENCE); } C[globalRow][globalCol] = sum; } ``` - **Time Measurement** - [分析 Event](https://www.khronos.org/registry/OpenCL/sdk/1.0/docs/man/xhtml/clGetEventProfilingInfo.html)： - CL_PROFILING_COMMAND_QUEUED - CL_PROFILING_COMMAND_SUBMIT - CL_PROFILING_COMMAND_START - CL_PROFILING_COMMAND_END ```clike= clCreateCommandQueue(..., CL_QUEUE_PROFILING_ENABLE, ...); clEnqueueNDRangeKernel(..., &event); clWaitForEvents(1 , &event); cl_int clGetEventProfilingInfo(event, param_name,...); ``` - **Multiple Devices**： - 如果有兩張 GPU 卡，若是 task 是 independent 的，其實可以同時操作這些 device 來加速 - EX：[20019](https://judgegirl.csie.org/problem/0/20019) / [Main](https://ideone.com/5JMZuu) / [Kernel](https://ideone.com/ywmXXe) - **Dependency**： - 方法一：用 clFinish 確保該 command queue 完全執行結束。 ```clike= cl_int clFinish(cl_command_queue command_queue); ``` - 方法二：後來的 kernel 在執行時，設定說要確保等到前面的(兩個) event 執行結束才開始執行。 ```clike= cl_int clEnqueueNDRangeKernel(..., 2, events ,...); ``` - **Group Size**： - **GPU Streaming multiprocessors 架構：** - 每一張 GPU 卡片約有 20 個 Streaming multiprocessors (SM)。 - 每一個 SM 有 32 個 processing cores。 - 每一個 core 共用記憶體，執行 SIMD 指令。 - 每次會抓 32 個 thread 形成一個 warp，丟到同一個 SM 上面執行 - 因此最多 640 條 thread 同時運行。 - $GPU$ 同時可以服務的 groups 數量有個上限 - 最多就是 20 個 SM 同時執行 - $GPU$ 的 group size 有個上限 - 規定最多只能有 1024 個 item - 根據 processing cores，最多只有 32 個 thread 同時執行 - 假設現在工作的數量固定為 $N$，在 group size 不超過 1024 得前提下： - group size 小，group 數多，平行度不好 - group size 大，group 數少，平行度好 - **Makefile 編譯與執行：** ```clike= LIBS = /usr/lib/x86_64-linux-gnu/libOpenCL.so -lrt LINKOPT = -Wl,-rpath,/usr/lib/x86_64-linux-gnu all: gcc main.c $(LIBS) $(LINKOPT) -o main-cl -fopenmp ``` #### PP CUDA - EX：[20029](https://judgegirl.csie.org/problem/0/20029) / [Main](https://ideone.com/AzNqjt) - **設置記憶體空間：** ```clike= cudaSetDevice(0); cudaMalloc(Cu[0], 100 * sizeof(UINT)); cudaSetDevice(1); cudaMalloc(Cu[1], 200 * sizeof(UINT)); ``` - **Host 與 Device 記憶體搬動：** - [cudaMemcpy](http://horacio9573.no-ip.org/cuda/group__CUDART__MEMORY_g48efa06b81cc031b2aa6fdc2e9930741.html) - cudaMemcpyHostToHost - cudaMemcpyHostToDevice - cudaMemcpyDeviceToHost - cudaMemcpyDeviceToDevice ```clike= cudaMemcpy(To, From, Size, Direction); ``` - **呼叫 Kernel Function：** - 一維： ```clike= int N = 640; // 共有 640 件工作 int blocks = 20; // 分成 20 個 block int threads = 32; // 每個 block 包含 32 件工作 fun<<<blocks,threads>>>(N, ...); ``` - 最多三維： ```clike= int N = 1024*1024; // 共有 1024*1024 件工作 dim3 blocks(16,16); // 分成 16*16 個 block dim3 threads(64,64); // 每個 block 包含 64*64 件工作 fun<<<blocks,threads>>>(N, ...); ``` - **Kernel Function 寫法：** ```clike= __global void fun(int n, int A[][MAXN], int B[][MAXN]) { // 同一個 block 可以 shared 記憶體 __shared__ int Alocal[BSIDE][BSIDE]; __shared__ int Blocal[BSIDE][BSIDE]; // 查詢自己的 block id 以及 thread id int localRow = threadIdx.x; int localCol = threadIdx.y; int globalRow = blockIdx.x * BSIDE + threadIdx.x; int globalCol = blockIdx.y * BSIDE + threadIdx.y; for(int iter=0;iter<100;iter++){ //do something __syncthreads(); //do something __syncthreads(); } } ``` - **Makefile 編譯與執行：**[-D_FORCE_INLINES](https://github.com/tsiv/ccminer-cryptonight/issues/13) ```clike= nvcc -Xcompiler "-O2 -fopenmp" -D_FORCE_INLINES main.cu -o main ``` #### PP MapReduce / Hadoop - [Google File System](https://zh.wikipedia.org/wiki/Google%E6%AA%94%E6%A1%88%E7%B3%BB%E7%B5%B1) - Google 自行研發的一種分散式系統 - 假設： - 系統是由大量便宜、不穩定的硬體組成 - 檔案數量不多，但每個檔案都很大 - 每個 File 被 Divide 成 64 MB blocks - 每個 File 有三份 Replica - Read 動作大多都是 sequential 的 - Write 動作大多都是 sequential 的接在檔案最後 - 重要的是高的頻寬，而非開始的延遲 - MapReduce - 概念： - Programmer 只需要撰寫簡單的 Map 以及 Reduce 的函數，專注在計算本身而非平行化 - 系統會自動在 large cluster 上平行的執行 - ![](https://i.imgur.com/7lvSPBb.png) - Map Function - 把 input 變成 (key, value) - 運行 Map function 的機器稱為 Mapper - 運行完會把結果寫入 local disk - Reduce Function - 把 (key, value) 變成 output - 運行 Reduce function 的機器稱為 Reducer - 運行前要使用 remote procedure call 去向 Mapper 附近的 Disk 拿資料 - Shuffle Stage - Runtime System 自動對中間的 (key, value) 做整理 - 把 (key, value) 從 map function 透過網路傳輸到 reduce function - 此外在送給 reduce function 前，會先 sorting 以及 grouping - Partition - 預設是使用 Hash Function，決定每個 (key, value) 要送給哪個 Reducer - 使用者也可以自行設定對應關係 - Combiner - 在 Mapper 寫入 local disk 時，預先跑 combiner 的整理程式 - 目的是降低網路傳輸時間，也減輕 reducer 的工作負擔 - Fault Tolerance - Master 會一段時間固定去 ping worker - Mapper 發生 Fail，他原已完成的工作要全部重做 - Reducer 發生 Fail，他原已工作已寫入 GFS，不用重做 - Master 發生 Fail，系統需要重新啟動 - [Hadoop](https://hadoop.apache.org/) - 基於 GFS 以及 MapReduce，由 Apache 開發，使用 java 所撰寫的開源軟體架構