Chapter_5 Data Level Parallelism(DLP)

# Data Level Parallelism (DLP) ## 1. Introduction data level parallelism 常用於 SIMD(single instruction multiple data) - 常見應用: machine learning algorithm, image processing, etc. - SIMD 的架構和應用 1. Vector architecture * 每個element各自獨立，且做不同的運算 * 早期 supercomputer = vector machine 開始使 I/O 花的時間 > device 內部運算的時間 2. Multimedia SIMD Instruction Set Extensions (多媒體 SIMD 指令集擴展) * 針對通用處理器設計的指令集擴展，使得處理器能夠在執行普通指令的同時，利用專門的指令對多個數據進行並行處理 * 每個element各自獨立，但做差不多的運算 * ex: Intel a. MMX (Multimedia Extensions, 1996) 是最早的 SIMD 擴展，主要用於加速多媒體和通信應用中的計算。 b. SSE (Streaming SIMD Extensions) 是 MMX 的後續版本，提供了更強大的指令集，能夠處理更多類型的數據。 c. AVX (Advanced Vector Extensions) 是 SSE 的進一步擴展，增加了更多的寄存器和更寬的向量支持，提高了處理效率。 3. **Graphics Processor Units (GPUs)** 與傳統的Vector architecture相比，GPU 具有更多的處理單元，能夠同時處理更多的數據。更適合用於大規模並行計算，例如深度學習、科學計算等。 ## 2. Vector Machine ### 2.1 vector memory-memory (VMM) - first vector machine was memory-memory machine - require greater memory bandwidth arithmetic operand must be read in and out of memory ### 2.2 vector register machine (VRM) #### 2.2.1 introduction - arithmetic operand 來自 register, **access 資料要對 memory 進行 load/store** - Cray-1: first vector register machine - example ![3E8EC2A9-F9CD-4516-A912-8704C34EC2B7](https://hackmd.io/_uploads/HybFef2nA.jpg) - VMM vs. VRM ![螢幕擷取畫面 2024-09-09 142031](https://hackmd.io/_uploads/HkW04MhnA.png) #### 2.2.2 VRM Architecture - register files act as compiler controlled buffers register files 是處理器中的一組高速存儲單元，用於臨時存儲運算過程中的數據。在 VRM 架構中，這些寄存器檔案可以看作是存放數據的 buffer - can hide memory latency - load/store 會花比較多時間 → vector load/store units are deeply pipelined for hiding latency - vector functional units 也 deeply pipelined - ex: RV64V ![Untitled (15)](https://hackmd.io/_uploads/B1_Ebz2hC.png) ![Untitled (16)](https://hackmd.io/_uploads/HJKSZz3h0.png) ![Untitled (17)](https://hackmd.io/_uploads/SyC8ZGhnA.png) ![9E4D4D24-C5D2-4C4F-83BF-CBC66E7713FF](https://hackmd.io/_uploads/Hkt6ffh3C.jpg) - lots of pipeline stalls are necessary for deeply pipelined architecture - stall once for the first vector element, subsequent elements will flow smoothly down the pipeline → pipeline stall 只會在每個vector instruction 出現一次，而不是在每個vector element都發生 → start up time較長 #### 2.2.3 multiple lanes(datapath) for high throughput ![Untitled (19)](https://hackmd.io/_uploads/HyIO7fnhC.png) - since no communication between lanes and is paralleled → vectorization is a massive compile-time reordering of operation sequencing #### 2.2.4 vector execution time - factor 1. length of operand vector 2. structural hazard 3. data dependency - estimate: **convoy and chime** - **convoy: set of vector instructions that could execute together** - must not contain structural hazards - sequences with RAW hazards should be in different convoys, 除非chain起來 - vector chaining: register bypassing ![1DE92339-72F9-45AE-A4D1-67025B24AB02](https://hackmd.io/_uploads/rksoKG33C.jpg) ![Untitled (20)](https://hackmd.io/_uploads/H162KMn3C.png) - **chime: unit of time to execute one convoy** - assume m convoys execute in m chimes, for vector length of n, it requires totally m*n clock cycles - 不考慮startup time → better approximate for long vectors - ex: ![Untitled (21)](https://hackmd.io/_uploads/ByMX5f33R.png) #### 2.2.5 vector length register - maximum vector length (mvl): 是指vector machine在一次向量操作中能夠同時處理的數據元素的最大數量。 the physical limit of a vector processor - a vector of arbitrary length ![Untitled (22)](https://hackmd.io/_uploads/BJ1Xoz32A.png) - 較難vectorize 1. condition (IF statements) inside loop ![Untitled (23)](https://hackmd.io/_uploads/Hy0Bizh3R.png) 2. sparse matrices: 大部分元素是零，非零元素分佈不均勻，導致數據不連續。而vectorize依賴於數據的連續性，因為向量指令一次性操作多個相鄰的數據元素。 → **solution: vector-mask control** #### 2.2.6 vector-mask control - allows the use of a mask while executing vector instructions to **control which elements participate in the computation**. Elements with a zero mask do not participate in the computation, while elements with a one mask do. - vector mask register(VMR) - a boolean vector to control the execution of a vector instruction - two implementation: ![Untitled (24)](https://hackmd.io/_uploads/rkJC3f320.png) - example of compress/expand operations ![Untitled (26)](https://hackmd.io/_uploads/SyxSBTf22R.png) compress: 掃描masked(M)，識別哪些元素需要保留（即mask值為 1 的element） expand: performs inverse operation #### 2.2.7 memory bank - memory system must be designed to support high bandwidth for load/ store → 如果開很多port，面積會很大 → 解決: memory banks support multiple and non-sequential loads or stores per clock. Be able to control the addresses to banks independently - example ![Untitled (27)](https://hackmd.io/_uploads/H10x0fh3R.png) 32*6 = 192，每個處理器在每個周期有192次的memory reference SRAM 記憶體需要 15 / 2.167 ≈ 6.92 ≈ 7 個處理器周期在full bandwidth的狀況下，所需的memory bank數量為 192 × 7 = 1344，但實際上hardware配置可能無法完全達到full bandwidth的理想要求 #### 2.2.8 Handling multidimensional arrays in vector architecture 在向量架構中處理多維數組 - load/store moves groups of data between vector registers and memory - stride: 分隔元素並放到單個vector register的距離 three types 1. unit stride: contiguous(連續、接連) block of information in memory fastest <img src="https://hackmd.io/_uploads/SJLdCM3nC.png" width="60%"> 2. non-unit stride: 數據在內存中是按固定間隔分佈的，這些間隔不一定是 1，跳動的拿 harder to optimize 使用質數數量的數據庫：這樣可以更容易地在滿帶寬下支持不同的步幅。 <img src="https://hackmd.io/_uploads/rkZMJ73nA.png" width="60%"> 3. indexed 數據的位置由一個index list決定，從內存的不同位置收集數據。 good for sparse arrays of data <img src="https://hackmd.io/_uploads/HyZB1XhnA.png" width="60%"> - Interleaved Memory Layout 將memory address按順序分配到不同的memory bank中，以便每個bank在給定的時間內處理不同的address，提高bandwidth。 ![Untitled (31)](https://hackmd.io/_uploads/SkAr-QnnC.png) bank number = 8，non-unit stride 的情況下， stride = 4 時，只有mod8 = 0,4 的memory bank 能被load/store 到 stride = 3 時，基本上每個memory bank 都能被load/store到 → prime to 8 的比較好 - challenge 1. memory bank conflict ![56736CAF-B0C8-48A9-A3EB-BD6EB0EF56DB](https://hackmd.io/_uploads/rkNLGXn30.jpg) solution a. SW: loop interchange or declaring array not power of 2 (array padding) b. HW: prime number of banks，但運算較不好做(除質數) ![Untitled (33)](https://hackmd.io/_uploads/ryN9zmhnR.png) * bank busy time: minimum period of time between successive access to the same bank * bank conflict: when the same bank is hit faster than busy time 2. in multidimensional array ![Untitled (34)](https://hackmd.io/_uploads/Skx77mn2R.png) B: row-major order, has a stride of 1 double word D: column-major order, has a stride of 100 double words solution: use VLDS/VSTS: vector load/ store with stride (一種指令)，用於以固定步幅但非連續位置的讀取或寫入數據 #### 2.2.9 sparse matrices <img src="https://hackmd.io/_uploads/rkFAX73hC.png" width="60%"> - K and M are index vectors to designate the nonzero elements of A and C - Gather-scatter operations are used for handling sparse matrices operation: `vldi` (vector load indexed)、`vsti`(vector store indexed) index vector， base address 加上 offset 來確定需要讀取的內存位置 ![Untitled (36)](https://hackmd.io/_uploads/HkfB4mnhR.png) * Vsetdcfg 4*FP64： Configure 4 64-bit floating-point vector registers. * vld v0, x7: load data from the memory address specified by register `x7` into vector register `v0`. Here, `v0` contains the data of the array `K`. * vldi v1, x5, v0: Use the base address `x5` and the index vector `v0` to load data from memory into vector register `v1`. This loads the data of the array `A[K[i]]` into `v1`. ## 3. Multimedia SIMD Instruction Set Extensions - data type: 多媒體數據（如音訊和圖像）通常由較小的數據單元組成，ex: 8 bit - flexible: 64-bit registers can split into 2×32bit or 4×16bit or 8×8bit - SIMD extensions vs. vector machine 1. no vector length register vector length registers 允許動態設置和調整 vector operation lengths，而SIMD extension 通常是fixed operation lengths. 2. no strided or scatter/gather data transfer instructions SIMD extension 通常僅支持contiguous memory operations 3. no mask register mask register 允許有條件地操作向量中的特定元素，而SIMD extension沒有 > ! 較新的extenstion，如AVX-512，也有 mask register (Chatgpt) - pros easy to implement: SIMD extension 是基於現有的算術單元進行擴展，所以實施起來不需要複雜的改變 easy to add new instructions for new media operation - example: RVP (RISC-V 的 SIMD extension) <img src="https://hackmd.io/_uploads/B1HHq86nC.png" width="50%"> <img src="https://hackmd.io/_uploads/ByCr5La30.png" width="50%"> RVP 的 P 代表packed指令 4D: 表示該指令操作4個double-precision floating-point的operands(一次4筆) ![Untitled (3)](https://hackmd.io/_uploads/SJY3cL62C.png) splat.4D: 將scalar a 複製4次到256bit的f0，目的為將scalar`a`擴展為vector，使能夠在SIMD操作中使用 - 優點: 1. 較快 2. 可減少指令數量，以此例來說減少約4倍指令數 - 缺點: 1. 處理非整數倍的向量長度時，此例來說是4，就需要額外的邏輯來確保所有數據都被正確處理 2. 只能訪問連續的數據 - Roofline performance model - 目的: 描述了一個應用程序或計算任務中計算和數據傳輸之間的平衡。 - arithmetic intensity: floating-point/byte of memory read <img src="https://hackmd.io/_uploads/HJdfoIT2A.png" width="80%"> - x-axis: arithmetic intensity 左到右: 算術強度從低到高 y-axis: floating-point operation 的時間複雜度 O(1): 常數時間複雜度，例如稀疏矩陣計算（SpMV） O(log(N)): 對數時間複雜度，例如譜方法（FFT） O(N): 線性時間複雜度，例如密集矩陣計算（BLAS3） - 不同類型的任務 Sparse matrix (SpMV)：稀疏矩陣運算，算術強度低，通常涉及較少的浮點運算和大量的內存訪問。 Spectral methods (FFTs)：譜方法，例如快速傅立葉變換（FFT），算術強度較高，涉及大量的浮點運算。 Dense matrix (BLAS3)：密集矩陣運算，例如基本線性代數子程序（BLAS3），算術強度高，涉及大量的浮點運算。 - ridge point ![Untitled (5)](https://hackmd.io/_uploads/SkPiiI6hR.png) peak memory bandwidth（斜線）和 performance計算能力（水平線）的交點。代表了系統性能的轉折點。在這個點上，性能從受memory bandwidth限制轉變為受計算能力限制。 ## 4. GPU （graphical processing units） - properties（Nvidia 為例） 1. almost every type of parallelism: multithreading, MIMD, SIMD, instruction-level 2. Heterogeneous Computing Model（異構計算模型） heterogeneous 通常指的是使用不同類型的處理器來協同工作。 CPU與GPU協同工作，CPU負責控制和管理，GPU負責大量並行計算任務 3. unify all forms of GPU parallelism as CUDA thread 在CUDA中，所有的並行計算任務都被分解為thread。每個thread執行相同的程序代碼，但可以操作不同的數據 ex: 圖像處理，每個像素點可以分配一個CUDA線程，進行並行處理。 ex: 機器學習，在深度學習訓練中，每個神經元的計算或每個數據樣本的處理可以由CUDA線程來執行。 4. SIMT（Single Instruction Multiple Thread）：一條指令同時作用於多個thread，每個thread處理不同的數據。與SIMD的差別為應用於multi-thread 環境 SIMD vs. SIMT - SIMD：強調數據並行性，一個單一指令可以同時對多個數據點進行操作，常用於向量處理和矩陣計算等對大數據集進行相同操作的任務。所有數據路徑執行相同的操作。 - SIMT：強調 thread 並行性，可以在同一個指令中執行不同的操作分支，比 SIMD 更靈活。 - Cuda（Compute Unified Device Architecture） - a parallel computing platform and programming model developed by NVIDIA. - C-like language - abstractions for expressing parallelism - include threads, blocks, and grids. - a thread is associated with each data element - threads are organized into blocks, called thread block - thread blocks are organized into a grid - Nvidia GPU vs. vector machine similarities 1. scatter-gather: 將一組連續的數據從global memory分散地傳輸到多個不同的目標位置（local store）。這通常 parallel computing 用於將較大數據分配給不同的處理單元。 2. mask registers 3. large register files difference 1. use multithreading to hide memory latency 2. has many deeply pipelined functional units ## 附錄: Loop-level parallelism - a compiler techonology - a loop is parallel if it can be written without a cycle in dependence - **loop-carried dependence** - loop中某些迭代的結果依賴之前迭代的計算結果 - loop-level parallelism has no loop-carried dependence - example 1. s1 和 s2 使用前一次iteration的值 <img src="https://hackmd.io/_uploads/rkwl0UpnA.png" width="60%"> 2. s1 使用 s2 前一次 iteration 的值 <img src="https://hackmd.io/_uploads/Bk0EAUahC.png" width="60%"> solution <img src="https://hackmd.io/_uploads/rJQjCUph0.png" width="60%"> - recurrence a special form of loop-carried dependence <img src="https://hackmd.io/_uploads/Bkgg1va3C.png" width="40%"> some vector computers have special support for executing recurrence #### - loop-carried dependence and intro-loop dependence are different intro-loop dependence: 同一迴圈中的不同指令之間存在依賴關係 - two types 1. circular: s1 depends on s2, s2 depends on s1 2. not circular: although s1 depends on s2, s2 doesn’t depend on s1 - compiler technology for finding dependence - affine 的概念: 一種線性變換，它包括了線性變換和平移，可寫成a×i+b 形式， a和b是常數，i 是迴圈index ex: <img src="https://hackmd.io/_uploads/H14_yP630.png" width="60%"> <img src="https://hackmd.io/_uploads/BkK_yPp2A.png" width="60%"> - affine 和 loop level parallelism 的關係 Affine indexing 允許compiler更容易地分析loop dependence，從而確定迴圈是否可以並行化 - example - load 一個索引為 c×i+d 並store到 a×i+b，i runs from m to n - 如果以下兩個條件成立，則存在dependence關係： 1. **範圍條件**：存在index j 和 k，使得 m≤j≤n 且 m≤k≤n 2. **等式條件**：a×j+b=c×k+d 成立。 - 快速測試: GCD(greatest common divisor) test - if a loop-carried dependence exists, then GCD(c,a) | |d-b| c 和 a 的最大公因數必須能整除 d-b 的絕對值 - example <img src="https://hackmd.io/_uploads/HkmbxwpnA.png" width="50%"> a = 2, b = 3, c = 2 and d= 0 GCD(a,c) = 2, |b-d| = 3 2 doesn’t divide 3 → no dependence - GCD test is sufficient but not necessary，有依賴性的 GCD 一定可以測出來，但是存在 GCD 測試成功但實際上不存在依賴性的情況。 - 其實是個複雜的問題