# Data Level Parallelism (DLP)
## 1. Introduction
data level parallelism 常用於 SIMD(single instruction multiple data)
- 常見應用: machine learning algorithm, image processing, etc.
- SIMD 的架構和應用
1. Vector architecture
* 每個element各自獨立,且做不同的運算
* 早期 supercomputer = vector machine
開始使 I/O 花的時間 > device 內部運算的時間
2. Multimedia SIMD Instruction Set Extensions (多媒體 SIMD 指令集擴展)
* 針對通用處理器設計的指令集擴展,使得處理器能夠在執行普通指令的同時,利用專門的指令對多個數據進行並行處理
* 每個element各自獨立,但做差不多的運算
* ex: Intel
a. MMX (Multimedia Extensions, 1996) 是最早的 SIMD 擴展,主要用於加速多媒體和通信應用中的計算。
b. SSE (Streaming SIMD Extensions) 是 MMX 的後續版本,提供了更強大的指令集,能夠處理更多類型的數據。
c. AVX (Advanced Vector Extensions) 是 SSE 的進一步擴展,增加了更多的寄存器和更寬的向量支持,提高了處理效率。
3. **Graphics Processor Units (GPUs)**
與傳統的Vector architecture相比,GPU 具有更多的處理單元,能夠同時處理更多的數據。更適合用於大規模並行計算,例如深度學習、科學計算等。
## 2. Vector Machine
### 2.1 vector memory-memory (VMM)
- first vector machine was memory-memory machine
- require greater memory bandwidth
arithmetic operand must be read in and out of memory
### 2.2 vector register machine (VRM)
#### 2.2.1 introduction
- arithmetic operand 來自 register, **access 資料要對 memory 進行 load/store**
- Cray-1: first vector register machine
- example

- VMM vs. VRM

#### 2.2.2 VRM Architecture
- register files act as compiler controlled buffers
register files 是處理器中的一組高速存儲單元,用於臨時存儲運算過程中的數據。在 VRM 架構中,這些寄存器檔案可以看作是存放數據的 buffer
- can hide memory latency
- load/store 會花比較多時間 → vector load/store units are deeply pipelined for hiding latency
- vector functional units 也 deeply pipelined
- ex: RV64V




- lots of pipeline stalls are necessary for deeply pipelined architecture
- stall once for the first vector element, subsequent elements will flow smoothly down the pipeline
→ pipeline stall 只會在每個vector instruction 出現一次,而不是在每個vector element都發生
→ start up time較長
#### 2.2.3 multiple lanes(datapath) for high throughput

- since no communication between lanes and is paralleled
→ vectorization is a massive compile-time reordering of operation sequencing
#### 2.2.4 vector execution time
- factor
1. length of operand vector
2. structural hazard
3. data dependency
- estimate: **convoy and chime**
- **convoy: set of vector instructions that could execute together**
- must not contain structural hazards
- sequences with RAW hazards should be in different convoys, 除非chain起來
- vector chaining: register bypassing


- **chime: unit of time to execute one convoy**
- assume m convoys execute in m chimes, for vector length of n, it requires totally m*n clock cycles
- 不考慮startup time → better approximate for long vectors
- ex:

#### 2.2.5 vector length register
- maximum vector length (mvl): 是指vector machine在一次向量操作中能夠同時處理的數據元素的最大數量。
the physical limit of a vector processor
- a vector of arbitrary length

- 較難vectorize
1. condition (IF statements) inside loop

2. sparse matrices: 大部分元素是零,非零元素分佈不均勻,導致數據不連續。而vectorize依賴於數據的連續性,因為向量指令一次性操作多個相鄰的數據元素。
→ **solution: vector-mask control**
#### 2.2.6 vector-mask control
- allows the use of a mask while executing vector instructions to **control which elements participate in the computation**. Elements with a zero mask do not participate in the computation, while elements with a one mask do.
- vector mask register(VMR)
- a boolean vector to control the execution of a vector instruction
- two implementation:

- example of compress/expand operations

compress: 掃描masked(M),識別哪些元素需要保留(即mask值為 1 的element)
expand: performs inverse operation
#### 2.2.7 memory bank
- memory system must be designed to support high bandwidth for load/ store
→ 如果開很多port,面積會很大
→ 解決: memory banks
support multiple and non-sequential loads or stores per clock. Be able to control the addresses to banks independently
- example

32*6 = 192,每個處理器在每個周期有192次的memory reference
SRAM 記憶體需要 15 / 2.167 ≈ 6.92 ≈ 7 個處理器周期
在full bandwidth的狀況下,所需的memory bank數量為 192 × 7 = 1344,但實際上hardware配置可能無法完全達到full bandwidth的理想要求
#### 2.2.8 Handling multidimensional arrays in vector architecture 在向量架構中處理多維數組
- load/store moves groups of data between vector registers and memory
- stride: 分隔元素並放到單個vector register的距離
three types
1. unit stride: contiguous(連續、接連) block of information in memory
fastest
<img src="https://hackmd.io/_uploads/SJLdCM3nC.png" width="60%">
2. non-unit stride: 數據在內存中是按固定間隔分佈的,這些間隔不一定是 1,跳動的拿
harder to optimize
使用質數數量的數據庫:這樣可以更容易地在滿帶寬下支持不同的步幅。
<img src="https://hackmd.io/_uploads/rkZMJ73nA.png" width="60%">
3. indexed
數據的位置由一個index list決定,從內存的不同位置收集數據。
good for sparse arrays of data
<img src="https://hackmd.io/_uploads/HyZB1XhnA.png" width="60%">
- Interleaved Memory Layout
將memory address按順序分配到不同的memory bank中,以便每個bank在給定的時間內處理不同的address,提高bandwidth。

bank number = 8,non-unit stride 的情況下,
stride = 4 時,只有mod8 = 0,4 的memory bank 能被load/store 到
stride = 3 時,基本上每個memory bank 都能被load/store到
→ prime to 8 的比較好
- challenge
1. memory bank conflict

solution
a. SW: loop interchange or declaring array not power of 2 (array padding)
b. HW: prime number of banks,但運算較不好做(除質數)

* bank busy time: minimum period of time between successive access to the same bank
* bank conflict: when the same bank is hit faster than busy time
2. in multidimensional array

B: row-major order, has a stride of 1 double word
D: column-major order, has a stride of 100 double words
solution: use VLDS/VSTS: vector load/ store with stride (一種指令),用於以固定步幅但非連續位置的讀取或寫入數據
#### 2.2.9 sparse matrices
<img src="https://hackmd.io/_uploads/rkFAX73hC.png" width="60%">
- K and M are index vectors to designate the nonzero elements of A and C
- Gather-scatter operations are used for handling sparse matrices
operation: `vldi` (vector load indexed)、`vsti`(vector store indexed)
index vector, base address 加上 offset 來確定需要讀取的內存位置

* Vsetdcfg 4*FP64: Configure 4 64-bit floating-point vector registers.
* vld v0, x7: load data from the memory address specified by register `x7` into vector register `v0`. Here, `v0` contains the data of the array `K`.
* vldi v1, x5, v0: Use the base address `x5` and the index vector `v0` to load data from memory into vector register `v1`. This loads the data of the array `A[K[i]]` into `v1`.
## 3. Multimedia SIMD Instruction Set Extensions
- data type: 多媒體數據(如音訊和圖像)通常由較小的數據單元組成,ex: 8 bit
- flexible: 64-bit registers can split into 2×32bit or 4×16bit or 8×8bit
- SIMD extensions vs. vector machine
1. no vector length register
vector length registers 允許動態設置和調整 vector operation lengths,而SIMD extension 通常是fixed operation lengths.
2. no strided or scatter/gather data transfer instructions
SIMD extension 通常僅支持contiguous memory operations
3. no mask register
mask register 允許有條件地操作向量中的特定元素,而SIMD extension沒有
> ! 較新的extenstion,如AVX-512,也有 mask register (Chatgpt)
- pros
easy to implement: SIMD extension 是基於現有的算術單元進行擴展,所以實施起來不需要複雜的改變
easy to add new instructions for new media operation
- example: RVP (RISC-V 的 SIMD extension)
<img src="https://hackmd.io/_uploads/B1HHq86nC.png" width="50%">
<img src="https://hackmd.io/_uploads/ByCr5La30.png" width="50%">
RVP 的 P 代表packed指令
4D: 表示該指令操作4個double-precision floating-point的operands(一次4筆)

splat.4D: 將scalar a 複製4次到256bit的f0,目的為將scalar`a`擴展為vector,使能夠在SIMD操作中使用
- 優點:
1. 較快
2. 可減少指令數量,以此例來說減少約4倍指令數
- 缺點:
1. 處理非整數倍的向量長度時,此例來說是4,就需要額外的邏輯來確保所有數據都被正確處理
2. 只能訪問連續的數據
- Roofline performance model
- 目的: 描述了一個應用程序或計算任務中計算和數據傳輸之間的平衡。
- arithmetic intensity: floating-point/byte of memory read
<img src="https://hackmd.io/_uploads/HJdfoIT2A.png" width="80%">
- x-axis: arithmetic intensity
左到右: 算術強度從低到高
y-axis: floating-point operation 的時間複雜度
O(1): 常數時間複雜度,例如稀疏矩陣計算(SpMV)
O(log(N)): 對數時間複雜度,例如譜方法(FFT)
O(N): 線性時間複雜度,例如密集矩陣計算(BLAS3)
- 不同類型的任務
Sparse matrix (SpMV):稀疏矩陣運算,算術強度低,通常涉及較少的浮點運算和大量的內存訪問。
Spectral methods (FFTs):譜方法,例如快速傅立葉變換(FFT),算術強度較高,涉及大量的浮點運算。
Dense matrix (BLAS3):密集矩陣運算,例如基本線性代數子程序(BLAS3),算術強度高,涉及大量的浮點運算。
- ridge point

peak memory bandwidth(斜線)和 performance計算能力(水平線)的交點。
代表了系統性能的轉折點。在這個點上,性能從受memory bandwidth限制轉變為受計算能力限制。
## 4. GPU (graphical processing units)
- properties(Nvidia 為例)
1. almost every type of parallelism: multithreading, MIMD, SIMD, instruction-level
2. Heterogeneous Computing Model(異構計算模型)
heterogeneous 通常指的是使用不同類型的處理器來協同工作。
CPU與GPU協同工作,CPU負責控制和管理,GPU負責大量並行計算任務
3. unify all forms of GPU parallelism as CUDA thread
在CUDA中,所有的並行計算任務都被分解為thread。每個thread執行相同的程序代碼,但可以操作不同的數據
ex: 圖像處理,每個像素點可以分配一個CUDA線程,進行並行處理。
ex: 機器學習,在深度學習訓練中,每個神經元的計算或每個數據樣本的處理可以由CUDA線程來執行。
4. SIMT(Single Instruction Multiple Thread):一條指令同時作用於多個thread,每個thread處理不同的數據。與SIMD的差別為應用於multi-thread 環境
SIMD vs. SIMT
- SIMD:強調數據並行性,一個單一指令可以同時對多個數據點進行操作,常用於向量處理和矩陣計算等對大數據集進行相同操作的任務。所有數據路徑執行相同的操作。
- SIMT:強調 thread 並行性,可以在同一個指令中執行不同的操作分支,比 SIMD 更靈活。
- Cuda(Compute Unified Device Architecture)
- a parallel computing platform and programming model developed by NVIDIA.
- C-like language
- abstractions for expressing parallelism
- include threads, blocks, and grids.
- a thread is associated with each data element
- threads are organized into blocks, called thread block
- thread blocks are organized into a grid
- Nvidia GPU vs. vector machine
similarities
1. scatter-gather: 將一組連續的數據從global memory分散地傳輸到多個不同的目標位置(local store)。這通常 parallel computing 用於將較大數據分配給不同的處理單元。
2. mask registers
3. large register files
difference
1. use multithreading to hide memory latency
2. has many deeply pipelined functional units
## 附錄: Loop-level parallelism
- a compiler techonology
- a loop is parallel if it can be written without a cycle in dependence
- **loop-carried dependence**
- loop中某些迭代的結果依賴之前迭代的計算結果
- loop-level parallelism has no loop-carried dependence
- example
1. s1 和 s2 使用前一次iteration的值
<img src="https://hackmd.io/_uploads/rkwl0UpnA.png" width="60%">
2. s1 使用 s2 前一次 iteration 的值
<img src="https://hackmd.io/_uploads/Bk0EAUahC.png" width="60%">
solution
<img src="https://hackmd.io/_uploads/rJQjCUph0.png" width="60%">
- recurrence
a special form of loop-carried dependence
<img src="https://hackmd.io/_uploads/Bkgg1va3C.png" width="40%">
some vector computers have special support for executing recurrence
####
- loop-carried dependence and intro-loop dependence are different
intro-loop dependence: 同一迴圈中的不同指令之間存在依賴關係
- two types
1. circular: s1 depends on s2, s2 depends on s1
2. not circular: although s1 depends on s2, s2 doesn’t depend on s1
- compiler technology for finding dependence
- affine 的概念: 一種線性變換,它包括了線性變換和平移,可寫成a×i+b 形式, a和b是常數,i 是迴圈index
ex:
<img src="https://hackmd.io/_uploads/H14_yP630.png" width="60%">
<img src="https://hackmd.io/_uploads/BkK_yPp2A.png" width="60%">
- affine 和 loop level parallelism 的關係
Affine indexing 允許compiler更容易地分析loop dependence,從而確定迴圈是否可以並行化
- example
- load 一個索引為 c×i+d 並store到 a×i+b,i runs from m to n
- 如果以下兩個條件成立,則存在dependence關係:
1. **範圍條件**:存在index j 和 k,使得 m≤j≤n 且 m≤k≤n
2. **等式條件**:a×j+b=c×k+d 成立。
- 快速測試: GCD(greatest common divisor) test
- if a loop-carried dependence exists, then GCD(c,a) | |d-b|
c 和 a 的最大公因數必須能整除 d-b 的絕對值
- example
<img src="https://hackmd.io/_uploads/HkmbxwpnA.png" width="50%">
a = 2, b = 3, c = 2 and d= 0
GCD(a,c) = 2, |b-d| = 3
2 doesn’t divide 3 → no dependence
- GCD test is sufficient but not necessary,有依賴性的 GCD 一定可以測出來,但是存在 GCD 測試成功但實際上不存在依賴性的情況。
- 其實是個複雜的問題