計算機組織與結構 L5 利用管線增加效能

# 計算機組織與結構 L5 利用管線增加效能 ## 管線特性 - **Pipelineing :** - 讓多個**指令平行執行**，可**增加硬體使用率**及**增加單位時間完成指令數**，是讓 **Processor 效能更快**的主要關鍵 - 將 **Job** 分成**多個步驟**，並將執行工作的**硬體**切成幾個對應的 **Stage**，讓多個工作使用不同部分的資源以提升硬體使用率 - **MIPS** 將步驟分為 - **Instruction Fetch (IF)** - **Instruction Decode (ID)** - **Execution (EXE)** - **Memory (MEM)** - **Write Back (WB)** - **MIPS** 是專為 Pipeline 設計的原因 - **MIPS 指令長度相同 :** 讓 IF、ID 較為簡單 - **MIPS 只有幾種 Instruction Format，且 rs 都在同樣位置 :** 在 ID 的同時還能讀取 Register File 的資料，就不用將 ID 再細分成兩個 Stage - **只有 Load、Store 才會存儲記憶體 :** 在 EXE 後就可以直接存儲 Memory，而非在 EXE 後還要增加計算 Memory 的零件，才能存取 Memory - **運算元在 Memory 中有 Alignment :** 若沒有 Alignment，則需要花兩次時間存儲 Memory - Pipelineing **不會對單一工作的 Latency 有幫助** - Pipelineing **會對整體 Throughput 有增進** - Pipelineing 的 **Potential Speedup 約等於 Number of Stages** - **Pipeline Rate 受限於最慢 Pipeline Stage 的執行時間** - **Fill 和 Drain 所花的時間會降低 Pipeline 的加速** - 假設理想 Pipeline Machine 的 $Clock\ Cycle\ Time\ 為\ T$，有 $S\ Stages$，$N\ 個\ Instructions$ 要執行 : - $Execution\ Time = [(S - 1) + N]*T$ - $CPI = \cfrac{[(S-1)+N]}{N},\ if\ N\ \rightarrow\ \infty,\ CPI = 1$ - $Speedup = \cfrac{S*N*T}{[(S-1)+N]*T},\ if\ N\ \rightarrow\ \infty,\ Speedup = S$ ## Pipeline Datapath - 加了 **Pipeline Register** 來儲存上一層的執行結果，供下一層使用 - ![image](https://hackmd.io/_uploads/BkHGs9_qa.png =70%x) - 為了防 WB 寫回 Register 時和 ID 讀 Register 時碰撞，都是**先寫再讀** - 接了 **Clock** 的有 - **PC** - **IF/ID** - **Register** - **ID/EXE** - **EXE/MEM** - **MEM/WB** ![image](https://hackmd.io/_uploads/B1On2qO5T.png =70%x) - **Pipeline 圖形化表示** - **Multiple-Clock-Cycle Pipeline Machine** - 標準 ![image](https://hackmd.io/_uploads/r1jF6c_56.png =60%x) - 簡易 ![image](https://hackmd.io/_uploads/SJgTTc_5a.png =60%x) - **Single-Clock-Cycle Pipeline Machine** - 簡易 (Multiple-Clock-Cycle Pipeline Machine 的某時間的截圖) ![image](https://hackmd.io/_uploads/BJRNR9uqp.png =60%x) ## Pipeline Datapath 的例外處理 - **Exception** 與 **Interrupt** - **Exception** - **Processor 內部**發生非預期事件干擾程式執行 - CPU 執行的 Software 出錯，大部分都是 Exception - 由 **OS** 處理，**處理方式** : - 將引發例外的指令位址存到 **EPC (Exception Program Counter)** 中 - 將造成例外的原因存到 **Cause Register** 中 - 將控制權轉到 **OS** - 對**程式**的**處理步驟**: - **Flush** 掉 **IF、ID、EXE** 的指令 (因為是在 EXE 階段發生例外的) - 讓造成 Exception 指令的前面指令盡快完成 - 將 **Offending Instruction (= Excepting Instruction = Faulting Instruction = 造成 Exeception 的指令)** 的 **Memory Address** 存到 **EPC** - **Exception Handling Routine** 要先將剛剛存的 Memory Address 減 4 (PC + 4)，再呼叫 **OS** 處理 - 返回程式碼 - 知道造成例外的原因的**方法** : - **Status Register (= Cause Register) :** 利用 **Status Register** 存例外原因，當例外發生時，單一的 **Entry Point ($80000180_{16}$)** 將控制權丟給 **OS**， OS 再對 Status Register **解碼**得知例外發生原因 ![image](https://hackmd.io/_uploads/S1UPaZYqT.png =60%x) - **Vectored Interrupt :** 控制權轉換是由例外造成，到 **Interrupt Vector Table** 查詢被觸發的位址得知例外發生原因 ![image](https://hackmd.io/_uploads/HynFTbF9T.png =60%x) - ![image](https://hackmd.io/_uploads/HJZzTWY5a.png =90%x) - **Interrupt** - **Processor 外部**發生非預期事件干擾程式執行 - 例如 I/O 來的信號 - 分為**軟體中斷**和硬體中斷，其中，軟體中斷又分為 : - **Precise Interrupt (= Precise Exception) :** - 出現在 **Stage 少**的機器中，例如 **ARM**、**MIPS** - 解決方式為 **Continue** 或 **Kill Instruction** - **Imprecise Interrput (= Imprecise Exception) :** - 要確切紀錄到 Exception 的指令位址不是一件簡單的事情，像是 x86 有四十幾個 Stage 的Pipeline，發生 Exception 時，**不知道哪個 Stage 的指令發生 Exception**，因此就**不會記錄**到發生 Exception 的指令位址 - 出現在 **Stage 多**的機器中，例如 **x86** - 解決方式為 **Kill Instruction** ## Pipeline Control Unit - ![image](https://hackmd.io/_uploads/B1NukiuqT.png =70%x) ![image](https://hackmd.io/_uploads/r1rskod5a.png =70%x) - ![image](https://hackmd.io/_uploads/BJLJeo_ca.png =70%x) - 加入 **Control Unit** ![image](https://hackmd.io/_uploads/Hyn-xsO5T.png =50%x) - ![image](https://hackmd.io/_uploads/SkScxid5T.png =80%x) ## Pipeline Hazard - **Hazard :** 在管線中不能順利進行下一個指令 - **Structural Hazards :** 硬體資源不夠多，導致一時間內要執行多個指令而無法執行 (假議題) - **Data Hazards :** 因為 **Data Dependency** 且 Pipeline 中**後面指令需要用到前面指令的 data** 而造成 (Data Dependency 不一定會 Data Hazards) - **Data Dependency** - **Instruction Distance <= 2** - **Control Hazards (Branch Hazards) :** Branch 還沒決定要不要跳，後面的指令就先進入 Pipeline 中了 ## Data Dependency - **Data Dependency** - **RAW :** - 才會造成 Data Hazards ```MIPS= // $s0 add $s0, $s1, $s2 sub $s3, $s0, $s4 ``` - 又叫 **True Data Dependency** - **WAR :** - 不會造成 Data Hazards ```MIPS= // $s1 add $s0, $s1, $s2 sub $s1, $s3, $s4 ``` - 又叫 **Antidependence (= Name Dependence) :** 是因為 Register 號碼被重複使用，而非 Data Dependency，可利用 **Register Renaming**，讓**編譯器安排額外的暫存器**解決此問題 - **WAW :** - 不會造成 Data Hazards ```MIPS= // $s0 add $s0, $s1, $s2 sub $s0, $s3, $s4 ``` - 又叫 **Output Dependency** - **Superscalar Pipeline 三種都可能造成 Data Hazards** ## 解決 Hazards 的大綱 ![image](https://hackmd.io/_uploads/HJtkVs_c6.png =80%x) ## Solutions To Structural Hazards - 解決 - 加入足夠硬體 - 暫停管線，讓優先權高的先執行 ## Solutions To Data Hazards - **軟體**解決 : 利用 **Compiler** 插入足夠 NOP 或重排指令執行順序 - **插入 NOP** - 優點 : 簡單 - 缺點 : 效率差，會佔去時脈週期 - **重排指令順序** - 未必所有程式都可以在不影響程式的正確性下重排指令順序 - 可能需要搭配插入 NOP - **硬體**解決 - **Forwarding (Bypassing)** - 加入特殊硬體，提早從內部資源獲取所缺少的項目 - 步驟為 1. **Detection :** 偵測**目前指令是否寫入暫存器**、**目前指令的目的暫存器是否和其後的來源暫存器相同** **EX Hazards** ![image](https://hackmd.io/_uploads/Bkt872_96.png =30%x) ``` cpp= if (EX/MEM.RegWrite and (EX/MEM.Register.Rd ≠ 0) and (EX/MEM.RegisterRd = ID/EXE.RegisterRs)) ForwardA = 10; ``` ![image](https://hackmd.io/_uploads/BkLOX2d5T.png =30%x) ``` cpp= if (EX/MEM.RegWrite and (EX/MEM.Register.Rd ≠ 0) and (EX/MEM.RegisterRd = ID/EXE.RegisterRt)) ForwardB = 10; ``` **MEM Hazards** ![image](https://hackmd.io/_uploads/rye7Ihd9p.png =30%x) ``` cpp= if (MEM/WB.RegWrite and (MEM/WB.Register.Rd ≠ 0) and (MEM/WB.RegisterRd = ID/EXE.RegisterRs)) ForwardA = 01; ``` ![image](https://hackmd.io/_uploads/r1DDU3u5a.png =30%x) ``` cpp= if (MEM/WB.RegWrite and (MEM/WB.Register.Rd ≠ 0) and (MEM/WB.RegisterRd = ID/EXE.RegisterRt)) ForwardB = 01; ``` 但因為可能出現 **EXE/MEM** 和 **ME/WB** 也要 Forwarding 的衝突，就要更正為 ```cpp= if (MEM/WB.RegWrite and (MEM/WB.Register.Rd ≠ 0) and not (EX/MEM.RegWrite and (EX/MEM.Register.Rd ≠ 0) and (EX/MEM.RegisterRd = ID/EXE.RegisterRs)) and (MEM/WB.RegisterRd = ID/EXE.RegisterRs)) ForwardA = 01; ``` 和 ``` cpp= if (MEM/WB.RegWrite and (MEM/WB.Register.Rd ≠ 0) and not (EX/MEM.RegWrite and (EX/MEM.Register.Rd ≠ 0) and (EX/MEM.RegisterRd = ID/EXE.RegisterRt)) and (MEM/WB.RegisterRd = ID/EXE.RegisterRt)) ForwardB = 01; ``` ![image](https://hackmd.io/_uploads/BJHku3Oca.png =70%x) 2. **Forwarding** ![image](https://hackmd.io/_uploads/BkAKvhOqT.png =70%x) ![image](https://hackmd.io/_uploads/SkA6YhOc6.png =70%x) ![image](https://hackmd.io/_uploads/BkRPw6dqp.png) 3. **Stall :** 當遇到 **Load** 指令後面跟著一個指令，且這個指令的**來源暫存器**剛好是 **Load 的目的暫存器**，則需要 Stall (此情形稱為 **Load-Use Data Hazard**) - **在 ID 和 EXE 階段偵測的 Hazard Detect Unit** 將 **Load-Use Data Hazard 中後來的指令 Stall** ```cpp= if (ID/EXE.MEMRead and [(ID/EXE.RegisterRt = IF/ID.Register.Rs) or (ID/EXE.RegisterRt = IF/ID.Register.Rt)]) stall ``` ![image](https://hackmd.io/_uploads/H1EGphuqa.png =90%x) - **Branch 的 Data Hazards (= Static Branch Prediction 永遠猜跳)** (此時 Branch 已在 ID 決定是否跳) 1. Branch 的暫存器與**前面第二或第三個 ALU 指令的目的暫存器**有 Data Dependency : ```MIPS # $1 add $1, $2, $3 sub $4, $4, $2 beq $1, $5, L ``` **Solution :** 用 **EXE/MEM 到 ID 的 Forwarding** 或 **利用先寫在讀的特性** 2. Branch 的暫存器與**前面一個 ALU 指令或前面第二個 Load 指令的目的暫存器**有 Data Dependency : ```MIPS # $1 add $1, $2, $3 beq $1, $5, L sub $4, $4, $2 ``` **Solution :** **Stall 一個 Clock** 3. Branch 的暫存器與前面一個 Load 指令的目的暫存器有 Data Dependency : ```MIPS sw $1, 12($3) beq $1, $5, L sub $4, $4, $2 ``` **Solution :** **Stall 兩個 Clock** ## Solutions To Control Hazards - 解決 - **減少分支延遲 :** 將**原本要在 MEM** 才能決定要不要跳**前移到 ID** 決定，這樣最多只有一個指令會被清除，而非兩個 ![image](https://hackmd.io/_uploads/BylmZpOqT.png =80%x) 若真的要清除指令，利用 **IF.Flush** 控制線，將 IF/ID 的指令換成 **sll $0, $0, $0 (NOP)** - **動態分支預測 (Dynamic Branch Prediction 利用 Runtime Information 猜跳或不跳) :** - 利用 - **BHT ( Branch History Table = Branch Prediction Buffer) :** 將 Branch 指令位址較低的地方當作索引的小型記憶體，記憶體中用一位元**紀錄最近 Branch 有沒有發生** - **BTB (Branch Target Buffer) :** **存放目的地 PC** 或**目的地指令**的 cache - 方法 - **1-bit Prediction Scheme :** - 屬於 **Counter Based** 方法 - 採**飽和運算** - 存 **BHT** 中 ![image](https://hackmd.io/_uploads/rkSMoa_9T.png =60%x) - **2-bit Prediction Scheme :** - 屬於 **Counter Based** 方法 - 採**飽和運算** - 存 **BHT** 中 ![image](https://hackmd.io/_uploads/HJ0rsa_qT.png =60%x) - **Correlating Predictor :** - 同時使用 **Local** 和 **Global** 分別看自己有沒有跳過、別人有沒有跳過預測自記這次要不要跳 (總共使用 2 bits) - **Tournament Predictor :** - 每個 Branch 使用多個預測器，看哪個比較好，再決定用哪個 - **延遲分支 :** **Compiler** 和 Assembler 將**安全指令** (不論分支條件是否成立都會執行的指令)放到 **Branch Delay Slot** 中，如此一來，不需要丟棄指令 - **From Before** - 在 **Branch 前面**找 - **先試 From Before，不可行再換另外兩種試試看** ![image](https://hackmd.io/_uploads/r1vHJ0O9p.png =20%x) - **From Target** - 在 **Branch 裡面**找 - 假設**跳的機率很高** - **不可能同時和 From Fall Through 同時存在** ![image](https://hackmd.io/_uploads/HJZC1Cdcp.png =20%x) - **From Fall Through** - 在 **Branch 後面**找 - 假設**跳的機率很低** - **不可能同時和 From Target 同時存在** ![image](https://hackmd.io/_uploads/B1VfeAuqa.png =20%x) ## 進階的管線 - Pipelineing 充分利用 **ILP (Instruction Level Parallelism)** - **提高 ILP** 的方法有兩種 : - **增加管線深度 (Stages 數) :** 例如 **Superpipeline** - **Multiple Issue :** 複製電腦內部單元，讓每個管線階段都能執行多個指令，可**讓 CPI 小於 1** (IPC 為 CPI 倒數，意思是 1 個 Clock 可執行的 Instruction 數) - 需要解決兩個**問題** - **要把 Instruction 分包並安裝進 Issue Slot** - **處理 Data Hazards 和 Control Hazards** - 有兩種**實作**方法 - **Static Multiple Issue :** 在**程式編譯**時做決策 - **MIPS-64** (2 issues, 5 stages) : 指令包分為 **整數型ALU運算或分支指令** 及 **載入或儲存指令** 兩個 Issue ![image](https://hackmd.io/_uploads/rJFdL1Fqp.png =60%x) ![image](https://hackmd.io/_uploads/ByNs8yY9p.png =70%x) ![image](https://hackmd.io/_uploads/ByxJwyF56.png =90%x) - **VLIW (Very Long Instruction Word) :** 完全使用 Compiler 包裹 Instruction，將 Issue Packet 當成一個單獨指令，根據定義好的欄位決定要做的動作 ![image](https://hackmd.io/_uploads/Hy6EKJF96.png =40%x) **優點 :** **△ Simpler Hardware** **△ More Scalable :** 一個 VLIW bundle 有更多 Instructions **缺點 :** **△ Complicated Software :** 編譯所需時間長 **△ All Issues Stall When Hazards** **△ Object Code Is Incompatible** **△ Needs Lots Of Memory Bandwidth** - **Intel IA-64 (= EPIC, Explicitly Parallel Instruction Computer)** 歸納為 **RISC**，暫存器比 MIPS 多，利用 **Instrution Group** 和 **Bundle** 達到 VLIW 優點，又比 VLIW 更有彈性 : **△ Instruction Group :** **Compiler** 包的，被丟到一般硬體執行，是一連串**沒有 Data Dependency** 的指令，只要有足夠硬體資源，就可以平行執行 **△ Bundle :** **Hardware** 包的，被放到特殊硬體中執行。每個 Bundle 長度為 **128 bits**，其中包含**一個** 5 bits 的 **Template Field** 和**三個** 41 bits 的 **Instructions**。(**Template Field** 會標示出 Bundle 中每個指令要用到的 5 個執行單元的哪個) IA-64 還用到 **Prediction** 處理分支 ```MIPS= # MIPS bne (P), ELSE statement 1 j Exit Exit: Statement 2 Exit: ``` ```MIPS= #IA-64 (P) Statement 1 (~ P) Statement 2 ``` > 下列指的東西相同，只是名稱不同 > > 在 **MIPS-64** 稱為 **Instruction Package(指令包)** > 在 **VLIW** 稱為 **Very Long Instruction (超長指令)** > 在 **IA-64** 稱為 **Instruction Group(指令群)** - **Dynamic Multiple Issue (= Superscalar):** 在**程式執行**時做決策還是需要仰賴 **Compiler** 做指令排程。 **Superscalar 技術**包括 : - **Dynamic Pipeline Scheduling (In-Order Completion) :** Processor 會在不改變程式資料流的前提下，**指令執行不會被打亂**，較為保守 將 Pipeline 分成 **Instruction Fetch And Issue**、**Multiple Functional Units**、**Commit Unit** 三種單元 ![image](https://hackmd.io/_uploads/HJM5Wxt56.png =80%x) **Reservation Buffer :** Multiple Finctional Units 的緩衝器，儲存運算元和運算子用的，運算完後，結果會被丟到 **Reservation Units** 和 **Commit Unit** **Reorder Buffer :** Commit Unit 內的緩衝器，會先等待安全後，再將結果**寫回 Register 或 Memory** - **Out-of-Order Execution :** Processor 不會考慮程式資料流的順序，**指令執行會被打亂**，較為不保守 **RAW、WAR、WAW 都會造成 Data Hazards** **△ RAW** 無法改變 **△ WAR、WAW** 合稱 **Storage Conflicts**，是由於**硬體資源衝突**發生，可用 **Register Renaming** 改變 - **Speculation** 是為了找尋和開發更大的平行度 - **軟體**而言 : **Compiler**會加入其他指令檢測正確性，並提供**修補程式**在猜錯時使用 - **硬體**而言 : **Processor**將猜測結果暫存，若猜對則將上述結果**寫回暫存器或記憶體**；猜錯就將暫存結果 **Flush** 掉，重新執行正確序列 ## 平行度 - **ILP、TLP、DLP、JLP** - **ILP (= Instruction Level Parallelism) :** - A microprocessor can execute (retire) more than one instruction in parallel. - **TLP (Thread Level Parallelism) :** - It is **a single program is decomposed in multiple parallel bits of code (threads)**. These threads can be executed in parallel. - **DLP (Data Level Parallelism) :** - (= **Distributed Systems**) - **run the same task on different components of data** - **JLP (Job Level Parallelism) :** - (= **Function Parallelism** = **Control Parallelism** = **Task Parallelism**) - It is a form of parallelization of computer code across multiple processors in parallel computing environments - In contrast to data parallelism which involves running the same task on different components of data, task parallelism is distinguished by **running many different tasks at the same time on the same data**.