Chap. 04 - The Processor (2): Pipelining

# Chap. 04 - The Processor (2): Pipelining > 課程內容 : 清華大學開放式課程黃婷婷教授 > 參考書目 : Computer Organization and Design: The Hardware/Software Interface (5th), David A. Patterson, John L. Hennessy > > 其他科目內容請參見 [[Here]](https://hackmd.io/@ChenZE/By4SOO6Jyl) ## Contents * [Sec. 4.7 An overview of pipelining](#Sec-47-An-overview-of-pipelining) * [Sec. 4.8 A pipelined datapath](#Sec-48-A-pipelined-datapath) * [1. Design a pipeline processor](#1-Design-a-pipeline-processor) * [2. Pipeline registers](#2-Pipeline-registers) * [2.1 Example: load](#21-Example-load) * [3. Pipelining R-format and load](#3-Pipelining-R-format-and-load) * [3.1 Example](#31-Example) * [Sec. 4.9 Pipeline control](#Sec-49-Pipeline-control) * [1. Group signals according to stages](#1-Group-signals-according-to-stages) * [2. Example](#2-Example) * [Sec. 4.10 Hazards: types of hazard](#Sec-410-Hazards-types-of-hazard) * [1. Structural hazard: single memory](#1-Structural-hazard-single-memory) * [2. Feedback path: data hazard](#2-Feedback-path-data-hazard) * [2.1 Types of data hazard](#21-Types-of-data-hazard) * [3. Feedback path: control hazard](#3-Feedback-path-control-hazard) * [Sec. 4.11 Handling data hazards](#Sec-411-Handling-data-hazards) * [1. Inserting NOP (software)](#1-Inserting-NOP-software) * [2. Forwarding, R-type-use (hardware)](#2-Forwarding-R-type-use-hardware) * [2.1 Datapath with forwarding](#2-1Datapath-with-forwarding) * [2.2 Detecting data hazard](#22-Detecting-data-hazard) * [2.3 Forwarding logic](#23-Forwarding-logic) * [2.4 Example](#24-Example) * [3. Stalls, load-use (hardware)](#3-Stalls-load-use-hardware) * [3.1 Datapath with stalling unit](#31-Datapath-with-stalling-unit) * [3.2 Control: handling stalls](#32-Control-handling-stalls) * [3.3 Example](#33-Example) * [Sec. 4.12 Handling branch hazards](#Sec-412-Handling-branch-hazards) * [1. Pipeline datapath with control signals](#1-Pipeline-datapath-with-control-signals) * [2. Handling branch hazard and pipeline](#2-Handling-branch-hazard-and-pipeline) * [2.1 Example](#21-Example) * [3. Dynamic branch](#3-Dynamic-branch) * [4. Delayed branch](#4-Delayed-branch) * [Sec. 4.13 Exceptions](#Sec-413-Exceptions) * [1. Exception and interrupts](#1-Exception-and-interrupts) * [2. Handling exceptions](#2-Handling-exceptions) * [2.1 Before jump to OS](#21-Before-jump-to-OS) * [2.2 Handler actions](#22-Handler-actions) * [3. Exception in pipeline](#3-Exception-in-pipeline) * [3.1 Exception example](#31-Exception-example) * [4. Multiple exceptions](#4-Multiple-exceptions) * [5. Imprecise exceptions](#5-Imprecise-exceptions) * [Sec. 4.14 Superscalar and dynamic pipelining](#Sec-414-Superscalar-and-dynamic-pipelining) * [1. Instruction-level parallelism](#1-Instruction-level-parallelism) * [2. Multiple issue](#2-Multiple-issue) * [3. Compiler/Hardware speculation](#3-CompilerHardware-speculation) * [3.1 Static multiple issue (compiler approach)](#31-Static-multiple-issue-compiler-approach) * [3.2 Dynamic multiple issue (hardware approach)](#32-Dynamic-multiple-issue-hardware-approach) ## Sec. 4.7 An overview of pipelining > 情境舉例: > 假設有 4 個人 Ann、Brain、Cathy 與 Dave，每個人都必須進行洗衣服、烘衣服與摺衣服 3 種動作，且洗衣服需耗時 30 分鐘，烘衣服需耗時 40 分鐘，摺衣服需耗時 20 分鐘，則 4 人依序完成這些動作所需時間為多少？上述問題如果以每個人依序執行 (sequential) 的方式完成，示意圖如下，總共需耗時 6 個小時 ![未命名](https://hackmd.io/_uploads/Syte5dOqA.png) 但考慮到前一個人在烘衣服時，洗衣機是空的可以給下一個人使用，以此方式將資源使用時間重疊可發展如下，僅需要 3.5 hr ![未命名](https://hackmd.io/_uploads/Bk8Ocuuq0.png) 上述概念即為 pipeline，整理如下: * Doesn't help latency of single task but throughput of entire: 不能縮短等待時間，但可以提升單位時間的工作量 * Pipeline rate limited by slowest stage: pipeline 的速度會被最慢的 stage 所限制 * Multiple tasks working at same time using different resources: 在同一時間點使用不同資源來完成多個任務 * Potential speedup = Number pipe stages: 速度提升與切分的 stage 數量有關 * Unbalanced stage length; time to fill and drain the pipeline reduce speedup: 每個 stage 的時間是不相等的 * Stall for dependences: 在 dependence 的地方會發生等待比較過去的 single cycle 與 pipeline 的執行方式，為了將一個 cycle 的工作內容切分成不同的階段 (stage)，對應到 clock cycle 也必須切分成多個 small clock cycle，示意圖如下，每一個 small clock cycle 對完成一個 stage 的工作 ![image](https://hackmd.io/_uploads/BJL0nOuqA.png) 對比兩者的效率 (performance) 而言，如下圖所示，single-cycle 需要每 800 ps 才會有結果，但 pipeline 每 200 ps 就會有結果產生 ![image](https://hackmd.io/_uploads/HkGwpdOcC.png) 簡而言之，所謂的 pipeline 就是在同一時間中，不同的 instruction 會共享不同的 resource ![image](https://hackmd.io/_uploads/HJiIWtu9A.png) ## Sec. 4.8 A pipelined datapath ### 1. Design a pipeline processor 如同前一小節所述，使用 pipeline 就必須將任務切分，並讓資源重新作分配，需要注意的細節如下: 首先將 datapath 切分成 5 個 stage: * IF: instruction fetch * ID: instruction decode and register file read * EX: execution or address calculation * MEM: data memory access * WB: write back 接著協調同一個 stage 中的 resources，並確保 data flow 在執行的過程中不會發生 conflict，最後確保 control signal 在適合的 stage 中發生 MIPS 的相關指令 (R-format、lw、sw、beq 等) 與上述切分的 5 個 stage 對應工作內容如下。從下表可以發現到說　lw 的指令比其他的指令多了一個 WB 的 stage ![image](https://hackmd.io/_uploads/HkKaetdcR.png) ### 2. Pipeline registers 將 single-datapath 照上述方式切分後如下圖， ![未命名](https://hackmd.io/_uploads/SyUlzF_50.png) 但實際上在訊號傳遞時，並不會依照這種人為的方式走，應該強制把每個 stage 完成後做一個 ending 的動作，簡單來說就是把每個 stage 完成後的結果存起來，才能將 stage 完全切割。 :::info 搭配 small clock cycle 將每個 stage 的結果儲存 ::: 為此必須引進一些另外的 register 來儲存值這些中間結果，稱為 pipeline register (下圖橘色部分)，此外，透過 pipeline register 把結果儲存起來後，就可釋放該 stage 的資源並提供給下一個 cycle 使用 ![image](https://hackmd.io/_uploads/BJFVXFu9A.png) #### 2.1 Example: load 以 load 的執行過程為例: ![image](https://hackmd.io/_uploads/HyHTNF_9C.png) ##### IF: instruction fetch 從 instruction memory 中擷取 instruction ![image](https://hackmd.io/_uploads/r1JySKd5A.png) :::info pipeline register 中，左邊顏色代表 write，右邊顏色代表 read ::: ##### ID: register fetch and instruciton decode ID 時必須從 read port 事先 access register 中的資料，再根據 decode 的結果決定是否要使用它 ![image](https://hackmd.io/_uploads/ryR2SKucR.png) ##### EX: calculate the memory address 透過 ALU 計算記憶體位址 ![image](https://hackmd.io/_uploads/ByH0UKOqA.png) ##### MEM: read data from memory 根據 address 從 data memory 讀取資料 ![image](https://hackmd.io/_uploads/H1xkPtu9C.png) ##### WB: write the data back 透過 write port 將資料從 memory 寫入 register ![image](https://hackmd.io/_uploads/Bk5JDKO5R.png) :::danger 但此時有一個問題，舊的資料要從 memory 寫入 \$rt，但新的 instruction 也要從 \$rt 讀取資料，該如何選擇？(後面會說明) ::: 從上述步驟可看出，在 pipeline datapath 進行的過程中，會有 5 個 functional unit 參與這個過程 ### 3. Pipelining R-format and load 考慮 R-format 的執行如下: ![image](https://hackmd.io/_uploads/rkym_Fd5R.png) 但是當 R-format 與 load 接續執行時，在 WB 階段會發生 conflict 的問題 (如下圖)，此問題稱為 structural hazard，簡單來說就是分配時間不對所造成，是不同 instruction 搶同一個 resource 所引發的問題 ![image](https://hackmd.io/_uploads/Hk57ptO9R.png) 解決方式著重在以下兩點: * Each functional unit cna only be used once per instruction: 每個 functional unit 同一時間只能被一個 instruciton 使用 * Each functional unit must be used at the same stage for all instruction: 每個 instruction 中，相同的 stage 必須做相同的事情舉例來說，load 在第 5 個 stage 會使用 write regsiter port 來寫入資料，但 R-format 卻在第 4 個 stage 使用 write register port 來寫入資料，因此必須將 R-format 的第 4 個 stage 往後 delay 一個 cycle，使得每個 instruction 有相同的長度 ![image](https://hackmd.io/_uploads/SkAtJ5Oc0.png) 解決方式是將 R-format 新增一個 MEM 的 stage，讓長度對齊，但是這個 MEM stage 對 R-format 來說不會做任何事 (NOP stage)。 ![image](https://hackmd.io/_uploads/H1s_e5u5C.png) 其餘像是 store 與 branch equal 等指令也是以相同概念來對齊長度 ![image](https://hackmd.io/_uploads/Syuxb5_5A.png) #### 3.1 Example ![未命名](https://hackmd.io/_uploads/HkN_-cOcC.png) ![未命名](https://hackmd.io/_uploads/HkX9WqOqR.png) ![未命名](https://hackmd.io/_uploads/rJBsZ9_q0.png) ![未命名](https://hackmd.io/_uploads/SJ02Z9_9R.png) ![未命名](https://hackmd.io/_uploads/S1r4G9O9C.png) ![未命名](https://hackmd.io/_uploads/BkGJz5u5C.png) ## Sec. 4.9 Pipeline control ### 1. Group signals according to stages 在 stage 2 做完 indtruction decode 後，main control 會發送 signal 給不同的 MUX，但 MUX 分佈在許多不同的 stage 中，甚至有些 instruction 的 stage 還沒被執行到解決方式是根據 instruction 的 stage 將 main control 所發出的 control signal 進行分組。Main control 所發出的 control signal 同樣會傳到第 2 個 pipeline register 中，這些 signal 會根據當前所在的 stage 來決定是否要傳送 signal 到當前的 stage 的 MUX，如果沒用到較繼續傳到下一個 stage 的 pipeline register，如下圖所示，藍色圓圈處表示沒有用到的 signal，會繼續往下一個 stage 傳遞 ![未命名](https://hackmd.io/_uploads/SyTBB9dcA.png) 回顧 [2.1 Example: load](#21-Example-load) 的問題，當 load 指令在進行 WB 時要寫回 \$rt，但如果後面的 instruction 又傳入 \$rd 給 write register port，那到底要寫入哪一個 register 中？ ![image](https://hackmd.io/_uploads/ByoXv9ucC.png) 仿照 control signal 的方式，將 stage 2: ID 所解析出來的 instruction 要用到的 \$rt 往後傳，在 WB 階段回傳給 register file ，如下圖所示 ![未命名](https://hackmd.io/_uploads/SyBKY9uc0.png) 1. decode 完後決定哪個 register 要往後傳 2. 將 register number 放入 pipeline register 中 3. 回傳 \$rt 4. 2 個 \$rt 是指向不同的 register number，所以才要把前一個 instruction 的 \$rt 回傳 5. 但還是要搭配 write enable 來決定是否要將資料寫入 register ### 2. Example 以下列指令為例 ``` assenbly= lw $10, 20($1) sub $11, $2, $3 and $12, $4, $5 or $13, $6, $7 add $14, $8, $9 ``` ![image](https://hackmd.io/_uploads/Skft55OcR.png) ![image](https://hackmd.io/_uploads/ByKt5q_qA.png) ![image](https://hackmd.io/_uploads/rJpFc5d9C.png) ![image](https://hackmd.io/_uploads/HkW99cO9A.png) ![image](https://hackmd.io/_uploads/SJr9ccdqC.png) ![image](https://hackmd.io/_uploads/ryi59qucA.png) ![image](https://hackmd.io/_uploads/HJescqu9A.png) ![image](https://hackmd.io/_uploads/HJXjc5_9C.png) ![image](https://hackmd.io/_uploads/SJIjc9u9A.png) ## Sec. 4.10 Hazards: types of hazard Pipeline 的 hazards 有分成 3 個不同類型 * Structural hazards: 在同一時間點，不同 instruction 使用相同的 resource (見 [Sec. 4.8](#3-Pipelining-R-format-and-load)) * Data hazards: 新的 instruction 在 stage 2: ID 階段要讀取 register 的資料時，前一個 instruction 還沒完成 stage 5: WB (新的資料尚未寫入) * Control hazards: 遇到 $\mathtt{beq}$ 時，stage 3: EX 階段尚未計算出是否要做 branch，但下一個 instruction 已經開始做 stage 1: IF 的 instruction fetch 遇到 hazards 的問題時，通常都可以依靠 waiting 來解決，但這樣做的速度太慢，一方面浪費 cycle，一方面也失去做 pipeline 的意義。在處理 hazards 的問題時通常會牽涉到兩個重點 * How to detect hazard ? (何時發生 hazard) * How to design datapath ? (如何設計新的 datapath) ### 1. Structural hazard: single memory 以下圖的 senquential instruction 為例，在第 4 個 cycle 時會有兩個 instruction 同時使用 memory，一個是做 data access 一個是做 instruction access 解決方式是將 memory 拆開，分為兩個不同的 memory，一個用資料的儲存與載入 (data memory) 一個用作 instruction 的儲存與載入 (instruction memory) ![未命名](https://hackmd.io/_uploads/BkKu3EocC.png) ### 2. Feedback path: data hazard 從平面的 datapath 觀察 data hazard 的發生如下: ![未命名](https://hackmd.io/_uploads/SJ2cCVocA.png) 1. 假設某個時間點的 instruction 要讀取 \$rs=\$2 的資料 2. 但前一個 instruction 尚未進行到 stage: 5 的 WB，還沒將新的資料寫入 \$rd=\$2 若以立體圖來看如下，中間三個 instruction: $\mathtt{and}$、$\mathtt{or}$ 與 $\mathtt{add}$ 會出現問題。第一個 $\mathtt{sub}$ 指令的資料尚未寫入，後面的 instruction 就開始讀取 ![image-2](https://hackmd.io/_uploads/S1JEJHicR.png) #### 2.1 Types of data hazard Data hazard 的發生情境又可以區分為三種狀況: (以 inst.1 與 inst.2 分別表示第一個與第二個 instruction) * **R**ead **A**fter **W**rite (RAW) * inst.2 tries to read operand before inst.1 writes it * 在 inst.1 寫入前，inst.2 就已經讀取資料 * 如上圖之情境 * **W**rite **A**fter **R**ead (WAR) * inst.2 tries to write operand before inst.1 reads it * 在 inst.1 讀取資料前，inst.2 就已經把資料寫入 * 這種情況在 MIPS 中不會發生，因為 MIPS 的 5-stage 切分時已經把 write 放在最後端 (stage-5)，而讀取資料是在 stage-2 發生 * **W**rite **A**fter **W**rite (WAW) * inst.2 tries to write operand before inst.1 writes it * inst.2 搶在 inst.1 前把資料先寫入 * 這種情況在 MIPS 中不會發生，因為 MIPS 的 5-stage 切分時已經把 write 放在最後端 (stage-5) ### 3. Feedback path: control hazard 從平面的 datapath 觀察 control hazard 的發生如下: ![未命名](https://hackmd.io/_uploads/B17UfBscR.png) 1. 第一個 instruction 在此階段才判斷完 condition 要不要做 branch 2. 但第二個 instruction 已經讀取了，且這指令可能是錯的 3. 同理第三個 instruction 也已經讀取了，這指令也可能是錯的 ## Sec. 4.11 Handling data hazards Data hazard 基本處理可透過以下方式 * 在 single-cycle 的前期 (stage-1: ID) 就進行 instruction fetch，可避免 WAR 的發生 * 在 single-cycle 的後半段再做 WB 的動作可避免 WAW 此外，在 register file 內部也可以設計 internal forwarding，在 clock cycle 的前半段先寫入資料，後半段再讀取資料，這樣可以解決前述圖中 (見[此節](#2-Feedback-path-data-hazard)第二張圖) $\mathtt{add}$ 指令的問題一般來說，解決 data hazard 的方式有 3 種: * Compiler insert NOP (software solution) * Forward (hardware solution) * Stall (hardware solution) 後續針對這 3 種方式做說明 ### 1. Inserting NOP (software) 以下列 asseembly program 為例 ```= sub $2, $1, $3 and $12, $2, $5 or $13, $6, $2 add $14, $2, $2 sw $15, 100($2) ``` 在 $\mathtt{sub}$ 時要把資料寫入 \$2 中，但下一個指令 $\mathtt{and}$ 又要從 \$2 讀取資料，但此時 $\mathtt{sub}$ 的結果還沒寫到 \$2 中，會導致 $\mathtt{and}$ 讀取到舊的資料，至少要 delay 兩個 stage 才可以避免資料讀取錯誤解決方式可以在 $\mathtt{sub}$ 與 $\mathtt{add}$ 之間插入兩個 NOP，見下圖 ![image](https://hackmd.io/_uploads/rJFeKrsqA.png) ### 2. Forwarding, R-type-use (hardware) 再次從以下 datapath 的立體圖觀察: (C.C. = clock cycle) ![未命名](https://hackmd.io/_uploads/Hy7sCBsqC.png) 1. $\mathtt{sub}$ 中 C.C. 5 欲儲存的結果早在 C.C. 3 的 ALU 時就已經算好了 2. 在 C.C. 3 中，$\mathtt{sub}$ 的 ALU 的計算結果會儲存在第 3 個 pipeline register 3. 在 C.C. 3 中，$\mathtt{and}$ 的 register 所讀取到的資料會送到 ALU 中計算 而 forward 指的是 4. ALU 計算後的結果不等 stage-5 的 WB，就先把結果往前傳給下一個 instruction 使用 5. 同理，對 $\mathtt{or}$ 來說，計算 ALU 所需要的資料應該在 $\mathtt{sub}$ 中的第 4 個 pipeline register，所以從這裡往前傳給 or 指令使用 重點要考慮兩件事: * 何時做 forward ? * 如何設計 datapath #### 2.1 Datapath with forwarding Data hazard 的 datapath 設計如下: ![未命名](https://hackmd.io/_uploads/SkUtP_o9R.png) 以 inst.n 表示第 n 個 instruction，下圖中 inst.1 進行到 stage-4 的 MEM/WB，inst.2 進行到 stage-3 的 EX/MEM register， inst.3 進行到 stage-2 的 ID/EX register 1. \$rs 的 MUX，有 3 個可能的輸入，因為與 \$rt 相同，此處以 \$rt 為例 2. \$rt 的 MUX，有 3 個可能的輸入，因為與 \$rs 相同，此處以 \$rt 為例 3. 第一種輸入，來自正常管道的 \$rt 的資料，signal control 以 00 表示 4. 第二種輸入，來自 MEM/WB register 回傳的資料 (inst.1 的結果)，signal control 以 01 表示 5. 第三種輸入，來自 EX/MEM register 回傳的資料 (inst.2 的結果)，signal control 以 10 表示 6. 所以 instruction 執行完 stage-2 後，讀取到的 \$rt 或 \$rd 資料會往後傳遞給下一個 stage，往後都以 RegisterRd 表示 forwarding unit 的可以歸類為 3 大類: 7. 來自 inst.3 剛載入的 \$rs 與 \$rt 8. 來自 inst.2 的 EX/MEM. RegisterRd 與 inst.1 的 MEM/WB. RegisterRd 9. EX/MEM. Register 的 control signa (來自 inst.2 ) 與 MEM/WB. Register 的 control signal (來自 inst.1) forwarding unit 內部的動作: 10. 此處會比較 forward unit 的前兩個 input 來決定是否要做 forward * 如果 inst.2 的 EX/MEM.RegisterRd 或 inst.1 的 MEM/WB. RegisterRd 與 inst.3 的 \$rs 或 \$rt 相同，表示資料尚未寫入，此時就需要做 forward * 另外要注意的是，是否要寫入還需要依據編號 9. 的 control signal 來判斷 write enable #### 2.2 Detecting data hazard 根據上述結果，data hazards 的發生條件可以分為: (1) inst.2 與 inst.3 發生衝突或 (2) inst.1 與 inst.3 發生衝突，主要判斷方式是觀察兩者資料是否相同，如果舊指令的 registerRd 與新指令的 registerRs (或 registerRt) 資料相同，表示舊指令計算後的結果尚未寫回到 register 中，那就可能會發生 data hazards 1. Conflict between inst.2 and inst.3 * EX/MEM.RegisterRd = ID/EX.RegisterRs * EX/MEM.RegisterRd = ID/EX.RegisterRt 2. Conflict between inst.1 and inst.3 * MEM/WB.RegisterRd = ID/EX.RegisterRs * MEM/WB.RegisterRd = ID/EX.RegisterRT 除此之外還需要注意兩件事: * 如果是不用執行寫入的 instruction 就不會發生 :arrow_right: RegWrite 這個 control signal 判斷 * 如果 destination register 是 \$0 也不會發生 data hazards，因為 \$0 不會被覆蓋 :arrow_right: 判斷RegisterRd 是否為 \$0 總結來說，判斷方式如下: * EX stage * (EX/MEM.RegWrite) and (EX/MEM.RegRd ≠ 0) and (EX/MEM.RegRd = ID/EX.RegRs) * (EX/MEM.RegWrite) and (EX/MEM.RegRd ≠ 0) and (EX/MEM.RegRd = ID/EX.RegRt) * MEM stage * (MEM/WB.RegWrite) and (MEM/WB.RegRd≠0) and (MEM/WB.RegRd = ID/EX.RegRs) * (MEM/WB.RegWrite) and (MEM/WB.RegRd≠0) and (MEM/WB.RegRd = ID/EX.RegRt) 完整的立體圖表示法如下 ![image](https://hackmd.io/_uploads/ryHusdjcC.png) #### 2.3 Forwarding logic 另外要注意的是，如果 inst.2 與 inst.1 同時與 inst.3 發生衝突時，要該要 forward EX/MEM.RegisterRd 的資料 (即 inst.2 的)，因為 inst.2 中的資料才是最新的 #### 2.4 Example 以下列 asseembly program 為例 (注意 \$rs 與 \$rt 的 MUX 該使用何種 control signal) ```= sub $2, $1, $3 and $12, $2, $5 or $13, $6, $2 add $14, $2, $2 sw $15, 100($2) ``` ![未命名](https://hackmd.io/_uploads/SytdnOj9A.png) ![未命名](https://hackmd.io/_uploads/Bkja2dic0.png) ![未命名](https://hackmd.io/_uploads/rJy7Tuic0.png) ![未命名](https://hackmd.io/_uploads/BkzKTds5A.png) ### 3. Stalls, load-use (hardware) 在遇到 data hazard 時，不可能永遠都使用 forward 來處理，以下圖舉例 ![未命名](https://hackmd.io/_uploads/HkPsPv290.png) 1. $\mathtt{lw}$ 的 ALU 不會計算任何結果，只計算 memory address，所以無法將計算後的結果做 forward 2. 至少需要等到 data memory 階段才能將資料傳入下一個 instruction 這種狀況通常發生在 $\mathtt{lw}$ 後接 R-format 的 instruction 解決方法: ![未命名](https://hackmd.io/_uploads/HJ0KFDh5A.png) 1. 可以將 inst.2 的 stage-2 再執行一次 2. inst.1 就可以透過 forwarding 把 access memory 所得到的結果傳入 inst.2 中 #### 3.1 Datapath with stalling unit 以下圖的平面 datapath 為例，同樣以 inst.1 表示第一個 load instruction，以 inst.2 表示第二個 R-format instruction ![未命名](https://hackmd.io/_uploads/Hyr65On9R.png) Hazard detectino unit 有 3 個主要的輸入: 1. 在 inst.1 的 stage-3 時，ID/EX.MemRead 會回傳一個 signal 表示這個 instruction 要做 load (enable write) 2. 從 ID/EX.Register 回傳，inst.1 在 stage-3 後所讀取到的 \$rt 3. inst.2 在 stage-2 所讀取到的 \$rs 或 \$rt Hazard detection unit 內部動作: 4. Hazard detection unit 會比較上述的後兩個輸入 * 如果 inst.1 回傳的 \$rt 與 inst.2 讀取到的 \$rs (或\$rt) 相同，表示新的資料尚未從 data memory 載入到 register，此時就會發生 data hazard Hazard detection unit 有 2 個輸出: 5. 傳送給 PC 的 signal 6. 傳送給 IF/ID.Register 的 signal 經過編號 4 的比較後，如果發生 data hazard，則上述兩個 signal 會控制讓 PC 跟 IF/ID.Register 保持上一步動作。此外會再發送編號 7 的 zero signal，這個 signal 的用意是創建一個空的 stage (稱為 bubble)，讓 inst.1 與 inst.2 中間的 pipeline 多一個 cycle 的緩衝有了這個 bubble 後 inst.1 在後面的 stage-5 才能把從 memory 載入的資料做 forwarding #### 3.2 Control: handling stalls 根據上述的結果，是否需要做 stalls 的判斷條件可列如下: `(if ID/EX.MemRead) and ((ID/EX.RegRt = IF/ID.RegRs) or (ID/EX.RegRt = IF/ID.RegRt))` :::info `ID/EX.MemRead = 1` 表示為 load instruction ::: 當發生需要做 stall (停止) 的狀況時，具體該如何執行 ? * 針對 inst.2 的 IF 與 ID 做 stall instruction * stall instruction 的動作表示不會改變 PC 與 IF/ID 內部的資料 * 所以這兩個位置的 pipeline 會再執行一次 (re-execute) * 真 inst.1 的 stage-3: EX 插入一個 NOP * 透過傳送一個 signal 0 給 ID/EX.Register 中 EX、MEM、WB 的 control field 來讓 register 都設為 0 * 因為這些 signal 是會繼續往後傳遞，所以讓這個緩衝區 (bubble) 繼續往後延伸 * 有了 bubble 後，後面再進行 forwarding #### 3.3 Example 以下列 assembly program 為例 ```= lw $2, 20($1) and $4, $2, $5 or $8, $2, $6 add $9, $4, $2 ``` ![未命名](https://hackmd.io/_uploads/Hy_A1Y2qA.png) ![未命名](https://hackmd.io/_uploads/Bkp4gY35R.png) ![未命名](https://hackmd.io/_uploads/rkgY-Fh5R.png) ![未命名](https://hackmd.io/_uploads/HkUJGYhcC.png) ![未命名](https://hackmd.io/_uploads/HylSfF3c0.png) ![未命名](https://hackmd.io/_uploads/ByCnGth50.png) ## Sec. 4.12 Handling branch hazards ### 1. Pipeline datapath with control signals ![未命名](https://hackmd.io/_uploads/HyZ9Dt3qA.png) 如上圖，branch hazard 的問題點是當遇到 $\mathtt{beq}$ 等狀 instruction 時，在 stage-4 才會發生 signal 來決定是否要做 branch。因為我們預設所有的 instruction 在 IF 階段都是透過 PC + 4 取得，一旦 inst.1 需要做 branch，則後面的 instruction 都可能是錯的要解決這個問題，考慮的點是能不能把 branch 的計算與判斷往前拉，提早決定 ? ![未命名](https://hackmd.io/_uploads/SyvMvt350.png) :::info 此處把需要做 branch 的狀況稱為 taken，不需要做 branch 稱為 not taken ::: ### 2. Handling branch hazard and pipeline branch hazard 的處理可以遵循以下幾點: * 預設 branch 不會發生 (not taken) * 但如果預測錯誤的話，則需要藉由添加其他的 hardware 來把錯誤的 instruction 刪除 (flushing) * 如果在 stage-4: MEM 階段才做出 branch 的決策判斷，則 branch 實際發生時就會需要刪除兩個 instruction (IF/ID.Register 與 ID/EX.Register)，刪除方式是將 control signal 設為 0 (與 stall 類似) * 將 branch 的決策判斷往前移動可以解決上述問題要刪除 instruction 的問題 1. 將 **branch address** 的計算移到 stage-2: ID 階段 2. 在 stage-2: ID 階段新增一個 XOR gate 比較 \$rs 與 \$rt，用來進行 branch 的決策判斷，這種方式只要 flushing 一個 instruction 即可 (IF/ID.Register) 3. 新增一個 control signal 來做 flushing，用來控制 IF/ID.Register 的內容設為 zero instruction (又稱為 bubble 或 NOP) ![未命名](https://hackmd.io/_uploads/BJcX6t29C.png) #### 2.1 Example ![未命名](https://hackmd.io/_uploads/SyuU0t2cA.png) ![未命名](https://hackmd.io/_uploads/H1T30Yhq0.png) ### 3. Dynamic branch 另一種處理 branch hazard 的方式是動態預測。比方說在進行 loop 時，執行到最後必定會做一個 branch，但因為 instruction fetch 預設的位置是 PC + 4，所以每次在 loop 結束前的 branch 決策判斷都是錯的，都要做 flushing，這樣會導致速度變慢動態預測，指的是將該次 taken 或 not taken 的結果儲存起來並設為下一次的預設。這個 taken or not 的結果會存在一個 branch prefiction buffer，當再次執行到 branch 時，會先去檢查這個表格，如果上一次是 not taken 則這次也是；上次是 taken 則這次同樣也是 taken ### 4. Delayed branch 上述的 branch 處理方式，都會有 1 個 cycle 的 penalty。預設 not taken 但實際是 taken，反之亦同。而 delayed branch 的目的是找到一個 instruction 使得 branch 過程是 0 cycle 的 penalty 這個 instruction 要做的是 predicted not taken 與 predicted taken 兩者同時都會計算，並透過 compiler 的技巧將這個 instruction 塞在 branch 之後。當 branch not taken 時，後面接續就是擷取 PC + 4 的 instruction；當 branch taken 時，後面接續就是擷取 target address 的 instruction 大約 50\% 的情況可以找到這個 instruction :::info 雖然說是要找**一個** instruction 讓 taken 與 not taken 都做，但實際上 branch 與塞在 branch 後的指令**一起看**。有點像是 compiler 去預測 branch 的結果，當預測 branch 是做 taken，則後面就是塞 target address 的 instruction；當預測 branch 的結果是 not taken，則後面就是塞 PC + 4 的 instruction ::: ## Sec. 4.13 Exceptions ### 1. Exception and interrupts 當 unexpected events 發生時，會需要做控制權的轉換來解決這類事情，發生時會跳到 OS 處理 (why:question:)。通常分成兩個不同狀況: Eeception 與 Interrupt * Exception * 通常由 CPU 所發起，例如: undefined opcode, overflow, syscall * Interrupt * 來自外部的 I/O controller 在 MIPS 中，exception 通常由 syetem control coprocessor (CP0) 控制，且 exception 的處理都會需要犧牲一些效能 ### 2. Handling exceptions #### 2.1 Before jump to OS 發生 unexpected events 且控制權轉換到 OS 前，需要做以下準備 * Save PC: 紀錄轉換前的 program counter * 對 interrupt 來說，因為是由 I/O 所發出的訊號，所以會紀錄下一個 PC * 對 exception 來說，因為是出現問題導致 CPU 發出訊號，所以會紀錄當前的 PC * 在 MIPS 中，會將 PC 儲存在 Exception Program Counter (EPC) * Save indication of the problem: 紀錄問題原因 * 在 MIPS 中，會將原因紀錄在被稱為 cause register 的 register 中 * 且使用 1-bit 來表示問題原因 * 0: undefined opcode * 1: overflow 紀錄完後，會將 address 跳到 8000 00180 的 handler 中，這個 8000 00180 其實是 OS 的入口，接著再依據不同的原因跳到個別的 handler 處理 #### 2.2 Handler actions 在 OS 中會進行以下動作 * 讀取發生原因並轉換到相關的 handler * 進行處理 * Restartable (解決後) * 進行正確的動作 * 並使用 EPC 跳到原來的 PC 位址如果不能解決... * 終止程式 (terminate program) * 回報錯誤 (report error) ### 3. Exception in pipeline 以 add 指令中，在 EX-stage 發生 overflow 為例，必須預防 \$1 的內容被影響 $$ \text{add}\ \ $1,\ \ $2,\ \ $1 $$ * 完成之前的指令 ( complete previous instruction) * 移除當前指令 (`add`) 與接續的指令 (flush add and subsequent instructions) * 紀錄原因 (set cause and EPC register values) * 轉交控制權 (transfer control to handler) 上述過程是 exceptions 發生時的 pipeline，大致流程與 mispredicted branch 類似 #### 3.1 Exception example ```assembly= ### exception on add in ### 40 sub $11, $2, $4 44 and $12, $2, $5 48 or $13, $2, $6 4c add $1, $2, $1 # 假設在這行指令的 EX-stage 發生 overflow 50 slt $15, $6, $7 54 lw $16, 50($7) ### handler ### 80000180 sw $25, 1000($0) # 發生 overflow 後跳到此處 80000184 sw $26, 1004($0) ``` 以上 assembly program 為例 (開頭數字是記憶體位址)，當第 5 行的 add 在進行 EX-stage 並發生 overflow 時，會跳到 80000180 (OS 的入口) 來進行錯誤處理 ![S__2326532_0](https://hackmd.io/_uploads/B1xIR8xaC.jpg) ![S__2326534_0](https://hackmd.io/_uploads/SJD8AIepR.jpg) ### 4. Multiple exceptions 所謂的 pipeline 是將一個 instructio 分成不同的階段，並讓處理器可以同時執行多個 instruction，但某些情況下可能會同時產生多個 exception 最簡單的方法，是從最早的 instruction 開始處理 exception，處理完最早的 exception 後再清除 (flush) 後續的指令另一種方法稱為 precise exception，指的是當 exception 發生時，讓處理器回復到 exception 發生前的狀況。然而因為大部分處理器的 pipeline 比較複雜 (每個 cycle 可能發出多個 instruction)，所以這種回復到原先狀態的方法比較困難，因為不知道是哪條 instruction 比較早執行 ### 5. Imprecise exceptions 如果是 imprecise exception，會交由 OS 來判斷誰該執行發生 exception 時，pipeline 會停止執行並儲存當前狀態以及 exception 發生的原因，並交給後續的 handler 處理 Handler 會解決發生 exception 的指令，並決定哪些指令需要完成，哪些指令需要清除，通常會需要手動來處理，也就是需要一些額外的軟體來處理，此時就不是由硬體設備來進行這種方式好處是可以簡化硬體的設計，但同時也會需要更複雜的軟體來檢查與處理 exception ## Sec. 4.14 Superscalar and dynamic pipelining ### 1. Instruction-level parallelism 為了能夠完成更多任務並最大化利用硬體資源，所以必須使用更平行化的方式來執行 instruction。而所謂的更平行化，指的就是同一時間有更多的 instruction 同時執行，可以分成兩種方法: (1) deeper pipeline (2) multiple issues :::info 不過其實 pipeline 本身就是平行化的一種，因為對 MIPS 來說可以同時有 5 個 instruction 在執行 ::: ##### Deeper pipeline 簡單來說就是把每個 instruction 切分成更細的 stage (less work per stage , then shorter clock cycle)。但這種做法的缺點是，如果後面的 stage 才做 `beq` 的決定，那必須付出的 penalty 也會更大。 ##### Multiple issue 原本一個 cycle 只會做一個 instruction，現在改為一個 clock cycle 會輸入 2 個或 2 個以上的 instruction，簡單來說就是會有更多的 pipeline 同時進行。如下圖所示，同一個 stage 有兩個不同任務 (一紅一藍) 的 instruction 在執行，每過一個 clock cycle 就會同時有兩個不同任務的 instruction 輸入，也就是同時進行兩個 pipeline。 ![image](https://hackmd.io/_uploads/HJ-NC_b6C.png) 此外，為了量化這種 multiple issues 的方式，所以發展一個新的指標來做評估，稱為 Instruction Per Cycle (IPC)，表示一個 cycle 可完成多少的 instruction。然而實際上還是要看 instruction 之間有無 dependency 才能決定是否能夠進行 multiple issues。 :::info 之前用的量化指標是 CPI (cycle per instruction)，表示一個 instruction 需要多少的 cycle 來執行，通常是 1 或 1 以上，與 IPC 剛好是倒數關係。以 4 GHz 4-way multiple-issue 的處理器來說，4 GHz 表示 clock cycle 是每秒 4 giga，亦即每秒有 $4 \times 10^9$ 的 clock cycle，4-way 表示每個 clock cycle 可以最多發出 4 條 instruction (假設 1 個 instruction 需要 1 個 cycle)，所以每秒可以同時進行 160 億的 instruction (16 Billion-Instruction-Per-Second, BIPS)。因此 peak CPI = 0.25 且 peak IPC = 4 ::: ### 2. Multiple issue multiple issue 檢查有無 dependency 的方式有兩種: * Static multiple issue * 在 compile 時就決定有無 dependency，並把能做 issue 的 instruction 組合起來 * 是 compiler 來進行 * Dynamic multiple issue * 在 runtime 時才決定 * 由 hardware 處理但不論哪種方式，都必須要猜測 (guess) 指令要做的事情，如果猜對了就完成後續指令，猜錯了就退回 (roll back) 上一步再進行正確動作。以下舉兩個例子說明 ##### Speculate on branch outcome 在做 `beq` 時，需要猜測是否需要做 branch，如果猜對就繼續執行後續指令，猜錯就做 roll back ##### Speculate on load 在做 store 與 load 時，如果是先做 store 再做 load，通常預設不會在相同的 address 進行這兩個動作，也就是說不會先把資料儲存在某個 memory address，再從這個 address 中取出資料做計算，因為這樣有點多此一舉。然而，如果真的出現先做 store 再做 load 的 instructions 時，就可以猜測要載入的資料與要儲存的資料是在不同的 address，所以可以先提前做 load 的計算(因為 load 比較耗時)，如果猜錯再做 roll back ### 3. Compiler/Hardware speculation 先說結論，compiler 可以重新調整 instruction 的順序 (reorder)，避免不必要的等待，如上述的先做 load 再做 store，如果猜錯也能夠做修正 (fix-up)；hareware 可以提前查看 (look-ahead) 指令並將執行的結果存在緩衝區 (buffer)，如果猜測錯誤再將緩衝區的資料清除 (flush) #### 3.1 Static multiple issue (compiler approach) Compiler 會將多個 instrudtion 組合 (group) 起來並打包 (package) 成一個 issue packets，這些被組合的 package 會在一個 single cycle 中被執行，但用這個方法的話，表示同時間會執行更多的 instruction，所以必須增加 pipeline 的資源。此外，這些 issue packet 可以被視為一種很長的 instruction (Very Long Instruction Word, VLIW) Compiler 的 static multiple issue 還牽涉到 scheduling (排程)，因為並不是隨便找幾個 instruction 就能夠組合在一起，compiler 還必須確保這些 instructions 沒有發生 hazards: * Reorder instructions into issue packets * No dependence with a packet * Pad with NOP if necessary ##### MIPS with static dual issue 以 MIPS 中把兩個 instruction 打包 (package) 成一個 issue 舉例。在 MIPS 中通常會把 ALU/branch 等 instruction 組成一個 group，load/store 組成一個 group。在做 2-issue packets 時，通常一個是 ALU/branch，另一個是 load/store 才比較不會發生 hazard。因為是 2-issue packets，所以有 64-bits 的長度需要對齊，必要時 (Ex: 無法打包) 就放入 NOP，如下圖所示。 ![未命名](https://hackmd.io/_uploads/H14q8PcTA.png) 如同前面所說的，為了增加多個 instruction 做平行化，所以必須增加 pipeline resource，如下圖所示，增加了一些 read port、write port 與用來計算 address 的 ALU ![未命名](https://hackmd.io/_uploads/rkkivw5TA.png) ##### Hazard in the dual-issue MIPS 在 multiple issue 時同樣會發生 hazard 的問題，且因為同時間會處理更多的 instruction，所以發生 hazardz 時代價更大以 data hazard 為例，single-issue 可以使用 forwarding 來避免停滯 (stall)，但如果是 dual-issue 的 instructions，就無法使用 forwarding，以下述指令為例: ```assembly= add $t0, $s0, $s1 load $s2, 0($t0) ``` load 指令從記憶體載入資料時，需等待 add 把記憶體位址計算後才可載入資料，如此一來就勢必會發生 stall，解決方式是將其拆成兩個 packets，但效果與 stall 是等價的。以 load-use hazard 為例，跟 single-instruction 一樣會有 1 個 cycle 的 penalty。但如果是 dual-issue 就會產生兩個 bubble (兩倍的 penalty)。由此可知，在 static multiple issue 中，排程 (scheduling) 是很重要的。 --- 以 dual-issue MIPS 舉例說明排程，下列組合語言主要做三個動作: * 取出 array 的值 * 加上一個 scalar * 存到下一個 index ![image](https://hackmd.io/_uploads/Skp_Tl1sxl.png)  從上述舉例可以發現 1. 前三個 instructions 有 dependency 的關係 2. 後兩個 instructions 有 dependency 的關係 3. `sw` 可以和 `addi` 或 `bne` 組合成 packet，因為兩者是屬於不同的 group 實際 scheduling 後的結果如下圖所示: ![image](https://hackmd.io/_uploads/rkybzs960.png) 在 4 個 cycle 中執行了 5 個 instruction，所以 IPC = 5/4 = 1.25 :::info IPC: Instruction Per Cycle ::: ##### Loop unrolling Loop unrolling (迴圈展開) 指的是增加每次迭代中執行指令的數量，以上述的例子來說: ```= Loop: lw $t0, 0($s1) addu $t0, $t0, $s2 sw $t0, 4($s1) addi $s1, $s1-4 bne $s1, $zero, Loop ``` 如果這個迴圈執行兩次，並使用 loop unrolling 的話，就會變成 ``` Loop: lw $t0, 0($s1) addu $t0, $t0, $s2 sw $t0, 4($s1) addi $s1, $s1-4 lw $t0, 0($s1) addu $t0, $t0, $s2 sw $t0, 4($s1) addi $s1, $s1-4 bne $s1, $zero, Loop ``` 使用 loop unrolling 的好處在於可以減少 branch predicted，假設迴圈執行 100 次，那每次結束時都要檢查，就需要檢查 1000 次，但如果把 loop body 擴展成 4 份，那只需要檢查 100 / 4 = 25 次 --- 同樣以上述的例子舉例 (見下圖)，並將 loop body 重複 4 次 (每 4 次檢查一次) ![image](https://hackmd.io/_uploads/rkF6Os96R.png) 使用 dual-issue MIPS 來做 schedule，並先考慮兩次的迭代 (如下圖)，上半部份的 `$t0` 與下半部份的 `t0` 雖然看起來是一樣的，但實際上是不一樣的內容，因為上半部份的 `$t0` 已經經過 store 了，所以不會發生 dependency 的問題，指示名稱剛好一樣而已 (考以稱為 name dependency)。 ![未命名](https://hackmd.io/_uploads/B1utFjcTA.png) 實際做了排程後如下表所示。因為每個迭代中都做相同的事，所以先把所有資料一次讀取進來 ![未命名](https://hackmd.io/_uploads/r1lVps9TR.png) 1. 第 2 個迭代開始載入資料時，第 1 個迭代就可以開始做加法。 2. 資料讀取完且加完後，第 1 個迭代可以開始儲存資料。 3. 最後做完 loop body 後再檢查 在 8 個 cycle 中塞了 14 個 instructions，所以 IPC = 14 / 8 = 1.75 --- 總結來說，Loop unrolling 的重點如下: * 減少 loop-control overhead (原本檢查 1000 次，擴展後改為檢查 250 次條件) * 在每次的擴展 (replicate loop body) 中，盡量使用不同的 register，這種方式稱為 register renaming * 這種設計方式可以避免 anti-dependencies，例如 store 後面使用 load，但此時 register 的值已經被改變，使用不同的 register 可以避免這個問題 #### 3.2 Dynamic multiple issue (hardware approach) Dynamic multiple issue 通常在超大型處理器 (superscalar processors) 使用。Dymanic multiple issue 可以在每個 clock cycle 發行多條 instruction，具體數量由 CPU 來決定，能夠避免 structural hazard 與 data hazard。 Dynamic multiple issue 最核心的技術是 dynamic pipeline scheduling，它可以允許 CPU 不依照原有的既定順序 (out of order)來執行 instructions，避免發生 stall，但是在儲存時，還是要依照 instruction 的順序存入 register，以下舉例 ```= lw $t0, 20($2) addu $t1, $t0, $t2 sub $s4, $s4, $t3 slti $t5, $s4, 20 ``` 第一行的 `lw` 指令與第二行的 `addu` 之間有 dependency，`addu` 必須等待 `lw` 的資料載入完成後才能做 (load 的速度較慢)。而後兩個 `sub` 與 `slti` 指令與前面不會發生 dependency，所以可以先做 ##### Dynamic scheduled CPU Dynamic scheduled CPU 包含 3 個階段與 4 個單元 ![image](https://hackmd.io/_uploads/S1iQsPCTC.png) * Stage - 1: In-order issue * 依照 compiler 所排序的順序來做 fetch、decoder 等 * Stage - 2: Out-of-order execute * Reservation station * 用來儲存即將要執行的 instruction 的 buffer * 控制哪些 instruction 需要做 pending (等待) * Functional units * 根據不同的形態來做相對應的運算 * Stage - 3: In-order commit * 將前一階段計算完後的結果，透過 commit unit 儲存在 reorder buffer 中 * 在 buffer 中會依照原執行的順序重新排列，並將計算結果寫入 register 或 memory 或是之前正在 reservation station 等待的 instructions ##### Why do dynamic scheduling ? > 為何不讓 compiler 去做 code 的排程就好 ? * 因為並不是所有的 stalls 都可以被判斷出來，某些情況下 (Ex: cache misses) 是無法判斷的 (memory access can't predicable) * 此外， branch 的結果也不能單靠 scheduling 來解決 * 不同的 ISA 有不同的實作方式，stage 也不大相同，也會遇到不同的 hazard，要重新做編譯 (re-compile) 的話非常麻煩