# Instruction Level Parallelism (ILP) ## 1. Introduction 通過同時執行多條指令來提高處理器的性能。利用程式中指令之間潛在的獨立性,使多條指令能夠在同一時間內在處理器的不同 unit 上並行處理 ### 1.1 方法 #### hardware-based dynamic approach → Superscalar processors - in run-time - use in-order execution if they are statically scheduled use out-of-order execution if they are dynamically scheduled - 應用於server, desktop processor, high-end phone等,要求高性能的設備 #### compiler-based static approach → VLIW processors - in compile-time 編譯時由編譯器發現並行性,編譯器嘗試 re-ordering,以最大化的平行執行指令 - issue a fixed number of instructions formatted either as one large instruction or as a fixed instruction packet with parallelism - 應用於 DSP processor ### 1.2 ILP 一定要處理 dependence 的問題 #### a. true dependence(data dependence) 真的要等 <img src="https://hackmd.io/_uploads/S1qwMAc3R.png" width="80%"> #### b. name dependence naming 的問題,rename register 就好了 (WAR, WAW) ![Untitled (17)](https://hackmd.io/_uploads/Hyh2MAq2C.png) #### c. control dependence ![Untitled (18)](https://hackmd.io/_uploads/BywWm0q2A.png) ## 2. 針對ILP的基本compiler 技術 (VLIW) ### 2.1 Implementation separate dependent instructions ![0033B486-4052-4AF5-AF30-1547AAD48BDF](https://hackmd.io/_uploads/S1Sb00ch0.jpg) **Step1: Insert stalls** ![Untitled (19)](https://hackmd.io/_uploads/SyN1ARc3C.png) **Step2: Loop unrolling without scheduling** ![D1C5B324-DAA9-4736-AC76-F07C1BC67568](https://hackmd.io/_uploads/H10RJyo3R.jpg) use different registers **step3: re-schedule** ![Untitled (20)](https://hackmd.io/_uploads/SJoxl1ihA.png) ### 2.2 VLIW Introduction - very long instruction word - use multiple, independent functional units - Loop unrolling and then code scheduling (just as above) ![70C17435-3173-44EA-B5FC-E9BC63BD419E (1)](https://hackmd.io/_uploads/SJ3XXyo3A.jpg) - Problem 1. increase in code size 為了填滿每個非常長的指令字,編譯器可能需要插入無操作指令(NOPs),特別是當它無法找到足夠的獨立指令來平行執行時。這導致程式碼大小增加,進而影響cache效能並增加記憶體使用。 2. compiler複雜且如果硬體改版,需要進行大量的重新編譯和優化 ## 3. 用進階的advanced branch prediction 來減少branch cost **→ dynamic branch prediction by hardware** branch prediction: 1. branch or not, 2. if branch, where is the target goal: 越早知道 branch outcome 越好 ### 3.1 Example ![829ADA67-AA27-4A77-9188-6FFDBD409F2B](https://hackmd.io/_uploads/By4rNks2A.jpg) ![FBC73F33-327B-40C6-A794-9CAA70C80986](https://hackmd.io/_uploads/Hk0HryjnA.jpg) ### 3.2 Branch History Table (BHT) correlating predictor/ 2-level (m, n) predictors #### Implementation - m: 前m個branch的outcome(前m個branch的taken/not taken所構成的table) n: predictor的bit數 p: branch address - record last m branches to select 2^m history tables with n-bit predictor total bits for (m, n) BHT prediction buffer with p-index: **2^m * n * 2^p** - use p bits of branch address to select row use m-bit shift register to select column get the n predictor bits in the entry to make the decision ![3A9EA83B-B504-47D3-B56A-E8E630C1315F](https://hackmd.io/_uploads/r1tVxfs2R.jpg) - Example ![Untitled (1)](https://hackmd.io/_uploads/HkzDgfo2C.png) (0, 1) predictor ![Untitled (2)](https://hackmd.io/_uploads/SkDtxzo2R.png) ### 3.3 Tournament predictor 結合多種預測方法的優勢來優化預測準確性(混合型) 1. Multiple predictors 常見的有Global Predictor和Local Predictor. Global Predictor使用一個global 的branch歷史記錄來做出預測,而Local Predictor則為每個branch單獨維護一個歷史記錄。 ![Untitled (3)](https://hackmd.io/_uploads/BkP-WGjnR.png) 3. Others: * McFarling’s Gshare Predictor: 按照過去的行為和現在的branch address做運算 * Tagged hybrid predictors(TAGE): have predictor for each branch and history ## 4. Dynamic scheduling for ILP (Superscalar) ### Introduction - 優點: 1. reduce stall: rearrange the instruction 2. 在run time 時執行,因在編譯時無法識別所有dependence的情況,所以superscalar效果比VLIW好,也使編譯器比較容易設計 3. explore **hardware speculation**,但需要高度精確的控制和復原機制。 缺點: 硬體設計複雜 - Implies **out-of-order execution and out-of-order completion** → WAR, WAW hazards → Register renaming ![Untitled (4) (1)](https://hackmd.io/_uploads/Hy12Mzsh0.png) - <img src="https://hackmd.io/_uploads/rkQrmfjnA.png" width="80%"> **structural and data hazards could be checked during ID stage** ID stage can be separated into 2 parts - Issue: 看有沒有available的hardware function unit, check for structural hazard in the manner of in-order issue - Read operands: wait until no data hazards, then read operand - Scoreboard vs. Tomasulo’s approach ### 4.1 Scoreboard #### Implementation [Scoreboard.pdf](https://drive.google.com/file/d/1r8DTNvHYp8O86no0RPDUkA9CijClfpJf/view?usp=sharing) - **In-order issue, out-of-order execution/ completion** - WAR, WAW → waiting. ex: cycle 17 in Scoreboard.pdf stall write-back until registers have been read can only read registers during read-operand stage - **No register renaming** - Centralized control ### 4.2 Tomasulo’s approach #### Implementation [Tomasulo.pdf](https://drive.google.com/file/d/1Z_MZQXM2HYqXntStbe87F00AgmzQ3Gee/view?usp=sharing) #### Architecture ![0E48BFC1-A657-46F2-9A49-6F6B684F12BB](https://hackmd.io/_uploads/HyQYPfs2R.jpg) - **registers in instruction are renamed** to reservation station (RS) → avoid WAR, WAW - distributed control FU, load/ store 有自己的 RS → distributed - RS紀錄的方法並非直接紀錄其register的名字,而是紀錄他在那一個buffer或是reservation stations上,透過這個方法達到register renaming if no RS is available → structural hazards → stall - FU 的結果透過common data bus (CDB) broadcast 回 RS,讓在RS等的資料可以做 #### 透過 speculation 更精進ILP hardware speculate on outcome of branches and executing program as if guesses were correct - dynamic scheduling: only fetches and issues instructions - speculation: fetch, issue, and execute instructions → 但可能會猜錯,所以要能復原回去 - Tomasulo’s challenge 1. 可能有多個function unit 同時做完 → need more CDB → more complex 2. branch 猜錯或exception 產生時,因為是out-of-order completion,很難復原回去 → solution: **Reorder Buffer(ROB)** Operation - Reorder Buffer Operation 寫進ROB 的順序和 in-order 的 issue 一樣 ![2FF252FE-6699-480C-81E9-3DFFE3533628](https://hackmd.io/_uploads/ByKicfih0.jpg) - 3 components of HW-based speculation 1. Dynamic branch prediction: choose instructions 2. Dynamic scheduling: manages the order of executing instruction (out-of-order) 3. Speculation: executes instructions **before control dependences are resolved + ability to undo when incorrectly speculate** - Adding ROB to Tomasulo - 多加一個state為instruction commit,且commit in-order,確定instruction不是speculative才update register file - 猜錯就flush掉 - provides precise exceptions - ROB 用處: holds the result of instruction that have finished execution but not committed - 影響: - Use ROB number instead of RS to indicate the source of operands when execution completes (but not committed) - ROB 有位置時 instruction 才能 issue - ROB內容 1. instruction type: branch/ store/ register operation 2. destination: 寫入reg的位置 ALU operations: 寫入哪個reg load: 寫入哪個reg store: memory address 3. value: value of instruction result 4. ready: 顯示instruction has completed execution, the value is ready 5. busy: 顯示instruction hasn’t completed execution - Four steps of ROB to Tomasulo 1. issue: If **reservation station and reorder buffer slot are free(缺一不可)**, issue instruction & send operands & reorder buffer no. for destination 2. Execution (EX) → When both operands ready, then execute → if not ready, watch CDB for result; when both in reservation station, execute 3. Write result (WB) mark reservation station available Write on 1. Common Data Bus and 2. reorder buffer 4. Commit * When instruction at head of reorder buffer & result present, update register with result (or store to memory), then remove instruction from reorder buffer. * if mispredicted → flush - architecuture ![AF47DCD5-AF44-4D58-AE61-7756AAC1F62B](https://hackmd.io/_uploads/r1svTGinR.jpg) - Use extended registers for renaming in both reservation stations(RS) and ROB - multiple issue (同時有多個instruction 被 issue) w/ or w/o speculation ![99D29840-B975-4BF4-847B-1B3B619BA0F8](https://hackmd.io/_uploads/rJuSkXi3R.jpg) `Id following bne can’t start execute earlier` ![7757A1D5-066F-43E8-BAF0-FF4A98872ABE](https://hackmd.io/_uploads/H1o2k7onR.jpg) `Id following bne can start execute earlier` #### BTB - challenge: 除了 predicting branches well,我們也希望能早點知道要不要branch和要branch到哪裡 <img src="https://hackmd.io/_uploads/BkG9xQs3C.png" width="80%"> `jr $ra ,常用於函數返回操作中,其中return address被儲存在一個register($ra)` - solution - Software 1. loop unrolling: 把branch拿掉 2. instruction scheduling - Hardware 1. predicated instruction 2. branch target buffer (BTB) → more common vs. BHT,BHT 沒算出 target - BTB 希望可以在 IF 就知道是 conditional instruction,並知道要跳去哪(know the address) → **use the instruction address rather than wait for decode** → access BTB during IF stage** ![Untitled (10)](https://hackmd.io/_uploads/SyDUWQi20.png) keep both the branch PC and target PC in the BTB ![Untitled (11)](https://hackmd.io/_uploads/B1sObQs2R.png) - 更精進BTB: 存target的地方,乾脆存target的instruction,而不是address,跳到target後就不用再去instruction memory fetch → Branch folding - BTB challenge → Return address predictors - indirect jump: destination address varies at run time。目標地址不是直接給定,而是存儲在寄存器或內存位置中,這使得跳轉目標在執行時才確定。 ex: nested loop → accuracy of BTB for procedure returns are low - solution: use a small buffer of return addresses operating as a stack ![Untitled (12)](https://hackmd.io/_uploads/ryoyfmonA.png) ![Untitled (13)](https://hackmd.io/_uploads/S14ef7ihC.png) - Combination of BTB and BHT: 使用BTB較貴但較快,entries也較少,所以BHT作為輔助,hold 更多entries ![Untitled (14)](https://hackmd.io/_uploads/r1KXzmj2A.png) ## 5. Summary - superscalar can do: dynamic scheduling, multiple issue, speculation 1. old codes still run: 已編譯的二進位程式碼(binaries)可以在不需要重新編譯的情況下繼續運行。(不怕compiler改版) 2. little impact on code density - superscalar multiple issue 的問題 1. multiple instructions are issued per cycle, so we deal with multiple instructions commit per cycle → CDB 要很多條 2. There may be many possible dependencies between instructions → instruction issue is the bottleneck in dynamically scheduled superscalar - Limits on ILP - **現有機制下的ILP可利用性**: 考慮到現有的硬件機制和增加的硬件預算,能夠達到多少程度的ILP是一個關鍵問題。 - dependency 問題,使ILP遇到bottleneck **→ Thread Level Parallelism (TLP)**