計組 L4 - HackMD

# 計組 L4 # 1st期中考 ## 1.ALU基本實現功能 #### 1-bit概念 ![](https://i.imgur.com/XuLfavU.png =70%x70%) - ### ADD :::info ![](https://i.imgur.com/saQEdwr.png) - ALU1的Cin(CarryIn)是ALU0的Cout(CarryOut) ::: ### SUB :::info ![](https://i.imgur.com/dWBGV9F.png) :::success ### 衍伸：NOR--迪摩根定律 ![](https://i.imgur.com/FR0l7M5.png) :::info ::: ### SLT - 要求a<b , 所以可寫成(a-b)<0 - 滿足時result = 1 (0...01, 從左至右依序是31-0bit) :::info ![](https://i.imgur.com/E66WKL0.png) ::: ### Overflow(溢位) ``` 如果最高位元的CarryIn與CarryOut一者為1一者為0，則有溢位。即： Overflow = Cin[N-1] XOR Cout[N-1] ``` --- ## 2.CPU Overview ### Multiplexer :::info ![](https://i.imgur.com/QQ7lVNV.png =80%x80%) 上面這是CPU的概觀，但不能把線都混在一起 →所以需要Mux(多工器), 以便完成多種功能 ::: ### Build a datapath. :::info #### - Instruction Fetch ![](https://i.imgur.com/eDufVFB.png =60%x60%) #### - R-format Instructions ![](https://i.imgur.com/lnhOchd.png =80%x80%) #### - Load/Store Instructions ![](https://i.imgur.com/EbLWS69.png =80%x80%) #### - Branch Instructions ![](https://i.imgur.com/iwMvBJo.png =80%x80%) ::: ### 問題：要怎麼把這四個電路圖combine在一起成為一個datapath? :::info - 用Mux(多工器)！ ![](https://i.imgur.com/urEhfuP.png =80%x80%) ::: ### "Write Register"的不同 :::info ![](https://i.imgur.com/AtdZVdV.png =70%x70%) - R-format 是 rd (addr[15:11]) - lw 是 rt (addr[20-16]) ::: ### So,the correct FULL Datapath is... :::info ![](https://i.imgur.com/x5CTq2y.png) > :::success ### 以下三種不同路線R-format, Load, Branch - R-format ![](https://i.imgur.com/cRv287x.png) - Load(以lw當example) ![](https://i.imgur.com/ZMi9e7y.png) - Store (rt寫入Data Memory的Write Data這個input port) ![](https://i.imgur.com/fzpIsG7.png) - Branch ![](https://i.imgur.com/lg9u9ci.png) ::: :smile::smile::smile::smile::smile::smile::smile::smile:1st期中考L4.P.35到這裡喔 --- # 2nd期中考 ### ALU功能: function field :::warning ![](https://i.imgur.com/Uv3H23n.png =60%x60%) **註：lw, sw 要加(ADD)immediate，beq 則是要把 rs, rt 相減(SUB)** ![](https://i.imgur.com/eU6H7I8.png =60%x60%) ::: ### Control的架構 :::success ![](https://i.imgur.com/zRdmso7.png =60%x60%) **Hint** - rt 在load，是當寫入的register - rd都是寫入 --- ### Control lines分別為0與1時的情況: | 控制信號 | 0 | 1 | | :--------: | :--------: | :--------: | | RegDst | 寫入目的暫存器: rt |寫入目的暫存器: rd| | RegWrite | None | Write register = Write Data寫入 | | ALUSrc | ALU input2 = Read Data2寫入 | ALU input2 = instruction[15:0]經過Sign Extension的值 | | PCSrc | PC = PC+4 | PC = Branch Target | MemRead | None | 允許要被讀取的資料輸出記憶體(lw) | | MemWrite | None | 允許要被寫入的資料輸入記憶體(sw) | | MemtoReg | Write Data是來自ALU計算結果 | Write Data是來自Data Memory 【原圖】 ![](https://i.imgur.com/V3Ljfu7.png =80%x80%) ::: ### R-format ![](https://i.imgur.com/YxX0XQ6.png) ### Load ![](https://i.imgur.com/7C19M4A.png) ### Branch ![](https://i.imgur.com/ecNzL7u.png) :::info ### 綜合比較: Control line數值 ![](https://i.imgur.com/GxAD367.png) ::: ### Jump ![](https://i.imgur.com/7qGZ4aQ.png =60%x60%) ![](https://i.imgur.com/87jWlAi.png) --- ### Issue: 效能缺陷分析 - Longest delay determines clock period -- **Critical path: load instruction** -- *Instruction memory > register file > ALU > data memory > register file* - Not feasible to vary period for different instructions - **Violates design principle:** -- **Making the common case fast** - We will improve performance by **pipelining** --- ### Pipelining: 將工作(job)執行劃分為數個stage(step) ![](https://i.imgur.com/4nlER4m.png =60%x60%) **`HINT`: 如果每個stage花費的時間大約相等，並有足夠的jobs時，Speedup ≒ Stage數目** ### MIPS Pipeline 1. IF: Instruction fetch from memory 2. ID: Instruction decode & register read 3. EX: Execute operation or calculate address 4. MEM: Access memory operand 5. WB: Write result back to register --- ### Single-cycle v.s Pipeline Assume following, ![](https://i.imgur.com/SUDWiaq.png =60%x60%) ![](https://i.imgur.com/vvIh4DI.png =60%x60%) For 3-load-word instructions, **Single-cycle: 2400ps Pipeline: 1400ps** --- ### Pipeline加速特性 - 增加整體jobs生產量(throughput)，但並不會減少單一job的延遲時間(latency) - 如果Stage時間不盡相等，Speedup會下降 ### Pipeline執行時間及加速公式假設一週期時間 T, 管線機器有 S 個階段, 若有 N 個指令要執行: - Execution Time = [ ( S-1 )+N ] * T - CPI = [ ( S-1 )+N ] / N - Speedup = S * N * T / Execution Time (Single-Cycle執行時間/Pipeline執行時間) ### 專為Pipleline設計的指令集-MIPS: 特點 - All instructions are 32-bits: Stage1指令擷取, Stage2解碼變簡單 - Few and regular instruction formats: Stage2解碼可同時讀取register file - Only Load/store accessing memory: 可以在Stage3 Calculate address, Stage4 access memory - Alignment of memory operands: 如此就不會有單一資料傳輸指令需要兩次資料記憶體存取 --- ### Hazards(危障): 管線中不能順利在下個週期, 執行下個instruction - `Structure Hazards`: 硬體資源不夠，導致同一時間要執行多個指令卻無法 - `Data Hazards`: 管線中某一指令需要用到前面指令(還在管線中)尚未產生的結果(**資料相依, data dependency**), 也可說執行指令所需的資料還未get - `Control(Branch) Hazards`: 　當分支指令尚未決定，是否分支至其他指令執行時，後續指令已經進入管線中，若分支發生則指令執行順序便會發生錯誤。　Control(Branch) Hazards的產生是由管線中的分支指令，及其他會改變Program Counter的指令所引發。 --- ### 解決Hazards: Stall(暫停) ![](https://i.imgur.com/HbhXKaV.png =50%x50%) . ### 解決Hazards: Forwarding(前饋) ![](https://i.imgur.com/YkoHG0V.png =50%x50%) ### 特例:解決Load-Use Data Hazards: Stall+Forwarding ![](https://i.imgur.com/XL5xAYE.png =50%x50%) `解釋：因為至少要Stall一個cycle，才能Forwarding到SUB指令的ALU的輸入，不然會無法往回forward!!!` --- ### 比較：三種Hazards解決方法 - Structure Hazards --**Stall** --Add more hardware - Data Hazards | 軟體方式 | 方法1.插入NOP指令(Stall)／方法2.重排指令順序 | | -------- | -------- | -------- | | 硬體方式 | 方法1.Forwarding(Non-Load-Use)／方法2.Stall+Forwarding(Load-Use) | `Hint:5個Stage的MIPS Pipeline，指令間距離>2就不會有Data Hazards` ![](https://i.imgur.com/2XMpTuA.png =50%x50%) - Control(Branch) Hazards **(下面Hazard Detection第2點有詳細解說)** --Add hardware to do it in **ID** stage --**Branch Prediction**: 1-bit(預測錯1次就改變預測位元), 2-bit(預測錯2次才改變預測位元) - In MIPS pipeline, it can predict branches not taken ![](https://i.imgur.com/pHjy9fC.png =50%x50%) --現實的Branch Prediction - Static branch prediction - `Based on typical branch behavior` - `Example: loop and if-statement branches` - Dynamic branch prediction - `Hardware measures actual branch behavior` - `Assume future behavior will continue the trend` --- ### Pipeline Datapath ![](https://i.imgur.com/oNwcx0y.png) ![](https://i.imgur.com/aFwI1FH.png) ### Pipeline Register詳解 `分左邊, 右邊著色表示現在是read還是write` #### :one: Load指令 ![](https://i.imgur.com/1hSCJGV.png =60%x60%) ![](https://i.imgur.com/bP7cKb7.png =60%x60%) ![](https://i.imgur.com/t8stdAB.png =60%x60%) ![](https://i.imgur.com/YIhgesF.png =60%x60%) :::warning ![](https://i.imgur.com/JNZo9Cx.png =60%x60%) #### `錯誤：WB for Load 寫data到register是下幾個指令當前讀到的registerNO.，錯了!!!!!` ![](https://i.imgur.com/nRCIyCi.png =80%x80%) #### `Solution：接一條線從IF/ID到MEM/WB以便得到正確的rt暫存器編號` ![](https://i.imgur.com/kMZgfue.png =80%x80%) #### `Load+R-format` ::: #### :two: Store指令比較 ![](https://i.imgur.com/CCOKu3E.png =80%x80%) `>圖中可以看到是去做write` ### Pipeline Control控制信號跟Single-Cycle CPU一樣，只是有用Pipeline Register ![](https://i.imgur.com/8cJSV6a.png) ![](https://i.imgur.com/zsM0zfh.png) --- ### Hazards Detection(危障偵測) `Hint:看一下指令間是否存在data dependencies` #### :a:Data Hazards . **下面是一些指令，存在data dependencies** ![](https://i.imgur.com/CT6ip8y.png =80%x80%) . #### :one: Fowarding **1. 會存在需要forwarding，代表RegWrite(寫資料到register)這個控制信號=1 2. 而且rd必須必須不是$zero暫存器** ![](https://i.imgur.com/mVvsvYu.png =50%x50%) . **見下圖，ForwardA 管 rs ／ ForwardB 管 rt** ![](https://i.imgur.com/WuMoGfK.png =80%x80%) . **接線示意圖** ![](https://i.imgur.com/44bYymK.png =80%x80%) . **Forwarding Control(ForwardA, ForwardB控制信號)** ![](https://i.imgur.com/gzbeL9q.png =80%x80%) . **Double Data Hazards(兩個以上的資料危障，forward會出現問題!) →MEM hazard forward只存在 EX hazard沒出現的情況** :::success ![](https://i.imgur.com/iCjJc8g.png =80%x80%) 所以，修改過的版本↓（加入not(EX hazard condition)) ![](https://i.imgur.com/hYKk0P8.png =60%x60%) ::: . 加入forwarding的電路圖Datapath ![](https://i.imgur.com/b5koZ8k.png) . **特例:Load-Use Hazard Detection** Recall that: Need stall+forwarding ![](https://i.imgur.com/vxwYB4N.png =80%x80%) ![](https://i.imgur.com/oAxBzav.png =60%x60%) . #### :two: Stall - `Force control values in ID/EX registerto 0(NOP):這個動作叫Flush ` - ![](https://i.imgur.com/WTEHzLZ.png =60%x60%) - bubble - 把EX, MEM, WB control value變0 - 保留IF及ID中的instruction for one more cycle - `Prevent update of (1)Program Counter (2)IF/ID register` - 藉由這點來保留指令在下個cycle開始繼續執行 #### :three: Datapath with Hazard Detection ![](https://i.imgur.com/dJHPx7x.png) --- #### :b:Control(Branch) Hazards . 如果Branch結果在MEM stage被決定跳到哪個PC ![](https://i.imgur.com/jR1HERp.png =60%x60%) . 但我們要降低Branch的Delay `>>在ID stage就決定!(因此需要增加一個module功能包含"1.目標address加法器/2.Register值比較器")` ![](https://i.imgur.com/yosKWsE.png =60%x60%) . 參考上圖指令藍色①處是Target Address Adder: `beq(PC=40)去(PC+4)+Offset(7*4)=72跳到lw指令(PC=72)位置` 藍色②處是Register Comparator . 以上①②皆在ID stage完成，成功降低了Branch delay ![](https://i.imgur.com/d3ebF6e.png) . 對於要跳過的指令，我們給他Control Value = 0 (圖中藍圈處) ![](https://i.imgur.com/NVD3Su3.png) . 因為ID stage時Branch就要compare兩個register的值(提早了，原本是在EX階段) 如果是**往上2個(含)以上的R-Format指令們** 可以用forwarding來解決data hazards ![](https://i.imgur.com/Zqe0Wvf.png =60%x60%) . 如果是**往上1個R-Format指令**／**往上2個load instruction** :arrow_forward: **需要Stall 一個cycle** 接著可以用forwarding來解決data hazards ![](https://i.imgur.com/x5EDuU6.png =60%x60%) :::success #### 動態Branch預測(分1-bit, 2-bit) ![](https://i.imgur.com/3wJnFSM.png =60%x60%) 2-bit預測只要change，再錯一次預測就會再change。 (因為上一次錯了，這一次再錯就change) ::: --- ### Pipeline危機狀況處理: Exceptions and Interrupts 簡介: | 中斷原因類型 | 發生性質 | | -------- | -------- | | Exception | 內部程式問題(overflow, undefined instruction....) | | Interrupt | 外部I/O devices | . 如何處理Exception(in MIPS)? 1. 把發生問題的指令位置存入 Exception PC 2. 把發生問題的原因寫入Cause Register(下圖假設1-bit cause register可以去分辨undefined opcode或是overflow) 3. 比較：Cause register是先分類再定義好／Vectored Interrupt是已經定義好直接放進去 ![](https://i.imgur.com/mw32yRp.png =60%x60%) ![](https://i.imgur.com/VpXsJoB.png =60%x60%) . 假設`add $1, $2, $1`有overflow產生 1. 我們要防止這個overflow錯誤的值存進$1 2. 前面的指令要繼續執行完 3. 這個add指令與接下來的指令都flush掉(control value設0) 4. Set Cause register跟 EPC register的value 5. jump to handler ![](https://i.imgur.com/AIdNf6u.png =60%x60%) . 複習：藍色⓪處是遇到hazard 藍色①處ID.Flush是處理Decoder遇到undefined instruction的情況此2個情況只要有一個發生(OR邏輯閘)，就會給那個MUX=0，產生Stall . 藍色②處EX.Flush是處理overflow的情況，因為要WB(Write data回去)，所以WB那邊要再接一個MUX=0 ![](https://i.imgur.com/AFgyQGU.png) . 來看一個Example ![](https://i.imgur.com/C68mQpR.png =40%x40%) ![](https://i.imgur.com/wEuTn9W.png =70%x70%) . 然後要Flush完變成 ![](https://i.imgur.com/nPjaZuz.png) --- ### 進階管線平行處理: Instuction-Level Parallelism (ILP) . 要讓指令層次平度提升有2種方法: 1.增加管線深度 2.讓每個階段執行更多指令>>IPC是CPI倒數 ![](https://i.imgur.com/5nDUIEY.png =60%x60%) . Multiple issue又可分為: 1.Static: Compiler重新排列instruction順序 2.Dynamic: 硬體之間的分工 ![](https://i.imgur.com/EFZFkkM.png =60%x60%) . Static Example: Dual-issue packets(指令成雙出現) `ALU或Branch指令必須在前面` `如果成對指令中有一個指令是不能使用的，則用NOP代替` ![](https://i.imgur.com/YEgZQV9.png =60%x60%) . 電路圖: 1.首先需要的是Registers上額外的存取ports，在一個clock cycle我們可能需要讀出兩個暫存器的值來做ALU運算，還要讀出兩個暫存器的值為儲存指令做運算。 2.並且需要一個ALU運算的寫入port及一個給載入指令的寫入port。 3.由於ALU這個模塊一定是拿來做ALU運算，所以我們還需要另一個加法器來計算資料傳送指令的有效位址。 ![](https://i.imgur.com/PoWjEPN.png) . Dual-issue 中 Hazards的情況解決: 1.EX data hazard要把R-format跟load/store分開成2個packet，因為無法forwarding 2.Load-use hazard一樣stall一個cycle，但是得到的好處是一次進來2個指令(一個packet進來) ![](https://i.imgur.com/U6mBWxv.png =60%x60%) --- ### Code Scheduling ![](https://i.imgur.com/YddT5oQ.png =70%x70%) . 前面3個指令有data dependencies ![](https://i.imgur.com/Rkf94Sx.png =20%x20%) 後面2個指令也data dependencies ![](https://i.imgur.com/Go83iav.png =20%x20%) . 重新排程後的結果如上圖(不做的指令填NOP) . #### :star: Loop Unrolling 多重分發管線之迴圈展開 . **Antidependence**: 純粹暫存器$t0的名字重複使用而已，不是真正的data dependency **Register Renaming**: 消除Antidependence的情況 ![](https://i.imgur.com/2OlwSOw.png =60%x60%) . 把他寫成Dual-Issue的指令packet ![](https://i.imgur.com/m4h7xf6.png =60%x60%) --- ### Dynamic Multiple Issue - 又叫"Superscalar" processors - 由CPU決定要issue幾個instruction/per cycle - 比較沒有那麼仰賴code scheduling - **由硬體來確保執行沒有錯誤(hardware support)** - out-of-order execution: 允許CPU不按順序執行指令來避免stalls - in-order commit: But commit result to registers in orde - 保持相依性 - reorder buffer: buffer in commit unit ![](https://i.imgur.com/GKGtW7X.png =70%x70%) :::success #### 為何要做Dynamic Scheduling? 1. 並不是所有stall你都可以預測去做到( ex. cache misses) 2. 不能總是schedule around branches (Branch outcome is dynamically determined) 3. 不同的的ISA架構有不同的latencies跟hazards ::: ### Power Efficiency ![](https://i.imgur.com/dVpf76o.png =60%x60%) --- ### L4總結 ![](https://i.imgur.com/sD3tzgk.png =60%x60%) --- ##### tags: `計組` `CO` `L4` --- :::warning #### 問答區/Chatting Room Q1: Code scheduling那個圖我不會by秉要看小考4 :::