# 計算機組織期末整理 ## The Processor ### CPU performance factors * Instruction count(Determined by ISA and compiler) * CPI and Cycle time(Determined by CPU hardware) ### Instruction Execution 1. fetch instruction: PC -> instruction memory 2. read registers: Register numbers -> register file 3. Others(depending on instruction class) * Use ALU to calculate * Access data memory for load/store * PC <- target address or PC + 4 ### CPU overview 因為我們假定memory在一個clock cycle只能做一次R\W,所以要將memory分成Instruction memory and Data memory ![](https://i.imgur.com/i97eB6k.png =500x) ### 各指令需要用到的CPU component #### Instruction Fetch ![](https://i.imgur.com/9ZHc2XJ.png =400x) #### R-type Datapath | op | rs | rt | rd | shamt | funct | | -- | -- | -- | -- | ----- | ----- | | 6 | 5 | 5 | 5 | 5 | 6 | ![](https://i.imgur.com/qQgF9Cu.png =500x) #### Load/Store Datapath | op | rs | rt | immed | | -- | -- | -- | -- | | 6 | 5 | 5 | 16 | ![](https://i.imgur.com/7Lgza7q.png =500x) #### Branch Datapath beq rs, rt, imm16 ![](https://i.imgur.com/7qnVaF9.png =500x) #### Jump | op | address | | -- | -- | | 6 | 26 | address * 4 + (PC+4) ### Composing ![](https://i.imgur.com/ReVpzyt.png =600x) ### ALUop and ALU Control 先透過op決定是哪種指令如果是lw, sw, beq則不需要function code即可決定ALU control, 如果是R-type則需要再透過function code來決定ALU control ![](https://i.imgur.com/dP77pww.png =500x) ![](https://i.imgur.com/RhaCBgC.png =400x) ### Control signal ![](https://i.imgur.com/VDWzkeP.png =400x) #### logic of control signal ![](https://i.imgur.com/2u7CKtK.png =400x) ## Pipeline ### MIPS Pipeline Five stages, one step per stage 1. IF: Instruction fetch from memory 2. ID: Instruction decode & register read 3. EX: Execute operation or calculate address 4. MEM: Access memory operand 5. WB: Write result back to register ![](https://i.imgur.com/NV6KZJX.png) * Need registers between stages: To hold information produced in previous cycle ![](https://i.imgur.com/4rEuAoL.png) ### pipeline performance 今天我們假設ID和WB所需要花的時間是100ps而其他的要花200ps,稍微計算一下個指令需要花的時間 * lw: 800ps * sw: 700ps * R-format: 600ps * beq: 500ps 如果不使用pipeline的話,每一個cycle time是800ps,而使用pipeline的話每個stage最長是200ps所以cycle time則是200ps ##### thoughput增加、lantency不變 ### speedup If all stages are balanced (i.e., all take the same time) If not balanced, speedup is less ``` Time between instructions(pipelined) = Time between instructions(nonpipelined)/Number of stages ``` ### MIPS ISA designed for pipelining * All instructions are 32-bits (Easier to fetch in one cycle) * Few and regular instruction formats (Can decode and read registers in one step) * Load/store addressing * Alignment of memory operands (Memory access takes only one cycle) ### Load example 在每一個clock cycle中,write在前半段、read在後半段 * Instruction fetch ![](https://i.imgur.com/CPxtSAr.png =550x) * Instrction decode ![](https://i.imgur.com/U96FHPI.png =550x) * Execution ![](https://i.imgur.com/stoT00H.png =550x) * Memory access ![](https://i.imgur.com/ptl8R3k.png =550x) * Write back ![](https://i.imgur.com/M0LDsYo.png =550x) ### Multi-Cycle Pipeline ![](https://i.imgur.com/1znBuy9.png =500x) ![](https://i.imgur.com/dEYHlHL.png =500x) ![](https://i.imgur.com/L1DDLWZ.png =500x) ![](https://i.imgur.com/gvNxjCC.png =500x) ![](https://i.imgur.com/tBBHbaL.png =500x) ![](https://i.imgur.com/AbmMMTv.png =500x) ### WB issue for load and R-type WB時的write register應該等到要WB時再傳回去決定(所以應該要持續透過pipeline register傳下去),不然在multi-cycle下會遺失應該要write back的register是哪個 ![](https://i.imgur.com/Rlx7Nfu.png =600x) ## Hazard Situations that prevent starting the next instruction in the next cycle ### Structure hazards Conflict for use of a resource * load 需要五個 stage (IF, ID, EX, MEM, WB) * R-type 只需要四個 stage(IF, ID, EX, WB) 當load在R-type前一個執行時會在同時需要WB! 解決的方法就是不管每個指令需要多少個stage都**指定為五個stage** ### Data Hazards An instruction depends on completion of data access by a previous instruction * RAW(i2 tries to read operand before i1 writes it) * WAR(i2 tries to write operand before i1 reads it) Can’t happen in MIPS 5-stage pipeline * WAW(i2 tries to write operand before i1 writes it) occur only in pipelines that write in more than one stage #### Forwarding Requires extra connections in the datapath 有兩種情況,第一種是在第一個指令EX完後拉到下一個指令的EX前(EX hazard),第二種是在第一個指令MEM階段完拉到下兩個指令的EX前(MEM hazard) ![](https://i.imgur.com/ts1yZJN.png =600x) * Forwarding Condition * EX hazard ``` if ( EX/MEM.RegWrite and (EX/MEM.RegisterRd ≠ 0) and (EX/MEM.RegisterRd = ID/EX.RegisterRs) ) ForwardA = 10 if ( EX/MEM.RegWrite and (EX/MEM.RegisterRd ≠ 0) and (EX/MEM.RegisterRd = ID/EX.RegisterRt) ) ForwardB = 10 ``` * MEM hazard ``` if ( MEM/WB.RegWrite and (MEM/WB.RegisterRd ≠ 0) and (MEM/WB.RegisterRd = ID/EX.RegisterRs) ) ForwardA = 01 if ( MEM/WB.RegWrite and (MEM/WB.RegisterRd ≠ 0) and (MEM/WB.RegisterRd = ID/EX.RegisterRt) ) ForwardB = 01 ``` * Datapath with Forwarding 將EX/MEM的register(RegWrite)、EX/MEM.RegisterRd、MEM/WB的register(RegWrite)、MEM/WB.RegisterRd、ID/EX.RegisterRsRt傳入**Forwarding unit** 然後Forwarding unit會算出ForwardA, ForwardB 如果為10則拿MEM階段的RS+RT,為01則拿WB階段的RS+RT ![](https://i.imgur.com/Wh1Q5CA.png) #### Double Data Hazard 考慮以下狀況 ``` add $1, $1, $2 add $1, $1, $3 add $1, $1, $4 ``` 可以發現當執行到到三行指令的時候MEM hazard和EX hazard的情況都會符合,但是很顯然我們要的是第二行指令算出來的$1,所以需要稍微修改一下MEM hazard: 當EX hazard不存在時MEM hazard存在才做 ``` if ( MEM/WB.RegWrite and (MEM/WB.RegisterRd ≠ 0) and not( EX/MEM.RegWrite and (EX/MEM.RegisterRd ≠ 0) and (EX/MEM.RegisterRd = ID/EX.RegisterRs) ) and (MEM/WB.RegisterRd = ID/EX.RegisterRs) ) ForwardA = 01 if ( MEM/WB.RegWrite and (MEM/WB.RegisterRd ≠ 0) and not( EX/MEM.RegWrite and (EX/MEM.RegisterRd ≠ 0) and (EX/MEM.RegisterRd = ID/EX.RegisterRt) ) and (MEM/WB.RegisterRd = ID/EX.RegisterRt) ) ForwardB = 01 ``` #### Load-Use Data Hazard Can’t always avoid stalls by forwarding ![](https://i.imgur.com/69LRFvC.png =450x) ``` if ID/EX.MemRead and ( (ID/EX.RegisterRt = IF/ID.RegisterRs) or (ID/EX.RegisterRt = IF/ID.RegisterRt) ) stall and insert bubble ``` * Datapath with Load-use hazard detection 將 ID/EX.MemRead、ID/EX.RegisterRt、IF/ID.RegisterRtRs,傳入**Hazard detection unit** 如果判斷為true則 1. 將ID/EX register中的control values全部設為零,也就是讓之後的EX, MEM, WB都沒有指令執行 2. 防止PC更新 Using instruction is decoded again 3. 防止IF/ID register更新 Following instruction is fetched again 1-cycle stall allows MEM to read data for lw, can subsequently forward to EX stage ![](https://i.imgur.com/nYvqvr9.png) #### Code Scheduling Reorder code to avoid use of load result in the next instruction 由於有些指令執行的順序並不會影響結果,我們可以重新排序減少hazard ### Branch(Control) Hazards * Need to compare registers and compute target early in the pipeline Add hardware to do the branch execution (i.e., comparison) in **ID stage** ![](https://i.imgur.com/b7Pwepz.png) By moving the comparison to ID stage, **only one clock cycle** is wasted ![](https://i.imgur.com/UlXEl6r.png =500x) * Data Hazard for Branch * case 1: 如果branch的rs, rt是前兩個及前三個指令的rd,透過forwarding到ID即可 ![](https://i.imgur.com/QoSIXSi.png =500x) * case 2: 如果branch的rs, rt是前一個及前二個指令的rd,就算forwarding也需要 1 stall cycle ![](https://i.imgur.com/mlf8o1V.png =500x) * case 3: 如果branch的rs或rt是其一個指令的rt(lw),則需要2 stall cycles ![](https://i.imgur.com/k7fElYw.png =500x) #### Branch Prediction 如果猜對了則不需要有bubble 分成兩種預測方式: 1. Static branch prediction Based on typical branch behavior 2. Dynamic branch prediction * Branch prediction buffer (aka branch history table) * Indexed by recent branch instruction addresses * Stores outcome (taken/not taken) * To execute a branch * Check table, expect the same outcome * Start fetching from fall-through or target * If wrong, **flush pipeline and flip prediction** 1. 1-Bit Predictor 一開始先使用隨便一個(taken/not taken),如果猜錯則在下一個branch猜另一個 如果是一個雙層for loop的話 ``` outer: ... ... inner: ... ... beq ..., ..., inner ... beq ..., ..., outer ``` 假設我們一開始都猜taken,除了第一個inner loop會錯一次外(離開的那次),後面每次inner loop都會猜錯兩次(進去和離開) 我們可以發現會發生這種情況的原因是因為1-Bit Predictor不容許錯誤,當一有錯誤發生它就換另一個 2. 2-Bit Predictor Only change prediction on **two successive mispredictions** ![](https://i.imgur.com/CxnpI8J.png =500x) 如此一來上面的例子每次inner loop都只會錯一次(離開) #### Branch target buffer * Cache of target addresses * Indexed by PC when instruction fetched If hit and instruction is branch predicted taken, can fetch target immediately ### Exceptions and Interrupts "Unexpected” events requiring change in flow of control * Exception: Arises within the CPU * Interrupt: From an external I/O controller Dealing with them without sacrificing performance is hard #### Handling Exceptions In MIPS, exceptions managed by a System Control Coprocessor (CP0) 1. Exception occurs 2. Save PC of offending (or interrupted) instruction In MIPS save in **Exception Program Counter (EPC)** 3. Transfer control to OS and perform some actions for exception Save indication of the problem 1. **Single entry**: In MIPS: Cause register (jump to a **single address** 8000 00180) 2. **Multiple entry**: Vectored Interrupts * The CPU knows the address of the Interrupt Service Routings (ISR) in advance * The interrupting device sends a vector to CPU * The CPU compares the vector to an interrupt table in memory, and then carries out the correct ISR for that device 4. If restartable (e.g., most I/O interrupt) * Take corrective action * Use EPC to return to program * **Otherwise: Terminate program** #### Exception in a Pipeline 1. Exception occurs(like overflow on add in EX) 2. Complete previous instructions 3. **Flush add and subsequent instructions** 4. Set Cause and EPC register values 5. Transfer control to handler 當overflow發生時會將EX.flush設為1,然後調整在PC前面的MUX ![](https://i.imgur.com/T0czFA4.png =700x) ## Instruction-Level Parallelism (ILP) To increase ILP * Deeper pipeline * Issue multiple instructions in one cycle ### Multiple issue * Static multiple issue * Compiler **groups instructions** to be issued together * Compiler detects and avoids hazards (像兩個有hazard的指令就不能在同一個group) * Think of an issue packet as a very long instruction word(VLIW) * Dynamic multiple issue * **CPU(hardware)** examines instruction stream and chooses instructions to issue each cycle * CPU resolves hazards using advanced techniques **at runtime** ### Speculation 主要是在猜這個指令跟其他指令有沒有dependency,如果乍看之下沒有dependency(沒有可以先偵測到的hazard)就先拿進來執行(先不要write),當其他指令執行完後發現確實沒有dependency的話,就可以安心地完成該指令,如果發現猜錯了,就洗掉重做 * Compiler can reorder instructions * Hardware can look ahead for instructions to execute * **Buffer results** until it determines they are actually needed * **Flush buffers on incorrect speculation** ### Scheduling Static Multiple Issue * Reorder instructions into issue packets * **No dependencies** with a packet * **Pad with nop if necessary** #### MIPS with Static Dual Issue Two-issue packets * One ALU/branch instruction * One load/store instruction * 64-bit aligned ![](https://i.imgur.com/6WjS4db.png =650x) #### Additional hardware needed 黑色的線是ALU/branch instruction的datapath 藍色的是load/store instruction的datapath ![](https://i.imgur.com/50rJcsh.png =650x) #### Loop Unrolling 將loop展開來盡可能使用parallelism 主要就是將每次迴圈用來暫存的register都換成不同的,如此就可以減少dependency ### Dynamic Multiple Issue 四個階段: 1. IF/ID * **Execute in order**, very fast 2. Reservation station * **Buffers holding** operands and operation * Once all operands and functional unit are ready, send to functional units 3. Functional units * **Completes the operation**, send results to commit units and reservation station (because other instructions may wait for the result) 4. Commit unit * **Re-order the register writing sequence**(Reorders buffer, 最後的寫入一定要照順序) ![](https://i.imgur.com/yJbWSXt.png =600x) * Complexity of dynamic scheduling and speculations requires power(使用越多issue width, speculation會耗越多能源) * Multiple simpler cores may be better ## Cache Basic Principle of Locality * Temporal locality Items accessed recently are likely to be accessed again soon(剛使用完的可以先留在cache) * Spatial locality Items near those accessed recently are likely to be accessed soon(要拿資料時連他附近的一起拿,也就是一次拿一個block) ### Memory Hierarchy Levels * Block (aka line): unit of copying * If accessed data is present in upper level: Hit **Hit ratio: hits/accesses** * If accessed data is absent: Miss **Miss ratio: misses/accesses = 1 – hit ratio** ![](https://i.imgur.com/qDr2pJT.png =300x) ### Cache Memory The level of the memory hierarchy closest to the CPU ### Direct Mapped Cache * Location determined by address * Direct mapped: only one choice ``` Cache address = (Block address) modulo (#Blocks in cache) ``` ![](https://i.imgur.com/GqiP77d.png =500x) #### Address Subdivision 1. 比對index 2. 比對Tag 3. 1, 2都符合->Hit 4. 取出Data ![](https://i.imgur.com/gqoIQMK.png =500x) #### Example 假設Cache有64個blocks,每個block有16bytes 1. To what block number does (byte) address 1200 map(using direct mapped cache)? Block address = floor(1200/16) = 75 Block number = 75 % 64 = **11** 2. How many bits for tag, index, and offset? 64 blocks -> 64 = 2^6 -> 6 bits index 16 bytes/block -> 16 = 2^4 -> 4 bits for offset The remaining is tag -> 32 – 6 – 4 = 22 bits 3. Cache總共有多少bit block number * block size -> block number * (Valid bit + Tag + Data) -> 64 * (1 + 22 + 16*8) ![](https://i.imgur.com/DRSeUDp.png) ### Cache Misses * On cache hit, CPU proceeds normally * On cache miss * Stall the CPU pipeline * Fetch block from next level of hierarchy * Instruction cache miss Restart instruction fetch * Data cache miss Complete data access ### Data-wrtie機制 #### Write-Through * On data-write hit, could just update the block in cache, but then **cache and memory would be inconsistent** * Write through: **also update memory** * But makes writes take longer * Solution: **write buffer** * Holds data waiting to be written to memory * CPU continues immediately Only stalls on write if write buffer is already full #### Write-Back * Alternative: On data-write hit, just update the block in cache * When a **dirty block is replaced**, Write it back to memory(當block從cache中被換掉時才寫入memory) #### Write Miss * For Write-Through: No-write allocation (aka Write around): **don’t fetch** the block on write miss(直接寫到memory而不把那塊block抓進cache) * Since programs often write a whole block before reading it (e.g., initialization) * For write-back: Usually **fetch** the block ### Memory bandwidth design * One-word-wide memory organization Simple. Fewest hardware needed * Wide memory organization Largest bandwidth * Interleaved memory organization ![](https://i.imgur.com/KEjyWk2.png) #### Miss Penalty of Different Designs 假設 * 1 memory bus clock to send the address * 15 memory bus clocks for each DRAM word access initiated * 1 memory bus clock to send a DRAM word of data * A cache block = 4 words 1. A one-word-wide bank of DRAMs Miss penalty = 1 + 4 x 15 + 4 x 1 = 65 cycles Bandwidth = 16 bytes / 65 cycles = 0.25 B/cycle 2. A four-word-wide bank of DRAMs Miss penalty = 1 + 1 x 15 + 1 = 17 cycles Bandwidth = 16 bytes / 17 cycles = 0.94 B/cycle 3. A four-bank, one-word-wide bus of DRAMs Miss penalty = 1 + 1 x 15 + 4 x 1 = 20 cycles Bandwidth = 16 bytes / 20 cycles = 0.8 B/cycle ### Measuring Cache Performance CPU time can be divided into two parts * Program execution cycles * Memory stall cycles ``` Memory stall cycles = (Memory accessed / Program) x Miss rate x Miss penalty ``` ### Average memory access time (AMAT) AMAT = Hit time + Miss rate × Miss penalty ###### tags: `考試複習`