# CA Mid 2 (5/2) <style> .markdown-body li + li { padding-top: 0 !important; } .markdown-body table th, .markdown-body table td { padding: 2px 10px; } </style> --- :::danger :page_facing_up: ***With 1 sheet (2 pages) of A4 handwritten notes*** :page_facing_up: R: 39~59, C: ~150 ::: :::info **影片**: - [03/28](https://www.youtube.com/watch?v=_nilBb7Qudo) - [04/11](https://www.youtube.com/watch?v=Hck32R8ph9U) - [04/18](https://www.youtube.com/watch?v=vamvgCKJTSY) - [04/25](https://www.youtube.com/watch?v=JrabBCuP9fI) ::: :::info **其他筆記** - https://hackmd.io/@sysprog/H1sZHv4R?type=view ::: --- [TOC] --- ## ==TODO== - Past - Collect Slide's exercises - [x] MIPS major insts formats ## ==TODO==: Must Write Down Big Data - Load interlocks & forwarding logic - Scoreboard & Tomasulo & Tomasulo with HW speculation logic ## 3. Pipelining - **DLX Pipeline**: IF, ID, EX, MEM, WB ### Structural Hazard - **Structural Hazard**: If some combination of insts cannot be accommodated because of **HW resource conflicts**, the machine is said to have a structural hazard. - E.g. Only one memory port (**IF <-> MEM**) - If all other factors are equal, a processor without structural hazards will have a lower CPI - Why allowed: - **⬇ cost of the unit** - Support both an inst cache and data cache every cycle requires twice the total memory bandwidth - Much cost of functional unit to deal with Floating point inst - A **Stall**: A pipeline bubble ### Data Hazard - **Data Hazard**: When the pipeline changes the order or R/W accesses to operands so that the order differs from the order seen sequentially executing insts on an un-pipelined machine. - **3 Generic Data Hazards**: [Dependencies Notes](##Dependencies) - **RAW (Read After Write)**: Inst 2 reads operand before inst 1 writes it - **(True/Data) Dependence**: Results from an actual need for communication. - **WAR (Write After Read)**: Inst 2 writes operand before inst 1 reads it - **Anti-Dependence**: Results from the reuse of the name of the operand - Can't happen in DLX 5 stage pipeline (not complicated, dynamic scheduling): - All inst takes 5 stages - Read: 2, Write: 5 - **WAW (Write After Write)**:: Inst 2 writes operand before inst 1 writes it - **Output Dependence**: Results from the reuse of the name of the operand - Can't happen in DLX 5 stage pipeline (not complicated, dynamic scheduling): - All inst takes 5 stages - Write: 5 - **Forwarding** (Bypassing): A HW technique to minimize (not all) Data Hazards - A result is forwarded from the output of one unit to the input of another - **Load Interlock**: Insts after load inst cannot immediately get forwarded value before it is loaded from memory, therefore there may be stalls - **SW Scheduling**: Change the order of insts to reduce stalls by interlocks - **HW implementing** the Control for **DLX** Pipeline: - **Detecting interlocks early in the pipeline** reduces the HW complexity: - ![](https://i.imgur.com/BbYclu0.png =550x) - ***(| Inst 1 | Inst 2 | If true then a Load Interlock detected |)*** ![](https://i.imgur.com/MKwulWK.png =550x) ![](https://i.imgur.com/deina7I.png =300x)![](https://i.imgur.com/IlKnYUS.png =300x) - Detect the hazard or **forwarding at the beginning of a clock cycle that uses an operand**: - ***(| From p-reg | From inst | To p-reg | To inst | To intput | If true then forward)*** ![](https://i.imgur.com/QbiQMhH.png) ![](https://i.imgur.com/iRXD3pH.png) - [MIPS Inst Cheatsheet](https://www.kth.se/social/files/563c63c9f276547044e8695f/mips-ref-sheet.pdf) | Type | Class | Syntax | |:----:|:-----------:|:---------------:| | R | Reg-reg ALU | `rd, rs, rt` | | I | ALU imm | `rt, rs, imm` | | I | Load/store | `rt, imm(rs)` | | I | Branch | `rs, rt, label` | - ==TODO: Exrcise== ### Control Hazard - **Control Hazard**: When the pipeline makes wrong decisions on branch prediction and therefore brings insts into the pipeline that must subsequently be discarded - Can cause a greater performance loss for DLX pipeline than do data hazards - **Branch Stall Impact**: - Original arch: **+3 cycles penalty for branch** - 2 parts solution: - **Determine** branch taken or not **sooner**, AND - **Compute** taken branch address **earlier** - **DLX** Solution: **+1 cycle penalty for branch** - Move **Zero test to ID** stage, and transmit **back to IF** - Adder to **calculate new PC in ID** stage, and transmit **back to IF** - **4 Branch Hazard Alternatives**: - **Stall** until direction is clear - Predict Branch **Not Taken**: - Execute successor insts in sequence - Squash insts in pipeline if branch actually taken - Advantagee of late pipeline state update - PC + 4 already calculated, so use it to get next inst - Predict Branch **Taken**: - **Haven't calculated branch target address in DLX** - Still incurs 1 cycle branch penalty - Other machines: branch target known before outcome - **Delayed Branch**: **DLX** - Do not waste stalled cycles (called **delayed slot**) - 1 slot delay allows proper decision and branch target address in 5 stage pipeline - Insts in the slot are from: **Must be no dependence** - **From before**: Improve perf **always** - Branch must **not depend on** the rescheduled insts - **From target** (taken part): Improve perf when branch is **taken** - Must be **fine to execute** rescheduled insts if branch is not taken - **From fall-through** (untaken part): Improve perf when branch is **not taken** - Must be **fine to execute** rescheduled insts if branch is taken - **Canceling branches** mechanism: canceling wrong insts, allow more slots to be filled - **Compiler** effectiveness - **More powerful prediction scheme is desired for deep pipelines** ### Extending Pipeline - What makes **pipelining hard to implement**: - Dealing with **exceptions** - **Stopping and restarting** execution - Inst set **complication** - **Exceptions of DLX** pipeline: | Stage | Problem exceptions occurring | |:-----:|:-------------------------------------------------------------------------------- | | IF | Page fault on inst fetch; misaligned memory access; memory- protection violation | | ID | Undefined or illegal op-code | | EX | Arithmetic Exception | | MEM | Page Fault on data fetch; misaligned memory access; memory- protection violation | | WB | None | - **Latency** (clock cycle): intervening cycles between an inst that produces a result and an inst that uses it - **Initial interval** (Repeat interval): clock cycles that must elapse between issuing two operations of a given type - MIPS R4000: Extended the DLX pipeline to handle multiple operations - **Deeper Pipelines**: more speedup, stalls of hazards often be more then 1 cycle - Need **better Branch prediction** - **Longer latency** pipelines: - The divide unit is **not fully pipelined**, **structural hazards** can occur - **Varing running times**: - Number of reg writes required in one cycle can be > 1 - WAW hazards are possible (not reach WB in order) - Out of order completion causing problems with exception - **Stalls for RAW** will be more frequent - WAR is not possible in this case since the reg reads always occurs in ID ## 4. Advanced Pipeling and ILP ### Pipelining Basics - **Hazards**: limit performance - **Structural**: need more HW - **Data**: need forwarding, compiler rescheduling - **Control**: early evaluation PC, delayed branch, prediction - **Length** of pipelining ⬆️ Impact of **Hazards** ⬆️ - Pipelinings helps inst **bandwisth, not latency** - Interrupts, FP Insts make pipelining harder - Compilers reduce cost of data and control hazards (e.g. load delay slots) - Longer pipelines: **More inst parallelism**, **needs better branch prediction** ### Instruction-Level Parallelism - **ILP (Instruction-Level Parallelism)**: The potential overlap among insts - Overlapping execution of **unrelated (no dependencies)** insts - **SW (compile time)**: ⬇ the impact of data and control hazards - **HW (run time)**: ⬆️ the ability of the processor to exploit parallelism - Pipeline CPI = Ideal pipeline CPI + Structural stalls + Data stalls + Control stalls = Ideal pipeline CPI + Structural stalls + RAW stalls + WAR stalls + WAW stalls + Control stalls | S/H | Technique | Reduces | |:---:|:----------------------------------------- |:---------------- | | SW | Loop Unrolling | Control stalls | | SW | Basic Pipeline Scheduling | RAW stalls | | HW | Dynamic Scheduling with Scoreboarding | RAW stalls | | HW | Dynamic Scheduling with Register Renaming | WAR & WAW stalls | | HW | Dynamic Branch Prediction | Control stalls | | HW | Issuing multiple insts per cycle | Ideal CPI | ### Rescheduling - **Rescheduling**: - **Preserve dependencies** needed (e.g. by **adjust offsets**) - Fill the **delayed branch slot** - Static & Dynamic Scheduling: - **Static Scheduling**: Compiler techniques for scheduling the insts - **Dynamic Scheduling**: HW schemes - Do not know **real dependence** at compile time - **Keep compiler simpler** - Compiled code for one machine **runs well on another** - Allow **insts behind stall to proceed** - Enables **out-of-order execution** $\rightarrow$ **out-of-order completion** ### Loop Unrolling - **Loop Unrolling**: Exploit parallelism among iterations of a loop, ⬆️ the amount of available ILP - The most compact may not have the best performance - Trade off by **⬆️ the code size** to **⬇ the necessity of redundant controls** (branch impact) - Straightforward way: **Repeat the statement** (basic block) - Do **Rescheduling** - Useful if the loop **iterations were independent** - Check if **legal to move** insts and **adjust offsets** - **Use different regs** to avoid unnecessary constraints - **⬇ tests and branches** - Adjust the **loop termination and iteration** code - Check if **load & store** in the unrolling loop are **independent and can be interchange** - **Analyze memory addresses** to check they are not refer to the same address - **Compiler** perspective on Code Movement: **Schedule to avoid hazards** - Concerns about dependencies (**If dependent, they cannot be executed in parallel**) - **Easy to determine for regs** (fixed names), **hard for memory** - **3 Limits** to Loop Unrolling: - ⬇ amount of **overhead** amortized with each extra unrolling (Amdahl's Law) - ⬆️ **code size** - **Reg pressure**: potential shortfall in regs created by aggressive unrolling and scheduling ### Dependencies - **Data/True dependence**: Inst 1 produces a result used by inst 2 - **Pass down**: - Inst 3 is data dependent on inst 2 & inst 2 is data dependent on inst 1 $\Rightarrow$ Inst 3 is data dependent on inst 1 - **Importance**: - Indicate the **possibility of a hazard** - Determine the **order** in which result must be calculated - Set an **upper bound** on how much **parallelism** can possibly be exploited - **Overcome**: - Maintaining the dependence but avoiding a harzard - Elimination a dependence by **transforming** the code - **Name dependence**: 2 insts use the same name (reg or mem location) without exchanging data - **Anti-dependence**: Inst 2 writes a reg or mem loc that inst 1 reads from - **Output dependence**: Inst 1 & inst 2 write the same reg or mem loc - **Control dependence**: Statements are control dependent on controls - **Constraints**: - Insts are control dependent on a branch cannot be moved before the branch - So that its execution is no longer controlled by the branch - Insts are not control dependent on a branch cannot be moved after the branch - So that its execution is controlled by the branch - **Two properties critical to program correctness**, normally preserved by control dependencies: - **Exception Behavior**: e.g. memory protection exception - **Data flow**: Branch makes the data flow **dynamic** - Violating the control dependence may not affect the data flow - **Liveness**: whether a value will be used by an upcoming inst - **Compiler-based Speculation**: Safely violating control dependences - **Loop-carried dependence**: there are dependencies between loop iterations - **Recurrence**: **Dependence distance**: Dependent on # iterations away - ⬆️ distance, ⬆️ potential parallelism can be obtained by unrolling the loop - **Circular**: e.g. S1 depends on S2 (loop-carried), but S2 does not depends on S1 - **Un-circular loop can be made parallel (loop unrolling)**: - Make dependencies not loop-carried by **shift statements** - Iterations of the loop may be overlapped provided **the statements in each iteration are in order** - ==TODO: Exrcise== ### Issuing Multiple Instructions per Cycle - Getting **CPI < 1** - **SS (SuperScalar)**: - **Varying** number of insts per cycle (1~8) - Scheduled by **compiler or HW (Tomasulo)** - E.g. Superscalar DLX: 2 insts, 1 FP & 1 anything else - 1 cycle load delay expands to **3 insts** in Superscalar (SS) - Need additional reg ports - Need pipelined or multiple FP units - **VLIW (Very Long Instruction Words)**: - **Fixed** number of insts per cycle (4~16) - Scheduled by **compiler** - Put operations into wide **templates** - Each "insts" has explicit coding for multiple operations - **Tradeoff inst space for simple decoding** - All operations the compiler puts in the long inst word are **independent** $\Rightarrow$ **execute in parallel** - Need **compiler technique** beyond basic block (Trace scheduling) - **Limitations** in Multiple-Issue Processors: - Inherent limitations of available ILP in programs: - To **fill the issue slots** and **keep the functional units busy** - **Functional units** that are **pipelined** or with **multi-cycle latency** require a larger number of operations that can be executed in parallel - \# of independent operations $\approx$ the average **pipeline depth** $\times$ # of **functional units** - Difficulties in building the underlying HW: - **HW cost**: - Multiple FP and integer **functional units** - ⬆️⬆️ **memory bandwidth** & **register-file bandwidth** ## 5. Scoreboard ### Scoreboard Introduction - **Scoreboard**: A bookkeeping technique - A **Dynamic scheduling** model - **Centralized** model - Out-of -order execution divides ID stage: 1. **Issue**: Decode insts, check for structural hazard 2. **Read operands**: Wait until no data hazards, then read operands - **Insts execute whenever not dependent on previous insts and no hazards** - E.g. CDC 6600 in 1963: - **In order issue** - **Out-of-order execution** - **Out-of-order commit** (or completion): - Solution for **WAR hazards**: - **Stall** writeback until regs have been read - Read regs only during **Read Operands stage** - Solution for **WAW hazards**: - Detect hazard and **stall** issue of new inst until other inst completes - **No forwarding!** - **No reg renaming!** - **Multiple or pipelined execution units** - **Keep track of dependencies** between insts already issued - **Replace ID, EX, WB with 4 stages** ### CDC 6600 Scoreboard Architecture - **4 stages** of Scoreboard Control: 1. **Issue (ID1)**: Decode insts, check for structural hazard - Issue **in program order** - **Not** issue (stall) if **structural hazard** - **Not** issue (stall) if **WAW harzards**: - **Output dependent** on any previous issued but uncompleted inst 2. **Read operands (ID2)**: Wait until no data hazards, then read operands - Resolve **RAW hazards**, wait for insts to write back data - **No forwarding!** 3. **Execution (EX)**: Operate on operands - The functional unit begins execution upon receiving operands - When the result is **ready**, it **notifies the scoreboard** that it has completed execution 4. **Write result (WB)**: Finish execution - **Stall** if **WAR hazards** with previous insts - **3 parts** of the Scoreboard: - **Instruction status**: Which of 4 steps the inst is in - **Functional unit status**: Indicates the state of the functional unit (FU), **9 fields** for each FU - **Busy**: Whether the unit is busy or not - **Op**: Operation to perform in the unit (e.g. + or -) - **Fi**: Destination reg number - **Fj, Fk**: Source reg numbers - **Qj, Qk**: FUs producing source regs Fj, Fk (blank if Fj, Fk are ready) - **Rj, Rk**: Whether Fj, Fk are ready - **Register result status**: Which FU will write each reg, if one exists - Blank when no pending insts will write that reg - Detailed Scoreboard **Pipeline Control** ![](https://i.imgur.com/F4RFRKM.png =500x) ### Scoreboard Summary - **Limitations** of CDC 6600 Scoreboard: - No forwarding - **\# of scoreboard entries** - **\# of functional units**, especially interger/load store units - Not issue on structural hazards - Wait for WAR hazards - Prevent WAW hazards - Increasingly powerful (and complex) **dynamic mechanism** for detecting and resolving hazards - In-order pipeline, in-order op-fetch with reg reservations, in-order issue with **scoreboard** - Out-of-order execution - **Facilitate multiple issue** - **Compiler techniques** make it easier for HW to find the ILP - **⬇ the impact of more sophisticated organization** - Requires a **larger architected namespace** (regs) - **Easier for more structured code** ## 6. Tomasulo ### Tomasulo Introduction - **Tomasulo**: Dynamically Scheduled Instruction Processing - A **Dynamic scheduling** model - **In order issue** - **Out-of-order execution** - **Out-of-order commit** (or completion) - **Registers**: - **Short, absolute name** for a recently computed (or frequently used) value - **Fast, high bandwidth storage** in the data path - Means of **broadcasting a computed value** to set of insts that use the value - Series of insts issued and waiting for value to be produced by logically preceding inst - CDC6600 has each come back and read the value once it is placed in reg file - Alternative: broadcast value and reg # to all the waiting insts - **Tomasulo vs. Scoreboard**: - Control & buffers **distributed** with FU v.s. centralized in scoreboard - FU buffers called **reservation stations (RS)**, hold pending operands - Regs in insts replaced by values or pointers to RS; called **register renaming** - **Avoid WAR, WAW hazards** - More RS than regs, so it can **do optimizations compilers can't** - Results to FU from RS, **not through regs**, but over **Common Data Bus** that broadcasts to all FUs - Load and Stores treated as FUs with RSs as well - Tomasulo's scheme **widely adopted in multiple-issue processors** after 1990s: - Achieving high performance **without requiring the compiler** to target code to **a specific pipeline structure** - With the presence of **cache**, the delays are unpredictable (miss). Out-of-order execution allows the processors to continue executing insts while awaiting the completion of a cache miss (**hiding cache miss penalty**) - Processors becoming **more aggressive in issue capability** demands regs renaming and dynamic scheduling - Dynamic scheduling is a key component of speculation, it was **adopted along with HW speculation** in the mid-1990s ### Tomasulo Architecture - **Load buffer**: - Hold the components of the **effective address** until it is computed - Track **outstanding loads** that are **waiting on the memory** - Hold the **results** of completed loads - **Store buffer**: - Hold the components of the **effective address** until it is computed - Hold the **destination memory addresses** of **outstanding stores** that are **waiting for data value** to store - Hold the **address** and **value** to store **until the memory unit is available** - **3 stages** of Tomasulo Algorithm: 1. **Issue (ID)**: get inst from FP Op Queue - **FIFO** order to ensure the correct data flow - If **RS free (no structural hazard)**, control issues inst & **sends operands** (**renames regs**) - If no empty RS: **structural hazard, stall** until a station or buffer is freed - If operands are not in the regs, **keep track of the FU that will produce the operands** - **Eliminate WAR & WAW** hazards 2. **Execution (EX)**: operate on operands - If **both operands ready** then execute - If not ready, watch **Common Data Bus** for result (**Resolve RAW**) - Several insts could become ready **in the same clock cycle** - **Independent FU** could begin execution in the same clock cycle for different insts - If more than one inst is waiting for **a single FU**, the unit will choose among them 3. **Write result (WB)**: finish execution - Write on Common Data Bus to all awaiting units, mark RS available - Normal Data Bus: data + destination (go to bus) - **Common Data Bus**: data + **source** (**come from** bus) - 64 bits of data + 4 bits of FU **source** address - Write of matches expected FU (produces result) - Does the broadcast - **Reservation Station (RS)**: 7 fields - **Op**: Operation to perform in the unit - **Vj, Vk**: **Value** of Source operands - Store buffers has V field, result to be stored - **Qj, Qk**: RS producing source regs (source regs to be written by other insts) - Oj, Ok = 0 indicates source is already in Vj, Vk - **Busy**: RS or FU is busy - **A**: Hold information for **memory address calculation for a load store** - Initially: immediate field of the inst - after address calculation: effective address - **Register result status**: on filed **Qi** - \# of the RS (which FU) will write the reg, if one exists - Blank when no pending insts that will write that reg - **Load and store buffers** each have a field: **A** - Hold the result of the effective address once the first step of execution has been completed - Detailed **Tomasulo's Algorithm**: ![](https://i.imgur.com/keVnldV.png =550x) ![](https://i.imgur.com/3E4lzea.png =550x) ![](https://i.imgur.com/5NNfTXf.png =551x) ### Hardware-based Speculation - **Speculation**: fetch, issue, and execute insts as if the branch predictions were always correct - Key ideas: - **Dynamic branch prediction** - Allow execution of insts **before the control dependencies are resolved** - **Dynamic scheduling of different combinations of basic blocks** - Preform update that cannot be undone only when the inst is **no longer speculative** - Update the reg file or memory: **inst commit** - **Out-of-order execution** but **in-order commit** - **Separate execution complete** from **inst commit** - Require additional HW buffers (**Reorder buffer, ROB**) - Hold the results of inst between completion and commit - **Finish execution but have not committed** - **Reorder buffer, ROB**: 4 fields - **Instruction type**: Branch/Store/Reg - **Destination field**: Reg # - **Value field**: Output value - **Ready field**: Whether completed execution - Tomasulo's algorithm with & without speculation: - Without speculation: - **Only partially overlaps basic blocks** because it requires that a branch to be resolved before actually executing any insts in the successor basic block - One an inst writes its result, **any subsequently issued insts will find the result in the reg file** - With speculation: - **Dynamic scheduling of different combinations of basic blocks** - **The reg file is not updated until the inst commits** - **Integrate the function of store buffer into ROB** - **Tomasulo's algorithm with speculation**: 1. **Issue**: - Get the inst from the inst queue - Issue if there is an **empty RS and empty slot in ROB** (**stall if no empty entry**) - Send the operands to RS if they are available - Update the control entries - \# of of the ROB entry allocated for the result is also sent to the RS 2. **Execution**: - If one or more **operands is not yet available, monitor the CDB** - When both operands are available at a RS, execute the operation (**Eliminate RAW**) 3. **Write result**: - When the result is available, **write it on the CDB, any RSs waiting for this result, and from the CDB to ROB** 4. **Commit**: - **Branch with an incorrect prediction**: the speculation is wrong, so the **ROB is flushed** and the execution is restarted at the correct successor of the branch - **If the branch is correctly predicted, the branch is finished** - **Store**: when an inst reaches the head of ROB, **update the memory** with the result and removes the inst from the ROB - **Any other inst**: when an inst reaches the head of ROB, **update the reg** with the result and removes the inst from the ROB ## 7. Branch Prediction ### Branch Prediction Intro - **Branch Prediction**: As we try to exploit more ILP, the accuracy of branch prediction becomes critical - Using profile information collected from **earlier runs**, and modify prediction based on the **last run** - **Dynamic Branch Prediction**: - Perf = f(acc, cost of mis-pred, freq of branch) - Why does prediction work: - Underlying algorithm has regularities - Data that is being operated on has regularities - Inst sequence has redundancies that are artifacts of the way that humans/compilers think about problems - Is dynamic branch prediction better than static branch prediction? - Seems to be - There are a small number of important branches in programs which have dynamic behavior ### Branch History Table - **Branch History Table (BHT) (Branch-Prediction Buffer)**: - A memory indexed by the lower bits of PC address - The memory contains 1 bit: **1-bit prediction scheme** - Says whether or not the branch taken last time - **No tag & no address check**: It saves HW, but we don't know if it is the right branch. (It may have been put there by another branch that has the same low-order address bits) - **2-bit prediction schemes** ![](https://i.imgur.com/vEqOrBP.png =450x) - In a loop, 1-bit branch prediction will cause 2 mis-predictions - Enter the loop - End of loop - To remedy this weakness, **2-bit prediction schemes** are often used - A prediction must miss twice before it is changed - Use a small amount of (hopefully) local information to predict behavior - Implementations: - Alternative 1: a small, special "cache" accessed with the **inst address** during the IF stage - Alternative 2: a pair of bits attached to **each block in the inst cache** and fetched with the inst - Mis-prediction: - Wrong guess for the branch - Got branch history of wrong branch when indexing the table - 4K entries about as good as infinite table for older version of a subset of SPEC benchmark. **For newer version, about 8K entries would be needed to match infinite 2-bit predictor** - **Correlating Predictors**: 1 bit predictor & **1 bit of correlation**, change on the followed bit | Pred bits | Pred if last NT | Pred if last T | |:---------:|:---------------:|:--------------:| | NT/NT | NT | NT | | NT/T | NT | T | | T/NT | T | NT | | T/T | T | T | - Sometimes the **behavior is correlated**, and we can do better by keep track of direction of related branches - Predictors that use the behavior of other branches to make a prediction are called **correlating predictors** or **2-level predictors** - Hypothesis: recent branches are correlated - Idea: **record m most recently executed branches** as taken or not, and use that pattern to select the proper branch history table - In general, **$(m,n)$ predictor** means record **last $m$ branches** to select between **$2^m$ history tables** each with **$n$-bits counters** - e.g. 2-bit BHT is a $(0,2)$ predictor - e.g. $(2,2)$ predictor - $m = 2,\ n = 2$ - HT number: $2^m = 4$ - $2$ bits per branch predictor (entries) - If $4$ branch lower-order bits - HT size: $2^4 = 16$ - Total size: $2^4 \times 2 \times 2^2 = 128$ bits - **Branch Target Buffer (BTB)**: Address of branch index to get prediction AND branch address (if taken) ![](https://i.imgur.com/QxJ1hbF.png =500x) - Must check for branch match now, since we can't use the wrong branch address ![](https://i.imgur.com/wLWw6KE.png =500x) ### Dynamic Branch Prediction Summary - Prediction is important - **BHT**: 2 bits for loop accuracy - **Correlation**: Recently executed branches correlated with next branch - Either different branches or different executions of the same branches - **Tournament Predictor**: more resources for competitive solutions and pick between them - **BTB**: Include branch address & prediction