# CA Mid 2 (5/2)
<style>
.markdown-body li + li
{
padding-top: 0 !important;
}
.markdown-body table th, .markdown-body table td
{
padding: 2px 10px;
}
</style>
---
:::danger
:page_facing_up: ***With 1 sheet (2 pages) of A4 handwritten notes*** :page_facing_up:
R: 39~59, C: ~150
:::
:::info
**影片**:
- [03/28](https://www.youtube.com/watch?v=_nilBb7Qudo)
- [04/11](https://www.youtube.com/watch?v=Hck32R8ph9U)
- [04/18](https://www.youtube.com/watch?v=vamvgCKJTSY)
- [04/25](https://www.youtube.com/watch?v=JrabBCuP9fI)
:::
:::info
**其他筆記**
- https://hackmd.io/@sysprog/H1sZHv4R?type=view
:::
---
[TOC]
---
## ==TODO==
- Past
- Collect Slide's exercises
- [x] MIPS major insts formats
## ==TODO==: Must Write Down Big Data
- Load interlocks & forwarding logic
- Scoreboard & Tomasulo & Tomasulo with HW speculation logic
## 3. Pipelining
- **DLX Pipeline**: IF, ID, EX, MEM, WB
### Structural Hazard
- **Structural Hazard**: If some combination of insts cannot be accommodated because of **HW resource conflicts**, the machine is said to have a structural hazard.
- E.g. Only one memory port (**IF <-> MEM**)
- If all other factors are equal, a processor without structural hazards will have a lower CPI
- Why allowed:
- **⬇ cost of the unit**
- Support both an inst cache and data cache every cycle requires twice the total memory bandwidth
- Much cost of functional unit to deal with Floating point inst
- A **Stall**: A pipeline bubble
### Data Hazard
- **Data Hazard**: When the pipeline changes the order or R/W accesses to operands so that the order differs from the order seen sequentially executing insts on an un-pipelined machine.
- **3 Generic Data Hazards**: [Dependencies Notes](##Dependencies)
- **RAW (Read After Write)**: Inst 2 reads operand before inst 1 writes it
- **(True/Data) Dependence**: Results from an actual need for communication.
- **WAR (Write After Read)**: Inst 2 writes operand before inst 1 reads it
- **Anti-Dependence**: Results from the reuse of the name of the operand
- Can't happen in DLX 5 stage pipeline (not complicated, dynamic scheduling):
- All inst takes 5 stages
- Read: 2, Write: 5
- **WAW (Write After Write)**:: Inst 2 writes operand before inst 1 writes it
- **Output Dependence**: Results from the reuse of the name of the operand
- Can't happen in DLX 5 stage pipeline (not complicated, dynamic scheduling):
- All inst takes 5 stages
- Write: 5
- **Forwarding** (Bypassing): A HW technique to minimize (not all) Data Hazards
- A result is forwarded from the output of one unit to the input of another
- **Load Interlock**: Insts after load inst cannot immediately get forwarded value before it is loaded from memory, therefore there may be stalls
- **SW Scheduling**: Change the order of insts to reduce stalls by interlocks
- **HW implementing** the Control for **DLX** Pipeline:
- **Detecting interlocks early in the pipeline** reduces the HW complexity:
- 
- ***(| Inst 1 | Inst 2 | If true then a Load Interlock detected |)***


- Detect the hazard or **forwarding at the beginning of a clock cycle that uses an operand**:
- ***(| From p-reg | From inst | To p-reg | To inst | To intput | If true then forward)***


- [MIPS Inst Cheatsheet](https://www.kth.se/social/files/563c63c9f276547044e8695f/mips-ref-sheet.pdf)
| Type | Class | Syntax |
|:----:|:-----------:|:---------------:|
| R | Reg-reg ALU | `rd, rs, rt` |
| I | ALU imm | `rt, rs, imm` |
| I | Load/store | `rt, imm(rs)` |
| I | Branch | `rs, rt, label` |
- ==TODO: Exrcise==
### Control Hazard
- **Control Hazard**: When the pipeline makes wrong decisions on branch prediction and therefore brings insts into the pipeline that must subsequently be discarded
- Can cause a greater performance loss for DLX pipeline than do data hazards
- **Branch Stall Impact**:
- Original arch: **+3 cycles penalty for branch**
- 2 parts solution:
- **Determine** branch taken or not **sooner**, AND
- **Compute** taken branch address **earlier**
- **DLX** Solution: **+1 cycle penalty for branch**
- Move **Zero test to ID** stage, and transmit **back to IF**
- Adder to **calculate new PC in ID** stage, and transmit **back to IF**
- **4 Branch Hazard Alternatives**:
- **Stall** until direction is clear
- Predict Branch **Not Taken**:
- Execute successor insts in sequence
- Squash insts in pipeline if branch actually taken
- Advantagee of late pipeline state update
- PC + 4 already calculated, so use it to get next inst
- Predict Branch **Taken**:
- **Haven't calculated branch target address in DLX**
- Still incurs 1 cycle branch penalty
- Other machines: branch target known before outcome
- **Delayed Branch**: **DLX**
- Do not waste stalled cycles (called **delayed slot**)
- 1 slot delay allows proper decision and branch target address in 5 stage pipeline
- Insts in the slot are from: **Must be no dependence**
- **From before**: Improve perf **always**
- Branch must **not depend on** the rescheduled insts
- **From target** (taken part): Improve perf when branch is **taken**
- Must be **fine to execute** rescheduled insts if branch is not taken
- **From fall-through** (untaken part): Improve perf when branch is **not taken**
- Must be **fine to execute** rescheduled insts if branch is taken
- **Canceling branches** mechanism: canceling wrong insts, allow more slots to be filled
- **Compiler** effectiveness
- **More powerful prediction scheme is desired for deep pipelines**
### Extending Pipeline
- What makes **pipelining hard to implement**:
- Dealing with **exceptions**
- **Stopping and restarting** execution
- Inst set **complication**
- **Exceptions of DLX** pipeline:
| Stage | Problem exceptions occurring |
|:-----:|:-------------------------------------------------------------------------------- |
| IF | Page fault on inst fetch; misaligned memory access; memory- protection violation |
| ID | Undefined or illegal op-code |
| EX | Arithmetic Exception |
| MEM | Page Fault on data fetch; misaligned memory access; memory- protection violation |
| WB | None |
- **Latency** (clock cycle): intervening cycles between an inst that produces a result and an inst that uses it
- **Initial interval** (Repeat interval): clock cycles that must elapse between issuing two operations of a given type
- MIPS R4000: Extended the DLX pipeline to handle multiple operations
- **Deeper Pipelines**: more speedup, stalls of hazards often be more then 1 cycle
- Need **better Branch prediction**
- **Longer latency** pipelines:
- The divide unit is **not fully pipelined**, **structural hazards** can occur
- **Varing running times**:
- Number of reg writes required in one cycle can be > 1
- WAW hazards are possible (not reach WB in order)
- Out of order completion causing problems with exception
- **Stalls for RAW** will be more frequent
- WAR is not possible in this case since the reg reads always occurs in ID
## 4. Advanced Pipeling and ILP
### Pipelining Basics
- **Hazards**: limit performance
- **Structural**: need more HW
- **Data**: need forwarding, compiler rescheduling
- **Control**: early evaluation PC, delayed branch, prediction
- **Length** of pipelining ⬆️ Impact of **Hazards** ⬆️
- Pipelinings helps inst **bandwisth, not latency**
- Interrupts, FP Insts make pipelining harder
- Compilers reduce cost of data and control hazards (e.g. load delay slots)
- Longer pipelines: **More inst parallelism**, **needs better branch prediction**
### Instruction-Level Parallelism
- **ILP (Instruction-Level Parallelism)**: The potential overlap among insts
- Overlapping execution of **unrelated (no dependencies)** insts
- **SW (compile time)**: ⬇ the impact of data and control hazards
- **HW (run time)**: ⬆️ the ability of the processor to exploit parallelism
- Pipeline CPI = Ideal pipeline CPI + Structural stalls + Data stalls + Control stalls
= Ideal pipeline CPI + Structural stalls + RAW stalls + WAR stalls + WAW stalls + Control stalls
| S/H | Technique | Reduces |
|:---:|:----------------------------------------- |:---------------- |
| SW | Loop Unrolling | Control stalls |
| SW | Basic Pipeline Scheduling | RAW stalls |
| HW | Dynamic Scheduling with Scoreboarding | RAW stalls |
| HW | Dynamic Scheduling with Register Renaming | WAR & WAW stalls |
| HW | Dynamic Branch Prediction | Control stalls |
| HW | Issuing multiple insts per cycle | Ideal CPI |
### Rescheduling
- **Rescheduling**:
- **Preserve dependencies** needed (e.g. by **adjust offsets**)
- Fill the **delayed branch slot**
- Static & Dynamic Scheduling:
- **Static Scheduling**: Compiler techniques for scheduling the insts
- **Dynamic Scheduling**: HW schemes
- Do not know **real dependence** at compile time
- **Keep compiler simpler**
- Compiled code for one machine **runs well on another**
- Allow **insts behind stall to proceed**
- Enables **out-of-order execution** $\rightarrow$ **out-of-order completion**
### Loop Unrolling
- **Loop Unrolling**: Exploit parallelism among iterations of a loop, ⬆️ the amount of available ILP
- The most compact may not have the best performance
- Trade off by **⬆️ the code size** to **⬇ the necessity of redundant controls** (branch impact)
- Straightforward way: **Repeat the statement** (basic block)
- Do **Rescheduling**
- Useful if the loop **iterations were independent**
- Check if **legal to move** insts and **adjust offsets**
- **Use different regs** to avoid unnecessary constraints
- **⬇ tests and branches**
- Adjust the **loop termination and iteration** code
- Check if **load & store** in the unrolling loop are **independent and can be interchange**
- **Analyze memory addresses** to check they are not refer to the same address
- **Compiler** perspective on Code Movement: **Schedule to avoid hazards**
- Concerns about dependencies (**If dependent, they cannot be executed in parallel**)
- **Easy to determine for regs** (fixed names), **hard for memory**
- **3 Limits** to Loop Unrolling:
- ⬇ amount of **overhead** amortized with each extra unrolling (Amdahl's Law)
- ⬆️ **code size**
- **Reg pressure**: potential shortfall in regs created by aggressive unrolling and scheduling
### Dependencies
- **Data/True dependence**: Inst 1 produces a result used by inst 2
- **Pass down**:
- Inst 3 is data dependent on inst 2 & inst 2 is data dependent on inst 1
$\Rightarrow$ Inst 3 is data dependent on inst 1
- **Importance**:
- Indicate the **possibility of a hazard**
- Determine the **order** in which result must be calculated
- Set an **upper bound** on how much **parallelism** can possibly be exploited
- **Overcome**:
- Maintaining the dependence but avoiding a harzard
- Elimination a dependence by **transforming** the code
- **Name dependence**: 2 insts use the same name (reg or mem location) without exchanging data
- **Anti-dependence**: Inst 2 writes a reg or mem loc that inst 1 reads from
- **Output dependence**: Inst 1 & inst 2 write the same reg or mem loc
- **Control dependence**: Statements are control dependent on controls
- **Constraints**:
- Insts are control dependent on a branch cannot be moved before the branch
- So that its execution is no longer controlled by the branch
- Insts are not control dependent on a branch cannot be moved after the branch
- So that its execution is controlled by the branch
- **Two properties critical to program correctness**, normally preserved by control dependencies:
- **Exception Behavior**: e.g. memory protection exception
- **Data flow**: Branch makes the data flow **dynamic**
- Violating the control dependence may not affect the data flow
- **Liveness**: whether a value will be used by an upcoming inst
- **Compiler-based Speculation**: Safely violating control dependences
- **Loop-carried dependence**: there are dependencies between loop iterations
- **Recurrence**: **Dependence distance**: Dependent on # iterations away
- ⬆️ distance, ⬆️ potential parallelism can be obtained by unrolling the loop
- **Circular**: e.g. S1 depends on S2 (loop-carried), but S2 does not depends on S1
- **Un-circular loop can be made parallel (loop unrolling)**:
- Make dependencies not loop-carried by **shift statements**
- Iterations of the loop may be overlapped provided **the statements in each iteration are in order**
- ==TODO: Exrcise==
### Issuing Multiple Instructions per Cycle
- Getting **CPI < 1**
- **SS (SuperScalar)**:
- **Varying** number of insts per cycle (1~8)
- Scheduled by **compiler or HW (Tomasulo)**
- E.g. Superscalar DLX: 2 insts, 1 FP & 1 anything else
- 1 cycle load delay expands to **3 insts** in Superscalar (SS)
- Need additional reg ports
- Need pipelined or multiple FP units
- **VLIW (Very Long Instruction Words)**:
- **Fixed** number of insts per cycle (4~16)
- Scheduled by **compiler**
- Put operations into wide **templates**
- Each "insts" has explicit coding for multiple operations
- **Tradeoff inst space for simple decoding**
- All operations the compiler puts in the long inst word are **independent**
$\Rightarrow$ **execute in parallel**
- Need **compiler technique** beyond basic block (Trace scheduling)
- **Limitations** in Multiple-Issue Processors:
- Inherent limitations of available ILP in programs:
- To **fill the issue slots** and **keep the functional units busy**
- **Functional units** that are **pipelined** or with **multi-cycle latency** require a larger number of operations that can be executed in parallel
- \# of independent operations $\approx$ the average **pipeline depth** $\times$ # of **functional units**
- Difficulties in building the underlying HW:
- **HW cost**:
- Multiple FP and integer **functional units**
- ⬆️⬆️ **memory bandwidth** & **register-file bandwidth**
## 5. Scoreboard
### Scoreboard Introduction
- **Scoreboard**: A bookkeeping technique
- A **Dynamic scheduling** model
- **Centralized** model
- Out-of -order execution divides ID stage:
1. **Issue**: Decode insts, check for structural hazard
2. **Read operands**: Wait until no data hazards, then read operands
- **Insts execute whenever not dependent on previous insts and no hazards**
- E.g. CDC 6600 in 1963:
- **In order issue**
- **Out-of-order execution**
- **Out-of-order commit** (or completion):
- Solution for **WAR hazards**:
- **Stall** writeback until regs have been read
- Read regs only during **Read Operands stage**
- Solution for **WAW hazards**:
- Detect hazard and **stall** issue of new inst until other inst completes
- **No forwarding!**
- **No reg renaming!**
- **Multiple or pipelined execution units**
- **Keep track of dependencies** between insts already issued
- **Replace ID, EX, WB with 4 stages**
### CDC 6600 Scoreboard Architecture
- **4 stages** of Scoreboard Control:
1. **Issue (ID1)**: Decode insts, check for structural hazard
- Issue **in program order**
- **Not** issue (stall) if **structural hazard**
- **Not** issue (stall) if **WAW harzards**:
- **Output dependent** on any previous issued but uncompleted inst
2. **Read operands (ID2)**: Wait until no data hazards, then read operands
- Resolve **RAW hazards**, wait for insts to write back data
- **No forwarding!**
3. **Execution (EX)**: Operate on operands
- The functional unit begins execution upon receiving operands
- When the result is **ready**, it **notifies the scoreboard** that it has completed execution
4. **Write result (WB)**: Finish execution
- **Stall** if **WAR hazards** with previous insts
- **3 parts** of the Scoreboard:
- **Instruction status**: Which of 4 steps the inst is in
- **Functional unit status**: Indicates the state of the functional unit (FU), **9 fields** for each FU
- **Busy**: Whether the unit is busy or not
- **Op**: Operation to perform in the unit (e.g. + or -)
- **Fi**: Destination reg number
- **Fj, Fk**: Source reg numbers
- **Qj, Qk**: FUs producing source regs Fj, Fk (blank if Fj, Fk are ready)
- **Rj, Rk**: Whether Fj, Fk are ready
- **Register result status**: Which FU will write each reg, if one exists
- Blank when no pending insts will write that reg
- Detailed Scoreboard **Pipeline Control**

### Scoreboard Summary
- **Limitations** of CDC 6600 Scoreboard:
- No forwarding
- **\# of scoreboard entries**
- **\# of functional units**, especially interger/load store units
- Not issue on structural hazards
- Wait for WAR hazards
- Prevent WAW hazards
- Increasingly powerful (and complex) **dynamic mechanism** for detecting and resolving hazards
- In-order pipeline, in-order op-fetch with reg reservations, in-order issue with **scoreboard**
- Out-of-order execution
- **Facilitate multiple issue**
- **Compiler techniques** make it easier for HW to find the ILP
- **⬇ the impact of more sophisticated organization**
- Requires a **larger architected namespace** (regs)
- **Easier for more structured code**
## 6. Tomasulo
### Tomasulo Introduction
- **Tomasulo**: Dynamically Scheduled Instruction Processing
- A **Dynamic scheduling** model
- **In order issue**
- **Out-of-order execution**
- **Out-of-order commit** (or completion)
- **Registers**:
- **Short, absolute name** for a recently computed (or frequently used) value
- **Fast, high bandwidth storage** in the data path
- Means of **broadcasting a computed value** to set of insts that use the value
- Series of insts issued and waiting for value to be produced by logically preceding inst
- CDC6600 has each come back and read the value once it is placed in reg file
- Alternative: broadcast value and reg # to all the waiting insts
- **Tomasulo vs. Scoreboard**:
- Control & buffers **distributed** with FU v.s. centralized in scoreboard
- FU buffers called **reservation stations (RS)**, hold pending operands
- Regs in insts replaced by values or pointers to RS; called **register renaming**
- **Avoid WAR, WAW hazards**
- More RS than regs, so it can **do optimizations compilers can't**
- Results to FU from RS, **not through regs**, but over **Common Data Bus** that broadcasts to all FUs
- Load and Stores treated as FUs with RSs as well
- Tomasulo's scheme **widely adopted in multiple-issue processors** after 1990s:
- Achieving high performance **without requiring the compiler** to target code to **a specific pipeline structure**
- With the presence of **cache**, the delays are unpredictable (miss). Out-of-order execution allows the processors to continue executing insts while awaiting the completion of a cache miss (**hiding cache miss penalty**)
- Processors becoming **more aggressive in issue capability** demands regs renaming and dynamic scheduling
- Dynamic scheduling is a key component of speculation, it was **adopted along with HW speculation** in the mid-1990s
### Tomasulo Architecture
- **Load buffer**:
- Hold the components of the **effective address** until it is computed
- Track **outstanding loads** that are **waiting on the memory**
- Hold the **results** of completed loads
- **Store buffer**:
- Hold the components of the **effective address** until it is computed
- Hold the **destination memory addresses** of **outstanding stores** that are **waiting for data value** to store
- Hold the **address** and **value** to store **until the memory unit is available**
- **3 stages** of Tomasulo Algorithm:
1. **Issue (ID)**: get inst from FP Op Queue
- **FIFO** order to ensure the correct data flow
- If **RS free (no structural hazard)**, control issues inst & **sends operands** (**renames regs**)
- If no empty RS: **structural hazard, stall** until a station or buffer is freed
- If operands are not in the regs, **keep track of the FU that will produce the operands**
- **Eliminate WAR & WAW** hazards
2. **Execution (EX)**: operate on operands
- If **both operands ready** then execute
- If not ready, watch **Common Data Bus** for result (**Resolve RAW**)
- Several insts could become ready **in the same clock cycle**
- **Independent FU** could begin execution in the same clock cycle for different insts
- If more than one inst is waiting for **a single FU**, the unit will choose among them
3. **Write result (WB)**: finish execution
- Write on Common Data Bus to all awaiting units, mark RS available
- Normal Data Bus: data + destination (go to bus)
- **Common Data Bus**: data + **source** (**come from** bus)
- 64 bits of data + 4 bits of FU **source** address
- Write of matches expected FU (produces result)
- Does the broadcast
- **Reservation Station (RS)**: 7 fields
- **Op**: Operation to perform in the unit
- **Vj, Vk**: **Value** of Source operands
- Store buffers has V field, result to be stored
- **Qj, Qk**: RS producing source regs (source regs to be written by other insts)
- Oj, Ok = 0 indicates source is already in Vj, Vk
- **Busy**: RS or FU is busy
- **A**: Hold information for **memory address calculation for a load store**
- Initially: immediate field of the inst
- after address calculation: effective address
- **Register result status**: on filed **Qi**
- \# of the RS (which FU) will write the reg, if one exists
- Blank when no pending insts that will write that reg
- **Load and store buffers** each have a field: **A**
- Hold the result of the effective address once the first step of execution has been completed
- Detailed **Tomasulo's Algorithm**:



### Hardware-based Speculation
- **Speculation**: fetch, issue, and execute insts as if the branch predictions were always correct
- Key ideas:
- **Dynamic branch prediction**
- Allow execution of insts **before the control dependencies are resolved**
- **Dynamic scheduling of different combinations of basic blocks**
- Preform update that cannot be undone only when the inst is **no longer speculative**
- Update the reg file or memory: **inst commit**
- **Out-of-order execution** but **in-order commit**
- **Separate execution complete** from **inst commit**
- Require additional HW buffers (**Reorder buffer, ROB**)
- Hold the results of inst between completion and commit
- **Finish execution but have not committed**
- **Reorder buffer, ROB**: 4 fields
- **Instruction type**: Branch/Store/Reg
- **Destination field**: Reg #
- **Value field**: Output value
- **Ready field**: Whether completed execution
- Tomasulo's algorithm with & without speculation:
- Without speculation:
- **Only partially overlaps basic blocks** because it requires that a branch to be resolved before actually executing any insts in the successor basic block
- One an inst writes its result, **any subsequently issued insts will find the result in the reg file**
- With speculation:
- **Dynamic scheduling of different combinations of basic blocks**
- **The reg file is not updated until the inst commits**
- **Integrate the function of store buffer into ROB**
- **Tomasulo's algorithm with speculation**:
1. **Issue**:
- Get the inst from the inst queue
- Issue if there is an **empty RS and empty slot in ROB** (**stall if no empty entry**)
- Send the operands to RS if they are available
- Update the control entries
- \# of of the ROB entry allocated for the result is also sent to the RS
2. **Execution**:
- If one or more **operands is not yet available, monitor the CDB**
- When both operands are available at a RS, execute the operation (**Eliminate RAW**)
3. **Write result**:
- When the result is available, **write it on the CDB, any RSs waiting for this result, and from the CDB to ROB**
4. **Commit**:
- **Branch with an incorrect prediction**: the speculation is wrong, so the **ROB is flushed** and the execution is restarted at the correct successor of the branch
- **If the branch is correctly predicted, the branch is finished**
- **Store**: when an inst reaches the head of ROB, **update the memory** with the result and removes the inst from the ROB
- **Any other inst**: when an inst reaches the head of ROB, **update the reg** with the result and removes the inst from the ROB
## 7. Branch Prediction
### Branch Prediction Intro
- **Branch Prediction**: As we try to exploit more ILP, the accuracy of branch prediction becomes critical
- Using profile information collected from **earlier runs**, and modify prediction based on the **last run**
- **Dynamic Branch Prediction**:
- Perf = f(acc, cost of mis-pred, freq of branch)
- Why does prediction work:
- Underlying algorithm has regularities
- Data that is being operated on has regularities
- Inst sequence has redundancies that are artifacts of the way that humans/compilers think about problems
- Is dynamic branch prediction better than static branch prediction?
- Seems to be
- There are a small number of important branches in programs which have dynamic behavior
### Branch History Table
- **Branch History Table (BHT) (Branch-Prediction Buffer)**:
- A memory indexed by the lower bits of PC address
- The memory contains 1 bit: **1-bit prediction scheme**
- Says whether or not the branch taken last time
- **No tag & no address check**: It saves HW, but we don't know if it is the right branch. (It may have been put there by another branch that has the same low-order address bits)
- **2-bit prediction schemes**

- In a loop, 1-bit branch prediction will cause 2 mis-predictions
- Enter the loop
- End of loop
- To remedy this weakness, **2-bit prediction schemes** are often used
- A prediction must miss twice before it is changed
- Use a small amount of (hopefully) local information to predict behavior
- Implementations:
- Alternative 1: a small, special "cache" accessed with the **inst address** during the IF stage
- Alternative 2: a pair of bits attached to **each block in the inst cache** and fetched with the inst
- Mis-prediction:
- Wrong guess for the branch
- Got branch history of wrong branch when indexing the table
- 4K entries about as good as infinite table for older version of a subset of SPEC benchmark. **For newer version, about 8K entries would be needed to match infinite 2-bit predictor**
- **Correlating Predictors**: 1 bit predictor & **1 bit of correlation**, change on the followed bit
| Pred bits | Pred if last NT | Pred if last T |
|:---------:|:---------------:|:--------------:|
| NT/NT | NT | NT |
| NT/T | NT | T |
| T/NT | T | NT |
| T/T | T | T |
- Sometimes the **behavior is correlated**, and we can do better by keep track of direction of related branches
- Predictors that use the behavior of other branches to make a prediction are called **correlating predictors** or **2-level predictors**
- Hypothesis: recent branches are correlated
- Idea: **record m most recently executed branches** as taken or not, and use that pattern to select the proper branch history table
- In general, **$(m,n)$ predictor** means record **last $m$ branches** to select between **$2^m$ history tables** each with **$n$-bits counters**
- e.g. 2-bit BHT is a $(0,2)$ predictor
- e.g. $(2,2)$ predictor
- $m = 2,\ n = 2$
- HT number: $2^m = 4$
- $2$ bits per branch predictor (entries)
- If $4$ branch lower-order bits
- HT size: $2^4 = 16$
- Total size: $2^4 \times 2 \times 2^2 = 128$ bits
- **Branch Target Buffer (BTB)**: Address of branch index to get prediction AND branch address (if taken)

- Must check for branch match now, since we can't use the wrong branch address

### Dynamic Branch Prediction Summary
- Prediction is important
- **BHT**: 2 bits for loop accuracy
- **Correlation**: Recently executed branches correlated with next branch
- Either different branches or different executions of the same branches
- **Tournament Predictor**: more resources for competitive solutions and pick between them
- **BTB**: Include branch address & prediction