CA Mid 2 Mini (5/2)

# CA Mid 2 Mini (5/2) <style> .markdown-body li + li { padding-top: 0 !important; } .markdown-body table th, .markdown-body table td { padding: 2px 10px; } </style> --- [TOC] --- ## 3. Pipelining ### Structural Hazard - **Structural Hazard**: When some combination of insts cannot be accommodated because of HW resource conflicts - **Why allowed**: ⬇ cost of the unit (Support both an inst cache and data cache every cycle requires twice the total memory bandwidth, FU to deal with FP inst) ### Data Hazard - **Data Hazard**: When the ppl changes the order or R/W accesses to operands so that the order differs from the order seen sequentially executing insts on an un-ppled machine. - **3 Generic Data Hazards**: 1. **RAW**: (True Dep), Inst 2 reads operand before inst 1 writes it 2. **WAR**: (Anti-Dep), Inst 2 writes operand before inst 1 reads it 3. **WAW**: (Output Dep), Inst 2 writes operand before inst 1 writes it - **WAR, WAW** not in DLX 5 stage ppl (not complicated, dynamic scheduling): All inst takes 5 stages, Read: 2, Write: 5 - **Load Interlock**: Insts after load inst cannot immediately get forwarded value before it is loaded from mem, therefore there may be stalls - **SW Scheduling**: Change the order of insts to ⬇ stalls by interlocks - **HW implementing** the Control for DLX ppl: - **MIPS Inst Syntax**: R-r ALU (`rd, rs, rt`), ALU imm (`rt, rs, imm`), L.S. (`rt, imm(rs)`), Branch (`rs, rt, label`) - **Detecting interlocks** early in the ppl ⬇ the HW complexity - **Forwarding** (Bypassing): A HW tech to min (not all) Data Hazards, A result is forwarded from the output of one unit to the input of another - **Detect the hazard or forwarding** at the beginning of a clock cycle that uses an operand ### Control Hazard - **Control Hazard**: When encounter branch stalls, or the pipeline makes wrong branch preds and some insts in the pipeline must be discarded - **Resolve branch stall**: Determine branch taken or not sooner, AND Compute taken branch address earlier - **DLX** Solution: +1 cycle penalty for branch (Origin: +3 cycles): Move zero test to ID stage, and transmit back to IF, Move adder to calculate new PC in ID stage, and transmit back to IF - **4 Branch Hazard Alternatives**: 1. **Stall** until direction is clear 2. **Pred Not Taken**: Execute successor insts in sequence and discard them if taken 3. **Pred Taken**: Haven't calculated branch target address in DLX (early evaluation PC) 4. **Delayed Branch** (DLX): Do not waste stalled cycles (called **delayed slot**), 1 slot delay allows proper decision and branch target address in 5 stage ppl, More powerful pred scheme is desired for deep ppls - **Insts in delayed slots**: Must resolve deps - **From before**: ⬆️ perf always, branch must not depend on the rescheduled insts - **From target** (taken part): ⬆️ perf when taken, must be fine to execute rescheduled insts if not taken - **From fall-through** (untaken part): ⬆️ perf when not taken, must be fine to execute rescheduled insts if taken - **Canceling branches mechanism**: canceling wrong insts, allow more slots to be filled ### Extending Pipeline - **Ppling hard to implement**: Dealing with exceptions, stopping and restarting execution, and inst set complication, interrupts, FP inst - **Latency** (clock cycle): intervening cycles between an inst that produces a result and an inst that uses it - **Initial interval** (Repeat interval): clock cycles that must elapse between issuing two operations of a given type - **(Longer) Deeper Ppl**: - **Pros**: More inst parallelism, A stage can be implemented with simpler circuitry, which may let the processor clock run faster, Can handle multiple operations in parallel with varying running times - **Cons**: It needs better branch pred, The divide unit is not fully ppled, structural hazards can occur, It has varying running times and out-of-order completion which may raise the frequency of hazards and stalls (longer latency) ## Advanced Pipeling and ILP ### Pipeline Basics - **Pipelinings**: helps inst bandwisth, not latency, Compilers ⬇ cost of data and control hazards (e.g. load delay slots) ### ILP - **ILP (Instruction-Level Parallelism)**: The potential overlap among insts, Overlapping execution of unrelated (no deps) insts - **SW (compile time)**: ⬇ the impact of data and control hazards - **HW (run time)**: ⬆️ the ability of the processor to exploit parallelism - **Pipeline CPI** = Ideal pipeline CPI + Stalls (Structural + RAW + WAR + WAW + Control) - **ILP techs**: (S) Loop Unrolling: Control; (S) Basic Ppl Scheduling: RAW; (H) D.Sch. with Scoreboarding: RAW; (H) D.Sch. with Reg Renaming: WAR, WAW; (H) D. Branch Pred: Control; (H) Issuing multiple insts per cycle: Ideal CPI ### Rescheduling - **Rescheduling**: Preserve deps needed (e.g. by adjust offsets), Fill the delayed branch slot - **Static Scheduling**: Compiler - **Dynamic Scheduling**: HW schemes, Do not know real dep at compile time, Keep compiler simpler, Compiled code for one machine runs well on another, Allow insts behind stall to proceed, Enables out-of-order execution $\rightarrow$ out-of-order completion ### Loop Unrolling - **Loop Unrolling**: ⬆️ the amount of available ILP, ⬇ the negative branch impact, ⬆️ the code size, ⬇ amount of overhead amortized with each extra unrolling (Amdahl's Law), Reg pressure: potential shortfall in regs created by aggressive unrolling and scheduling - **Loop Unrolling**: Useful if the loop iters were indep (preserve deps) - Repeat the statement (basic block), adjust offsets and the loop termination and iteration code, use different regs to avoid unnecessary constraints - Check load & store deps: Analyze mem addresses to check they are not refer to the same address - **Compiler** perspective on Code Movement: Schedule to avoid hazards, Concerns about deps (If dep, they cannot be executed in parallel), Easy to determine for regs (fixed names), hard for memory ### Dependences - **True dep**: Inst 1 produces a result used by inst 2 - **Pass down**: Inst 3 is data dep on inst 2 & inst 2 is data dep on inst 1 $\Rightarrow$ Inst 3 is data dep on inst 1 - **Importance**: Indicate the possibility of a hazard, Determine the **order** in which result must be calculated, Set an upper bound on how much parallelism can possibly be exploited - **Overcome**: Maintaining the dep but avoiding a harzard, Elimination a dep by transforming the code - **Name dep**: 2 insts use the same name (reg or mem loc) without exchanging data - **Anti-dep**: Inst 2 writes a reg or mem loc that inst 1 reads from - **Output dep**: Inst 1 & inst 2 write the same reg or mem loc - **Control dep**: Statements are control dep on controls (branches) - **Constraints**: Insts are (/not) control dep on a branch cannot be moved (before/after) the branch, so that its execution is (no longer/) controlled by the branch - **Two properties critical to program correctness** (normally preserved by control deps): 1. **Exception Behavior**: e.g. memory protection exception 2. **Data flow**: Branch makes the data flow dynamic, Violating the control dep may not affect the data flow - **Liveness**: whether a value will be used by an upcoming inst - **Compiler-based Speculation**: Safely violating control deps - **Loop-carried dep**: there are depes between loop iters - **Recurrence**: **Dep distance**: Dep on # iterations away, ⬆️ distance ⬆️ potential parallelism can be obtained by unrolling the loop - **Circular**: e.g. S1 depends on S2 (loop-carried), but S2 does not depends on S1 - **Un-circular loop can be made parallel (loop unrolling)**: Make deps not loop-carried by transforming code (shift statements), Iterations of the loop may be overlapped provided the statements in each iteration are in order ### Issuing Multiple Instructions per Cycle - **Issuing Multiple Insts/Cycle**: Get CPI < 1 - **SS (SuperScalar)**: Varying # of insts/cycle, Scheduled by compiler or HW (Tomasulo), may ⬆️ load delay, may need additional reg ports, pipelined or multiple FP units - **VLIW (Very Long Instruction Words)**: Fixed # of insts/cycle, Scheduled by compiler, Put operations into wide templates, Each "insts" has explicit coding for multiple operations, Tradeoff inst space for simple decoding, All operations the compiler puts in the long inst word are indep $\Rightarrow$ execute in parallel, Need compiler tech beyond basic block (Trace scheduling) - **Limitations** in Multiple-Issue Processors: To fill the issue slots and keep the FUs busy, FUs that are ppled or with multi-cycle latency require a larger # of operations that can be executed in parallel, \# of indep operations $\approx$ the average ppl depth $\times$ # of FUs ## 5. Scoreboard - **Scoreboard**: A bookkeeping tech, A dynamic scheduling model, centralized, Insts execute whenever not dep on previous insts and no hazards - **Scoreboard Limits**: No forwarding; # of scoreboard entries; # of functional units, especially interger/load store units; Not issue on structural hazards; Wait for WAR hazards; Prevent WAW hazards - **CDC 6600**: In order issue; Out-of-order execution & completion; no forwarding; no reg renaming; Multiple or ppled execution units; Keep track of deps between insts already issued - **Resolve WAR**: Stall writeback until regs have been read; Read regs only during Read Operands stage - **Resolve WAW**: Detect hazard and stall issue of new inst until other inst completes - **CDC 6600 4 stages**: 1. **Issue** (Decode insts, check for structural hazard): Issue in program order; Not issue (stall) if structural & WAW harzards (Output dep on any previous issued but uncompleted inst) 2. **Read operands** (Wait until no data hazards, then read operands): Resolve RAW hazards, wait for insts to write back data; No forwarding! 3. **Execution** (Operate on operands): The functional unit begins execution upon receiving operands; When the result is ready, it notifies the scoreboard that it has completed execution 4. **Write result** (Finish execution): Stall if WAR hazards with previous insts - **CDC 6600 3 parts**: Instruction status; FU status: Busy; Op; Fi(Dest reg #); Fj, Fk (Source reg #s); Qj, Qk (Fj, Fk from which FU, blank if ready); Rj, Rk (If Fj, Fk ready); Register result status: Which FU will write each reg - **Compiler techs** make it easier for HW to find the ILP: ⬇ the impact of more sophisticated org; a larger architected namespace (regs); Easier for more structured code ## 6. Tomasulo - **Register**: Short, absolute name for a recently computed (or frequently used) value; Fast, high bandwidth storage in the data path; Means of broadcasting a computed value to set of insts that use the value - **Tomasulo**: A Dynamic scheduling model; Distributed; In order issue; Out-of-order execution & completion - **Tomasulo v.s. Scoreboard**: Control & buffers (RS) distributed with FU; Regs in insts replaced by values or pointers to RS called reg renaming, Avoid WAR, WAW hazards, More RS than regs, so it can do optimizations compilers can't; Results to FU from RS, not through regs, but over CDB that broadcasts to all FUs; Load and Stores treated as FUs with RSs as well; - **Tomasulo popular**: Achieving high perf without requiring the compiler to target code to a specific ppl structure; out-of-order execution hiding cache miss penalty; Processors becoming more aggressive in issue capability; adopted along with HW speculation - **T Load buffer**: Hold the components of the effective address until it is computed; Track outstanding loads that are waiting on the memory; Hold the results of completed loads - **T Store buffer**: Hold the components of the effective address until it is computed; Hold the dest memory addresses of outstanding stores that are waiting for data value to store; Hold the address and value to store until the memory unit is available - **T 3 stages**: 1. **Issue** (get inst from FP Op Queue): FIFO order to ensure the correct data flow; If RS free (no structural hazard), control issues inst & sends operands (renames regs); If no empty RS (structural hazard) then stall until available; If operands are not in the regs, keep track of the FU that will produce the operands; Eliminate WAR & WAW 2. **Execution** (operate on operands): If both operands ready then execute; If not ready, watch CDB for result (Resolve RAW); Several insts could become ready in the same clock cycle; Indep FU could begin execution in the same clock cycle for different insts; If multiple insts are waiting for a single FU, it will choose among them 3. **Write result** (finish execution): Write on CDB to all awaiting units, mark RS available - **Common Data Bus**: data + source (come from bus, Does the broadcast); Normal Data Bus: data + destination (go to bus) - **Reservation station (RS, FU buffers)**: Op; Vj, Vk: Value of Source operands (Store buffers has V field, result to be stored); Qj, Qk: RS producing source regs, blank if ready; Busy; A: Hold info for mem addrs calc for a load store (Initially: immediate field of the inst, After addr calc: effective addr) - **Register result status**: which FU will write the reg, Blank if none - **Load & store buffers**: **A**: Hold the result of the effective address once the first step of execution has been completed - **HW-based Speculation**: - **Separate execution complete from inst commit**: Out-of-order execution but in-order commit; Preform update that cannot be undone only when the inst is no longer speculative; Update the reg file or memory when inst commit - **With**: Dynamic branch prediction; Dynamic scheduling of different combinations of basic blocks - **Reorder buffer, ROB**: Hold the results of inst between completion and commit; 4 fields: Instruction type, Destination field, Value field, Ready field - **T with speculation**: 1. **Issue**: Issue if there is an empty RS and empty slot in ROB or stall; # of the ROB entry allocated for the result is also sent to the RS 2. **Execute**: 3. **Write result**: When the result is available, write it on the CDB, any RSs waiting for this result, and from the CDB to ROB 4. **Commit**: When wrong branch pred, ROB is flushed and the execution is restarted at the correct successor of the branch; When an inst reaches the head of ROB, update the reg/memory with the result and removes the inst from the ROB ## 7. Branch Prediction ### Branch Prediction Intro - **Branch Pred**: Using profile info collected from earlier runs, and modify pred based on the last run - **Dynamic Branch Pred**: Underlying algorithm & data has regularities; human & compiler may make redundant inst sequences; Some important branches in programs have dynamic behavior - **Branch History Table (BHT) (Branch-Prediction Buffer)**: A memory indexed by the lower bits of PC address & predictor; No tag & no address check - **2-bit predictor scheme**: A prediction must miss twice before it is changed; ⬆️ loop accuracy; Now about 8K entries would be needed to match infinite one - **Predictor Implementations**: 1: a small, special “cache” accessed with the inst address during the IF stage; 2: a pair of bits attached to each block in the inst cache and fetched with the inst - **Mis-prediction**: Wrong guess for the branch; Got branch history of wrong branch when indexing the table - **Un-correlating branch pred**: Recent results of a branch are correlated; **Correlating branch pred**: Recent branches are correlated - **Why correlating**: Sometimes the behavior is correlated, and we can do better by keep track of direction of related branches; Predictors that use the behavior of other branches to make a prediction are called correlating predictors or 2-level predictors; E.g. `if(d==0) d=1; if(d==1) ...` - **$(m,n)$ predictor**: record last $m$ branches to select between $2^m$ history tables each with $n$-bits counters - **⬆️ entries in BHT?**: ⬇ the frequency of getting branch history of wrong branch when indexing the table; ⬆️⬆️ entries, ⬇ in miss rate would be very insignificant - **Branch Target Buffer (BTB)**: Address of branch index to get prediction AND branch address (if taken); Must check branch address - **Tournament Predictor**: more resources for competitive solutions and pick between them