###### tags: `計結` # 1062 - Computer Architecture - 4~8_page20 ## (4) Advanced Pipelining and ILP ### Review : Summary of pipelining Basics * **Hazards** limit performance * Structural: need more hardware * Data: need forwarding, compiler rescheduling * Control: early evaluation PC, delayed branch, prediction * Increasing length of pipeline increases impact of hazards * Pipelining helps instruction **bandwidth**, **not latency** * Interrupts, Floating Point Instructions make pipelining harder * Compilers reduce cost of data and control hazards (e.g. load delay slots) * Longer pipelines (R4000): **More instruction parallelism**, ***Needs* Better branch prediction** ### Advanced Pipelining and Instruction Level Parallelism (**ILP**) * The potential overlap among instructions is called **ILP** * **ILP**: Overlap execution of unrelated instructions * Looking at techniques that reduce the impact of data and control hazards (**software approaches**) * gcc 17% control transfer * 5 instructions + 1 branch * Beyond single block to get more instruction level parallelism * Increasing the ability of the processor to exploit parallelism (**hardware approaches**) ### CPI of a Pipelined machine * **Pipeline CPI** = Ideal pipeline CPI + Structural Stalls + Data stalls + Control stalls = Ideal pipeline CPI + Structural Stalls + (RAW stalls + WAR stalls + WAW stalls) + Control stalls * ![](https://i.imgur.com/ShpCMy9.png) ### **Loop Unrolling** * ![](https://i.imgur.com/y50QPC1.png) * Summary * Determine if it is **legal** to move the instructions and **adjust the offset** * Determine the unrolling loop would be useful by finding if the loop **iterations were independent** * Use different registers to avoid unnecessary constraints. * Eliminate the extra tests and branches * Determine that the loads and stores in the unrolling loop are independent and can be interchanged (analyze the **memory addresses** and find that they **do not refer to the same address**) * Schedule the code, preserving any **dependences** needed ### **Dependency** * Data dependence (true dependence, **RAW**) * Name dependence * Anti-dependence (**WAR**) * Output dependence (**WAW**) * Control dependence * ```python= if p1: S1 if p2: S2 ``` * `S1` is control dependent on `p1` * `S2` is control dependent on `p2` but **not on** `p1` * ![](https://i.imgur.com/8OvHjJu.png) ### **Loop carried dependence** * Dependence distance of 1 * ```c= for (i = 2; i <= 100; i++) { Y[i] = Y[i-1] + Y[i]; } ``` * Dependence distance of 5 * ```c= for (i = 6; i <= 100; i++) { Y[i] = Y[i-5] + Y[i]; } ``` * The **larger the distance**, the **more potential parallelism** can be obtained by unrolling the loop. ### 3 Limits to Loop Unrolling * Decrease in amount of overhead amortized with each extra unrolling * Amdahl’s Law * **Growth in code size** * For larger loops, concern it increases the instruction cache miss rate * **Register pressure**: potential shortfall in registers created by aggressive unrolling and scheduling * If not be possible to allocate all live values to registers, may lose some or all of its advantage ### Review : Loop Unrolling Decisions * Determine loop unrolling useful by finding that loop iterations were **independent** * Use **different registers** to avoid unnecessary constraints forced by using same registers for different computations * Eliminate the extra test and branch instructions and **adjust the loop termination and iteration code** * Determine that loads and stores in unrolled loop can be interchanged by observing that loads and stores from different iterations are independent * Transformation requires **analyzing memory addresses** and finding that they do not refer to the same address * Schedule the code, preserving any dependences needed to yield the same result as the original code ### SuperScalar * Superscalar DLX: 2 instructions, 1 FP & 1 anything else * Fetch 64-bits/clock cycle; Int on left, FP on right * Can only issue 2nd instruction if 1st instruction issues * More ports for FP registers to do FP load & FP op in a pair * ![](https://i.imgur.com/yNbu3rb.png) ### Very Large Instruction Word (**VLIW**) * Each "instruction" has explicit coding for multiple operations * The long instruction word has room for many operations * By definition, all the operations the compiler puts in the long instruction word are **independent** => execute in parallel * E.g., 2 int ops, 2 FP ops, 2 Memory refs, 1 branch * 16 to 24 bits per field * Need **compiling technique** that schedules across several branches (Trace scheduling) * ![](https://i.imgur.com/dWt3WaP.png) ### Limitations in Multiple-Issue Processors * Inherent limitations of available ILP in programs * To fill the issue slots and keep the functional units busy * **Functional units** that are **pipelined** or with **multi-cycle latency** require a larger number of operations that can be executed in parallel * Numbers of independent operations roughly equals to the **average pipeline depth** times the **number of functional units** * Difficulties in building the underlying hardware: **Hardware cost** * Multiple floating-point and integer functional units * Large increase in the **memory bandwidth** and **Register-file bandwidth** ## (5) Scoreboard ### Static and Dynamic Scheduling * Static Scheduling: **Compiler techniques** for scheduling * Dynamic Scheduling: **Hardware schemes** * Works when we don’t know real dependence at compile time * Keep compiler simpler * Compiled code for one machine runs well on another * Allow instructions behind stall to proceed * Enables out-of-order execution -> out-of-order completion ### Scoreboard : a bookkeeping technique * Out-of-order execution divides **ID** stage: * **Issue**: decode instructions, check for structural hazards * **Read operands**: wait until no data hazards, then read operands * Instructions execute whenever not dependent on previous instructions and no hazards. * Example: CDC 6600 * **In order issue** * **out-of-order execution** * **outof-order commit (or completion)** * No forwarding! * Imprecise interrupt/exception model ### Scoreboard Implications * Out-of-order completion $\rightarrow$ **WAR**, **WAW** * Solutions for **WAR**: * **Stall** writeback until regs have been read * Read regs only during **Read Operands** stage * Solution for **WAW**: * Detect hazard and stall issue of new instruction until other instruction completes * No register renaming! * Need to have multiple instructions in execution phase * **multiple execution units** or **pipelined execution units** * Scoreboard keeps **track of dependencies between instructions** that have already issued. * Scoreboard replaces ID, EX, WB with 4 stages * ![](https://i.imgur.com/GWynG4g.png) ### Four Stages of Scoreboard Control * **Issue** (ID1): decode instructions & check for structural hazards * Instructions issued in program order (for hazard checking) * Don’t issue if **structural hazard** * Don’t issue if instruction is output dependent (**WAW**) on any previously issued but uncompleted instruction * **Read operands** (ID2): wait until no data hazards, then read operands * All real dependencies (**RAW**) resolved in this stage, since we wait for instructions to write back data. * **No forwarding of data** in this model * **Execution** (EX): operate on operands * The functional unit begins execution upon receiving operands. When the result is ready, it notifies the scoreboard that it has completed execution. * **Write result** (WB): finish execution * Stall until no **WAR** with previous instructions: * ![](https://i.imgur.com/XNPeUyM.png) ### Three Parts of the Scoreboard * Instruction status: Which of 4 steps the instruction is in * Functional unit status: Indicates the state of the functional unit (FU). 9 fields for each functional unit * Busy: Indicates whether the unit is busy or not * Op: Operation to perform in the unit (e.g., + or *) * Fi: Destination register * Fj,Fk: Source-register numbers * Qj,Qk: Functional units producing source reg Fj, Fk * Rj,Rk: Flags indicating when Fj, Fk are ready * Register result status: Indicates which functional unit will write each register, if one exists. Blank when no pending instructions will write that register ### Scoreboard Example * ![](https://i.imgur.com/44SYbzR.png) * ![](https://i.imgur.com/ZyO6vfh.png) ### Limitations of CDC 6600 Scoreboard * No forwarding hardware * **Limited to instructions in basic block** (small window) * Limited to the **number of scoreboard entries** * Limited to the **number of functional units** (structural hazards), especially integer/load store units * Do not issue on structural hazards * Wait for WAR * Prevent WAW ### Summary * Increasingly powerful (and complex) dynamic mechanism for detecting and resolving hazards * In-order pipeline, in-order op-fetch with register reservations, in-order issue with **scoreboard** * Allow later instructions to proceed around ones that are stalled * Facilitate multiple issue * Compiler techniques make it easier for HW to find the ILP * Reduces the impact of more sophisticated organization * Requires a larger architected namespace * Easier for more structured code ## (6) Tomasulo's Algorithm ### What do registers offer? * Short, absolute name for a recently computed (or frequently used) value * Fast, high bandwidth storage in the data path * Means of broadcasting a computed value to set of instructions that use the value ### Broadcasting result value * Series of instructions issued and waiting for value to be produced by logically preceding instruction * CDC6600 has each come back and read the value once it is placed in register file * Alternative: **broadcast** value and reg number to all the waiting instructions ### Tomasulo's Algorithm * Control & buffers distributed with Function Units (FU) * FU buffers called **reservation stations** (RS), have pending operands * Regs in instructions **replaced by values or pointers to RS** (register renaming * avoids WAR, WAW * More RS than regs, so it can do optimizations compilers can’t * Results to FU from RS, not through regs. Over **Common Data Bus** (CDB) that broadcasts results to all FUs * Load and Stores treated as FUs with RSs as well * **Load buffer** * Hold the components of the **effective address** until it is computed * Track **outstanding loads** that are ==waiting on the memory== * Hold the **results** of completed loads * **Store buffer** * Hold the components of the **effective address** until it is computed * Hold the **destination memory addresses** of **outstanding stores** that are ==waiting for data value== to store * Hold the **address** and **value** to store ==until the memory unit is available== * The basic structure of a MIPS FP unit using Tomasulo’s algorithm * ![](https://i.imgur.com/fhQv5wn.png) ### Three Stages of Tomasulo Algorithm Each stage could take an arbitrary number of clock cycles. * **Issue**: get instruction from FP Op Queue * **FIFO** order to ensure the correct data flow * If RS free (no structural hazard), control issues instr & **sends operands** (renames regs). * If no empty RS: structural hazard, **stall** until a station or buffer is freed * If operands are not in the regs, **keep track of the FUs that will produce the operands**. * $\rightarrow$ Eliminate **WAR** and **WAW** * **Execution** (EX): operate on operands * If **both operands ready** then execute * If not ready, watch CDB for result (Resolve RAW) * Several instructions could become ready **in the same clock cycle** * **Independent FUs** could begin execution in the same clock cycle for different instructions * If more than one instruction is waiting for **a single FU**, the unit will choose among them * **Write result** (WB): finish execution * Write on CDB to all awaiting units, mark RS available * Write if matches expected FU (produces result) * Does the broadcast ### Data Bus * Normal data bus: data + destination ("go to" bus) * Common data bus: data + source ("come from" bus) * 64 bits of data + 4 bits of FU source address ### Reservation Station: Seven Fields * **Op**: Operation to perform in the unit * **Vj**, **Vk**: ==Value== of Source operands * Store buffers has V field, result to be stored * **Qj**, **Qk**: RSs producing source regs (source regs to be written by other instructions) * Note: Qj, Qk = 0 indicates source is already in Vj or Vk (or is unnecessary) * **Busy**: Indicates RS or FU is busy * **Address**: Hold information for memory address calculation for a load store * initially: immediate field of the instruction * after address calculation: effective address ### Register result status: one field Qi * Indicates the number of the RS (which functional unit) will write the reg, if one exists * Blank when no pending instructions that will write that reg ### Tomasulo Algorithm Example * [Rule 1](https://i.imgur.com/Yy89GeC.png) * [Rule 2](https://i.imgur.com/iaF2aQT.png) * [Rule 3](https://i.imgur.com/Ze05sr8.png) * ![](https://i.imgur.com/lmKn2Q7.png) * Compare with Scoreboard * ![](https://i.imgur.com/HpiZMb2.png) ### Tomasulo v.s. Scoreboard * IBM 360/91 v.s. CDC 6600 * ![](https://i.imgur.com/YFcMZ5z.png) * Tomasulo Drawbacks * Complexity * Performance limited by **CDB** ### Why is Tomasulo’s scheme widely adopted in multiple-issue processors after 1990s * Achieving high performance **without requiring the compiler** to target code to **a specific pipeline structure** * With the presence of ==cache==, the delays are unpredictable (miss). **Out-of-order execution** allows the processors to continue executing instructions while awaiting the completion of a cache miss (**hiding cache miss penalty**) * Processors becoming **more aggressive in issue capability** demands register renaming and dynamic scheduling * Dynamic scheduling is a key component of speculation, it was **adopted along with hardware speculation** in the mid-1990s ### Hardware-Based Speculation * Speculation: fetch, issue, and execute instructions as if **the branch predictions were always correct** * key ideas * Dynamic branch prediction * Allow execution of instructions before the control dependences are resolved * Dynamic scheduling of different combinations of basic blocks * Perform update that cannot be undone only when the instruction is **no longer speculative** * Update the reg file or memory: instruction commit * **Execute out of order** but **commit in order** * Separate **execution complete** from **instruction commit** * Require additional hardware buffers: **Reorder buffer** (ROB) * holds the result of instruction **between completion and commit** * Four fields: * Instruction type: branch/store/register * Destination field: register number * Value field: output value * Ready field: completed execution? * Modify RSs: * Operand source is now ROB instead of FU ### Differences between Tomasulo's algorithm with and without speculation * Without speculation * **Only partially overlaps basic blocks** because it requires that a branch to be resolved before actually executing any instructions in the successor basic block * Once an instruction writes its result, **any subsequently issued instructions will find the result in the register file** * With speculation * Dynamic scheduling of different combinations of basic blocks * The register file is not updated **until the instruction commits** * **Integrate the function of store buffer into ROB** ### Four steps of hardware-based speculation * **Issue** * Get the instruction from the instruction queue * Issue if there is an **empty reservation and empty slot in ROB** (**stall if no empty entry**) * Send the operands to RS if they are available * Update the control entries * The number of the ROB entry allocated for the result is also sent to the RS * **Execute** * If one or more operands is not yet available, monitor the CDB * When both operands are available at a RS, execute the operation (**eliminate RAW**) * **Write result** * When the result is available, write it on the CDB, any RSs waiting for this result, and from the CDB to ROB * **Commit** * **Branch with an incorrect prediction**: the speculation is wrong, so the ROB is flushed and the execution is restarted at the correct successor of the branch * If the branch is correctly predicted, the branch is finished * **Store**: when an instruction reaches the head of ROB, **update the memory** with the result and removes the instruction from the ROB * **Any other instruction**: when an instruction reaches the head of ROB, **update the register with the result** and removes the instruction from the ROB * ![](https://i.imgur.com/glcbHxe.png) ### Summary : Tomasulo's algorithm * **RS**s: *renaming* to larger set of regs + buffering source operands * Prevents regs as bottleneck * Avoids WAR, WAW of Scoreboard * Allows loop unrolling in hardware * **Not limited to basic blocks** (integer units gets ahead, beyond branches) * Helps cache misses as well * **Lasting Contributions** * Dynamic scheduling * Register renaming ## (7-1) Branch Prediction ### Branch Prediction * As we try to exploit more ILP, the accuracy of branch prediction becomes critical. * To reorder code around branches, need to predict branch statically at compile time * Simplest scheme is to predict a branch as taken * Average misprediction = untaken branch frequency = 34% ### Dynamic Branch Prediction * Why does prediction work? * Underlying algorithm has regularities * Data that is being operated on has regularities * Instruction sequence has redundancies that are artifacts of the way that humans/compilers think about problems * Is dynamic branch prediction better than static one? * Seems to be * There are a small number of important branches in programs which have dynamic behavior ### Branch History Table (Branch-Prediction Buffer) * A memory indexed by the lower bits of PC address * The memory contains 1 bit: **1-bit prediction scheme** * Says whether or not the branch taken last time * **No tag and no address check**: save hardware, but don’t know if it is the right branch (It may have been put there by another branch that has the same low-order address bits) * Implementing: * Alternative 1 * a small, special "cache" accessed with the instruction address during the IF stage * Alternative 2 * a pair of bits attached to each block in the instruction cache and fetched with the instruction ### Adding more prediction bits : 2-bit Dynamic Branch Prediction * ![](https://i.imgur.com/LZteDRu.png) ### Correlating Predictors * 2-bit prediction uses a small amount of (hopefully) local information to predict behaviour * Sometimes the behaviour is correlated, and we can do better by keeping track of direction of related branches, for example consider the two following code: ```c= if (d == 0) { d = 1; } if (d == 1) { // something } ``` * ![](https://i.imgur.com/htV8Rwp.png) * **If the 1st branch is not taken, neither is the 2nd** * Predictors that use the behaviour of other branches to make a prediction are called **correlating predictors** or **two-level predictors** ### Correlating Branches * Hypothesis: **recent branches are correlated**; i.e. behavior of recently executed branches affects prediction of current branch * Idea: **record $m$ most recently executed branches** as taken or not taken, and use that pattern to select the proper branch history table * **(m,n) predictor**: * Record last $m$ branches to select between $2^m$ history tables each with $n$-bit counters * 2-bit BHT is a (0,2) predictor ### The number of bits in an (m,n) predictor * Total bits $= 2^m\ *$ Number of prediction **entries** $*\ n$ * $2^m$ * last $m$ branches to select between $2^m$ history tables * Number of prediction **entries** * $k$ lower-order address $\rightarrow$ $2^k$ entries per history tables * $n$ * n-bits per entry * Example (2,2) predictor * ![](https://i.imgur.com/G69f1MH.png) ### Branch Target Buffer (**BTB**) * Address of branch index to get prediction AND branch address (if taken) * Note: must check for branch match now, since we can’t use the wrong branch address * ![](https://i.imgur.com/fmXeCu0.png) * ![](https://i.imgur.com/WGqOllZ.png) ### Summary : Dynamic Branch Prediction * Prediction is important * **Branch History Table**: 2 bits for loop accuracy * **Correlation**: Recently executed branches correlated with next branch * Either different branches or different executions of the same branches * **Tournament Predictor**: more resources for competitive solutions and pick between them * ![](https://i.imgur.com/GgiSxpi.png) * **Branch Target Buffer**: include branch address & prediction ## (7-2) Exceptions and Interrupts ### Interrupt * ![](https://i.imgur.com/dNOLslz.png) * ### Polling - [Wikipedia](https://zh.wikipedia.org/wiki/%E8%BC%AA%E8%A9%A2) * ![](https://i.imgur.com/Q2nAY1y.png) ### Comparison * Polling is faster than interrupts because * Compiler knows which registers in use at polling point. Hence, do not need to save and restore registers (or not as many) * Other interrupt overhead avoided (pipeline flush, trap priorities, etc) * Polling is slower than interrupts because * Overhead of polling instructions is incurred regardless of whether or not handler is run. This could add to inner-loop delay. * Device may have to wait for service for a long time ### When to use one or the other? * **Polling**: Frequent/regular events, **as long as device can be controlled at user level** * **Interrupts**: good for infrequent/irregular events ### Exceptions and Interrupts * Arithmetic trap (overflow, divided by zero, etc.) * Using an undefined instruction * Hardware malfunction * Invoking an OS service from a user program * I/O device request ### Classifications : Exceptions, Interrupts * **Exceptions**: relevant to the current process * Faults, arithmetic traps * Invoke software on behalf of the currently executing process * **Interrupts**: caused by asynchronous, outside events * I/O devices requiring service (DISK, network) ### Classifications : Synchronous, Asynchronous * **Synchronous**: means related to the instruction stream, i.e. during the execution of an instruction * **Must stop an instruction that is currently executing** * Page fault on load or store instruction * Arithmetic exception * Software Trap Instructions * **Asynchronous**: means unrelated to the instruction stream, i.e. caused by an outside event. * **Does not have to disrupt** instructions that are already executing * Interrupts are asynchronous ### Precise Exceptions/Interrupts * An interrupt or exception is considered **precise** if there is a **single** instruction (or **interrupt point**) for which: * All instructions before have committed their state * No following instructions (including the interrupting instruction) have modified any state * This means, that you can **restart execution at the interrupt point** and "get the right answer" * In previous example: **Interrupt point** is at first `lw` instruction * ![](https://i.imgur.com/5p8y0vd.png) ## (8) Multiprocessors (page_1 to page_21) ### Why Multiprocessors? * A growth in data-intensive applications * A growing interest in server markets * Complexity of current microprocessors * Collecting several much easier than redesigning one * Steady improvement in parallel software (scientific apps, databases, OS) * Long term goal of the field: * Scale number processors to size of budget, desired performance ### Categories ``` S = Single M = Multiple I = Instruction stream D = Data stream ``` * **SISD** * Uni-processor * **SIMD** * **Data level parallelism** * Each processor **has its own data memory**, but there is a **single instruction memory and control processor**, which fetches and dispatches instructions * Multimedia, graphics performance * **MISD** * No commercial product of this type * **MIMD** * Each processor has its own instructions and operates on its own data * **==Thread== level parallelism** ### Data Level Parallelism: **SIMD** * Vector architectures * SIMD extensions * Graphics Processor Units (GPUs) * **Advantage** * SIMD architectures can exploit **significant data level parallelism** for: * **Matrix-oriented** scientific computing * **Media-oriented** image and sound processors * SIMD is **more energy efficient** than MIMD * Only needs to fetch one instruction per data operation * Makes SIMD attractive for personal mobile devices * SIMD allows programmer to continue to think sequentially ### Vector Architectures * Basic idea * Read sets of data elements into "**vector registers**" * Operate on those registers * Disperse the results back into memory * Registers are controlled by compiler * Used to hide memory latency * Leverage memory bandwidth * The basic structure for a vector architecture * ![](https://i.imgur.com/nqbMmW3.png) ### Example architecture: **VMIPS** * **Vector registers** * Each register holds a 64-element, 64 bits/element vector * Register file has 16 read ports and 8 write ports * **Vector functional units** * Fully pipelined * Data and control hazards are detected * **Vector load-store unit** * Fully pipelined * One word per clock cycle after initial latency * **Scalar registers** * 32 general-purpose registers * 32 floating-point registers ### VMIPS Instructions * `ADDVV.D Vd, Vs, Vt` * add two vectors `Vs` and `Vt`, then put each result in `Vd` * `ADDVS.D Vd, Vs, Ft` * add scalar `Ft` to **each** element of vector `Vs`, then put each result in `Vd` * `SUBVV.D Vd, Vs, Vt` * `SUBVS.D Vd, Vs, Ft` * `SUBSV.D Vd, Fs, Vt` * `MULVV.D`, `MULVS.D`, `DIVVV.D`, `DIVVS.D`, `DIVSV.D` * `LV Vd, (Rs)`, `SV (Rd), Vs` * vector load / store from address `R` * [For more instruction](http://www.cburch.com/cs/330/reading/vmips-ref.pdf) ### **SAXPY** (Scalar Alpha X Plus Y) * $Y = \alpha X + Y$ * $X, Y$ are vectors, initially resident in memory * $\alpha$ is a scalar * For different data type, there are: * **`SAXPY`**: Single precision * **`DAXPY`**: Double precision * `CAXPY`: Complex number with Single precision * `ZAXPY`: Complex number with Double precision * This problem is so-called SAXPY or DAXPY loop in benchmark ```c= for (int i = m; i < n; i++) { y[i] = a * x[i] + y[i]; } ``` * SAXPY MIPS code ```= L.D F0, a ; load scalar a DADDIU R4, Rx, #512 ; last address to load Loop: L.D F2, 0(Rx) ; load X[i] MUL.D F2, F2, F0 ; a*X[i] L.D F4, 0(Ry) ; load Y[i] ADD.D F4, F4, F2 ; a * X + Y S.D F4, 0(Ry) ; store into Y[i] DADDIU Rx, Rx, #8 ; increment index to X DADDIU Ry, Ry, #8 ; increment index to Y DSUBU R20, R4, Rx ; compute bound BNEZ R20, Loop ; check if done ``` * SAXPY VMIPS code ```= L.D F0, a ; load scalar a LV V1, Rx ; load vector X MULVS.D V2, V1, F0 ; vector-scalar multiply LV V3, Ry ; load vector Y ADDVV V4, V2, V3 ; add SV Ry, V4 ; store the result ``` * Requires 6 instructions vs. almost 600 for MIPS ### Comparison : MIPS, VMIPS * The overhead instructions are not present in the VMIPS * The compiler produces vector instructions for such a sequence: The code is **vectorized** or **vectorizable** * Loops can be vectorized if there are **no loopcarried dependences** * VMIPS has fewer **pipeline load interlocks** * Vector architects call **forwarding of element dependent operations** "==chaining==" ### Vector Execution Time * Execution time depends on three factors: * Length of operand vectors * Structural hazards * Data dependencies * VMIPS functional units consume one element per clock cycle * Execution time is approximately the vector length * **Convoy** * Set of vector instructions that could potentially execute together ### Chimes * Sequences with **RAW** dependency hazards can be in the same convoy via chaining * **Chaining** * Allows a vector operation to start as soon as the individual elements of its vector source operand become available * **Chime** * Unit of time to execute one convoy * $m$ convoys executes in $m$ chimes * For vector length of $n$ , requires $m$ * $n$ clock cycles ### Example : Total clock cycle with chimes ```= LV V1, Rx ; load vector X MULVS.D V2, V1, F0 ; vector-scalar multiply LV V3, Ry ; load vector Y ADDVV.D V4, V2, V3 ; add two vectors SV Ry, V4 ; store the sum ``` * Convoys: 3 convoys ```= LV MULVS.D LV ADDVV.D SV ``` * Total clock cycle = 3 Chime * 64 element per vector = 192 clock cycles ### Multiple Lanes * Using parallel pipelines * Allows for multiple hardware lanes * Example * ![](https://i.imgur.com/zRyvxg8.png) * Structure of a vector unit containing 4 lanes * ![](https://i.imgur.com/eyWa8N3.png) ### Vector Length Register * Vector length not known at compile time? * Use **Vector Length Register** (**VLR**) * Use ==strip mining== for vectors over the **maximum vector length** (**MVL**) * ![](https://i.imgur.com/BeCuwtZ.png) * For vector length n, m = n % MVL * So length of each operation <= MVL, like the picture ### Vector Mask Registers * Consider: ```c= for (i = 0; i < 64; i++) { if (X[i] != 0) { X[i] = X[i] – Y[i]; } } ``` * Use vector mask register to "**disable**" elements: ```= LV V1, Rx ; load vector X into V1 LV V2, Ry ; load vector Y into V2 L.D F0, #0 ; load FP zero into F0 SNEVS.D V1, F0 ; sets VM(i) to 1 if V1(i)!=F0 SUBVV.D V1, V1, V2 ; subtract under vector mask SV Rx, V1 ; store the result in X ```