a5180352
  • NEW!
    NEW!  Connect Ideas Across Notes
    Save time and share insights. With Paragraph Citation, you can quote others’ work with source info built in. If someone cites your note, you’ll see a card showing where it’s used—bringing notes closer together.
    Got it
      • Create new note
      • Create a note from template
        • Sharing URL Link copied
        • /edit
        • View mode
          • Edit mode
          • View mode
          • Book mode
          • Slide mode
          Edit mode View mode Book mode Slide mode
        • Customize slides
        • Note Permission
        • Read
          • Only me
          • Signed-in users
          • Everyone
          Only me Signed-in users Everyone
        • Write
          • Only me
          • Signed-in users
          • Everyone
          Only me Signed-in users Everyone
        • Engagement control Commenting, Suggest edit, Emoji Reply
      • Invite by email
        Invitee

        This note has no invitees

      • Publish Note

        Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note No publishing access yet

        Your note will be visible on your profile and discoverable by anyone.
        Your note is now live.
        This note is visible on your profile and discoverable online.
        Everyone on the web can find and read all notes of this public team.

        Your account was recently created. Publishing will be available soon, allowing you to share notes on your public page and in search results.

        Your team account was recently created. Publishing will be available soon, allowing you to share notes on your public page and in search results.

        Explore these features while you wait
        Complete general settings
        Bookmark and like published notes
        Write a few more notes
        Complete general settings
        Write a few more notes
        See published notes
        Unpublish note
        Please check the box to agree to the Community Guidelines.
        View profile
      • Commenting
        Permission
        Disabled Forbidden Owners Signed-in users Everyone
      • Enable
      • Permission
        • Forbidden
        • Owners
        • Signed-in users
        • Everyone
      • Suggest edit
        Permission
        Disabled Forbidden Owners Signed-in users Everyone
      • Enable
      • Permission
        • Forbidden
        • Owners
        • Signed-in users
      • Emoji Reply
      • Enable
      • Versions and GitHub Sync
      • Note settings
      • Note Insights New
      • Engagement control
      • Make a copy
      • Transfer ownership
      • Delete this note
      • Save as template
      • Insert from template
      • Import from
        • Dropbox
        • Google Drive
        • Gist
        • Clipboard
      • Export to
        • Dropbox
        • Google Drive
        • Gist
      • Download
        • Markdown
        • HTML
        • Raw HTML
    Menu Note settings Note Insights Versions and GitHub Sync Sharing URL Create Help
    Create Create new note Create a note from template
    Menu
    Options
    Engagement control Make a copy Transfer ownership Delete this note
    Import from
    Dropbox Google Drive Gist Clipboard
    Export to
    Dropbox Google Drive Gist
    Download
    Markdown HTML Raw HTML
    Back
    Sharing URL Link copied
    /edit
    View mode
    • Edit mode
    • View mode
    • Book mode
    • Slide mode
    Edit mode View mode Book mode Slide mode
    Customize slides
    Note Permission
    Read
    Only me
    • Only me
    • Signed-in users
    • Everyone
    Only me Signed-in users Everyone
    Write
    Only me
    • Only me
    • Signed-in users
    • Everyone
    Only me Signed-in users Everyone
    Engagement control Commenting, Suggest edit, Emoji Reply
  • Invite by email
    Invitee

    This note has no invitees

  • Publish Note

    Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note No publishing access yet

    Your note will be visible on your profile and discoverable by anyone.
    Your note is now live.
    This note is visible on your profile and discoverable online.
    Everyone on the web can find and read all notes of this public team.

    Your account was recently created. Publishing will be available soon, allowing you to share notes on your public page and in search results.

    Your team account was recently created. Publishing will be available soon, allowing you to share notes on your public page and in search results.

    Explore these features while you wait
    Complete general settings
    Bookmark and like published notes
    Write a few more notes
    Complete general settings
    Write a few more notes
    See published notes
    Unpublish note
    Please check the box to agree to the Community Guidelines.
    View profile
    Engagement control
    Commenting
    Permission
    Disabled Forbidden Owners Signed-in users Everyone
    Enable
    Permission
    • Forbidden
    • Owners
    • Signed-in users
    • Everyone
    Suggest edit
    Permission
    Disabled Forbidden Owners Signed-in users Everyone
    Enable
    Permission
    • Forbidden
    • Owners
    • Signed-in users
    Emoji Reply
    Enable
    Import from Dropbox Google Drive Gist Clipboard
       Owned this note    Owned this note      
    Published Linked with GitHub
    2
    • Any changes
      Be notified of any changes
    • Mention me
      Be notified of mention me
    • Unsubscribe
    ###### tags: `計結` # 1062 - Computer Architecture - 4~8_page20 ## (4) Advanced Pipelining and ILP ### Review : Summary of pipelining Basics * **Hazards** limit performance * Structural: need more hardware * Data: need forwarding, compiler rescheduling * Control: early evaluation PC, delayed branch, prediction * Increasing length of pipeline increases impact of hazards * Pipelining helps instruction **bandwidth**, **not latency** * Interrupts, Floating Point Instructions make pipelining harder * Compilers reduce cost of data and control hazards (e.g. load delay slots) * Longer pipelines (R4000): **More instruction parallelism**, ***Needs* Better branch prediction** ### Advanced Pipelining and Instruction Level Parallelism (**ILP**) * The potential overlap among instructions is called **ILP** * **ILP**: Overlap execution of unrelated instructions * Looking at techniques that reduce the impact of data and control hazards (**software approaches**) * gcc 17% control transfer * 5 instructions + 1 branch * Beyond single block to get more instruction level parallelism * Increasing the ability of the processor to exploit parallelism (**hardware approaches**) ### CPI of a Pipelined machine * **Pipeline CPI** = Ideal pipeline CPI + Structural Stalls + Data stalls + Control stalls = Ideal pipeline CPI + Structural Stalls + (RAW stalls + WAR stalls + WAW stalls) + Control stalls * ![](https://i.imgur.com/ShpCMy9.png) ### **Loop Unrolling** * ![](https://i.imgur.com/y50QPC1.png) * Summary * Determine if it is **legal** to move the instructions and **adjust the offset** * Determine the unrolling loop would be useful by finding if the loop **iterations were independent** * Use different registers to avoid unnecessary constraints. * Eliminate the extra tests and branches * Determine that the loads and stores in the unrolling loop are independent and can be interchanged (analyze the **memory addresses** and find that they **do not refer to the same address**) * Schedule the code, preserving any **dependences** needed ### **Dependency** * Data dependence (true dependence, **RAW**) * Name dependence * Anti-dependence (**WAR**) * Output dependence (**WAW**) * Control dependence * ```python= if p1: S1 if p2: S2 ``` * `S1` is control dependent on `p1` * `S2` is control dependent on `p2` but **not on** `p1` * ![](https://i.imgur.com/8OvHjJu.png) ### **Loop carried dependence** * Dependence distance of 1 * ```c= for (i = 2; i <= 100; i++) { Y[i] = Y[i-1] + Y[i]; } ``` * Dependence distance of 5 * ```c= for (i = 6; i <= 100; i++) { Y[i] = Y[i-5] + Y[i]; } ``` * The **larger the distance**, the **more potential parallelism** can be obtained by unrolling the loop. ### 3 Limits to Loop Unrolling * Decrease in amount of overhead amortized with each extra unrolling * Amdahl’s Law * **Growth in code size** * For larger loops, concern it increases the instruction cache miss rate * **Register pressure**: potential shortfall in registers created by aggressive unrolling and scheduling * If not be possible to allocate all live values to registers, may lose some or all of its advantage ### Review : Loop Unrolling Decisions * Determine loop unrolling useful by finding that loop iterations were **independent** * Use **different registers** to avoid unnecessary constraints forced by using same registers for different computations * Eliminate the extra test and branch instructions and **adjust the loop termination and iteration code** * Determine that loads and stores in unrolled loop can be interchanged by observing that loads and stores from different iterations are independent * Transformation requires **analyzing memory addresses** and finding that they do not refer to the same address * Schedule the code, preserving any dependences needed to yield the same result as the original code ### SuperScalar * Superscalar DLX: 2 instructions, 1 FP & 1 anything else * Fetch 64-bits/clock cycle; Int on left, FP on right * Can only issue 2nd instruction if 1st instruction issues * More ports for FP registers to do FP load & FP op in a pair * ![](https://i.imgur.com/yNbu3rb.png) ### Very Large Instruction Word (**VLIW**) * Each "instruction" has explicit coding for multiple operations * The long instruction word has room for many operations * By definition, all the operations the compiler puts in the long instruction word are **independent** => execute in parallel * E.g., 2 int ops, 2 FP ops, 2 Memory refs, 1 branch * 16 to 24 bits per field * Need **compiling technique** that schedules across several branches (Trace scheduling) * ![](https://i.imgur.com/dWt3WaP.png) ### Limitations in Multiple-Issue Processors * Inherent limitations of available ILP in programs * To fill the issue slots and keep the functional units busy * **Functional units** that are **pipelined** or with **multi-cycle latency** require a larger number of operations that can be executed in parallel * Numbers of independent operations roughly equals to the **average pipeline depth** times the **number of functional units** * Difficulties in building the underlying hardware: **Hardware cost** * Multiple floating-point and integer functional units * Large increase in the **memory bandwidth** and **Register-file bandwidth** ## (5) Scoreboard ### Static and Dynamic Scheduling * Static Scheduling: **Compiler techniques** for scheduling * Dynamic Scheduling: **Hardware schemes** * Works when we don’t know real dependence at compile time * Keep compiler simpler * Compiled code for one machine runs well on another * Allow instructions behind stall to proceed * Enables out-of-order execution -> out-of-order completion ### Scoreboard : a bookkeeping technique * Out-of-order execution divides **ID** stage: * **Issue**: decode instructions, check for structural hazards * **Read operands**: wait until no data hazards, then read operands * Instructions execute whenever not dependent on previous instructions and no hazards. * Example: CDC 6600 * **In order issue** * **out-of-order execution** * **outof-order commit (or completion)** * No forwarding! * Imprecise interrupt/exception model ### Scoreboard Implications * Out-of-order completion $\rightarrow$ **WAR**, **WAW** * Solutions for **WAR**: * **Stall** writeback until regs have been read * Read regs only during **Read Operands** stage * Solution for **WAW**: * Detect hazard and stall issue of new instruction until other instruction completes * No register renaming! * Need to have multiple instructions in execution phase * **multiple execution units** or **pipelined execution units** * Scoreboard keeps **track of dependencies between instructions** that have already issued. * Scoreboard replaces ID, EX, WB with 4 stages * ![](https://i.imgur.com/GWynG4g.png) ### Four Stages of Scoreboard Control * **Issue** (ID1): decode instructions & check for structural hazards * Instructions issued in program order (for hazard checking) * Don’t issue if **structural hazard** * Don’t issue if instruction is output dependent (**WAW**) on any previously issued but uncompleted instruction * **Read operands** (ID2): wait until no data hazards, then read operands * All real dependencies (**RAW**) resolved in this stage, since we wait for instructions to write back data. * **No forwarding of data** in this model * **Execution** (EX): operate on operands * The functional unit begins execution upon receiving operands. When the result is ready, it notifies the scoreboard that it has completed execution. * **Write result** (WB): finish execution * Stall until no **WAR** with previous instructions: * ![](https://i.imgur.com/XNPeUyM.png) ### Three Parts of the Scoreboard * Instruction status: Which of 4 steps the instruction is in * Functional unit status: Indicates the state of the functional unit (FU). 9 fields for each functional unit * Busy: Indicates whether the unit is busy or not * Op: Operation to perform in the unit (e.g., + or *) * Fi: Destination register * Fj,Fk: Source-register numbers * Qj,Qk: Functional units producing source reg Fj, Fk * Rj,Rk: Flags indicating when Fj, Fk are ready * Register result status: Indicates which functional unit will write each register, if one exists. Blank when no pending instructions will write that register ### Scoreboard Example * ![](https://i.imgur.com/44SYbzR.png) * ![](https://i.imgur.com/ZyO6vfh.png) ### Limitations of CDC 6600 Scoreboard * No forwarding hardware * **Limited to instructions in basic block** (small window) * Limited to the **number of scoreboard entries** * Limited to the **number of functional units** (structural hazards), especially integer/load store units * Do not issue on structural hazards * Wait for WAR * Prevent WAW ### Summary * Increasingly powerful (and complex) dynamic mechanism for detecting and resolving hazards * In-order pipeline, in-order op-fetch with register reservations, in-order issue with **scoreboard** * Allow later instructions to proceed around ones that are stalled * Facilitate multiple issue * Compiler techniques make it easier for HW to find the ILP * Reduces the impact of more sophisticated organization * Requires a larger architected namespace * Easier for more structured code ## (6) Tomasulo's Algorithm ### What do registers offer? * Short, absolute name for a recently computed (or frequently used) value * Fast, high bandwidth storage in the data path * Means of broadcasting a computed value to set of instructions that use the value ### Broadcasting result value * Series of instructions issued and waiting for value to be produced by logically preceding instruction * CDC6600 has each come back and read the value once it is placed in register file * Alternative: **broadcast** value and reg number to all the waiting instructions ### Tomasulo's Algorithm * Control & buffers distributed with Function Units (FU) * FU buffers called **reservation stations** (RS), have pending operands * Regs in instructions **replaced by values or pointers to RS** (register renaming * avoids WAR, WAW * More RS than regs, so it can do optimizations compilers can’t * Results to FU from RS, not through regs. Over **Common Data Bus** (CDB) that broadcasts results to all FUs * Load and Stores treated as FUs with RSs as well * **Load buffer** * Hold the components of the **effective address** until it is computed * Track **outstanding loads** that are ==waiting on the memory== * Hold the **results** of completed loads * **Store buffer** * Hold the components of the **effective address** until it is computed * Hold the **destination memory addresses** of **outstanding stores** that are ==waiting for data value== to store * Hold the **address** and **value** to store ==until the memory unit is available== * The basic structure of a MIPS FP unit using Tomasulo’s algorithm * ![](https://i.imgur.com/fhQv5wn.png) ### Three Stages of Tomasulo Algorithm Each stage could take an arbitrary number of clock cycles. * **Issue**: get instruction from FP Op Queue * **FIFO** order to ensure the correct data flow * If RS free (no structural hazard), control issues instr & **sends operands** (renames regs). * If no empty RS: structural hazard, **stall** until a station or buffer is freed * If operands are not in the regs, **keep track of the FUs that will produce the operands**. * $\rightarrow$ Eliminate **WAR** and **WAW** * **Execution** (EX): operate on operands * If **both operands ready** then execute * If not ready, watch CDB for result (Resolve RAW) * Several instructions could become ready **in the same clock cycle** * **Independent FUs** could begin execution in the same clock cycle for different instructions * If more than one instruction is waiting for **a single FU**, the unit will choose among them * **Write result** (WB): finish execution * Write on CDB to all awaiting units, mark RS available * Write if matches expected FU (produces result) * Does the broadcast ### Data Bus * Normal data bus: data + destination ("go to" bus) * Common data bus: data + source ("come from" bus) * 64 bits of data + 4 bits of FU source address ### Reservation Station: Seven Fields * **Op**: Operation to perform in the unit * **Vj**, **Vk**: ==Value== of Source operands * Store buffers has V field, result to be stored * **Qj**, **Qk**: RSs producing source regs (source regs to be written by other instructions) * Note: Qj, Qk = 0 indicates source is already in Vj or Vk (or is unnecessary) * **Busy**: Indicates RS or FU is busy * **Address**: Hold information for memory address calculation for a load store * initially: immediate field of the instruction * after address calculation: effective address ### Register result status: one field Qi * Indicates the number of the RS (which functional unit) will write the reg, if one exists * Blank when no pending instructions that will write that reg ### Tomasulo Algorithm Example * [Rule 1](https://i.imgur.com/Yy89GeC.png) * [Rule 2](https://i.imgur.com/iaF2aQT.png) * [Rule 3](https://i.imgur.com/Ze05sr8.png) * ![](https://i.imgur.com/lmKn2Q7.png) * Compare with Scoreboard * ![](https://i.imgur.com/HpiZMb2.png) ### Tomasulo v.s. Scoreboard * IBM 360/91 v.s. CDC 6600 * ![](https://i.imgur.com/YFcMZ5z.png) * Tomasulo Drawbacks * Complexity * Performance limited by **CDB** ### Why is Tomasulo’s scheme widely adopted in multiple-issue processors after 1990s * Achieving high performance **without requiring the compiler** to target code to **a specific pipeline structure** * With the presence of ==cache==, the delays are unpredictable (miss). **Out-of-order execution** allows the processors to continue executing instructions while awaiting the completion of a cache miss (**hiding cache miss penalty**) * Processors becoming **more aggressive in issue capability** demands register renaming and dynamic scheduling * Dynamic scheduling is a key component of speculation, it was **adopted along with hardware speculation** in the mid-1990s ### Hardware-Based Speculation * Speculation: fetch, issue, and execute instructions as if **the branch predictions were always correct** * key ideas * Dynamic branch prediction * Allow execution of instructions before the control dependences are resolved * Dynamic scheduling of different combinations of basic blocks * Perform update that cannot be undone only when the instruction is **no longer speculative** * Update the reg file or memory: instruction commit * **Execute out of order** but **commit in order** * Separate **execution complete** from **instruction commit** * Require additional hardware buffers: **Reorder buffer** (ROB) * holds the result of instruction **between completion and commit** * Four fields: * Instruction type: branch/store/register * Destination field: register number * Value field: output value * Ready field: completed execution? * Modify RSs: * Operand source is now ROB instead of FU ### Differences between Tomasulo's algorithm with and without speculation * Without speculation * **Only partially overlaps basic blocks** because it requires that a branch to be resolved before actually executing any instructions in the successor basic block * Once an instruction writes its result, **any subsequently issued instructions will find the result in the register file** * With speculation * Dynamic scheduling of different combinations of basic blocks * The register file is not updated **until the instruction commits** * **Integrate the function of store buffer into ROB** ### Four steps of hardware-based speculation * **Issue** * Get the instruction from the instruction queue * Issue if there is an **empty reservation and empty slot in ROB** (**stall if no empty entry**) * Send the operands to RS if they are available * Update the control entries * The number of the ROB entry allocated for the result is also sent to the RS * **Execute** * If one or more operands is not yet available, monitor the CDB * When both operands are available at a RS, execute the operation (**eliminate RAW**) * **Write result** * When the result is available, write it on the CDB, any RSs waiting for this result, and from the CDB to ROB * **Commit** * **Branch with an incorrect prediction**: the speculation is wrong, so the ROB is flushed and the execution is restarted at the correct successor of the branch * If the branch is correctly predicted, the branch is finished * **Store**: when an instruction reaches the head of ROB, **update the memory** with the result and removes the instruction from the ROB * **Any other instruction**: when an instruction reaches the head of ROB, **update the register with the result** and removes the instruction from the ROB * ![](https://i.imgur.com/glcbHxe.png) ### Summary : Tomasulo's algorithm * **RS**s: *renaming* to larger set of regs + buffering source operands * Prevents regs as bottleneck * Avoids WAR, WAW of Scoreboard * Allows loop unrolling in hardware * **Not limited to basic blocks** (integer units gets ahead, beyond branches) * Helps cache misses as well * **Lasting Contributions** * Dynamic scheduling * Register renaming ## (7-1) Branch Prediction ### Branch Prediction * As we try to exploit more ILP, the accuracy of branch prediction becomes critical. * To reorder code around branches, need to predict branch statically at compile time * Simplest scheme is to predict a branch as taken * Average misprediction = untaken branch frequency = 34% ### Dynamic Branch Prediction * Why does prediction work? * Underlying algorithm has regularities * Data that is being operated on has regularities * Instruction sequence has redundancies that are artifacts of the way that humans/compilers think about problems * Is dynamic branch prediction better than static one? * Seems to be * There are a small number of important branches in programs which have dynamic behavior ### Branch History Table (Branch-Prediction Buffer) * A memory indexed by the lower bits of PC address * The memory contains 1 bit: **1-bit prediction scheme** * Says whether or not the branch taken last time * **No tag and no address check**: save hardware, but don’t know if it is the right branch (It may have been put there by another branch that has the same low-order address bits) * Implementing: * Alternative 1 * a small, special "cache" accessed with the instruction address during the IF stage * Alternative 2 * a pair of bits attached to each block in the instruction cache and fetched with the instruction ### Adding more prediction bits : 2-bit Dynamic Branch Prediction * ![](https://i.imgur.com/LZteDRu.png) ### Correlating Predictors * 2-bit prediction uses a small amount of (hopefully) local information to predict behaviour * Sometimes the behaviour is correlated, and we can do better by keeping track of direction of related branches, for example consider the two following code: ```c= if (d == 0) { d = 1; } if (d == 1) { // something } ``` * ![](https://i.imgur.com/htV8Rwp.png) * **If the 1st branch is not taken, neither is the 2nd** * Predictors that use the behaviour of other branches to make a prediction are called **correlating predictors** or **two-level predictors** ### Correlating Branches * Hypothesis: **recent branches are correlated**; i.e. behavior of recently executed branches affects prediction of current branch * Idea: **record $m$ most recently executed branches** as taken or not taken, and use that pattern to select the proper branch history table * **(m,n) predictor**: * Record last $m$ branches to select between $2^m$ history tables each with $n$-bit counters * 2-bit BHT is a (0,2) predictor ### The number of bits in an (m,n) predictor * Total bits $= 2^m\ *$ Number of prediction **entries** $*\ n$ * $2^m$ * last $m$ branches to select between $2^m$ history tables * Number of prediction **entries** * $k$ lower-order address $\rightarrow$ $2^k$ entries per history tables * $n$ * n-bits per entry * Example (2,2) predictor * ![](https://i.imgur.com/G69f1MH.png) ### Branch Target Buffer (**BTB**) * Address of branch index to get prediction AND branch address (if taken) * Note: must check for branch match now, since we can’t use the wrong branch address * ![](https://i.imgur.com/fmXeCu0.png) * ![](https://i.imgur.com/WGqOllZ.png) ### Summary : Dynamic Branch Prediction * Prediction is important * **Branch History Table**: 2 bits for loop accuracy * **Correlation**: Recently executed branches correlated with next branch * Either different branches or different executions of the same branches * **Tournament Predictor**: more resources for competitive solutions and pick between them * ![](https://i.imgur.com/GgiSxpi.png) * **Branch Target Buffer**: include branch address & prediction ## (7-2) Exceptions and Interrupts ### Interrupt * ![](https://i.imgur.com/dNOLslz.png) * ### Polling - [Wikipedia](https://zh.wikipedia.org/wiki/%E8%BC%AA%E8%A9%A2) * ![](https://i.imgur.com/Q2nAY1y.png) ### Comparison * Polling is faster than interrupts because * Compiler knows which registers in use at polling point. Hence, do not need to save and restore registers (or not as many) * Other interrupt overhead avoided (pipeline flush, trap priorities, etc) * Polling is slower than interrupts because * Overhead of polling instructions is incurred regardless of whether or not handler is run. This could add to inner-loop delay. * Device may have to wait for service for a long time ### When to use one or the other? * **Polling**: Frequent/regular events, **as long as device can be controlled at user level** * **Interrupts**: good for infrequent/irregular events ### Exceptions and Interrupts * Arithmetic trap (overflow, divided by zero, etc.) * Using an undefined instruction * Hardware malfunction * Invoking an OS service from a user program * I/O device request ### Classifications : Exceptions, Interrupts * **Exceptions**: relevant to the current process * Faults, arithmetic traps * Invoke software on behalf of the currently executing process * **Interrupts**: caused by asynchronous, outside events * I/O devices requiring service (DISK, network) ### Classifications : Synchronous, Asynchronous * **Synchronous**: means related to the instruction stream, i.e. during the execution of an instruction * **Must stop an instruction that is currently executing** * Page fault on load or store instruction * Arithmetic exception * Software Trap Instructions * **Asynchronous**: means unrelated to the instruction stream, i.e. caused by an outside event. * **Does not have to disrupt** instructions that are already executing * Interrupts are asynchronous ### Precise Exceptions/Interrupts * An interrupt or exception is considered **precise** if there is a **single** instruction (or **interrupt point**) for which: * All instructions before have committed their state * No following instructions (including the interrupting instruction) have modified any state * This means, that you can **restart execution at the interrupt point** and "get the right answer" * In previous example: **Interrupt point** is at first `lw` instruction * ![](https://i.imgur.com/5p8y0vd.png) ## (8) Multiprocessors (page_1 to page_21) ### Why Multiprocessors? * A growth in data-intensive applications * A growing interest in server markets * Complexity of current microprocessors * Collecting several much easier than redesigning one * Steady improvement in parallel software (scientific apps, databases, OS) * Long term goal of the field: * Scale number processors to size of budget, desired performance ### Categories ``` S = Single M = Multiple I = Instruction stream D = Data stream ``` * **SISD** * Uni-processor * **SIMD** * **Data level parallelism** * Each processor **has its own data memory**, but there is a **single instruction memory and control processor**, which fetches and dispatches instructions * Multimedia, graphics performance * **MISD** * No commercial product of this type * **MIMD** * Each processor has its own instructions and operates on its own data * **==Thread== level parallelism** ### Data Level Parallelism: **SIMD** * Vector architectures * SIMD extensions * Graphics Processor Units (GPUs) * **Advantage** * SIMD architectures can exploit **significant data level parallelism** for: * **Matrix-oriented** scientific computing * **Media-oriented** image and sound processors * SIMD is **more energy efficient** than MIMD * Only needs to fetch one instruction per data operation * Makes SIMD attractive for personal mobile devices * SIMD allows programmer to continue to think sequentially ### Vector Architectures * Basic idea * Read sets of data elements into "**vector registers**" * Operate on those registers * Disperse the results back into memory * Registers are controlled by compiler * Used to hide memory latency * Leverage memory bandwidth * The basic structure for a vector architecture * ![](https://i.imgur.com/nqbMmW3.png) ### Example architecture: **VMIPS** * **Vector registers** * Each register holds a 64-element, 64 bits/element vector * Register file has 16 read ports and 8 write ports * **Vector functional units** * Fully pipelined * Data and control hazards are detected * **Vector load-store unit** * Fully pipelined * One word per clock cycle after initial latency * **Scalar registers** * 32 general-purpose registers * 32 floating-point registers ### VMIPS Instructions * `ADDVV.D Vd, Vs, Vt` * add two vectors `Vs` and `Vt`, then put each result in `Vd` * `ADDVS.D Vd, Vs, Ft` * add scalar `Ft` to **each** element of vector `Vs`, then put each result in `Vd` * `SUBVV.D Vd, Vs, Vt` * `SUBVS.D Vd, Vs, Ft` * `SUBSV.D Vd, Fs, Vt` * `MULVV.D`, `MULVS.D`, `DIVVV.D`, `DIVVS.D`, `DIVSV.D` * `LV Vd, (Rs)`, `SV (Rd), Vs` * vector load / store from address `R` * [For more instruction](http://www.cburch.com/cs/330/reading/vmips-ref.pdf) ### **SAXPY** (Scalar Alpha X Plus Y) * $Y = \alpha X + Y$ * $X, Y$ are vectors, initially resident in memory * $\alpha$ is a scalar * For different data type, there are: * **`SAXPY`**: Single precision * **`DAXPY`**: Double precision * `CAXPY`: Complex number with Single precision * `ZAXPY`: Complex number with Double precision * This problem is so-called SAXPY or DAXPY loop in benchmark ```c= for (int i = m; i < n; i++) { y[i] = a * x[i] + y[i]; } ``` * SAXPY MIPS code ```= L.D F0, a ; load scalar a DADDIU R4, Rx, #512 ; last address to load Loop: L.D F2, 0(Rx) ; load X[i] MUL.D F2, F2, F0 ; a*X[i] L.D F4, 0(Ry) ; load Y[i] ADD.D F4, F4, F2 ; a * X + Y S.D F4, 0(Ry) ; store into Y[i] DADDIU Rx, Rx, #8 ; increment index to X DADDIU Ry, Ry, #8 ; increment index to Y DSUBU R20, R4, Rx ; compute bound BNEZ R20, Loop ; check if done ``` * SAXPY VMIPS code ```= L.D F0, a ; load scalar a LV V1, Rx ; load vector X MULVS.D V2, V1, F0 ; vector-scalar multiply LV V3, Ry ; load vector Y ADDVV V4, V2, V3 ; add SV Ry, V4 ; store the result ``` * Requires 6 instructions vs. almost 600 for MIPS ### Comparison : MIPS, VMIPS * The overhead instructions are not present in the VMIPS * The compiler produces vector instructions for such a sequence: The code is **vectorized** or **vectorizable** * Loops can be vectorized if there are **no loopcarried dependences** * VMIPS has fewer **pipeline load interlocks** * Vector architects call **forwarding of element dependent operations** "==chaining==" ### Vector Execution Time * Execution time depends on three factors: * Length of operand vectors * Structural hazards * Data dependencies * VMIPS functional units consume one element per clock cycle * Execution time is approximately the vector length * **Convoy** * Set of vector instructions that could potentially execute together ### Chimes * Sequences with **RAW** dependency hazards can be in the same convoy via chaining * **Chaining** * Allows a vector operation to start as soon as the individual elements of its vector source operand become available * **Chime** * Unit of time to execute one convoy * $m$ convoys executes in $m$ chimes * For vector length of $n$ , requires $m$ * $n$ clock cycles ### Example : Total clock cycle with chimes ```= LV V1, Rx ; load vector X MULVS.D V2, V1, F0 ; vector-scalar multiply LV V3, Ry ; load vector Y ADDVV.D V4, V2, V3 ; add two vectors SV Ry, V4 ; store the sum ``` * Convoys: 3 convoys ```= LV MULVS.D LV ADDVV.D SV ``` * Total clock cycle = 3 Chime * 64 element per vector = 192 clock cycles ### Multiple Lanes * Using parallel pipelines * Allows for multiple hardware lanes * Example * ![](https://i.imgur.com/zRyvxg8.png) * Structure of a vector unit containing 4 lanes * ![](https://i.imgur.com/eyWa8N3.png) ### Vector Length Register * Vector length not known at compile time? * Use **Vector Length Register** (**VLR**) * Use ==strip mining== for vectors over the **maximum vector length** (**MVL**) * ![](https://i.imgur.com/BeCuwtZ.png) * For vector length n, m = n % MVL * So length of each operation <= MVL, like the picture ### Vector Mask Registers * Consider: ```c= for (i = 0; i < 64; i++) { if (X[i] != 0) { X[i] = X[i] – Y[i]; } } ``` * Use vector mask register to "**disable**" elements: ```= LV V1, Rx ; load vector X into V1 LV V2, Ry ; load vector Y into V2 L.D F0, #0 ; load FP zero into F0 SNEVS.D V1, F0 ; sets VM(i) to 1 if V1(i)!=F0 SUBVV.D V1, V1, V2 ; subtract under vector mask SV Rx, V1 ; store the result in X ```

    Import from clipboard

    Paste your markdown or webpage here...

    Advanced permission required

    Your current role can only read. Ask the system administrator to acquire write and comment permission.

    This team is disabled

    Sorry, this team is disabled. You can't edit this note.

    This note is locked

    Sorry, only owner can edit this note.

    Reach the limit

    Sorry, you've reached the max length this note can be.
    Please reduce the content or divide it to more notes, thank you!

    Import from Gist

    Import from Snippet

    or

    Export to Snippet

    Are you sure?

    Do you really want to delete this note?
    All users will lose their connection.

    Create a note from template

    Create a note from template

    Oops...
    This template has been removed or transferred.
    Upgrade
    All
    • All
    • Team
    No template.

    Create a template

    Upgrade

    Delete template

    Do you really want to delete this template?
    Turn this template into a regular note and keep its content, versions, and comments.

    This page need refresh

    You have an incompatible client version.
    Refresh to update.
    New version available!
    See releases notes here
    Refresh to enjoy new features.
    Your user state has changed.
    Refresh to load new user state.

    Sign in

    Forgot password
    or
    Sign in via Google Sign in via Facebook Sign in via X(Twitter) Sign in via GitHub Sign in via Dropbox Sign in with Wallet
    Wallet ( )
    Connect another wallet

    New to HackMD? Sign up

    By signing in, you agree to our terms of service.

    Help

    • English
    • 中文
    • Français
    • Deutsch
    • 日本語
    • Español
    • Català
    • Ελληνικά
    • Português
    • italiano
    • Türkçe
    • Русский
    • Nederlands
    • hrvatski jezik
    • język polski
    • Українська
    • हिन्दी
    • svenska
    • Esperanto
    • dansk

    Documents

    Help & Tutorial

    How to use Book mode

    Slide Example

    API Docs

    Edit in VSCode

    Install browser extension

    Contacts

    Feedback

    Discord

    Send us email

    Resources

    Releases

    Pricing

    Blog

    Policy

    Terms

    Privacy

    Cheatsheet

    Syntax Example Reference
    # Header Header 基本排版
    - Unordered List
    • Unordered List
    1. Ordered List
    1. Ordered List
    - [ ] Todo List
    • Todo List
    > Blockquote
    Blockquote
    **Bold font** Bold font
    *Italics font* Italics font
    ~~Strikethrough~~ Strikethrough
    19^th^ 19th
    H~2~O H2O
    ++Inserted text++ Inserted text
    ==Marked text== Marked text
    [link text](https:// "title") Link
    ![image alt](https:// "title") Image
    `Code` Code 在筆記中貼入程式碼
    ```javascript
    var i = 0;
    ```
    var i = 0;
    :smile: :smile: Emoji list
    {%youtube youtube_id %} Externals
    $L^aT_eX$ LaTeX
    :::info
    This is a alert area.
    :::

    This is a alert area.

    Versions and GitHub Sync
    Get Full History Access

    • Edit version name
    • Delete

    revision author avatar     named on  

    More Less

    Note content is identical to the latest version.
    Compare
      Choose a version
      No search result
      Version not found
    Sign in to link this note to GitHub
    Learn more
    This note is not linked with GitHub
     

    Feedback

    Submission failed, please try again

    Thanks for your support.

    On a scale of 0-10, how likely is it that you would recommend HackMD to your friends, family or business associates?

    Please give us some advice and help us improve HackMD.

     

    Thanks for your feedback

    Remove version name

    Do you want to remove this version name and description?

    Transfer ownership

    Transfer to
      Warning: is a public team. If you transfer note to this team, everyone on the web can find and read this note.

        Link with GitHub

        Please authorize HackMD on GitHub
        • Please sign in to GitHub and install the HackMD app on your GitHub repo.
        • HackMD links with GitHub through a GitHub App. You can choose which repo to install our App.
        Learn more  Sign in to GitHub

        Push the note to GitHub Push to GitHub Pull a file from GitHub

          Authorize again
         

        Choose which file to push to

        Select repo
        Refresh Authorize more repos
        Select branch
        Select file
        Select branch
        Choose version(s) to push
        • Save a new version and push
        • Choose from existing versions
        Include title and tags
        Available push count

        Pull from GitHub

         
        File from GitHub
        File from HackMD

        GitHub Link Settings

        File linked

        Linked by
        File path
        Last synced branch
        Available push count

        Danger Zone

        Unlink
        You will no longer receive notification when GitHub file changes after unlink.

        Syncing

        Push failed

        Push successfully