###### tags: `計結` # 1062 - Computer Architecture - 1~3 ## (1) Fundamentals Of Computer Design ### Fundamentals of Computer Design * Reduced Instruction Set Computer (RISC) * in early 1980s * Focused the attention of designers on * Instruction Level Parallelism * The Use of Caches * Features * far fewer Transistors * Uniform Instruction Format * identical General Purpose Registers * simple Addressing Modes * Multiple processor per chip > Uni-processors ### Classes of computers * 不重要,自己看 ### Defining Computer Architecture * Issues of Instruction Set Architecture (ISA) * Class of ISA * ISAs today: General-purpose register architecture * 80x86 1. 16 general purpose registers, 16 floating point registers 2. Register-memory * MIPS 1. 32 general purpose registers, 32 floating point registers 2. Load-store * Recent ISAs are load-store * Memory Addressing * Byte addressing * MIPS requires that objects must be aligned (accesses are faster) * 80x86 does not require alignment * Addressing modes * Register * Immediate (for constants) * Displacement * Register Indirect * Indexed * Direct/Absolute * Memory Indirect * Types and sizes of operands * both MIPS and 80x86 support: 1. 8-bit (ASCII character) 2. 16-bit (Uni-code character or half word) 3. 32-bit (integer or word) 4. 64-bit (double word or long integer) 5. IEEE 754 floating point in 32-bit (single precision) 6. 64-bit (double precision) * 80x86 also supports: 1. 80-bit floating point (extended double precision) * Operations * Control flow instructions * Conditional branches * Unconditional jumps * Procedure Calls * Returns * PC-relative addressing * Encoding an ISA * MIPS 1. Fixed length encoding 2. 32-bit 3. Simplifies instruction decoding * 80x86 1. Variable length encoding 2. 1 to 18 bytes 3. Takes less space than fixed-length instructions ### Trends in Technology * Dynamic Power (watts): $= \dfrac{1}{2}(Capacitive\ load)(Voltage)^2(Frequency\ switched)$ * Energy (joules): $= (Capacitive\ load)(Voltage)^2$ ### Integrated Circuits Costs * 名詞: * Wafer = 晶圓 * Die = 晶粒、晶片 * Integrated Circuits Costs * ![](https://i.imgur.com/MwHqLkG.png) * $\dfrac{晶片數}{晶圓} = \dfrac{晶圓面積}{晶片面積} - \dfrac{晶圓圓周長}{晶片對角線長度}[ - 測試晶片數]$ * ![](https://i.imgur.com/NYd1oyB.png) ### Define and quantity dependability * Service Level Agreements (**SLA**): Systems alternate between 2 states of service with respect to an SLA: 1. **Service Accomplishment** = where the service is delivered as specified in SLA 2. **Service Interruption** = where the delivered service is different from the SLA * **Failure** = $state_1 \rightarrow state_2$ * **Restoration** = $state_2 \rightarrow state_1$ * Module reliability = measure of continuous service accomplishment (or time to failure). * Mean Time To Failure (**MTTF**) * Failures In Time (**FIT**) = **Failure Rate** = 1/**MTTF** * Traditionally reported as failures per billion($10^9$) hours of operation * Mean Time To Repair (**MTTR**) measures Service Interruption * Mean Time Between Failures (**MTBF**) = **MTTF** + **MTTR** * Module availability = **MTTF** / (**MTTF** + **MTTR**) * It measures service as alternate between the 2 states (number between 0 and 1) * Examples: * [Calculating Reliability](https://i.imgur.com/dPsxSDK.png) * [Redundant Power Supplies](https://i.imgur.com/nLyWUMB.png) * [Quiz 1 - Combination of the above](https://i.imgur.com/rNAK8yi.png) ### Performance * X is **n times faster** than Y means: $$n = \dfrac{ExcuTime_Y}{ExcuTime_X} = \dfrac{Performance_X}{Performance_Y}$$ * **Benchmarks** - Evaluation of Performance: * Real Programs: Including input, output, options in UI * Kernels: Small, key pieces from real programs * Toy Benchmarks: Small, easy to type, and run on almost any computer such as "sorting". * Synthetic Benchmarks: Created to match an average execution profile * **SPECRatio** $= \dfrac{ExcuTime\ on\ reference\ computer}{ExcuTime\ on\ computer\ being\ rated} \\ \rightarrow \dfrac{SPECRatio_A}{SPECRatio_B} = \dfrac{ExcuTime_B}{ExcuTime_A} = \dfrac{Performance_A}{Performance_B}$ * and we use **Geometric Mean** on **SPECRatio** $$Geometric\ Mean = \sqrt[n]{\prod_{i=1}^{n}{SPECRatio_i}}$$ * so $$\dfrac{Geometric\ Mean_A}{Geometric\ Mean_B} = \dfrac{\sqrt[n]{\prod_{i=1}^{n}{SPECRatioA_i}}}{\sqrt[n]{\prod_{i=1}^{n}{SPECRatioB_i}}} = \sqrt[n]{\prod_{i=1}^{n}{\dfrac{SPECRatioA_i}{SPECRatioB_i}}} \\ = \sqrt[n]{\prod_{i=1}^{n}{\dfrac{ExcuTimeB_i}{ExcuTimeA_i}}} = \sqrt[n]{\prod_{i=1}^{n}{\dfrac{PerformanceA_i}{PerformanceB_i}}}$$ * **Amdahl’s Law** * $ExcuTime_{new} = ExcuTime_{old}[(1-Fraction_{Enhanced}) + \dfrac{Fraction_{Enhanced}}{Speedup_{Enhanced}}] \\ \rightarrow Speedup_{Overall} = \dfrac{ExcuTime_{old}}{ExcuTime_{new}} = \dfrac{1}{(1-Fraction_{Enhanced}) + \dfrac{Fraction_{Enhanced}}{Speedup_{Enhanced}}}$ * **CPU Performance** * 名詞: * Instruction Count (**IC**) * Clock cycles Per Instruction (**CPI**) * **CPU time** = (CPU clock cycles for a program)(Clock cycle time) = (**IC**)(**CPI**)(Clock cycle time) = (**IC**)(**CPI**)/(Clock rate) * **Amdahl’s Law** with different **CPI** * $Fraction_{Enhanced}$ 要修改成$$\ Fraction\ of \ Clock\ cycles_{Enhanced} = Fraction\ of \ Instruction_{Enhanced} * CPI$$ * 如:指令A出現頻率25%,CPI = 5;指令B出現頻率75%,CPI = 1。總Clock Cycle = 125 + 75 = 200。 * 「真正」A的出現頻率 = 125/200 = 62.5% * 「真正」B的出現頻率 = 75/200 = 37.5% * 意思是:出現頻率要以「=Clock Cycle」的角度下去看才是正確的 ## (2) Instruction Set ### Instruction Set Principles * ![](https://i.imgur.com/qGb79Mj.png) * Properties of a good abstraction * Lasts through many generations (**portability**) * Used in many different ways (**generality**) * Provides **convenient** functionality to higher levels * Permits an **efficient** implementation at lower level * Classifying Instruction Set Architectures * ![](https://i.imgur.com/rtnQpem.png) * Two different conventions for ordering the bytes: * Little Endian * Big Endian * Addressing Modes * ![](https://i.imgur.com/tr4TY9s.png) * ![](https://i.imgur.com/WN6G9Gt.png) * Operations in the Instruction Set * ![](https://i.imgur.com/Q4R2vUN.png) * PC-relative * A displacement is **added** to the program counter * advantageous because the target is often **near** the current instruction * Permits the code to run independently of where it is loaded (position independence) * Register indirect is useful for * Virtual functions or methods * High-order functions or function pointers * **Dynamically** shared libraries * Conditional Branch Options * ![](https://i.imgur.com/UPzArDB.png) * Encoding an Instruction Set * ![](https://i.imgur.com/GJ58llj.png) ### Examples * **MIPS** instructions * Arithmetic logical: `add, addu, sub, subu, and, or, xor, nor` * Memory Access: `Load, Store` * Control:`Conditional Branch, Unconditional Jump` * **MIPS** instruction types * ![](https://i.imgur.com/ZlCw4Z3.png) * Compiler Perspective * ![](https://i.imgur.com/WNave5D.png) * Single Cycle Control and Data path * ![](https://i.imgur.com/LR5tlAn.png) * Why a Single-Cycle Implementation is not used? * Inefficient * The clock cycle is determined by the longest possible path * Load instruction: use five functional units in series: Instruction memory, register file, ALU, data memory, register file * Other instructions could fit in a shorter clock cycle * Although CPI=1, the overall performance is not good * Multi-cycle Implementation * ![](https://i.imgur.com/7c5I9Vy.png) * Main difference compared to single cycle implementation * Registers are added after every major functional unit to hold the output of that unit until the value is used in a subsequent clock cycle 1. Instruction register 2. Memory data register 3. A and B registers 4. ALUOut register * Every instruction can be implemented in at most 5 clock cycles * ![](https://i.imgur.com/X2APiZJ.png) ## (3) Pipeline ### A Pipelined Datapath * Five stages, one step per stage * Need to add pipeline registers between stages * Visualizing Pipelining * ![](https://i.imgur.com/tnLa077.png) * [Example - Basic Performance issues in pipelining](https://i.imgur.com/A3V60Ge.png) ### Limits to pipelining - ***Hazard*** * **Hazard** prevents next instruction from executing during its designated clock cycle * **Structural hazards**: *Hardware cannot support this combination of instructions* * **Data hazards**: Instruction depends on *result of prior instruction still in the pipeline* * **Control hazards**: Caused by delay between the fetching of instructions and decisions about changes in *control flow* * **Stall** * ![](https://i.imgur.com/PFalGwC.png) * Speed Up Equation for Pipelining * ![](https://i.imgur.com/SomndU2.png) ### Structural Hazard * **Def** * Some combination of instructions cannot be accommodated because of resource conflicts * **Solution** * **Stall** the pipeline until no hazard, commonly called a **pipeline bubble** * ![](https://i.imgur.com/RgaTD3D.png) * **Discussion** * If all other factors are equal, a processor without structural hazards will have a lower CPI * Why would a designer allow structural hazard? * To reduce cost of the unit * Supporting both an instruction cache and data cache every cycle requires **twice** the total memory bandwidth * Functional unit to deal with Floating point instruction ### Data Hazard * **Def** * Data hazards occur when the pipeline changes the order or read/write accesses to operands so that the order differs from the order seen sequentially executing instructions on an un-pipelined machine * **Three Generic** * Read After Write (**RAW**): `J` tries to read operand before `I` writes it * ![](https://i.imgur.com/dCTgdc3.png) * Called an "**Dependence**" * This hazard results from an actual need for communication. * Write After Read (**WAR**): `J` writes operand before `I` reads it * ![](https://i.imgur.com/UQl5PPa.png) * Called an "**Anti-Dependence**" * ***Can’t happen*** in DLX 5 stage pipeline because: **Reads** are always in stage 2, and **Writes** are always in stage 5 * Write After Write (**WAW**): `J` writes operand before `I` writes it * ![](https://i.imgur.com/oMhlB91.png) * Called an "**Output Dependence**" * ***Can’t happen*** in DLX 5 stage pipeline because: **Writes** are always in stage 5 * Will see **WAR** and **WAW** in more complicated pipelines * **Solution** * **Forwarding** (also called **Bypassing**) * A result is forwarded from the output of one unit to the input of another * ***Not all** potential data hazards can be handled by Forwarding* * [Forwarding to Avoid Data Hazard](https://i.imgur.com/bS1QPsT.png) * [Data Hazard Even with Forwarding - Insert Bubble](https://i.imgur.com/ggZ3nNW.png) * **Software Scheduling** * ![](https://i.imgur.com/VJ6fFmN.png) * **Implementing** * Detecting **interlocks** early in the pipeline reduces the hardware complexity * Detect the hazard or forwarding at the **beginning** of a clock cycle that uses an operand * Logic to detect the Load Interlocks * ![](https://i.imgur.com/ZFnAn1Z.png) * Forwarding Conditions * ![](https://i.imgur.com/3x01Kai.png) * Hardware change for Forwarding * ![](https://i.imgur.com/AyudoH5.png) ### Control Hazard (Branch Hazard) * **Def** * When a **branch** is executed, it may or may not change the PC: **Taken** or **Untaken** * Can cause a greater performance loss for the DLX pipeline than do data hazards * **Solution** * Flush: cause three stage Stalls * ![](https://i.imgur.com/jQyNbNU.png) * ![](https://i.imgur.com/BugmKUt.png) * Determine branch taken or not sooner * Compute taken branch address earlier * Solution for DLX (branch test whether register = 0) * Move Zero test to `ID` stage * Adder to calculate new PC in `ID` stage * 1 clock cycle penalty for branch versus 3 * **Four Branch Hazard Alternatives** 1. Stall until branch direction is clear 2. Predict Branch *Not Taken* * Execute **successor** instructions in sequence * **Squash** instructions in pipeline if branch actually taken * Advantage of late pipeline state update * 47% DLX branches not taken on average * PC+4 already calculated, so use it to get next instruction 3. Predict Branch *Taken* * 53% DLX branches taken on average * But haven’t calculated branch target address in DLX * DLX still incurs 1 cycle branch penalty * Other machines: branch target known before outcome 4. Delayed Branch * ![](https://i.imgur.com/26nTDv9.png) * ![](https://i.imgur.com/mzppigW.png) * Compiler effectiveness for single branch delay slot: Fills about 60% of branch delay slots About 80% of instructions executed in branch delay slots useful in computation $\rightarrow$ About 50% (60% x 80%) of slots usefully filled * Evaluating Branch Alternatives * ![](https://i.imgur.com/vQut0ix.png) ### What makes pipelining hard to implement * Dealing with exceptions * Stopping and restarting execution * Instruction set complication * Exceptions of DLX pipeline * ![](https://i.imgur.com/bU51VGh.png) ### Extending the DLX pipeline * Handle multiple operations * ![](https://i.imgur.com/sbKMsMR.png) * Latencies and initiation intervals * ![](https://i.imgur.com/DABNGl5.png) * $\rightarrow$ EX stage **needs not only one cycle** * Supports multiple outstanding FP operations * ![](https://i.imgur.com/XXnmH3a.png) * Problems * The divide unit is not fully pipelined. Structural hazards can occur. * The instructions have varying running times. The number of register writes required in a cycle can be larger than 1. * Instructions can complete in a different order than they were issued. * Stalls for RAW hazards are more frequent. * Multiple Write Back Simultaneously * ![](https://i.imgur.com/ME1pgha.png) * RAW hazards * The longer pipeline raises the frequency of stalls. * ![](https://i.imgur.com/RnGd1eV.png) ### MIPS R4000 Pipeline * 8 stages of MIPS R4000 * ![](https://i.imgur.com/ov5sN5h.png) * ![](https://i.imgur.com/zzQXONe.png) * A two-cycle load delay * ![](https://i.imgur.com/6Q4ueOj.png) * The basic branch delay * ![](https://i.imgur.com/ZyE2M4Z.png) ### Summary: Longer latency pipelines * The divide unit is not fully pipelined, structural hazards can occur. * Varying running time * Number of register writes required in one cycle can be > 1 * WAW hazards are possible (not reach WB in order) * Out of order completion causing problems with exception * Stalls for RAW will be more frequent * WAR is not possible in this case since the register reads always occurs in ID