###### tags: `計結`
# 1062 - Computer Architecture - 1~3
## (1) Fundamentals Of Computer Design
### Fundamentals of Computer Design
* Reduced Instruction Set Computer (RISC)
* in early 1980s
* Focused the attention of designers on
* Instruction Level Parallelism
* The Use of Caches
* Features
* far fewer Transistors
* Uniform Instruction Format
* identical General Purpose Registers
* simple Addressing Modes
* Multiple processor per chip > Uni-processors
### Classes of computers
* 不重要,自己看
### Defining Computer Architecture
* Issues of Instruction Set Architecture (ISA)
* Class of ISA
* ISAs today: General-purpose register architecture
* 80x86
1. 16 general purpose registers, 16 floating point registers
2. Register-memory
* MIPS
1. 32 general purpose registers, 32 floating point registers
2. Load-store
* Recent ISAs are load-store
* Memory Addressing
* Byte addressing
* MIPS requires that objects must be aligned (accesses are faster)
* 80x86 does not require alignment
* Addressing modes
* Register
* Immediate (for constants)
* Displacement
* Register Indirect
* Indexed
* Direct/Absolute
* Memory Indirect
* Types and sizes of operands
* both MIPS and 80x86 support:
1. 8-bit (ASCII character)
2. 16-bit (Uni-code character or half word)
3. 32-bit (integer or word)
4. 64-bit (double word or long integer)
5. IEEE 754 floating point in 32-bit (single precision)
6. 64-bit (double precision)
* 80x86 also supports:
1. 80-bit floating point (extended double precision)
* Operations
* Control flow instructions
* Conditional branches
* Unconditional jumps
* Procedure Calls
* Returns
* PC-relative addressing
* Encoding an ISA
* MIPS
1. Fixed length encoding
2. 32-bit
3. Simplifies instruction decoding
* 80x86
1. Variable length encoding
2. 1 to 18 bytes
3. Takes less space than fixed-length instructions
### Trends in Technology
* Dynamic Power (watts): $= \dfrac{1}{2}(Capacitive\ load)(Voltage)^2(Frequency\ switched)$
* Energy (joules): $= (Capacitive\ load)(Voltage)^2$
### Integrated Circuits Costs
* 名詞:
* Wafer = 晶圓
* Die = 晶粒、晶片
* Integrated Circuits Costs
* 
* $\dfrac{晶片數}{晶圓} = \dfrac{晶圓面積}{晶片面積} - \dfrac{晶圓圓周長}{晶片對角線長度}[ - 測試晶片數]$
* 
### Define and quantity dependability
* Service Level Agreements (**SLA**): Systems alternate between 2 states of service with respect to an SLA:
1. **Service Accomplishment** = where the service is delivered as specified in SLA
2. **Service Interruption** = where the delivered service is different from the SLA
* **Failure** = $state_1 \rightarrow state_2$
* **Restoration** = $state_2 \rightarrow state_1$
* Module reliability = measure of continuous service accomplishment (or time to failure).
* Mean Time To Failure (**MTTF**)
* Failures In Time (**FIT**) = **Failure Rate** = 1/**MTTF**
* Traditionally reported as failures per billion($10^9$) hours of operation
* Mean Time To Repair (**MTTR**) measures Service Interruption
* Mean Time Between Failures (**MTBF**) = **MTTF** + **MTTR**
* Module availability = **MTTF** / (**MTTF** + **MTTR**)
* It measures service as alternate between the 2 states (number between 0 and 1)
* Examples:
* [Calculating Reliability](https://i.imgur.com/dPsxSDK.png)
* [Redundant Power Supplies](https://i.imgur.com/nLyWUMB.png)
* [Quiz 1 - Combination of the above](https://i.imgur.com/rNAK8yi.png)
### Performance
* X is **n times faster** than Y means: $$n = \dfrac{ExcuTime_Y}{ExcuTime_X} = \dfrac{Performance_X}{Performance_Y}$$
* **Benchmarks** - Evaluation of Performance:
* Real Programs: Including input, output, options in UI
* Kernels: Small, key pieces from real programs
* Toy Benchmarks: Small, easy to type, and run on almost any computer such as "sorting".
* Synthetic Benchmarks: Created to match an average execution profile
* **SPECRatio** $= \dfrac{ExcuTime\ on\ reference\ computer}{ExcuTime\ on\ computer\ being\ rated} \\ \rightarrow \dfrac{SPECRatio_A}{SPECRatio_B} = \dfrac{ExcuTime_B}{ExcuTime_A} = \dfrac{Performance_A}{Performance_B}$
* and we use **Geometric Mean** on **SPECRatio** $$Geometric\ Mean = \sqrt[n]{\prod_{i=1}^{n}{SPECRatio_i}}$$
* so $$\dfrac{Geometric\ Mean_A}{Geometric\ Mean_B} = \dfrac{\sqrt[n]{\prod_{i=1}^{n}{SPECRatioA_i}}}{\sqrt[n]{\prod_{i=1}^{n}{SPECRatioB_i}}} = \sqrt[n]{\prod_{i=1}^{n}{\dfrac{SPECRatioA_i}{SPECRatioB_i}}} \\ = \sqrt[n]{\prod_{i=1}^{n}{\dfrac{ExcuTimeB_i}{ExcuTimeA_i}}} = \sqrt[n]{\prod_{i=1}^{n}{\dfrac{PerformanceA_i}{PerformanceB_i}}}$$
* **Amdahl’s Law**
* $ExcuTime_{new} = ExcuTime_{old}[(1-Fraction_{Enhanced}) + \dfrac{Fraction_{Enhanced}}{Speedup_{Enhanced}}] \\ \rightarrow Speedup_{Overall} = \dfrac{ExcuTime_{old}}{ExcuTime_{new}} = \dfrac{1}{(1-Fraction_{Enhanced}) + \dfrac{Fraction_{Enhanced}}{Speedup_{Enhanced}}}$
* **CPU Performance**
* 名詞:
* Instruction Count (**IC**)
* Clock cycles Per Instruction (**CPI**)
* **CPU time** = (CPU clock cycles for a program)(Clock cycle time)
= (**IC**)(**CPI**)(Clock cycle time)
= (**IC**)(**CPI**)/(Clock rate)
* **Amdahl’s Law** with different **CPI**
* $Fraction_{Enhanced}$ 要修改成$$\ Fraction\ of \ Clock\ cycles_{Enhanced} = Fraction\ of \ Instruction_{Enhanced} * CPI$$
* 如:指令A出現頻率25%,CPI = 5;指令B出現頻率75%,CPI = 1。總Clock Cycle = 125 + 75 = 200。
* 「真正」A的出現頻率 = 125/200 = 62.5%
* 「真正」B的出現頻率 = 75/200 = 37.5%
* 意思是:出現頻率要以「=Clock Cycle」的角度下去看才是正確的
## (2) Instruction Set
### Instruction Set Principles
* 
* Properties of a good abstraction
* Lasts through many generations (**portability**)
* Used in many different ways (**generality**)
* Provides **convenient** functionality to higher levels
* Permits an **efficient** implementation at lower level
* Classifying Instruction Set Architectures
* 
* Two different conventions for ordering the bytes:
* Little Endian
* Big Endian
* Addressing Modes
* 
* 
* Operations in the Instruction Set
* 
* PC-relative
* A displacement is **added** to the program counter
* advantageous because the target is often **near** the current instruction
* Permits the code to run independently of where it is loaded (position independence)
* Register indirect is useful for
* Virtual functions or methods
* High-order functions or function pointers
* **Dynamically** shared libraries
* Conditional Branch Options
* 
* Encoding an Instruction Set
* 
### Examples
* **MIPS** instructions
* Arithmetic logical: `add, addu, sub, subu, and, or, xor, nor`
* Memory Access: `Load, Store`
* Control:`Conditional Branch, Unconditional Jump`
* **MIPS** instruction types
* 
* Compiler Perspective
* 
* Single Cycle Control and Data path
* 
* Why a Single-Cycle Implementation is not used?
* Inefficient
* The clock cycle is determined by the longest possible path
* Load instruction: use five functional units in series: Instruction memory, register file, ALU, data memory, register file
* Other instructions could fit in a shorter clock cycle
* Although CPI=1, the overall performance is not good
* Multi-cycle Implementation
* 
* Main difference compared to single cycle implementation
* Registers are added after every major functional unit to hold the output of that unit until the value is used in a subsequent clock cycle
1. Instruction register
2. Memory data register
3. A and B registers
4. ALUOut register
* Every instruction can be implemented in at most 5 clock cycles
* 
## (3) Pipeline
### A Pipelined Datapath
* Five stages, one step per stage
* Need to add pipeline registers between stages
* Visualizing Pipelining
* 
* [Example - Basic Performance issues in pipelining](https://i.imgur.com/A3V60Ge.png)
### Limits to pipelining - ***Hazard***
* **Hazard** prevents next instruction from executing during its designated clock cycle
* **Structural hazards**: *Hardware cannot support this combination of instructions*
* **Data hazards**: Instruction depends on *result of prior instruction still in the pipeline*
* **Control hazards**: Caused by delay between the fetching of instructions and decisions about changes in *control flow*
* **Stall**
* 
* Speed Up Equation for Pipelining
* 
### Structural Hazard
* **Def**
* Some combination of instructions cannot be accommodated because of resource conflicts
* **Solution**
* **Stall** the pipeline until no hazard, commonly called a **pipeline bubble**
* 
* **Discussion**
* If all other factors are equal, a processor without structural hazards will have a lower CPI
* Why would a designer allow structural hazard?
* To reduce cost of the unit
* Supporting both an instruction cache and data cache every cycle requires **twice** the total memory bandwidth
* Functional unit to deal with Floating point instruction
### Data Hazard
* **Def**
* Data hazards occur when the pipeline changes the order or read/write accesses to operands so that the order differs from the order seen sequentially executing instructions on an un-pipelined machine
* **Three Generic**
* Read After Write (**RAW**): `J` tries to read operand before `I` writes it
* 
* Called an "**Dependence**"
* This hazard results from an actual need for communication.
* Write After Read (**WAR**): `J` writes operand before `I` reads it
* 
* Called an "**Anti-Dependence**"
* ***Can’t happen*** in DLX 5 stage pipeline because: **Reads** are always in stage 2, and **Writes** are always in stage 5
* Write After Write (**WAW**): `J` writes operand before `I` writes it
* 
* Called an "**Output Dependence**"
* ***Can’t happen*** in DLX 5 stage pipeline because: **Writes** are always in stage 5
* Will see **WAR** and **WAW** in more complicated pipelines
* **Solution**
* **Forwarding** (also called **Bypassing**)
* A result is forwarded from the output of one unit to the input of another
* ***Not all** potential data hazards can be handled by Forwarding*
* [Forwarding to Avoid Data Hazard](https://i.imgur.com/bS1QPsT.png)
* [Data Hazard Even with Forwarding - Insert Bubble](https://i.imgur.com/ggZ3nNW.png)
* **Software Scheduling**
* 
* **Implementing**
* Detecting **interlocks** early in the pipeline reduces the hardware complexity
* Detect the hazard or forwarding at the **beginning** of a clock cycle that uses an operand
* Logic to detect the Load Interlocks
* 
* Forwarding Conditions
* 
* Hardware change for Forwarding
* 
### Control Hazard (Branch Hazard)
* **Def**
* When a **branch** is executed, it may or may not change the PC: **Taken** or **Untaken**
* Can cause a greater performance loss for the DLX pipeline than do data hazards
* **Solution**
* Flush: cause three stage Stalls
* 
* 
* Determine branch taken or not sooner
* Compute taken branch address earlier
* Solution for DLX (branch test whether register = 0)
* Move Zero test to `ID` stage
* Adder to calculate new PC in `ID` stage
* 1 clock cycle penalty for branch versus 3
* **Four Branch Hazard Alternatives**
1. Stall until branch direction is clear
2. Predict Branch *Not Taken*
* Execute **successor** instructions in sequence
* **Squash** instructions in pipeline if branch actually taken
* Advantage of late pipeline state update
* 47% DLX branches not taken on average
* PC+4 already calculated, so use it to get next instruction
3. Predict Branch *Taken*
* 53% DLX branches taken on average
* But haven’t calculated branch target address in DLX
* DLX still incurs 1 cycle branch penalty
* Other machines: branch target known before outcome
4. Delayed Branch
* 
* 
* Compiler effectiveness for single branch delay slot:
Fills about 60% of branch delay slots
About 80% of instructions executed in branch delay slots useful in computation
$\rightarrow$ About 50% (60% x 80%) of slots usefully filled
* Evaluating Branch Alternatives
* 
### What makes pipelining hard to implement
* Dealing with exceptions
* Stopping and restarting execution
* Instruction set complication
* Exceptions of DLX pipeline
* 
### Extending the DLX pipeline
* Handle multiple operations
* 
* Latencies and initiation intervals
* 
* $\rightarrow$ EX stage **needs not only one cycle**
* Supports multiple outstanding FP operations
* 
* Problems
* The divide unit is not fully pipelined. Structural hazards can occur.
* The instructions have varying running times. The number of register writes required in a cycle can be larger than 1.
* Instructions can complete in a different order than they were issued.
* Stalls for RAW hazards are more frequent.
* Multiple Write Back Simultaneously
* 
* RAW hazards
* The longer pipeline raises the frequency of stalls.
* 
### MIPS R4000 Pipeline
* 8 stages of MIPS R4000
* 
* 
* A two-cycle load delay
* 
* The basic branch delay
* 
### Summary: Longer latency pipelines
* The divide unit is not fully pipelined, structural hazards can occur.
* Varying running time
* Number of register writes required in one cycle can be > 1
* WAW hazards are possible (not reach WB in order)
* Out of order completion causing problems with exception
* Stalls for RAW will be more frequent
* WAR is not possible in this case since the register reads always occurs in ID