# 計算機組織期末整理
## The Processor
### CPU performance factors
* Instruction count(Determined by ISA and compiler)
* CPI and Cycle time(Determined by CPU hardware)
### Instruction Execution
1. fetch instruction: PC -> instruction memory
2. read registers: Register numbers -> register file
3. Others(depending on instruction class)
* Use ALU to calculate
* Access data memory for load/store
* PC <- target address or PC + 4
### CPU overview
因為我們假定memory在一個clock cycle只能做一次R\W,所以要將memory分成Instruction memory and Data memory

### 各指令需要用到的CPU component
#### Instruction Fetch

#### R-type Datapath
| op | rs | rt | rd | shamt | funct |
| -- | -- | -- | -- | ----- | ----- |
| 6 | 5 | 5 | 5 | 5 | 6 |

#### Load/Store Datapath
| op | rs | rt | immed |
| -- | -- | -- | -- |
| 6 | 5 | 5 | 16 |

#### Branch Datapath
beq rs, rt, imm16

#### Jump
| op | address |
| -- | -- |
| 6 | 26 |
address * 4 + (PC+4)
### Composing

### ALUop and ALU Control
先透過op決定是哪種指令如果是lw, sw, beq則不需要function code即可決定ALU control,
如果是R-type則需要再透過function code來決定ALU control


### Control signal

#### logic of control signal

## Pipeline
### MIPS Pipeline
Five stages, one step per stage
1. IF: Instruction fetch from memory
2. ID: Instruction decode & register read
3. EX: Execute operation or calculate address
4. MEM: Access memory operand
5. WB: Write result back to register

* Need registers between stages: To hold information produced in previous cycle

### pipeline performance
今天我們假設ID和WB所需要花的時間是100ps而其他的要花200ps,稍微計算一下個指令需要花的時間
* lw: 800ps
* sw: 700ps
* R-format: 600ps
* beq: 500ps
如果不使用pipeline的話,每一個cycle time是800ps,而使用pipeline的話每個stage最長是200ps所以cycle time則是200ps
##### thoughput增加、lantency不變
### speedup
If all stages are balanced (i.e., all take the same time)
If not balanced, speedup is less
```
Time between instructions(pipelined) =
Time between instructions(nonpipelined)/Number of stages
```
### MIPS ISA designed for pipelining
* All instructions are 32-bits (Easier to fetch in one cycle)
* Few and regular instruction formats (Can decode and read registers in one step)
* Load/store addressing
* Alignment of memory operands (Memory access takes only one cycle)
### Load example
在每一個clock cycle中,write在前半段、read在後半段
* Instruction fetch

* Instrction decode

* Execution

* Memory access

* Write back

### Multi-Cycle Pipeline






### WB issue for load and R-type
WB時的write register應該等到要WB時再傳回去決定(所以應該要持續透過pipeline register傳下去),不然在multi-cycle下會遺失應該要write back的register是哪個

## Hazard
Situations that prevent starting the next instruction in the next cycle
### Structure hazards
Conflict for use of a resource
* load 需要五個 stage (IF, ID, EX, MEM, WB)
* R-type 只需要四個 stage(IF, ID, EX, WB)
當load在R-type前一個執行時會在同時需要WB! 解決的方法就是不管每個指令需要多少個stage都**指定為五個stage**
### Data Hazards
An instruction depends on completion of data access by a previous instruction
* RAW(i2 tries to read operand before i1 writes it)
* WAR(i2 tries to write operand before i1 reads it)
Can’t happen in MIPS 5-stage pipeline
* WAW(i2 tries to write operand before i1 writes it)
occur only in pipelines that write in more than one stage
#### Forwarding
Requires extra connections in the datapath
有兩種情況,第一種是在第一個指令EX完後拉到下一個指令的EX前(EX hazard),第二種是在第一個指令MEM階段完拉到下兩個指令的EX前(MEM hazard)

* Forwarding Condition
* EX hazard
```
if ( EX/MEM.RegWrite and
(EX/MEM.RegisterRd ≠ 0) and
(EX/MEM.RegisterRd = ID/EX.RegisterRs)
)
ForwardA = 10
if ( EX/MEM.RegWrite and
(EX/MEM.RegisterRd ≠ 0) and
(EX/MEM.RegisterRd = ID/EX.RegisterRt)
)
ForwardB = 10
```
* MEM hazard
```
if ( MEM/WB.RegWrite and
(MEM/WB.RegisterRd ≠ 0) and
(MEM/WB.RegisterRd = ID/EX.RegisterRs)
)
ForwardA = 01
if ( MEM/WB.RegWrite and
(MEM/WB.RegisterRd ≠ 0) and
(MEM/WB.RegisterRd = ID/EX.RegisterRt)
)
ForwardB = 01
```
* Datapath with Forwarding
將EX/MEM的register(RegWrite)、EX/MEM.RegisterRd、MEM/WB的register(RegWrite)、MEM/WB.RegisterRd、ID/EX.RegisterRsRt傳入**Forwarding unit**
然後Forwarding unit會算出ForwardA, ForwardB
如果為10則拿MEM階段的RS+RT,為01則拿WB階段的RS+RT

#### Double Data Hazard
考慮以下狀況
```
add $1, $1, $2
add $1, $1, $3
add $1, $1, $4
```
可以發現當執行到到三行指令的時候MEM hazard和EX hazard的情況都會符合,但是很顯然我們要的是第二行指令算出來的$1,所以需要稍微修改一下MEM hazard:
當EX hazard不存在時MEM hazard存在才做
```
if ( MEM/WB.RegWrite and
(MEM/WB.RegisterRd ≠ 0) and
not( EX/MEM.RegWrite and
(EX/MEM.RegisterRd ≠ 0) and
(EX/MEM.RegisterRd = ID/EX.RegisterRs)
) and
(MEM/WB.RegisterRd = ID/EX.RegisterRs)
)
ForwardA = 01
if ( MEM/WB.RegWrite and
(MEM/WB.RegisterRd ≠ 0) and
not( EX/MEM.RegWrite and
(EX/MEM.RegisterRd ≠ 0) and
(EX/MEM.RegisterRd = ID/EX.RegisterRt)
) and
(MEM/WB.RegisterRd = ID/EX.RegisterRt)
)
ForwardB = 01
```
#### Load-Use Data Hazard
Can’t always avoid stalls by forwarding

```
if ID/EX.MemRead and
( (ID/EX.RegisterRt = IF/ID.RegisterRs) or
(ID/EX.RegisterRt = IF/ID.RegisterRt)
)
stall and insert bubble
```
* Datapath with Load-use hazard detection
將 ID/EX.MemRead、ID/EX.RegisterRt、IF/ID.RegisterRtRs,傳入**Hazard detection unit**
如果判斷為true則
1. 將ID/EX register中的control values全部設為零,也就是讓之後的EX, MEM, WB都沒有指令執行
2. 防止PC更新
Using instruction is decoded again
3. 防止IF/ID register更新
Following instruction is fetched again
1-cycle stall allows MEM to read data for lw, can subsequently forward to EX stage

#### Code Scheduling
Reorder code to avoid use of load result in the next instruction
由於有些指令執行的順序並不會影響結果,我們可以重新排序減少hazard
### Branch(Control) Hazards
* Need to compare registers and compute target early in the pipeline
Add hardware to do the branch execution (i.e., comparison) in **ID stage**

By moving the comparison to ID stage, **only one clock cycle** is wasted

* Data Hazard for Branch
* case 1: 如果branch的rs, rt是前兩個及前三個指令的rd,透過forwarding到ID即可

* case 2: 如果branch的rs, rt是前一個及前二個指令的rd,就算forwarding也需要 1 stall cycle

* case 3: 如果branch的rs或rt是其一個指令的rt(lw),則需要2 stall cycles

#### Branch Prediction
如果猜對了則不需要有bubble
分成兩種預測方式:
1. Static branch prediction
Based on typical branch behavior
2. Dynamic branch prediction
* Branch prediction buffer (aka branch history table)
* Indexed by recent branch instruction addresses
* Stores outcome (taken/not taken)
* To execute a branch
* Check table, expect the same outcome
* Start fetching from fall-through or target
* If wrong, **flush pipeline and flip prediction**
1. 1-Bit Predictor
一開始先使用隨便一個(taken/not taken),如果猜錯則在下一個branch猜另一個
如果是一個雙層for loop的話
```
outer: ...
...
inner: ...
...
beq ..., ..., inner
...
beq ..., ..., outer
```
假設我們一開始都猜taken,除了第一個inner loop會錯一次外(離開的那次),後面每次inner loop都會猜錯兩次(進去和離開)
我們可以發現會發生這種情況的原因是因為1-Bit Predictor不容許錯誤,當一有錯誤發生它就換另一個
2. 2-Bit Predictor
Only change prediction on **two successive mispredictions**

如此一來上面的例子每次inner loop都只會錯一次(離開)
#### Branch target buffer
* Cache of target addresses
* Indexed by PC when instruction fetched
If hit and instruction is branch predicted taken, can fetch target immediately
### Exceptions and Interrupts
"Unexpected” events requiring change in flow of control
* Exception: Arises within the CPU
* Interrupt: From an external I/O controller
Dealing with them without sacrificing performance is hard
#### Handling Exceptions
In MIPS, exceptions managed by a System Control Coprocessor (CP0)
1. Exception occurs
2. Save PC of offending (or interrupted) instruction
In MIPS save in **Exception Program Counter (EPC)**
3. Transfer control to OS and perform some actions for exception
Save indication of the problem
1. **Single entry**: In MIPS: Cause register (jump to a **single address** 8000 00180)
2. **Multiple entry**: Vectored Interrupts
* The CPU knows the address of the Interrupt Service Routings (ISR) in advance
* The interrupting device sends a vector to CPU
* The CPU compares the vector to an interrupt table in memory, and then carries out the correct ISR for that device
4. If restartable (e.g., most I/O interrupt)
* Take corrective action
* Use EPC to return to program
* **Otherwise: Terminate program**
#### Exception in a Pipeline
1. Exception occurs(like overflow on add in EX)
2. Complete previous instructions
3. **Flush add and subsequent instructions**
4. Set Cause and EPC register values
5. Transfer control to handler
當overflow發生時會將EX.flush設為1,然後調整在PC前面的MUX

## Instruction-Level Parallelism (ILP)
To increase ILP
* Deeper pipeline
* Issue multiple instructions in one cycle
### Multiple issue
* Static multiple issue
* Compiler **groups instructions** to be issued together
* Compiler detects and avoids hazards (像兩個有hazard的指令就不能在同一個group)
* Think of an issue packet as a very long instruction word(VLIW)
* Dynamic multiple issue
* **CPU(hardware)** examines instruction stream and chooses instructions to issue each cycle
* CPU resolves hazards using advanced techniques **at runtime**
### Speculation
主要是在猜這個指令跟其他指令有沒有dependency,如果乍看之下沒有dependency(沒有可以先偵測到的hazard)就先拿進來執行(先不要write),當其他指令執行完後發現確實沒有dependency的話,就可以安心地完成該指令,如果發現猜錯了,就洗掉重做
* Compiler can reorder instructions
* Hardware can look ahead for instructions to execute
* **Buffer results** until it determines they are actually needed
* **Flush buffers on incorrect speculation**
### Scheduling Static Multiple Issue
* Reorder instructions into issue packets
* **No dependencies** with a packet
* **Pad with nop if necessary**
#### MIPS with Static Dual Issue
Two-issue packets
* One ALU/branch instruction
* One load/store instruction
* 64-bit aligned

#### Additional hardware needed
黑色的線是ALU/branch instruction的datapath
藍色的是load/store instruction的datapath

#### Loop Unrolling
將loop展開來盡可能使用parallelism
主要就是將每次迴圈用來暫存的register都換成不同的,如此就可以減少dependency
### Dynamic Multiple Issue
四個階段:
1. IF/ID
* **Execute in order**, very fast
2. Reservation station
* **Buffers holding** operands and operation
* Once all operands and functional unit are ready, send to functional units
3. Functional units
* **Completes the operation**, send results to commit units and reservation station (because other instructions may wait for the result)
4. Commit unit
* **Re-order the register writing sequence**(Reorders buffer, 最後的寫入一定要照順序)

* Complexity of dynamic scheduling and speculations requires power(使用越多issue width, speculation會耗越多能源)
* Multiple simpler cores may be better
## Cache Basic
Principle of Locality
* Temporal locality
Items accessed recently are likely to be accessed again soon(剛使用完的可以先留在cache)
* Spatial locality
Items near those accessed recently are likely to be accessed soon(要拿資料時連他附近的一起拿,也就是一次拿一個block)
### Memory Hierarchy Levels
* Block (aka line): unit of copying
* If accessed data is present in upper level: Hit
**Hit ratio: hits/accesses**
* If accessed data is absent: Miss
**Miss ratio: misses/accesses = 1 – hit ratio**

### Cache Memory
The level of the memory hierarchy closest to the CPU
### Direct Mapped Cache
* Location determined by address
* Direct mapped: only one choice
```
Cache address = (Block address) modulo (#Blocks in cache)
```

#### Address Subdivision
1. 比對index
2. 比對Tag
3. 1, 2都符合->Hit
4. 取出Data

#### Example
假設Cache有64個blocks,每個block有16bytes
1. To what block number does (byte) address 1200 map(using direct mapped cache)?
Block address = floor(1200/16) = 75
Block number = 75 % 64 = **11**
2. How many bits for tag, index, and offset?
64 blocks -> 64 = 2^6 -> 6 bits index
16 bytes/block -> 16 = 2^4 -> 4 bits for offset
The remaining is tag -> 32 – 6 – 4 = 22 bits
3. Cache總共有多少bit
block number * block size ->
block number * (Valid bit + Tag + Data) -> 64 * (1 + 22 + 16*8)

### Cache Misses
* On cache hit, CPU proceeds normally
* On cache miss
* Stall the CPU pipeline
* Fetch block from next level of hierarchy
* Instruction cache miss
Restart instruction fetch
* Data cache miss
Complete data access
### Data-wrtie機制
#### Write-Through
* On data-write hit, could just update the block in cache, but then **cache and memory would be inconsistent**
* Write through: **also update memory**
* But makes writes take longer
* Solution: **write buffer**
* Holds data waiting to be written to memory
* CPU continues immediately
Only stalls on write if write buffer is already full
#### Write-Back
* Alternative: On data-write hit, just update the block in cache
* When a **dirty block is replaced**, Write it back to memory(當block從cache中被換掉時才寫入memory)
#### Write Miss
* For Write-Through: No-write allocation (aka Write around): **don’t fetch** the block on write miss(直接寫到memory而不把那塊block抓進cache)
* Since programs often write a whole block before reading it (e.g., initialization)
* For write-back: Usually **fetch** the block
### Memory bandwidth design
* One-word-wide memory organization
Simple. Fewest hardware needed
* Wide memory organization
Largest bandwidth
* Interleaved memory organization

#### Miss Penalty of Different Designs
假設
* 1 memory bus clock to send the address
* 15 memory bus clocks for each DRAM word access initiated
* 1 memory bus clock to send a DRAM word of data
* A cache block = 4 words
1. A one-word-wide bank of DRAMs
Miss penalty = 1 + 4 x 15 + 4 x 1 = 65 cycles
Bandwidth = 16 bytes / 65 cycles = 0.25 B/cycle
2. A four-word-wide bank of DRAMs
Miss penalty = 1 + 1 x 15 + 1 = 17 cycles
Bandwidth = 16 bytes / 17 cycles = 0.94 B/cycle
3. A four-bank, one-word-wide bus of DRAMs
Miss penalty = 1 + 1 x 15 + 4 x 1 = 20 cycles
Bandwidth = 16 bytes / 20 cycles = 0.8 B/cycle
### Measuring Cache Performance
CPU time can be divided into two parts
* Program execution cycles
* Memory stall cycles
```
Memory stall cycles = (Memory accessed / Program) x Miss rate x Miss penalty
```
### Average memory access time (AMAT)
AMAT = Hit time + Miss rate × Miss penalty
###### tags: `考試複習`