---
# System prepended metadata

title: 計組 L4

---

# 計組 L4 
# 1st期中考
## 1.ALU基本實現功能
#### 1-bit概念
![](https://i.imgur.com/XuLfavU.png =70%x70%)
-
### ADD
:::info
![](https://i.imgur.com/saQEdwr.png)
- ALU1的Cin(CarryIn)是ALU0的Cout(CarryOut)
:::
### SUB
:::info
![](https://i.imgur.com/dWBGV9F.png) 
:::success
### 衍伸：NOR--迪摩根定律
![](https://i.imgur.com/FR0l7M5.png)
:::info
:::

### SLT
- 要求a<b , 所以可寫成(a-b)<0 
- 滿足時result = 1 (0...01, 從左至右依序是31-0bit)
:::info
![](https://i.imgur.com/E66WKL0.png)
:::

### Overflow(溢位)
```
如果最高位元的CarryIn與CarryOut一者為1一者為0，則有溢位。
即： Overflow = Cin[N-1] XOR Cout[N-1]
```
---
## 2.CPU Overview
### Multiplexer
:::info
![](https://i.imgur.com/QQ7lVNV.png =80%x80%)
上面這是CPU的概觀，但不能把線都混在一起
→所以需要Mux(多工器), 以便完成多種功能
:::
### Build a datapath.
:::info
#### - Instruction Fetch
![](https://i.imgur.com/eDufVFB.png =60%x60%)
#### - R-format Instructions
![](https://i.imgur.com/lnhOchd.png =80%x80%)

#### - Load/Store Instructions
![](https://i.imgur.com/EbLWS69.png =80%x80%)

#### - Branch Instructions
![](https://i.imgur.com/iwMvBJo.png =80%x80%)

:::

### 問題：要怎麼把這四個電路圖combine在一起成為一個datapath?
:::info
- 用Mux(多工器)！
![](https://i.imgur.com/urEhfuP.png =80%x80%)
:::
### "Write Register"的不同
:::info
![](https://i.imgur.com/AtdZVdV.png =70%x70%)
- R-format 是 rd (addr[15:11])
- lw 是 rt (addr[20-16])
:::
### So,the correct FULL Datapath is...
:::info
![](https://i.imgur.com/x5CTq2y.png)
>
:::success
### 以下三種不同路線R-format, Load, Branch
- R-format
![](https://i.imgur.com/cRv287x.png)
- Load(以lw當example)
![](https://i.imgur.com/ZMi9e7y.png)
- Store (rt寫入Data Memory的Write Data這個input port)
![](https://i.imgur.com/fzpIsG7.png)
- Branch
![](https://i.imgur.com/lg9u9ci.png)

:::

:smile::smile::smile::smile::smile::smile::smile::smile:1st期中考L4.P.35到這裡喔

---
# 2nd期中考
### ALU功能: function field
:::warning
![](https://i.imgur.com/Uv3H23n.png =60%x60%)
<font color="black">**註：lw, sw 要加(ADD)immediate，beq 則是要把 rs, rt 相減(SUB)**</font>

![](https://i.imgur.com/eU6H7I8.png =60%x60%)
:::
### Control的架構
:::success
![](https://i.imgur.com/zRdmso7.png =60%x60%)
<font color="black">
**Hint**
- rt 在load，是當寫入的register
- rd都是寫入
</font>


---

### Control lines分別為0與1時的情況:


| 控制信號 | 0 | 1 |
| :--------: | :--------: | :--------: |
| RegDst    | 寫入目的暫存器: rt |寫入目的暫存器: rd|
| RegWrite | None | Write register = Write Data寫入 |
| ALUSrc | ALU input2 = Read Data2寫入 | ALU input2 = instruction[15:0]經過Sign Extension的值 |
| PCSrc | PC = PC+4 | PC = Branch Target
| MemRead | None | 允許要被讀取的資料輸出記憶體(lw) |
| MemWrite | None | 允許要被寫入的資料輸入記憶體(sw) |
| MemtoReg | Write Data是來自ALU計算結果 | Write Data是來自Data Memory

【原圖】
![](https://i.imgur.com/V3Ljfu7.png =80%x80%)

:::
### R-format
![](https://i.imgur.com/YxX0XQ6.png)
### Load
![](https://i.imgur.com/7C19M4A.png)
### Branch
![](https://i.imgur.com/ecNzL7u.png)
:::info
### <font color="black">綜合比較: Control line數值</font>
![](https://i.imgur.com/GxAD367.png)

:::
### Jump
![](https://i.imgur.com/7qGZ4aQ.png =60%x60%)
![](https://i.imgur.com/87jWlAi.png)

---

### Issue: 效能缺陷分析
- Longest delay determines clock period
-- **Critical path: load instruction**
-- *Instruction memory > register file > ALU > data memory > register file*
- Not feasible to vary period for different instructions
- **Violates design principle:**
-- **Making the common case fast**
- We will improve performance by **pipelining**

---
### Pipelining: 將工作(job)執行劃分為數個stage(step)
![](https://i.imgur.com/4nlER4m.png =60%x60%)
**`HINT`: 如果每個stage花費的時間大約相等，並有足夠的jobs時，Speedup ≒ Stage數目**
### MIPS Pipeline
1. IF: Instruction fetch from memory
2. ID: Instruction decode & register read
3. EX: Execute operation or calculate address
4. MEM: Access memory operand
5. WB: Write result back to register

---
### Single-cycle v.s Pipeline
Assume following,
![](https://i.imgur.com/SUDWiaq.png =60%x60%)
![](https://i.imgur.com/vvIh4DI.png =60%x60%)
For 3-load-word instructions,
**Single-cycle: 2400ps
Pipeline: 1400ps**

---
### Pipeline加速特性
- 增加整體jobs生產量(throughput)，但並不會減少單一job的延遲時間(latency)
- 如果Stage時間不盡相等，Speedup會下降
### Pipeline執行時間及加速公式
假設一週期時間 T, 管線機器有 S 個階段, 若有 N 個指令要執行:
- Execution Time = [ ( S-1 )+N ] * T
- CPI = [ ( S-1 )+N ] / N
- Speedup = S * N * T / Execution Time (Single-Cycle執行時間/Pipeline執行時間)
### 專為Pipleline設計的指令集-MIPS: 特點
- All instructions are 32-bits: Stage1指令擷取, Stage2解碼變簡單
- Few and regular instruction formats: Stage2解碼可同時讀取register file
- Only Load/store accessing memory: 可以在Stage3 Calculate address, Stage4 access memory
- Alignment of memory operands: 如此就不會有單一資料傳輸指令需要兩次資料記憶體存取

---
### Hazards(危障): 管線中不能順利在下個週期, 執行下個instruction
- `Structure Hazards`: 硬體資源不夠，導致同一時間要執行多個指令卻無法
- `Data Hazards`: 管線中某一指令需要用到前面指令(還在管線中)尚未產生的結果(**資料相依, data dependency**), 也可說執行指令所需的資料還未get
- `Control(Branch) Hazards`: 
　當分支指令尚未決定，是否分支至其他指令執行時，後續指令已經進入管線中，若分支發生則指令執行順序便會發生錯誤。
　Control(Branch) Hazards的產生是由管線中的分支指令，及其他會改變Program Counter的指令所引發。

---
### 解決Hazards: Stall(暫停)
![](https://i.imgur.com/HbhXKaV.png =50%x50%)
.
### 解決Hazards: Forwarding(前饋)
![](https://i.imgur.com/YkoHG0V.png =50%x50%)

### 特例:解決Load-Use Data Hazards: Stall+Forwarding
![](https://i.imgur.com/XL5xAYE.png =50%x50%)
`解釋：因為至少要Stall一個cycle，才能Forwarding到SUB指令的ALU的輸入，不然會無法往回forward!!!`

---
### 比較：三種Hazards解決方法
- Structure Hazards
--**Stall**
--Add more hardware
- Data Hazards
    | 軟體方式 | 方法1.插入NOP指令(Stall)／方法2.重排指令順序 |
    | -------- | -------- | -------- |
    | 硬體方式     | 方法1.Forwarding(Non-Load-Use)／方法2.Stall+Forwarding(Load-Use)   |
    `Hint:5個Stage的MIPS Pipeline，指令間距離>2就不會有Data Hazards`
    ![](https://i.imgur.com/2XMpTuA.png =50%x50%)


- Control(Branch) Hazards **(下面Hazard Detection第2點有詳細解說)**
--Add hardware to do it in **ID** stage
--**Branch Prediction**: 1-bit(預測錯1次就改變預測位元), 2-bit(預測錯2次才改變預測位元)
    - In MIPS pipeline, it can predict branches not taken
![](https://i.imgur.com/pHjy9fC.png =50%x50%)

    --現實的Branch Prediction
    - Static branch prediction
        - `Based on typical branch behavior`
        - `Example: loop and if-statement branches`
    - Dynamic branch prediction
        - `Hardware measures actual branch behavior`
        - `Assume future behavior will continue the trend`

---
### Pipeline Datapath
![](https://i.imgur.com/oNwcx0y.png)
![](https://i.imgur.com/aFwI1FH.png)

### Pipeline Register詳解
`分左邊, 右邊著色表示現在是read還是write`
#### :one: Load指令
![](https://i.imgur.com/1hSCJGV.png =60%x60%)
![](https://i.imgur.com/bP7cKb7.png =60%x60%)
![](https://i.imgur.com/t8stdAB.png =60%x60%)
![](https://i.imgur.com/YIhgesF.png =60%x60%)
:::warning
![](https://i.imgur.com/JNZo9Cx.png =60%x60%)
#### `錯誤：WB for Load 寫data到register是下幾個指令當前讀到的registerNO.，錯了!!!!!`
![](https://i.imgur.com/nRCIyCi.png =80%x80%)
#### `Solution：接一條線從IF/ID到MEM/WB以便得到正確的rt暫存器編號`
![](https://i.imgur.com/kMZgfue.png =80%x80%)
#### `Load+R-format`
:::

#### :two: Store指令比較
![](https://i.imgur.com/CCOKu3E.png =80%x80%)
`>圖中可以看到是去做write`

### Pipeline Control控制信號
跟Single-Cycle CPU一樣，只是有用Pipeline Register
![](https://i.imgur.com/8cJSV6a.png)
![](https://i.imgur.com/zsM0zfh.png)

---
### Hazards Detection(危障偵測)
`Hint:看一下指令間是否存在data dependencies`
#### <font color="blue">:a:Data Hazards
.
**下面是一些指令，存在data dependencies**
![](https://i.imgur.com/CT6ip8y.png =80%x80%)
.
#### :one: Fowarding
**1. 會存在需要forwarding，代表RegWrite(寫資料到register)這個控制信號=1
2. 而且rd必須必須不是$zero暫存器**
![](https://i.imgur.com/mVvsvYu.png =50%x50%)

.
**見下圖，ForwardA 管 rs ／ ForwardB 管 rt**
![](https://i.imgur.com/WuMoGfK.png =80%x80%)
.
**接線示意圖**
![](https://i.imgur.com/44bYymK.png =80%x80%)
.
**Forwarding Control(ForwardA, ForwardB控制信號)**
![](https://i.imgur.com/gzbeL9q.png =80%x80%)
.
**Double Data Hazards(兩個以上的資料危障，forward會出現問題!)
→MEM hazard forward只存在 EX hazard沒出現的情況**
:::success
![](https://i.imgur.com/iCjJc8g.png =80%x80%)
<font color ="black"> 所以，修改過的版本↓（加入not(EX hazard condition)) </font>
![](https://i.imgur.com/hYKk0P8.png =60%x60%)
:::
.
加入forwarding的電路圖Datapath
![](https://i.imgur.com/b5koZ8k.png)
.
**特例:Load-Use Hazard Detection**
Recall that: Need stall+forwarding
![](https://i.imgur.com/vxwYB4N.png =80%x80%)
![](https://i.imgur.com/oAxBzav.png =60%x60%)
.
#### :two: Stall
- `Force control values in ID/EX registerto 0(NOP):這個動作叫Flush `
    - ![](https://i.imgur.com/WTEHzLZ.png =60%x60%)
    - bubble - 把EX, MEM, WB control value變0
    - 保留IF及ID中的instruction for one more cycle
- `Prevent update of (1)Program Counter (2)IF/ID register`
    - 藉由這點來保留指令在下個cycle開始繼續執行

#### :three: Datapath with Hazard Detection
![](https://i.imgur.com/dJHPx7x.png)
</font>

---
#### <font color="red">:b:Control(Branch) Hazards
.
如果Branch結果在MEM stage被決定跳到哪個PC
![](https://i.imgur.com/jR1HERp.png =60%x60%)
.
但我們要降低Branch的Delay
`>>在ID stage就決定!(因此需要增加一個module功能包含"1.目標address加法器/2.Register值比較器")`
![](https://i.imgur.com/yosKWsE.png =60%x60%)
.
參考上圖指令
藍色①處是Target Address Adder: 
`beq(PC=40)去(PC+4)+Offset(7*4)=72跳到lw指令(PC=72)位置`
藍色②處是Register Comparator
.
以上①②皆在ID stage完成，成功降低了Branch delay
![](https://i.imgur.com/d3ebF6e.png)
.
對於要跳過的指令，我們給他Control Value = 0 (圖中藍圈處)
![](https://i.imgur.com/NVD3Su3.png)
.
因為ID stage時Branch就要compare兩個register的值(提早了，原本是在EX階段)
如果是**往上2個(含)以上的R-Format指令們**
可以用forwarding來解決data hazards
![](https://i.imgur.com/Zqe0Wvf.png =60%x60%)
.
如果是**往上1個R-Format指令**／**往上2個load instruction**
:arrow_forward: **需要Stall 一個cycle**
接著可以用forwarding來解決data hazards
![](https://i.imgur.com/x5EDuU6.png =60%x60%)
</font>

:::success
#### 動態Branch預測(分1-bit, 2-bit)
![](https://i.imgur.com/3wJnFSM.png =60%x60%)
2-bit預測只要change，再錯一次預測就會再change。
(因為上一次錯了，這一次再錯就change)
:::

---

### Pipeline危機狀況處理: Exceptions and Interrupts
簡介:
| 中斷原因類型 | 發生性質 | 
| -------- | -------- |
| Exception     |  內部程式問題(overflow, undefined instruction....)    | 
| Interrupt | 外部I/O devices |
.
如何處理Exception(in MIPS)?
1. 把發生問題的指令位置存入 Exception PC 
2. 把發生問題的原因寫入Cause Register(下圖假設1-bit cause register可以去分辨undefined opcode或是overflow)
3. 比較：Cause register是先分類再定義好／Vectored Interrupt是已經定義好直接放進去
![](https://i.imgur.com/mw32yRp.png =60%x60%)
![](https://i.imgur.com/VpXsJoB.png =60%x60%)

.
假設`add $1, $2, $1`有overflow產生
1. 我們要防止這個overflow錯誤的值存進$1
2. 前面的指令要繼續執行完
3. 這個add指令與接下來的指令都flush掉(control value設0)
4. Set Cause register跟 EPC register的value
5. jump to handler 
![](https://i.imgur.com/AIdNf6u.png =60%x60%)

.
複習：藍色⓪處是遇到hazard
藍色①處ID.Flush是處理Decoder遇到undefined instruction的情況
此2個情況只要有一個發生(OR邏輯閘)，就會給那個MUX=0，產生Stall
.
藍色②處EX.Flush是處理overflow的情況，因為要WB(Write data回去)，所以WB那邊要再接一個MUX=0
![](https://i.imgur.com/AFgyQGU.png)

.
來看一個Example
![](https://i.imgur.com/C68mQpR.png =40%x40%)
![](https://i.imgur.com/wEuTn9W.png =70%x70%)
.
然後要Flush完變成
![](https://i.imgur.com/nPjaZuz.png)

---
### 進階管線平行處理: Instuction-Level Parallelism (ILP)
.
要讓指令層次平度提升有2種方法:
1.增加管線深度
2.讓每個階段執行更多指令>>IPC是CPI倒數
![](https://i.imgur.com/5nDUIEY.png =60%x60%)
.
Multiple issue又可分為:
1.Static: Compiler重新排列instruction順序
2.Dynamic: 硬體之間的分工
![](https://i.imgur.com/EFZFkkM.png =60%x60%)
.
Static Example: Dual-issue packets(指令成雙出現)
`ALU或Branch指令必須在前面`
`如果成對指令中有一個指令是不能使用的，則用NOP代替`
![](https://i.imgur.com/YEgZQV9.png =60%x60%)
.
電路圖:
1.首先需要的是Registers上額外的存取ports，在一個clock cycle我們可能需要讀出兩個暫存器的值來做ALU運算，還要讀出兩個暫存器的值為儲存指令做運算。
2.並且需要一個ALU運算的寫入port及一個給載入指令的寫入port。
3.由於ALU這個模塊一定是拿來做ALU運算，所以我們還需要另一個加法器來計算資料傳送指令的有效位址。
![](https://i.imgur.com/PoWjEPN.png)
.
Dual-issue 中 Hazards的情況解決:
1.EX data hazard要把R-format跟load/store分開成2個packet，因為無法forwarding
2.Load-use hazard一樣stall一個cycle，但是得到的好處是一次進來2個指令(一個packet進來)
![](https://i.imgur.com/U6mBWxv.png =60%x60%)

---
### Code Scheduling
![](https://i.imgur.com/YddT5oQ.png =70%x70%)
.
前面3個指令有data dependencies
![](https://i.imgur.com/Rkf94Sx.png =20%x20%)
後面2個指令也data dependencies
![](https://i.imgur.com/Go83iav.png =20%x20%)
.
重新排程後的結果如上圖(不做的指令填NOP)
.
#### :star: Loop Unrolling 多重分發管線之迴圈展開
.
**Antidependence**: 純粹暫存器$t0的名字重複使用而已，不是真正的data dependency
**Register Renaming**: 消除Antidependence的情況
![](https://i.imgur.com/2OlwSOw.png =60%x60%)
.
把他寫成Dual-Issue的指令packet
![](https://i.imgur.com/m4h7xf6.png =60%x60%)


---

### Dynamic Multiple Issue
- 又叫"Superscalar" processors
- 由CPU決定要issue幾個instruction/per cycle
- 比較沒有那麼仰賴code scheduling
- **由硬體來確保執行沒有錯誤(hardware support)**
- out-of-order execution: 允許CPU不按順序執行指令來避免stalls
- in-order commit: But commit result to registers in orde
    - 保持相依性
- reorder buffer: buffer in commit unit

![](https://i.imgur.com/GKGtW7X.png =70%x70%)

:::success
#### 為何要做Dynamic Scheduling?
1. 並不是所有stall你都可以預測去做到( ex. cache misses)
2. 不能總是schedule around branches (Branch outcome is dynamically determined)
3. 不同的的ISA架構有不同的latencies跟hazards
:::

### Power Efficiency
![](https://i.imgur.com/dVpf76o.png =60%x60%)

---

### L4總結
![](https://i.imgur.com/sD3tzgk.png =60%x60%)

---
##### tags: `計組` `CO` `L4`
---

:::warning
#### <font color="black">問答區/Chatting Room
Q1: Code scheduling那個圖我不會by秉 要看小考4
    
    
</font>
:::