# Instruction Level Parallelism (ILP)
## 1. Introduction
通過同時執行多條指令來提高處理器的性能。利用程式中指令之間潛在的獨立性,使多條指令能夠在同一時間內在處理器的不同 unit 上並行處理
### 1.1 方法
#### hardware-based dynamic approach → Superscalar processors
- in run-time
- use in-order execution if they are statically scheduled
use out-of-order execution if they are dynamically scheduled
- 應用於server, desktop processor, high-end phone等,要求高性能的設備
#### compiler-based static approach → VLIW processors
- in compile-time
編譯時由編譯器發現並行性,編譯器嘗試 re-ordering,以最大化的平行執行指令
- issue a fixed number of instructions formatted either as one large instruction or as a fixed instruction packet with parallelism
- 應用於 DSP processor
### 1.2 ILP 一定要處理 dependence 的問題
#### a. true dependence(data dependence)
真的要等
<img src="https://hackmd.io/_uploads/S1qwMAc3R.png" width="80%">
#### b. name dependence
naming 的問題,rename register 就好了
(WAR, WAW)

#### c. control dependence

## 2. 針對ILP的基本compiler 技術 (VLIW)
### 2.1 Implementation
separate dependent instructions

**Step1: Insert stalls**

**Step2: Loop unrolling without scheduling**

use different registers
**step3: re-schedule**

### 2.2 VLIW Introduction
- very long instruction word
- use multiple, independent functional units
- Loop unrolling and then code scheduling (just as above)

- Problem
1. increase in code size
為了填滿每個非常長的指令字,編譯器可能需要插入無操作指令(NOPs),特別是當它無法找到足夠的獨立指令來平行執行時。這導致程式碼大小增加,進而影響cache效能並增加記憶體使用。
2. compiler複雜且如果硬體改版,需要進行大量的重新編譯和優化
## 3. 用進階的advanced branch prediction 來減少branch cost
**→ dynamic branch prediction by hardware**
branch prediction: 1. branch or not, 2. if branch, where is the target
goal: 越早知道 branch outcome 越好
### 3.1 Example


### 3.2 Branch History Table (BHT)
correlating predictor/ 2-level (m, n) predictors
#### Implementation
- m: 前m個branch的outcome(前m個branch的taken/not taken所構成的table)
n: predictor的bit數
p: branch address
- record last m branches to select 2^m history tables with n-bit predictor
total bits for (m, n) BHT prediction buffer with p-index: **2^m * n * 2^p**
-
use p bits of branch address to select row
use m-bit shift register to select column
get the n predictor bits in the entry to make the decision

- Example

(0, 1) predictor

### 3.3 Tournament predictor
結合多種預測方法的優勢來優化預測準確性(混合型)
1. Multiple predictors
常見的有Global Predictor和Local Predictor. Global Predictor使用一個global 的branch歷史記錄來做出預測,而Local Predictor則為每個branch單獨維護一個歷史記錄。

3. Others:
* McFarling’s Gshare Predictor: 按照過去的行為和現在的branch address做運算
* Tagged hybrid predictors(TAGE): have predictor for each branch and history
## 4. Dynamic scheduling for ILP (Superscalar)
### Introduction
- 優點:
1. reduce stall: rearrange the instruction
2. 在run time 時執行,因在編譯時無法識別所有dependence的情況,所以superscalar效果比VLIW好,也使編譯器比較容易設計
3. explore **hardware speculation**,但需要高度精確的控制和復原機制。
缺點: 硬體設計複雜
- Implies **out-of-order execution and out-of-order completion**
→ WAR, WAW hazards → Register renaming

-
<img src="https://hackmd.io/_uploads/rkQrmfjnA.png" width="80%">
**structural and data hazards could be checked during ID stage**
ID stage can be separated into 2 parts
- Issue: 看有沒有available的hardware function unit, check for structural hazard in the manner of in-order issue
- Read operands: wait until no data hazards, then read operand
- Scoreboard vs. Tomasulo’s approach
### 4.1 Scoreboard
#### Implementation
[Scoreboard.pdf](https://drive.google.com/file/d/1r8DTNvHYp8O86no0RPDUkA9CijClfpJf/view?usp=sharing)
- **In-order issue, out-of-order execution/ completion**
- WAR, WAW → waiting. ex: cycle 17 in Scoreboard.pdf
stall write-back until registers have been read
can only read registers during read-operand stage
- **No register renaming**
- Centralized control
### 4.2 Tomasulo’s approach
#### Implementation
[Tomasulo.pdf](https://drive.google.com/file/d/1Z_MZQXM2HYqXntStbe87F00AgmzQ3Gee/view?usp=sharing)
#### Architecture

- **registers in instruction are renamed** to reservation station (RS)
→ avoid WAR, WAW
- distributed control
FU, load/ store 有自己的 RS → distributed
- RS紀錄的方法並非直接紀錄其register的名字,而是紀錄他在那一個buffer或是reservation stations上,透過這個方法達到register renaming
if no RS is available → structural hazards → stall
- FU 的結果透過common data bus (CDB) broadcast 回 RS,讓在RS等的資料可以做
#### 透過 speculation 更精進ILP
hardware speculate on outcome of branches and executing program as if guesses were correct
- dynamic scheduling: only fetches and issues instructions
- speculation: fetch, issue, and execute instructions
→ 但可能會猜錯,所以要能復原回去
- Tomasulo’s challenge
1. 可能有多個function unit 同時做完 → need more CDB → more complex
2. branch 猜錯或exception 產生時,因為是out-of-order completion,很難復原回去
→ solution: **Reorder Buffer(ROB)** Operation
- Reorder Buffer Operation
寫進ROB 的順序和 in-order 的 issue 一樣

- 3 components of HW-based speculation
1. Dynamic branch prediction: choose instructions
2. Dynamic scheduling: manages the order of executing instruction (out-of-order)
3. Speculation: executes instructions **before control dependences are resolved + ability to undo when incorrectly speculate**
- Adding ROB to Tomasulo
- 多加一個state為instruction commit,且commit in-order,確定instruction不是speculative才update register file
- 猜錯就flush掉
- provides precise exceptions
- ROB 用處: holds the result of instruction that have finished execution but not committed
- 影響:
- Use ROB number instead of RS to indicate the source of operands when execution completes (but not committed)
- ROB 有位置時 instruction 才能 issue
- ROB內容
1. instruction type: branch/ store/ register operation
2. destination: 寫入reg的位置
ALU operations: 寫入哪個reg
load: 寫入哪個reg
store: memory address
3. value: value of instruction result
4. ready: 顯示instruction has completed execution, the value is ready
5. busy: 顯示instruction hasn’t completed execution
- Four steps of ROB to Tomasulo
1. issue:
If **reservation station and reorder buffer slot are free(缺一不可)**, issue instruction & send operands & reorder buffer no. for destination
2. Execution (EX)
→ When both operands ready, then execute
→ if not ready, watch CDB for result; when both in reservation station, execute
3. Write result (WB)
mark reservation station available
Write on 1. Common Data Bus and 2. reorder buffer
4. Commit
* When instruction at head of reorder buffer & result present, update register with result (or store to memory), then remove instruction from reorder buffer.
* if mispredicted → flush
- architecuture

- Use extended registers for renaming in both reservation stations(RS) and ROB
- multiple issue (同時有多個instruction 被 issue)
w/ or w/o speculation

`Id following bne can’t start execute earlier`

`Id following bne can start execute earlier`
#### BTB
- challenge: 除了 predicting branches well,我們也希望能早點知道要不要branch和要branch到哪裡
<img src="https://hackmd.io/_uploads/BkG9xQs3C.png" width="80%">
`jr $ra ,常用於函數返回操作中,其中return address被儲存在一個register($ra)`
- solution
- Software
1. loop unrolling: 把branch拿掉
2. instruction scheduling
- Hardware
1. predicated instruction
2. branch target buffer (BTB) → more common
vs. BHT,BHT 沒算出 target
- BTB
希望可以在 IF 就知道是 conditional instruction,並知道要跳去哪(know the address)
→ **use the instruction address rather than wait for decode**
→ access BTB during IF stage**

keep both the branch PC and target PC in the BTB

- 更精進BTB: 存target的地方,乾脆存target的instruction,而不是address,跳到target後就不用再去instruction memory fetch
→ Branch folding
- BTB challenge → Return address predictors
- indirect jump: destination address varies at run time。目標地址不是直接給定,而是存儲在寄存器或內存位置中,這使得跳轉目標在執行時才確定。 ex: nested loop
→ accuracy of BTB for procedure returns are low
- solution: use a small buffer of return addresses operating as a stack


- Combination of BTB and BHT: 使用BTB較貴但較快,entries也較少,所以BHT作為輔助,hold 更多entries

## 5. Summary
- superscalar can do: dynamic scheduling, multiple issue, speculation
1. old codes still run: 已編譯的二進位程式碼(binaries)可以在不需要重新編譯的情況下繼續運行。(不怕compiler改版)
2. little impact on code density
- superscalar multiple issue 的問題
1. multiple instructions are issued per cycle, so we deal with multiple instructions commit per cycle → CDB 要很多條
2. There may be many possible dependencies between instructions → instruction issue is the bottleneck in dynamically scheduled superscalar
- Limits on ILP
- **現有機制下的ILP可利用性**: 考慮到現有的硬件機制和增加的硬件預算,能夠達到多少程度的ILP是一個關鍵問題。
- dependency 問題,使ILP遇到bottleneck
**→ Thread Level Parallelism (TLP)**