<style>
.r {
color: rgb(226, 60, 47);
font-weight: bold;
}
.g {
color: rgb(40, 151, 72);
font-weight: bold;
}
.b {
color: rgb(46, 123, 237);
font-weight: bold;
}
</style>
# The Processor
:::warning
**男女友 & 父母二選一**
A: Day mod 5 == 0 BF/GF
B: Day mod 7 == 0 parent
A and B == 0 BF/GF
B and ~A parent
:::
## Introduction
### Preliminary
- The performance of a machine was determined by three key factors
- **Instruction count, clock cycle time, CPI**
- Both the clock cycle time and the CPI are determined by the implementation of the processor
- In this chapter
- To construct the datapath and control unit of the MIPS instruction set
- A highly abstract and simplified overview
- To build up a datapath and construct a simple version of a processor
- A more realistic pipelined MIPS implementation
- To implement more complex instruction sets (x86)
- **Make the common case fast**
- **Simplicity favors regularity**
:::info
為了一致性經常會執行一些額外的行為,只要不對 register 裡的值造成影響就沒關係。
:::
### A basic MIPS implementation
- An implementation includes the following instruction
- ***Memory-reference instructions:*** `lw`, `sw`
- ***Arithmetic-logical instructions:*** `add`, `sub`, `and`, `or`, `slt`
- ***Branch instructions:*** `beq`, `j`
- An overview of the implementation
- Execution steps
- 每個步驟前三步都很相似
- ***Step 1***: send the **program counter (PC)** to the memory to **fetch** the instruction
- ***Step 2***: read one or two registers(一般都直接讀兩個以保持一致性)
- Using fields of the instruction to select the registers to read
- **`lw`: read one register**
- Other instructions: read two registers
- ***Step 3***: use arithmetic-logical unit (ALU)
- <font class='b'>`lw`, `sw`: address calculation</font>
- <font class='g'>`add`, `sub`, `and`, `or`, `slt`: operation execution</font>
- `beq`: comparison
- <font class='r'>`j`: doesn’t use ALU</font>
- ***Step 4 & 5***: actions depend on different instruction types
- **store 要 4 個 step, load 要 5 個**
- `lw`, `sw`: access the memory
- `add`, `sub`, `and`, `or`, `slt`: write the register file
- `beq`, `j`: change the **PC**, otherwise the PC should be **incremented by 4**
- The abstract view of a MIPS implementation
- **Multiplexors**, which select from among several inputs based on the setting of its control lines, are omitted
- The control unit and control lines are omitted

- The datapath of the MIPS implementation
- Three required multiplexors are added
- A **control unit** and control lines for the major functional units are added

(黑色為資料傳輸線,藍色是控制訊號線,通常會全部畫黑色的)
## Logic Design Conventions
### The datapath elements in the MIPS implementation
- Logic element that operates on **data values**
- Combinational logic (no internal storage)
- The outputs ==**depend only on the current inputs**==(輸出只跟當前輸入有關)
- Given the same input always produces the same output
- Ex: ALU
- Logic element that contains **state (state element)**
- **Sequential** logic (**internal storage**)
- The outputs depend on both inputs and contents of the internal state
- **At least two inputs and one output**
- Input: data value to be written into the element
- Input: clock
- Used to determine when the state element should be written
- The state element **can be read at any time**
- Output: the value **written in an earlier clock cycle** (rising edge)
- Ex: Instruction/data memory, register
### Clocking methodology
- An approach to determine when data is **valid and stable relative to the clock**
- To define when signals can be read and written
- If a signal is written and read at the same time, the value of the read will be **unpredictable**!
- **Edge-triggered clocking**
- 另一種是 level-trigger
- Any values stored in a state element are updated only on a clock edge
- For a combinational logic
- Inputs and outputs must be **state elements**
- Inputs: values written in a previous clock cycle
- Outputs: values can be used in a following clock cycle
- All signals must propagate from inputs, through the combinational logic, and to outputs in the time of one clock cycle
- The time necessary for the **signals to reach outputs** defines the length of the clock cycle
- 因為所有指令必須在一個 clock cycle 內完成,才來得及在下一個 edge 把結果寫入 register(沒有及時寫入資料會不見)

- **Control signal vs. Data signal**
- Control signal
- Used for **multiplexor selection** or for **directing the operation** of a functional unit
- The state element is changed only when the **write control** signal is asserted and a clock edge occurs
- If a state element is written on **every active clock edge**, the control signal is often omitted(有寫入功能的要裝開關,若要一直開著可以不用裝)
- If a state element is not updated on every clock, then an explicit write control signal is required
- clock edge 到達 (rising edge) + 開關打開才能寫入(通常隨時都能讀)
- Data signal
- To contain information that is operated on by a functional unit
- Terminologies
- **Asserted/deasserted**: a signal is logically high/low
- **Assert/deassert**: a signal should be driven logically high/low
- To allow a state element to be read and written in the **same clock cycle**

- The following steps are done in the same clock cycle
1. To **read** the contents of a register
2. To send the value through some combinational logic
3. To **write** the same register
- No feedback within a single clock cycle
## Building a Datapath
### The datapath elements in the MIPS implementation
Elements used to ***fetch*** instructions and ***increment*** the ***PC***.
- ***Instruction memory***
- A memory unit to store instruction of a program and supply instructions given an address
- **只存指令,只讀不寫**
- ***Program counter (PC)***
- A register holds the address of the current instruction
- ***Adder***
- To increment the **PC** to the address of the next sequential instruction
- An ALU that the control always specifies an add operation
- 負責 PC + 4

- Considering the **R-type** instructions
- ```
add/sub/and/or/slt $t1, $t2, $t3
```
- To read two register, perform an ALU operation, write the result to a register
- **Register file**
- A collection of 32 32-bit general-purpose registers
- Any register can be read or written by specifying its register number
- To read a data word
- An **input** to specify the register number to be read (5 bits)
- An **output** to carry the value been read (**32 bits**)
- The register file always outputs the contents of whatever register numbers are on the Read register inputs
- To write a data word
- An **input** to specify the register number to be written
- An **input** to supply the data to be written into the register
- To occur at the clock edge and the write control signal is asserted
- $\Rightarrow$ A total of four inputs and two outputs are needed
- **Arithmetic logic unit (ALU)**
- To take **two 32-bit inputs** and produce a 32-bit result
- A 1-bit **Zero** signal to indicate the result is 0
- ALU result 為 0,Zero 輸出為 1,反之則為 0
- 4-bit control signals (ALU operation)

- Considering the memory reference instructions (`lw` and `sw`)
- **`lw/sw $t1, offset($t2)`**
- To compute a memory address by adding `$t2` to the 16-bit signed offset
- `lw`: memory $\rightarrow$ `$t1`
- `sw`: `$t1` $\rightarrow$ memory
- **Data memory**
- A state element to store data
- Two inputs for **address** and the **write data**, and one output for the read result
- Separate read and write controls
- To read the value of an invalid address can cause problems
- **Sign extension unit**: 16 bits $\rightarrow$ 32 bits
- To increase the size of a data item
- To replicate the high-order sign bit of the original data item in the high-order bits of the larger, destination data item

### For different classes of instruction
- `add/sub/and/or/slt $t1, $t2, $t3`
- Fetching instructions and incrementing the **PC**
- Register file: to read operands in `$t2` and `$t3`, to write result to `$t1`
- ALU: to operate
- `lw/sw $t1, offset($t2)`
- Fetching instructions and incrementing the **PC**
- Register file: to read the **base register** `$t2`, to write/read data to/from `$t1`
- **Sign extension**: to sign-extend the 16-bit offset to a 32-bit signed value
- ALU: to compute a memory address (base register `$t2` + offset)
- Data memory: to read/write data
- `beq $t1, $t2, offset`
- Fetching instructions and incrementing the **PC**
- Register file: to read registers `$t1` and `$t2` for comparison
- **Sign extension**: to sign-extend the 16-bit offset to a 32-bit signed value
- Adder: to compute the branch target address (PC + offset)
- ***Shifter***: to left shift the (sign-extended) **offset by 2**
- ALU: to compare contents of registers `$t1` and `$t2` by subtraction
- Using the zero signal to indicate whether the operands are equal or not
- **Branch taken**: Zero = 1 $\Rightarrow$ **new PC = branch target address**
- **Branch not taken** (untaken branch): Zero = 0 $\Rightarrow$ new PC = PC
- `j offset`
- Fetching instructions and incrementing the PC
- hifter: to left shift the **offset by 2** (26 bits $\rightarrow$ 28 bits)
- To replace the lower 28 bits of PC

### Create a single datapath
- To execute all instructions in **one clock cycle**
- No datapath resource can be used more than once per instruction
- 所有元件只能用一次
- Any element **needed more than once** must be **duplicated**
- **Separate instruction and data memories**
- Some elements can be shared by different instruction flows
- Using a **multiplexor** and **control signal** to select among the multiple inputs
- Example: Building a datapath
- To build a datapath for **memory references** and **R-type instructions** using a single register file and a single ALU
- Using the ALU
- R-type instruction: inputs coming from two registers
- Memory reference: the 2nd input is the sign-extended 16-bit offset
- $\Rightarrow$ A multiplexor is needed for the 2nd ALU input
- The value stored into a destination register
- R-type instruction: from the ALU
- Memory reference: from the memory (for a load)
- $\Rightarrow$ A multiplexor is needed for the data stored into the register file

- To combine all the pieces to make a simple datapath
- A datapath for, `add`, `sub`, `and`, `or`, `slt`, `lw`, `sw`, and `beq`
- A multiplexor is required to select either the **PC + 4** or the branch target address to be written into the **PC**
- A **control unit** is added to take inputs and generate
- a **write signal** for each state element
- the **selector control** for each multiplexor
- the **ALU control**

## A Simple Implementation Scheme
### Multiple levels of decoding
- The **main control unit** generates the **ALUOp** bits
- **ALUOp** bits are used as input to the **ALU control unit** that generate the actual signals to control the ALU unit
- Can reduce the size of the main control unit
- Potentially increase the speed of the control unit
- The speed of the control unit is often critical to clock cycle time

### The ALU control
- The ALU has 4 control inputs but only 6 combinations are used
- A small ALU control unit to generate the 4-bit ALU control input
- Inputs: the **funct** field of the instruction and a 2-bit **ALUOp**
- **ALUOp** = 00: addition for `lw`/`sw`
- **ALUOp** = 01: subtraction for `beq` (相減比較是否相等)
- **ALUOp** = 10: for R-type instructions, determined by the operation encoded in the funct field
- Output: a 4-bit signal that directly controls the ALU

- To implement the ALU control unit: truth table
- Only show the truth table entries for which the ALU control must have a specific value
- The full truth table has $2^8 = 256$ entries
- Don’t-care term
- The output does not depend on the value of all the inputs
- Useful to keep the table compact

### Designing the main control unit
#### Observations about instruction formats
- ***op*** field (**opcode**) in bits `31:26`
- To denote the operation and format of an instruction (`Op [5:0]`)
- ***rs*** and ***rt*** fields at positions `25:21` and `20:16`
- To specify the two registers to be read, for R-type, `beq`, `sw` instructions
- The base register for `lw` and `sw` is in bits 25:21 (**rs**)
- The 16-bit offset for `beq`, `lw`, and `sw` is in positions 15:0
- Destination register
- For `lw` is in bits `20:16` (**rt**)
- For R-type is in bits `15:11` (**rd**)
- $\Rightarrow$ A multiplexor is needed to select which field of the instruction is used to indicate the register number to be written

#### The datapath with multiplexors and control signals
- ALU control unit, 4 multiplexors, 7 single-bit control lines, and a 2-bit ALUOp control signal
- All the multiplexors have two inputs and require a single control line

- The function of each of the control signals
- All but **PCSrc** can be set based solely on the opcode
- **PCSrc** should be asserted if the instruction is `beq` and the **Zero** output of the ALU is asserted (PCSrc = **Branch** AND **Zero**)


- All 9 control signals can now be set on the basis of **opcode** to the control unit

#### Operation of the datapath
- R-type instruction: `add/sub/and/or/slt $t1, $t2, $t3`
- The instruction is fetched, **PC** is incremented
- `$t2` and `$t3` are read from the register file; main control unit sets control lines
- The ALU operates, using the ***funct*** field of the instruction (bits `5:0`) to generate the ALU function
- The result from the ALU is written into the register file (bits `15:11`)

- `lw $t1, offset($t2)`
- The instruction is fetched, **PC** is incremented
- `$t2` is read from the register file;main control unit sets control lines
- The ALU performs addition to compute the address(`$t2` + sign-extended offset)
- To read the data memory
- The data from the memory unit is **written into the register** file (bits `20:16`)

- `beq $t1, $t2, offset`
- The instruction is fetched, **PC** is incremented
- `$t2` and `$t3` are read from the register file; main control unit sets control lines
- The ALU performs subtraction for equality comparison; the adder computes the branch target address(PC + sign-extended offset shifted left by 2)
- The ***Zero*** result from the ALU is used to decide which adder result to store into the PC

#### Finalizing control
- Truth table of the main control unit
- Input: the 6-bit ***opcode*** field `Op[5:0]`
- Output: 9 control lines(7 1-bit control lines, 2-bit ***ALUOp***)
- Example: Implementing jumps
- Jump address
- [31:28] of the current PC + 4, 26-bit immediate filed of the jump instruction, bits 002
- An additional control signal ***Jump***


### Why a single-cycle implementation is not used today
- The single-cycle design is **inefficient**
- CPI = 1
- The clock cycle is determined by the longest possible path in the machine
- Example: Performance of single-cycle machines
- Problem
- Operation times(時間僅用於比較相對關係,不一定是實際比例)
- Memory units 200 ps, ALU & adder 100 ps, register file read/write 50 ps
- Instruction mix
- 25% loads, 10% stores, 45% ALU instruction, 15% branches, 5% jumps
- Implementation 1
- Every instruction operates in 1 clock cycle of a **fixed length**
- Implementation 2
- Every instruction executes in 1 clock cycle using a **variable-length** clock
- To compare the performance between implementations 1 and 2

- Solution
- To compute the critical path for different instruction classes
- Clock cycle of implementation 1: 600 ps
- Clock cycle of implementation 2:
$600 \times 25\% + 550 \times 10\% + 400 \times 45\% + 350 \times 15\% + 200 \times 5\% = 447.5 ps$
- Performance ratio:
$$
\frac{\text{CPU clock cycle}_\text{single clock}}{\text{CPU clock cycle}_\text{variable clock}} = \frac{600}{447.5} = 1.34
$$
- Using the single-cycle design with a fixed clock cycle
- Acceptable for small and simple instruction set
- The clock cycle is equal to the worst-case delay for all instructions
- Each functional unit can be used only once per clock
- $\Rightarrow$ **Multicycle** implementation
- Shorter clock cycle
- Each instruction requires multiple clock cycles to complete
## An Overview of Pipelining
### Pipeline
- An inplementation technique in which multiple instructions are **overlapped** in execution
(讓指令可以重疊執行,非齊頭式,與 superscalar 不同)

> Fig. 4.25 The laundry analogy for pipelining
- Observations of the laundry example
- Pipelining is faster for many loads because everything is working in parallel
- All stages are operating concurrently
- We have separate resources for each stage
- Pipelining improves ==<font class='r'>*throughput*</font>==
- The <font class="b">total time</font> to complete the work is <font class="b">decreased</font>
- The time to complete a <font class='g'>single task</font> is <font class='g'>not decreased</font>
- The ideal ***speedup*** due to pipelining is equal to the number of ***stages*** in the pipeline
- 以下兩個條件要**同時符合**。
- If all stages take about the same amount of time (所有 stage 花的時間均相同。)
- There is enough work to do (工作要夠多)
- In Fig. 4.25, the speedup is only 2.3, because there are only 4 loads
- The pipeline is not completely full **at the beginning and end** of the workload
(pipeline 啟動跟結束時的資源並不是滿載的)
- This start-up (開始) and wind-down (結束) affects performance when the number of tasks is not large compared to the number of stages in the pipeline
- In Fig. 4.25, if the number of loads is much larger than 4, the stages will be full most of the time and the speedup will be close to 4
- Example: **Single-cycle vs. Pipelined performance**
- Problem
- 8 instructions: **`lw`**, **`sw`**, **`add`**, **`sub`**, **`and`**, **`or`**, **`slt`**, **`beq`**
- Memory access: 200 ps
- ALU operation: 200 ps
- Register file R/W: 100 ps
- To compare the average time between instructions of a single-cycle implementation to a pipelined implementation
- Solution
- The pipeline execution clock cycle must have the worst-case clock cycle of 200 ps
- To execute three **`lw`** instructions
- The time between the 1^st^ and 4^th^ instruction: 2400 ps vs. 600 ps
- The total execution time: 2400 ps vs. 1400 ps


- Pipelining speed-up discussion
- Assuming ideal conditions: the stages are perfectly balanced
$$
\Rightarrow \text{Time between instructions}_{\text{pipelined}} = \frac{\text{Time between instruction}_{\text{nonpipelined}}}{\text{Number of pipes stages}}
$$
- **Under ideal conditions and with a large number of instructions**
- $\rightarrow$ **The speedup is approximately equal to the number of pipe stages**
- In previous example: 800 ps should be improved to 160 ps
- Pipelining involves some <font class='r'>overhead</font>
- The time per instruction in the pipelined processor will exceed the minimum possible
- The speed-up will be less than the number of pipeline stages
- In the previous example: 2400 ps $\rightarrow$ 1400 ps
- 3 **`lw`** instructions: speedup= 2400/1400 ≈ 1.71
- 1,000,003 **`lw`** instructions: speedup = 800,002,400/200,001,400 ≈ 4
- Summary
- Pipelining improves performance by increasing instruction <font class='r'>*throughput*</font>, as opposed to decreasing the execution time of an individual instruction.
- Instruction throughput is the important metric because real programs execute billions of instructions.
### Designing instruction sets for pipelining
- MIPS instruction set is **designed for pipelined execution**
- Characteristics of MIPS instruction set
- <font class='b'>All MIPS instructions are the same length (32-bit)</font>
- Easier to fetch and decode instructions
- Intel x86 instruction set
- Instructions vary from 1~15 bytes
- To translate x86 instructions into simple micro-operation
- To pipeline the micro-operations rather than the native x86 instructions
- MIPS has **only a few instruction formats**, with the source register fields being located at the same place in each instruction
- **The 2^nd^ stage can begin reading the register** file at the same time that the hardware is determining what type of instruction was fetched
- **Memory operands only appear in loads and stores**
- We can use the execute stage to calculate the memory address and then access memory in the following stage
- Operands must be **aligned in memory**
- A single data transfer instruction will not require two data memory accesses
- The requested data can be transferred in a single pipeline stage
### Pipeline hazards (Pipeline 代價)
- Hazards are situations in pipelining when the next instruction **cannot execute in the following clock cycle**
(Hazard: 下一個指令無法在下一個 clock cycle 進來。)
- **Structural hazards** (結構性危險)
- The hardware cannot support the combination of instructions that we want to execute in the same clock cycle(硬體不足,設計不良)
- Example
- Suppose we had a single memory instead of two memories
- In the same clock cycle the 1^st^ instruction is **accessing data** from memory while the 4^th^ instruction is **fetching an instruction** from the **same memory**

- **Data hazards**
- The pipeline must be ***stalled*** (停滯) because one step must **wait for another to complete**
(等前面的資料完成)
- Data that is needed to execute the instruction is not yet available
- Arising from the <font class='b'>dependence</font> of one instruction on an earlier one that is still in the pipeline (資料相依)
- 
- ```
add $s0, $t0, $t1
sub $t2, $s0, $t3 # 3 clock cycles are wasted in the pipeline
```
- Compiler can remove data hazards but its results is not satisfactory
(讓 compiler 調動指令的順序,因為 compiler 很保守所以效果並不好。)
- These dependences happen too often
- The delay is too long to expect the compiler to rescue us from this dilemma
- Solution: ***forwarding/bypassing***
- Adding extra hardware to retrieve the missing item early from the internal resource
- Forwarding paths are valid only if the destination stage is **later** in time than the source stage

- Forwarding cannot prevent all pipeline stalls
- ==**Load-use data hazard**==
- The data being loaded by a **`lw`** instruction has not yet become available when it is needed by another instruction
- **Pipeline stall (bubble)**
- A stall initiated in order to resolve a hazard

- Example: Reordering code to avoid pipeline stalls
- a = b + e; c = b + f;
- Original
```
lw $t1, 0($t0) # load b
lw $t2, 4($t0) # load e
add $t3, $t1, $t2
sw $t3, 12($t0)
lw $t4, 8($t0) # load f
add $t5, $t1, $t4
sw $t5, 16($t0)
```

- Reordering
```
lw $t1, 0($t0) # load b
lw $t2, 4($t0) # load e
lw $t4, 8($t0) # load f
add $t3, $t1, $t2
sw $t3, 12($t0)
add $t5, $t1, $t4
sw $t5, 16($t0)
```
- Each MIPS instruction writes at most one result and does this in the last stage of the pipeline
- Forwarding is harder if
- There are multiple results to forward per instruction
- There is a need to write a result early in instruction execution

- **Control hazards (branch hazard)**
- Arising from the need to make a decision based on the results of one instruction while others are executing $\Rightarrow$ branch instruction
- The instruction that was fetched is not the one that is needed
- The flow of instruction addresses is not what the pipeline expected
- Solution 1: **stall on branch**
- 停下來等,想辦法在下一個 bubble 就知道。
- To stall immediately after fetching a branch, and waiting until the pipeline determines the outcome of the branch
- Conservative option, slow
- Example
- Suppose the branch can be resolve in the 2^nd^ stage
- If the branch cannot be resolved in the 2^nd^ stage, we will see an even larger slowdown

- Solution 2: **prediction**
- 用猜ㄉ,猜對可以節省時間,猜錯要 recover
- Simple approach: always predict that branches will be **untaken**
- Untaken: pipeline proceeds at full speed
- **Taken**: pipeline stall
- Sophisticated version: **branch prediction**

- **Branch prediction**
- 目前準確率可以到 90%
- To assume a given outcome for the branch and proceed from that assumption rather than waiting to ascertain the actual outcome
- Some branches predicted as taken and some as untaken
- **Dynamic** hardware predictor
- Making their guess depending on the behavior of each branch
- May change predictions for a branch over the life of a program
- When the guess is wrong
- To ensure the instructions following the wrongly guessed branch have no effect
- To restart the pipeline from the proper branch address
- **Pipeline overview summary**
- Pipelining is a technique
- To exploit ***parallelism*** among instructions in a sequential instruction stream
- To increase the number of simultaneously executing instructions
- To increase the rate at which instructions are started and completed
- To improve instruction ***throughput*** rather than individual instruction ***execution time*** or ***latency***
- **Invisible to programmer**
## Pipelined Datapath and Control
### Pipelined datapath
- 5 stages: <font class='r'>IF, ID, EX, MEM, WB</font>
- **Instruction fetch, Instruction decode / register file read, Execute / address calculation, Memory access, Write Back**
- Up to 5 instructions will be in execution during any single clock cycle
- 每個 stage 1 個 clock cycle,每個指令都要 5 個 clock cycle
- 同一個時間點,會有 5 個指令同時在工作。(滿載時)

- **Pipeline register**
- To retain the value of an instruction for its other four stages
- All instructions advance during each clock cycle from one pipeline register to the next
- The registers are **named for the two stages separated by that register**
- **PC** can be thought of as a pipeline register
- **PC** is part of the visible architectural state

- Ex: **`lw`** goes through the 5 stages of pipelined execution
- Highlight of registers or memory
- Right half: read
- Left half:write
- **Instruction fetch (IF)**
- fetch 指令,然後 PC + 4
- To read the instruction using the address in the **PC** and place in the **IF/ID**
- To increment **PC** by 4,save to **PC** and **IF/ID**

- **Instruction decode and register file read (ID)**
- The 16-bit **immediate** field of the instruction supplied from **IF/ID** is sign-extended to 32 bit
- Register numbers supplied from **IF/ID** is used to read two registers
- All above three values and **PC** are placed in the **ID/EX**

- **Execute or address calculation (EX)**
- Using the ALU to calculate the load address
- To add the contents of register 1 and the sign-extended immediate from the **ID/EX**
- ALU result is placed in the **EX/MEM**

- **Memory access (MEM)**
- To read the data memory using the address from the **EX/MEM**
- To load the data into the **MEM/WB**

- **Write back (WB)**
- To read the data from the **MEM/WB**
- To write the data into the register file

- Ex: **`sw`** goes through the 5 stages of pipelined execution
- Fig. 4.39 and 4.40
- Two key points in the pipelined datapath
- To pass something from an early pipe stage to a later pipe stage, the information must be placed in a pipeline register
- Otherwise the information is lost when the next instruction enters that pipeline stage
- For the **`sw`** instruction, the data been stored was first placed in the **ID/EX** and then passed to the **EX/MEM**
- Each logical component of the datapath can be used only within a ***single*** pipeline stage
- Otherwise a structural hazard will occur
- A bug in this design: **the write register number** of `lw` is not supplied!
$\Rightarrow$ To preserve the destination register number in `lw` instruction
- **`lw`** must pass the write register number from the **ID/EX** through **EX/MEM** to the **MEM/WB**


### Graphically representing pipelines
- **Multiple-clock-cycle pipeline diagram**
- Simpler, do not contain all the details
- Occupying the proper clock cycles
- Time advances from left to right
- Instructions advance from the top to the bottom

- **Single-clock-cycle pipeline diagram**
- To show the details of what is happening within the pipeline during each clock cycle
- Typically the drawings appear in groups to show pipeline operation over a sequence of clock cycles

### Pipelined control
- To label the control lines
- Similar as those used in the single-cycle implementation
- No separate write signals for **PC** and pipeline registers
- **PC** and pipeline registers are written on each clock cycle



- Control lines corresponding to 5 pipeline stages
- Instruction fetch & Instruction decode/register file read
- No optional control lines to set
- Execution/address calculation
- ***RegDst***, ***ALUOp***, ***ALUSrc***
- Memory access
- ***Branch***, ***PCSrc***, ***MemRead***, ***MemWrite***
- Write back
- ***MemtoReg***, ***RegWrite***

- To implement control
- To extend the pipeline registers to include control information
- To create the control information during instruction decode


## Data Hazards: Forwarding versus Stalling
### Data hazards and forwarding
- Example
```
sub $2, $1, $3 // only $2 here is write
and $12, $2, $5
or $13, $6, $2
add $14, $2, $2
sw $15, 100($2)
```
- Value of `$2` changes from 10 to -20 at **CC5**
- `sw` gets the correct value
- Maybe `add` can also get the correct value
- `and`, `or` get the incorrect value because dependence lines go **backward** in time

- To resolve data hazards
- The design of the register file hardware
- Write: 1^st^ half of the cycle Read: 2^nd^ half of the cycle
- The read delivers what is written
- ***Forwarding***
- To forward the data as soon as it is available to any units that need it before it is available to read from the register file
- The data is available at the end of the ***EX*** (R-type) or ***MEM*** (load)stage
- To forward the data to an operation in the ***EX*** stage
- ALU operation, effective address calculation
- To detect data hazards
- Hazard conditions
- R**d**: destination

- Example: Dependence detection
- All instructions detect data hazards in the (beginning of) ***EX*** stage
- **sub-and**
- CC 4: instruction and is in the EX stage
- Hazard type 1a: ***EX/MEM.RegisterRd*** = ID/EX.RegisterRs = `$2`
- **sub-or**
- CC 5: instruction or is in the EX stage
- Hazard type 2b: MEM/WB.RegisterRd = ID/EX.RegisterRt = `$2`
- **sub-add**
- CC 6: instruction add is in the EX stage
- Not a hazard, the register file supplies the proper data
- **sub-sw**: no data hazard
- Above hazard condition is inaccurate
- If an instruction does not write registers, its forwarding is unnecessary
- Examining the WB control field of the pipeline register during the EX and MEM stages to determine whether RegWrite is asserted
- If `$0` is used as destination register, its value will not be forwarded
- MIPS requires that every use of `$0` as an operand must yield an operand value of 0
- Adding EX/MEM.RegisterRd $\ne$ 0 or MEM/WB.RegisterRd $\ne$ 0


- To forward the proper data
- The pipeline registers holding the data to be forwarded
- Adding multiplexors and controls to the input of the ALU
- ***Forwarding control*** is the <font class='r'>EX</font> stage




- Hazard detection and control signals
- ***EX*** hazard
- ```
if (EX/MEM.RegWrite
and (EX/MEM.RegisterRd != 0)
and (EX/MEM.RegisterRd = ID/EX.RegisterRs)) ForwardA = 10
```
- ```
if (EX/MEM.RegWrite
and (EX/MEM.RegisterRd != 0)
and (EX/MEM.RegisterRd = ID/EX.RegisterRt)) ForwardB = 10
```

- ***MEM*** hazard (original)
- ```
if (MEM/WB.RegWrite
and (MEM/WB.RegisterRd != 0)
and (MEM/WB.RegisterRd = ID/EX.RegisterRs)) ForwardA = 01
```
- ```
if (MEM/WB.RegWrite
and (MEM/WB.RegisterRd != 0)
and (MEM/WB.RegisterRd = ID/EX.RegisterRt)) ForwardB = 01
```

- ***MEM*** hazard (revised)
- EX and MEM hazards occurs simultaneously
- $\rightarrow$ Want to use the most recent value
- $\rightarrow$ Only forwards if EX hazard condition is <font class='r'>not true</font>

- ```
if (MEM/WB.RegWrite
and (MEM/WB.RegisterRd != 0)
and not (EX/MEM.RegWrite and (EX/MEM.RegisterRd != 0)
and (EX/MEM.RegisterRd = ID/EX.RegisterRs))
and (MEM/WB.RegisterRd = ID/EX.RegisterRs)) ForwardA = 01
```
- ```
if (MEM/WB.RegWrite
and (MEM/WB.RegisterRd != 0)
and not (EX/MEM.RegWrite and (EX/MEM.RegisterRd != 0)
and (EX/MEM.RegisterRd = ID/EX.RegisterRt))
and (MEM/WB.RegisterRd = ID/EX.RegisterRt)) ForwardB = 01
```


### Data hazards and stalls
- Forwarding cannot resolve all data hazards
- An instruction tries to read a register following a `lw` that writes the same register
- $\Rightarrow$ Something must ***stall*** the pipeline for the combination of `lw` followed by an instruction that reads its result

- ***Hazard detection unit***
- It operates during the ***ID*** stage to insert the ***stall*** between the load and its use
- Condition
- ```
if (ID/EX.MemRead
and ((ID/EX.RegisterRt = IF/ID.RegisterRs)
or (ID/EX.RegisterRt = IF/ID.RegisterRt))) stall the pipeline
```
- *If the condition holds, the instruction followed `lw` stalls one clock cycle*
- The instruction in the ***IF*** stage must also be stalled
- To prevent the ***PC*** and ***IF***/***ID*** pipeline register from changing
- The instruction in the ***IF*** stage will continue to be read using the same ***PC***
- The registers in the ***ID*** stage will continue to be read using the same instruction fields in the ***IF***/***ID*** pipeline register
- The back half of the `lw` pipeline starting with the ***EX*** stage $\Rightarrow$ ***nops***
- An instruction that does no operation to change stage
- How to insert ***nops***
- To deassert all 9 control signals in the ***EX***, ***MEM***, and ***WB*** stages
- To change the ***EX***, ***MEM***, and ***WB*** control fields of the ***ID/EX*** pipeline register to 0
- The changed control signals will **percolate** through the pipeline
- No registers or memories are written if the control values are all 0
- Example
- `and` becomes a nop, and all instructions following and are delayed one cycle
- `and` and `or` repeat in clock cycle 4 what they did in clock cycle 3
- A stall bubble delays everything behind it


 
## Control Hazard
### Control hazards (Branch Hazard)
- The delay in determining the proper instruction to fetch
- An instruction must be fetched at every clock cycle to sustain the pipeline
- The decision about whether to branch doesn't occur until the ***MEM*** stage
- Relatively simple and occur less frequently than data hazards
### Assume branch not taken
- Stalling until the branch is complete is too slow
- To ***predict*** the branch will be ***untaken***
- To continue execution down the sequential instruction stream
- If the branch is **untaken**
- The prediction is correct, and the pipeline is full-filled
- If the branch is **taken**
- The prediction is wrong
- To discard instruction that are fetched and being exectuted
- Execution continues at the branch target
- To discard instructions (***flush*** instructions)
- In previous design, the `beq` decides whether to branch in the ***MEM*** stage
- Three insturctions in ***IF***, ***ID***, and ***EX*** stages must be flushed
- 3 clock cycles delay
- Just change control to 0 in the ***ID*** is not sufficient

### Reducing the delay of branches
- If the branch execution is moved earlier in the pipeline, then fewer instructions need be flushed
- Many branched rely only on simple tests
- Do not require a full ALU operation but can be done with at most a few gates
- A separate instruction that uses an ALU for a more complex branch decision
- To move the branch decision up
- Branch address calculation
- To move the branch adder from the ***EX*** stage to the ***ID*** stage
- To be performed for all instructions but only used when needed
- Branch decision evaluation
- To perform equality comparison in ***ID*** stage using XOR an OR gates
- Additional forwarding and hazard detection hardware
- To forward results to the equality test logic that operates during ***ID***
- A data hazard can occur and a stall will be needed
- ALU instruction immediately preceding a branch

- To reduce the penalty of a branch to one instruction if the branch is taken
- To flush instructions in the ***IF*** stage
- ***IF.Flush*** is added to zero the instruction field of the ***IF/ID*** pipeline register
- The fetched instruction is transformed into a ***nop***
- Example: Pipelined branch
- Assumption
- The pipeline is optimized for branches that are untaken
- Branch execution is moved to the ***ID*** stage
- ```
36 sub $10, $4, $8
40 beq $1, $3, 7 # PC-relative branch to 40 + 4 + 7 * 4 = 72
44 and $12, $2, $5
48 or $13, $2, $6
52 and $14, $4, $2
56 slt $15, $6, $7
...
72 lw $4, 50($7)
```
- Only one pipeline bubble on a taken branch


### Dynamic branch prediction
- In an aggressive pipeline, a simple **static prediction** scheme may waste too much performance
- Prediction of branches at runtime using information
- Prediction is a hint that we hope is correct
- Fetching begins in the predicted direction (taken or untaken)
- **Branch prediction buffer (branch history table)**
- A small memory indexed by the low-order address bits of branch
- To contain one or more bits indicating whether the branch was recently taken or not
- Several branched with the same low-order address bits will share the same entry in the branch prediction buffer

- Dynamic branch prediction schemes
- **1-bit prediction** scheme
- Using 1 bit to record and predict
- Example: loops and prediction
- Loop branch, branch taken 9 times and then untaken once
- Mispredict on the first and last loop iteration
- ***Prediction accuracy***: 80%
- **2-bit prediction** scheme
- Using 2 bits to record and predict
- A prediction must be wrong twice before it is changed

- **Correlating predictor**
- Using information about both a local branch and the global behavior of recently executed branches
- Greater prediction accuracy for the same number of prediction bits
- **Tournament branch predictor**
- Multiple predictor for each branch
- A selection mechanism that chooses which predictor to enable for a given branch
- **Branch target buffer**
- A structure that caches the destination PC or destination instruction for a branch
- Usually organized as a cache with tags
- More costly than a simple prediction buffer
- A branch prediction buffer can be accessed during the ***IF*** stage
- Predict taken: the target address can be known as early as the ***ID*** stage
- Predict untaken: sequential fetching
- If the prediction turns out to be wrong, the prediction bits are changed based on the prediction scheme it used
- **Branch delay slot**
- The slot after a delayed branch instruction
- Compilers and assemblers try to place an instruction that always executes after the branch
- To make the successor instructions valid and useful
- Three ways to schedule the branch delay slot

### Example: Comparing performance of several control schemes
- Problem
- To compare performance for single-cycle, multicycle, and pipeline
- Instruction mix
- 25% loads, 10% stores, 11% branches, 2% jumps, 52% ALU instuctions
- Function unit times
- Memory access: 200ps
- ALU operation: 100ps
- Register file R/W: 50ps
- Pipelined
- Half of the loads are immediately followed by an instruction that uses the results
- Branch delay on misprediction is 1 clock cycle
- 1/4 branches are mispredited
- Jumps always pay 1 full clock cycle (average time is 2 clock cycle)
- Solution
- Single-cycle
- Clock cycle: 200 + 50 + 100 + 200 + 50 = 600 ps
- CPI: 1
- Multicycle
- Clock cycle: 200 px
- CPI: 0.25 * 5 + 0.1 * 4 + 0.11 * 3 + 0.02 * 3 + 0.52 * 4 = 4.12
- Pipelined
- Clock cycle: 200 ps
- CPI:
- Loads: 0.5 * 1 + 0.5 * 2 = 1.5
- Stores & ALU instruction: 1
- Branches: 0.75 * 1 + 0.25 * 2 = 1.25
- Jumps: 2
- $\Rightarrow$ 0.25 * 1.5 + 0.1 * 1 + 0.11 * 1.25 + 0.02 * 2 + 0.52 * 1 = 1.17
- Average instruction time
- Single-cycle: 600 * 1 = 600 ps
- Multicycle: 200 * 4.12 = 824 ps
- Pipelined: 200 * 1.17 = 234 ps

## Exceptions
### Preliminary
- Control is the most challenging aspect of processor design
- The hardest part to get right and make fast
- **Exception (interrupt)**
- Events other than branches or jumps that change the normal flow of instruction execution
- Exception
- Any unexpected change in control flow without distinguishing whether the cause is **internal or external**
- Interrupt
- An exception that comes from **outside of the processor**
- To detect exceptional conditions and taking the appropriate action is often on the critical timing path of a processor

### How exceptions are handled in the MIPS architechture
- Execution of an **undefined instruction**, an **arithmetic overflow**
- When an exception occurs
- To save the address of the offending instruction in the **exception program counter (EPC)**
- To determine where to restart the execution of the program
- To transfer control to the operating system at some **specified address**
- To provide some service to the user program
- To take some **predefined action** in response to an overflow
- To stop the execution and report an error
- To **terminate or continue** the program
- Two ways to communicate the reason for an exception
- Using a ***Cause*** register to indicate the reason for the exception
- **Vectored interrupts**
- The address to which control is transferred is determined by the cause of the exception
- The addresses are separated by **32 bytes or eight instructions**
- When the exception is not vectored
- A single entry point for all exceptions
- The OS decodes the status register to find the cause

- To implement the exception system used in the MIPS architecture
- The single entry point being the address 8000 0180~hex~
- **EPC** register
- A 32-bit register used to hold **the address of the affected instruction**
- **Cause** register
- A 32-bit register used to record the cause of the exception
- Ex:
- 10: undefined instruction
- 12: arithmetic overflow
### Exceptions in a pipelined implementation
- To treat exceptions as another form of control hazard
- Steps to deal with exception (using an arithmetic overflow as an example)
- To transfer control to the exception routine immediately
- To flush instructions that follow the instrction caused exception
- ***IF*** stage
- To zero the instruction field of the ***IF/ID*** pipeline register (turing into a **nop** instruction)
- Similar as steps for branch misprediction
- ***ID*** stage
- To zero control signals using the multiplexor already in the ***ID*** stage
- Similar as steps for stall
- ***ID.Flush*** is ORed with the stall signal from the hazard detection unit
- ***EX*** stage
- ***EX.Flush*** is added to cause new multiplexors to zero the control lines
- To prevent the instruction in the ***EX*** stage from writing its result in the ***WB*** stage
- To start fetching instructions from location 8000 0180~hex~
- To add an additional input to the ***PC*** multiplexor that sends 8000 0180~hex~ to the ***PC***
- To stop or complete the instruction that caused the exception
- To stop exection in the middle of the instruction if necessary
- If the instrction requires to be eventually completed, the instruction will be restarted from the beginning after the exception is handled
- To save the address of the offending instruction in the ***EPC*** register
- In reality it saves the address + 4
- The exception handling routine must first subtract 4 from the saved value

- Example: Exception in a pipelined computer
- Instruction sequence
- ```
hex
40 sub $11, $2, $4
44 and $12, $2, $5
48 or $13, $2, $6
4C add $1, $2, $1
50 slt $15, $6, $7
54 lw $16, 50($7)
...
```
- Instruction to be invoked on an exception
- ```
80000180 sw $25, 1000($0)
80000184 sw $26, 1004($0)
```
- Assume an overflow exception occurs in the `add` instruction


- Multiple exceptions can occur simultaneously
- To prioritize the exceptions (by the hardware)
- I/O device requests and hardware malfunctions are not associated with a specific instruction
- The implementation has some flexibility as to when to interrupt the pipeline
- To match the exception to the instruction
- The ***EPC*** register captures the address of the interrupted instructions
- The ***Cause*** register records all possible exceptions in a clock cycle
- Knowing in which pipeline stage a type of exception can occur
- An undefined instruction is discovered in the ID stage
- Invoking the operating system occurs in the EX stage
- The difficulty of always associating the correct exception with the correct instruction in pipelined computers
- **Imprecise interrupt (imprecise exception)**
- An interrupt or execution that is not associated with the exact instruction that was the cause the interrupt or exception
- The operating system will determine which instruction caused the problem
- **Precise interrupt (precise exception)**
- An interrupt or execution that is always associated with the correct instruction