Ch04: The Processor - HackMD

<style> .r { color: rgb(226, 60, 47); font-weight: bold; } .g { color: rgb(40, 151, 72); font-weight: bold; } .b { color: rgb(46, 123, 237); font-weight: bold; } </style> # The Processor :::warning **男女友 & 父母二選一** A: Day mod 5 == 0 BF/GF B: Day mod 7 == 0 parent A and B == 0 BF/GF B and ~A parent ::: ## Introduction ### Preliminary - The performance of a machine was determined by three key factors - **Instruction count, clock cycle time, CPI** - Both the clock cycle time and the CPI are determined by the implementation of the processor - In this chapter - To construct the datapath and control unit of the MIPS instruction set - A highly abstract and simplified overview - To build up a datapath and construct a simple version of a processor - A more realistic pipelined MIPS implementation - To implement more complex instruction sets (x86) - **Make the common case fast** - **Simplicity favors regularity** :::info 為了一致性經常會執行一些額外的行為，只要不對 register 裡的值造成影響就沒關係。 ::: ### A basic MIPS implementation - An implementation includes the following instruction - ***Memory-reference instructions:*** `lw`, `sw` - ***Arithmetic-logical instructions:*** `add`, `sub`, `and`, `or`, `slt` - ***Branch instructions:*** `beq`, `j` - An overview of the implementation - Execution steps - 每個步驟前三步都很相似 - ***Step 1***: send the **program counter (PC)** to the memory to **fetch** the instruction - ***Step 2***: read one or two registers（一般都直接讀兩個以保持一致性） - Using fields of the instruction to select the registers to read - **`lw`: read one register** - Other instructions: read two registers - ***Step 3***: use arithmetic-logical unit (ALU) - <font class='b'>`lw`, `sw`: address calculation</font> - <font class='g'>`add`, `sub`, `and`, `or`, `slt`: operation execution</font> - `beq`: comparison - <font class='r'>`j`: doesn’t use ALU</font> - ***Step 4 & 5***: actions depend on different instruction types - **store 要 4 個 step, load 要 5 個** - `lw`, `sw`: access the memory - `add`, `sub`, `and`, `or`, `slt`: write the register file - `beq`, `j`: change the **PC**, otherwise the PC should be **incremented by 4** - The abstract view of a MIPS implementation - **Multiplexors**, which select from among several inputs based on the setting of its control lines, are omitted - The control unit and control lines are omitted ![An abstract view of the implementation of the MIPS subset showing the major functional units and the major connections between them](https://hackmd.io/_uploads/S1Rb4n1mp.png) - The datapath of the MIPS implementation - Three required multiplexors are added - A **control unit** and control lines for the major functional units are added ![The basic implementation of the MIPS subset, including the necessary multiplexors and control lines](https://hackmd.io/_uploads/Sy1VV3JQ6.png) （黑色為資料傳輸線，藍色是控制訊號線，通常會全部畫黑色的） ## Logic Design Conventions ### The datapath elements in the MIPS implementation - Logic element that operates on **data values** - Combinational logic (no internal storage) - The outputs ==**depend only on the current inputs**==（輸出只跟當前輸入有關） - Given the same input always produces the same output - Ex: ALU - Logic element that contains **state (state element)** - **Sequential** logic (**internal storage**) - The outputs depend on both inputs and contents of the internal state - **At least two inputs and one output** - Input: data value to be written into the element - Input: clock - Used to determine when the state element should be written - The state element **can be read at any time** - Output: the value **written in an earlier clock cycle** (rising edge) - Ex: Instruction/data memory, register ### Clocking methodology - An approach to determine when data is **valid and stable relative to the clock** - To define when signals can be read and written - If a signal is written and read at the same time, the value of the read will be **unpredictable**! - **Edge-triggered clocking** - 另一種是 level-trigger - Any values stored in a state element are updated only on a clock edge - For a combinational logic - Inputs and outputs must be **state elements** - Inputs: values written in a previous clock cycle - Outputs: values can be used in a following clock cycle - All signals must propagate from inputs, through the combinational logic, and to outputs in the time of one clock cycle - The time necessary for the **signals to reach outputs** defines the length of the clock cycle - 因為所有指令必須在一個 clock cycle 內完成，才來得及在下一個 edge 把結果寫入 register（沒有及時寫入資料會不見） ![Combinational logic, state elements, and the clock are closely related](https://hackmd.io/_uploads/HJMINhy7T.png) - **Control signal vs. Data signal** - Control signal - Used for **multiplexor selection** or for **directing the operation** of a functional unit - The state element is changed only when the **write control** signal is asserted and a clock edge occurs - If a state element is written on **every active clock edge**, the control signal is often omitted（有寫入功能的要裝開關，若要一直開著可以不用裝） - If a state element is not updated on every clock, then an explicit write control signal is required - clock edge 到達 (rising edge) + 開關打開才能寫入（通常隨時都能讀） - Data signal - To contain information that is operated on by a functional unit - Terminologies - **Asserted/deasserted**: a signal is logically high/low - **Assert/deassert**: a signal should be driven logically high/low - To allow a state element to be read and written in the **same clock cycle** ![image.png](https://hackmd.io/_uploads/rJPcNhkma.png) - The following steps are done in the same clock cycle 1. To **read** the contents of a register 2. To send the value through some combinational logic 3. To **write** the same register - No feedback within a single clock cycle ## Building a Datapath ### The datapath elements in the MIPS implementation Elements used to ***fetch*** instructions and ***increment*** the ***PC***. - ***Instruction memory*** - A memory unit to store instruction of a program and supply instructions given an address - **只存指令，只讀不寫** - ***Program counter (PC)*** - A register holds the address of the current instruction - ***Adder*** - To increment the **PC** to the address of the next sequential instruction - An ALU that the control always specifies an add operation - 負責 PC + 4 ![image](https://hackmd.io/_uploads/BkeNMbAYHp.png) - Considering the **R-type** instructions - ``` add/sub/and/or/slt $t1, $t2, $t3 ``` - To read two register, perform an ALU operation, write the result to a register - **Register file** - A collection of 32 32-bit general-purpose registers - Any register can be read or written by specifying its register number - To read a data word - An **input** to specify the register number to be read (5 bits) - An **output** to carry the value been read (**32 bits**) - The register file always outputs the contents of whatever register numbers are on the Read register inputs - To write a data word - An **input** to specify the register number to be written - An **input** to supply the data to be written into the register - To occur at the clock edge and the write control signal is asserted - $\Rightarrow$ A total of four inputs and two outputs are needed - **Arithmetic logic unit (ALU)** - To take **two 32-bit inputs** and produce a 32-bit result - A 1-bit **Zero** signal to indicate the result is 0 - ALU result 為 0，Zero 輸出為 1，反之則為 0 - 4-bit control signals (ALU operation) ![The two elements needed to implement R-format ALU operations are the register file and the ALU.](https://hackmd.io/_uploads/S1G2WAYSp.png) - Considering the memory reference instructions (`lw` and `sw`) - **`lw/sw $t1, offset($t2)`** - To compute a memory address by adding `$t2` to the 16-bit signed offset - `lw`: memory $\rightarrow$ `$t1` - `sw`: `$t1` $\rightarrow$ memory - **Data memory** - A state element to store data - Two inputs for **address** and the **write data**, and one output for the read result - Separate read and write controls - To read the value of an invalid address can cause problems - **Sign extension unit**: 16 bits $\rightarrow$ 32 bits - To increase the size of a data item - To replicate the high-order sign bit of the original data item in the high-order bits of the larger, destination data item ![The two units needed to implement loads and stores, in addition to the register file and ALU, are the data memory unit and the sign extension unit](https://hackmd.io/_uploads/Hk_fzCtr6.png) ### For different classes of instruction - `add/sub/and/or/slt $t1, $t2, $t3` - Fetching instructions and incrementing the **PC** - Register file: to read operands in `$t2` and `$t3`, to write result to `$t1` - ALU: to operate - `lw/sw $t1, offset($t2)` - Fetching instructions and incrementing the **PC** - Register file: to read the **base register** `$t2`, to write/read data to/from `$t1` - **Sign extension**: to sign-extend the 16-bit offset to a 32-bit signed value - ALU: to compute a memory address (base register `$t2` + offset) - Data memory: to read/write data - `beq $t1, $t2, offset` - Fetching instructions and incrementing the **PC** - Register file: to read registers `$t1` and `$t2` for comparison - **Sign extension**: to sign-extend the 16-bit offset to a 32-bit signed value - Adder: to compute the branch target address (PC + offset) - ***Shifter***: to left shift the (sign-extended) **offset by 2** - ALU: to compare contents of registers `$t1` and `$t2` by subtraction - Using the zero signal to indicate whether the operands are equal or not - **Branch taken**: Zero = 1 $\Rightarrow$ **new PC = branch target address** - **Branch not taken** (untaken branch): Zero = 0 $\Rightarrow$ new PC = PC - `j offset` - Fetching instructions and incrementing the PC - hifter: to left shift the **offset by 2** (26 bits $\rightarrow$ 28 bits) - To replace the lower 28 bits of PC ![The datapath for a branch uses the ALU to evaluate the branch condition and a separate adder to compute the branch target as the sum of the incremented PC and the sign-extended, lower 16 bits of the instruction, shifted left 2 bits.](https://hackmd.io/_uploads/SkNJWm5Ba.png) ### Create a single datapath - To execute all instructions in **one clock cycle** - No datapath resource can be used more than once per instruction - 所有元件只能用一次 - Any element **needed more than once** must be **duplicated** - **Separate instruction and data memories** - Some elements can be shared by different instruction flows - Using a **multiplexor** and **control signal** to select among the multiple inputs - Example: Building a datapath - To build a datapath for **memory references** and **R-type instructions** using a single register file and a single ALU - Using the ALU - R-type instruction: inputs coming from two registers - Memory reference: the 2nd input is the sign-extended 16-bit offset - $\Rightarrow$ A multiplexor is needed for the 2nd ALU input - The value stored into a destination register - R-type instruction: from the ALU - Memory reference: from the memory (for a load) - $\Rightarrow$ A multiplexor is needed for the data stored into the register file ![The datapath for the memory instructions and the R-type instructions](https://hackmd.io/_uploads/Sklr2sOQp.png) - To combine all the pieces to make a simple datapath - A datapath for, `add`, `sub`, `and`, `or`, `slt`, `lw`, `sw`, and `beq` - A multiplexor is required to select either the **PC + 4** or the branch target address to be written into the **PC** - A **control unit** is added to take inputs and generate - a **write signal** for each state element - the **selector control** for each multiplexor - the **ALU control** ![The simple datapath for the MIPS architecture combines the elements required by different instruction classes](https://hackmd.io/_uploads/ByGwC2_Qp.png) ## A Simple Implementation Scheme ### Multiple levels of decoding - The **main control unit** generates the **ALUOp** bits - **ALUOp** bits are used as input to the **ALU control unit** that generate the actual signals to control the ALU unit - Can reduce the size of the main control unit - Potentially increase the speed of the control unit - The speed of the control unit is often critical to clock cycle time ![image.png](https://hackmd.io/_uploads/HJykg3dQT.png =300x) ### The ALU control - The ALU has 4 control inputs but only 6 combinations are used - A small ALU control unit to generate the 4-bit ALU control input - Inputs: the **funct** field of the instruction and a 2-bit **ALUOp** - **ALUOp** = 00: addition for `lw`/`sw` - **ALUOp** = 01: subtraction for `beq` （相減比較是否相等） - **ALUOp** = 10: for R-type instructions, determined by the operation encoded in the funct field - Output: a 4-bit signal that directly controls the ALU ![How the ALU control bits are set depends on the ALUOp control bits and the different function codes for the R-type instruction.](https://hackmd.io/_uploads/By6hDhcSa.png) - To implement the ALU control unit: truth table - Only show the truth table entries for which the ALU control must have a specific value - The full truth table has $2^8 = 256$ entries - Don’t-care term - The output does not depend on the value of all the inputs - Useful to keep the table compact ![The truth table for the three ALU control bits (called Operation)](https://hackmd.io/_uploads/H1rLF3qHa.png) ### Designing the main control unit #### Observations about instruction formats - ***op*** field (**opcode**) in bits `31:26` - To denote the operation and format of an instruction (`Op [5:0]`) - ***rs*** and ***rt*** fields at positions `25:21` and `20:16` - To specify the two registers to be read, for R-type, `beq`, `sw` instructions - The base register for `lw` and `sw` is in bits 25:21 (**rs**) - The 16-bit offset for `beq`, `lw`, and `sw` is in positions 15:0 - Destination register - For `lw` is in bits `20:16` (**rt**) - For R-type is in bits `15:11` (**rd**) - $\Rightarrow$ A multiplexor is needed to select which field of the instruction is used to indicate the register number to be written ![image](https://hackmd.io/_uploads/B10R2h5H6.png) #### The datapath with multiplexors and control signals - ALU control unit, 4 multiplexors, 7 single-bit control lines, and a 2-bit ALUOp control signal - All the multiplexors have two inputs and require a single control line ![image.png](https://hackmd.io/_uploads/Hks-mh_Xp.png) - The function of each of the control signals - All but **PCSrc** can be set based solely on the opcode - **PCSrc** should be asserted if the instruction is `beq` and the **Zero** output of the ALU is asserted (PCSrc = **Branch** AND **Zero**) ![image](https://hackmd.io/_uploads/rk5psn5Sp.png) ![image](https://hackmd.io/_uploads/Byum2hqr6.png) - All 9 control signals can now be set on the basis of **opcode** to the control unit ![image.png](https://hackmd.io/_uploads/SJezP2OQ6.png) #### Operation of the datapath - R-type instruction: `add/sub/and/or/slt $t1, $t2, $t3` - The instruction is fetched, **PC** is incremented - `$t2` and `$t3` are read from the register file; main control unit sets control lines - The ALU operates, using the ***funct*** field of the instruction (bits `5:0`) to generate the ALU function - The result from the ALU is written into the register file (bits `15:11`) ![image](https://hackmd.io/_uploads/Sk5FUJsB6.png) - `lw $t1, offset($t2)` - The instruction is fetched, **PC** is incremented - `$t2` is read from the register file;main control unit sets control lines - The ALU performs addition to compute the address(`$t2` + sign-extended offset) - To read the data memory - The data from the memory unit is **written into the register** file (bits `20:16`) ![image](https://hackmd.io/_uploads/B1Xp8yiST.png) - `beq $t1, $t2, offset` - The instruction is fetched, **PC** is incremented - `$t2` and `$t3` are read from the register file; main control unit sets control lines - The ALU performs subtraction for equality comparison; the adder computes the branch target address(PC + sign-extended offset shifted left by 2) - The ***Zero*** result from the ALU is used to decide which adder result to store into the PC ![image](https://hackmd.io/_uploads/ByUgv1jrp.png) #### Finalizing control - Truth table of the main control unit - Input: the 6-bit ***opcode*** field `Op[5:0]` - Output: 9 control lines(7 1-bit control lines, 2-bit ***ALUOp***) - Example: Implementing jumps - Jump address - [31:28] of the current PC + 4, 26-bit immediate filed of the jump instruction, bits 002 - An additional control signal ***Jump*** ![image](https://hackmd.io/_uploads/ry2uaJjr6.png) ![image.png](https://hackmd.io/_uploads/SyuuZTdXp.png) ### Why a single-cycle implementation is not used today - The single-cycle design is **inefficient** - CPI = 1 - The clock cycle is determined by the longest possible path in the machine - Example: Performance of single-cycle machines - Problem - Operation times（時間僅用於比較相對關係，不一定是實際比例） - Memory units 200 ps, ALU & adder 100 ps, register file read/write 50 ps - Instruction mix - 25% loads, 10% stores, 45% ALU instruction, 15% branches, 5% jumps - Implementation 1 - Every instruction operates in 1 clock cycle of a **fixed length** - Implementation 2 - Every instruction executes in 1 clock cycle using a **variable-length** clock - To compare the performance between implementations 1 and 2 ![image.png](https://hackmd.io/_uploads/ryBJfa_7T.png) - Solution - To compute the critical path for different instruction classes - Clock cycle of implementation 1: 600 ps - Clock cycle of implementation 2: $600 \times 25\% + 550 \times 10\% + 400 \times 45\% + 350 \times 15\% + 200 \times 5\% = 447.5 ps$ - Performance ratio: $$ \frac{\text{CPU clock cycle}_\text{single clock}}{\text{CPU clock cycle}_\text{variable clock}} = \frac{600}{447.5} = 1.34 $$ - Using the single-cycle design with a fixed clock cycle - Acceptable for small and simple instruction set - The clock cycle is equal to the worst-case delay for all instructions - Each functional unit can be used only once per clock - $\Rightarrow$ **Multicycle** implementation - Shorter clock cycle - Each instruction requires multiple clock cycles to complete ## An Overview of Pipelining ### Pipeline - An inplementation technique in which multiple instructions are **overlapped** in execution (讓指令可以重疊執行，非齊頭式，與 superscalar 不同) ![image](https://hackmd.io/_uploads/r1K7YhW4a.png =500x) > Fig. 4.25 The laundry analogy for pipelining - Observations of the laundry example - Pipelining is faster for many loads because everything is working in parallel - All stages are operating concurrently - We have separate resources for each stage - Pipelining improves ==<font class='r'>*throughput*</font>== - The <font class="b">total time</font> to complete the work is <font class="b">decreased</font> - The time to complete a <font class='g'>single task</font> is <font class='g'>not decreased</font> - The ideal ***speedup*** due to pipelining is equal to the number of ***stages*** in the pipeline - 以下兩個條件要**同時符合**。 - If all stages take about the same amount of time (所有 stage 花的時間均相同。) - There is enough work to do (工作要夠多) - In Fig. 4.25, the speedup is only 2.3, because there are only 4 loads - The pipeline is not completely full **at the beginning and end** of the workload (pipeline 啟動跟結束時的資源並不是滿載的) - This start-up (開始) and wind-down (結束) affects performance when the number of tasks is not large compared to the number of stages in the pipeline - In Fig. 4.25, if the number of loads is much larger than 4, the stages will be full most of the time and the speedup will be close to 4 - Example: **Single-cycle vs. Pipelined performance** - Problem - 8 instructions: **`lw`**, **`sw`**, **`add`**, **`sub`**, **`and`**, **`or`**, **`slt`**, **`beq`** - Memory access: 200 ps - ALU operation: 200 ps - Register file R/W: 100 ps - To compare the average time between instructions of a single-cycle implementation to a pipelined implementation - Solution - The pipeline execution clock cycle must have the worst-case clock cycle of 200 ps - To execute three **`lw`** instructions - The time between the 1^st^ and 4^th^ instruction: 2400 ps vs. 600 ps - The total execution time: 2400 ps vs. 1400 ps ![image](https://hackmd.io/_uploads/rkbAchbNp.png =600x) ![image](https://hackmd.io/_uploads/S16ei3WNp.png =600x) - Pipelining speed-up discussion - Assuming ideal conditions: the stages are perfectly balanced $$ \Rightarrow \text{Time between instructions}_{\text{pipelined}} = \frac{\text{Time between instruction}_{\text{nonpipelined}}}{\text{Number of pipes stages}} $$ - **Under ideal conditions and with a large number of instructions** - $\rightarrow$ **The speedup is approximately equal to the number of pipe stages** - In previous example: 800 ps should be improved to 160 ps - Pipelining involves some <font class='r'>overhead</font> - The time per instruction in the pipelined processor will exceed the minimum possible - The speed-up will be less than the number of pipeline stages - In the previous example: 2400 ps $\rightarrow$ 1400 ps - 3 **`lw`** instructions: speedup= 2400/1400 ≈ 1.71 - 1,000,003 **`lw`** instructions: speedup = 800,002,400/200,001,400 ≈ 4 - Summary - Pipelining improves performance by increasing instruction <font class='r'>*throughput*</font>, as opposed to decreasing the execution time of an individual instruction. - Instruction throughput is the important metric because real programs execute billions of instructions. ### Designing instruction sets for pipelining - MIPS instruction set is **designed for pipelined execution** - Characteristics of MIPS instruction set - <font class='b'>All MIPS instructions are the same length (32-bit)</font> - Easier to fetch and decode instructions - Intel x86 instruction set - Instructions vary from 1~15 bytes - To translate x86 instructions into simple micro-operation - To pipeline the micro-operations rather than the native x86 instructions - MIPS has **only a few instruction formats**, with the source register fields being located at the same place in each instruction - **The 2^nd^ stage can begin reading the register** file at the same time that the hardware is determining what type of instruction was fetched - **Memory operands only appear in loads and stores** - We can use the execute stage to calculate the memory address and then access memory in the following stage - Operands must be **aligned in memory** - A single data transfer instruction will not require two data memory accesses - The requested data can be transferred in a single pipeline stage ### Pipeline hazards (Pipeline 代價) - Hazards are situations in pipelining when the next instruction **cannot execute in the following clock cycle** (Hazard: 下一個指令無法在下一個 clock cycle 進來。) - **Structural hazards** (結構性危險) - The hardware cannot support the combination of instructions that we want to execute in the same clock cycle（硬體不足，設計不良） - Example - Suppose we had a single memory instead of two memories - In the same clock cycle the 1^st^ instruction is **accessing data** from memory while the 4^th^ instruction is **fetching an instruction** from the **same memory** ![image](https://hackmd.io/_uploads/Skk8ChWET.png) - **Data hazards** - The pipeline must be ***stalled*** (停滯) because one step must **wait for another to complete** (等前面的資料完成) - Data that is needed to execute the instruction is not yet available - Arising from the <font class='b'>dependence</font> of one instruction on an earlier one that is still in the pipeline (資料相依) - ![image](https://hackmd.io/_uploads/r1_IQJfNa.png =400x) - ``` add $s0, $t0, $t1 sub $t2, $s0, $t3 # 3 clock cycles are wasted in the pipeline ``` - Compiler can remove data hazards but its results is not satisfactory (讓 compiler 調動指令的順序，因為 compiler 很保守所以效果並不好。) - These dependences happen too often - The delay is too long to expect the compiler to rescue us from this dilemma - Solution: ***forwarding/bypassing*** - Adding extra hardware to retrieve the missing item early from the internal resource - Forwarding paths are valid only if the destination stage is **later** in time than the source stage ![image](https://hackmd.io/_uploads/HkayxpbE6.png =500x) - Forwarding cannot prevent all pipeline stalls - ==**Load-use data hazard**== - The data being loaded by a **`lw`** instruction has not yet become available when it is needed by another instruction - **Pipeline stall (bubble)** - A stall initiated in order to resolve a hazard ![image](https://hackmd.io/_uploads/B1aKiZjHT.png =500x) - Example: Reordering code to avoid pipeline stalls - a = b + e; c = b + f; - Original ``` lw $t1, 0($t0) # load b lw $t2, 4($t0) # load e add $t3, $t1, $t2 sw $t3, 12($t0) lw $t4, 8($t0) # load f add $t5, $t1, $t4 sw $t5, 16($t0) ``` ![image](https://hackmd.io/_uploads/HyH_bpZVT.png) - Reordering ``` lw $t1, 0($t0) # load b lw $t2, 4($t0) # load e lw $t4, 8($t0) # load f add $t3, $t1, $t2 sw $t3, 12($t0) add $t5, $t1, $t4 sw $t5, 16($t0) ``` - Each MIPS instruction writes at most one result and does this in the last stage of the pipeline - Forwarding is harder if - There are multiple results to forward per instruction - There is a need to write a result early in instruction execution ![image](https://hackmd.io/_uploads/S1NYMTZN6.png) - **Control hazards (branch hazard)** - Arising from the need to make a decision based on the results of one instruction while others are executing $\Rightarrow$ branch instruction - The instruction that was fetched is not the one that is needed - The flow of instruction addresses is not what the pipeline expected - Solution 1: **stall on branch** - 停下來等，想辦法在下一個 bubble 就知道。 - To stall immediately after fetching a branch, and waiting until the pipeline determines the outcome of the branch - Conservative option, slow - Example - Suppose the branch can be resolve in the 2^nd^ stage - If the branch cannot be resolved in the 2^nd^ stage, we will see an even larger slowdown ![image](https://hackmd.io/_uploads/Sk3bm6WE6.png) - Solution 2: **prediction** - 用猜ㄉ，猜對可以節省時間，猜錯要 recover - Simple approach: always predict that branches will be **untaken** - Untaken: pipeline proceeds at full speed - **Taken**: pipeline stall - Sophisticated version: **branch prediction** ![image](https://hackmd.io/_uploads/H1wLmT-Np.png) - **Branch prediction** - 目前準確率可以到 90% - To assume a given outcome for the branch and proceed from that assumption rather than waiting to ascertain the actual outcome - Some branches predicted as taken and some as untaken - **Dynamic** hardware predictor - Making their guess depending on the behavior of each branch - May change predictions for a branch over the life of a program - When the guess is wrong - To ensure the instructions following the wrongly guessed branch have no effect - To restart the pipeline from the proper branch address - **Pipeline overview summary** - Pipelining is a technique - To exploit ***parallelism*** among instructions in a sequential instruction stream - To increase the number of simultaneously executing instructions - To increase the rate at which instructions are started and completed - To improve instruction ***throughput*** rather than individual instruction ***execution time*** or ***latency*** - **Invisible to programmer** ## Pipelined Datapath and Control ### Pipelined datapath - 5 stages: <font class='r'>IF, ID, EX, MEM, WB</font> - **Instruction fetch, Instruction decode / register file read, Execute / address calculation, Memory access, Write Back** - Up to 5 instructions will be in execution during any single clock cycle - 每個 stage 1 個 clock cycle，每個指令都要 5 個 clock cycle - 同一個時間點，會有 5 個指令同時在工作。（滿載時） ![Fig. 4.33 The single-cycle datapath (similar to Fig. 4.17)](https://hackmd.io/_uploads/S1oyLJM4T.png) - **Pipeline register** - To retain the value of an instruction for its other four stages - All instructions advance during each clock cycle from one pipeline register to the next - The registers are **named for the two stages separated by that register** - **PC** can be thought of as a pipeline register - **PC** is part of the visible architectural state ![image](https://hackmd.io/_uploads/BJKrqyfVp.png) - Ex: **`lw`** goes through the 5 stages of pipelined execution - Highlight of registers or memory - Right half: read - Left half:write - **Instruction fetch (IF)** - fetch 指令，然後 PC + 4 - To read the instruction using the address in the **PC** and place in the **IF/ID** - To increment **PC** by 4,save to **PC** and **IF/ID** ![image](https://hackmd.io/_uploads/ByHFWlGVa.png) - **Instruction decode and register file read (ID)** - The 16-bit **immediate** field of the instruction supplied from **IF/ID** is sign-extended to 32 bit - Register numbers supplied from **IF/ID** is used to read two registers - All above three values and **PC** are placed in the **ID/EX** ![image](https://hackmd.io/_uploads/BJXc3yGNp.png) - **Execute or address calculation (EX)** - Using the ALU to calculate the load address - To add the contents of register 1 and the sign-extended immediate from the **ID/EX** - ALU result is placed in the **EX/MEM** ![image](https://hackmd.io/_uploads/HyOahkzVa.png) - **Memory access (MEM)** - To read the data memory using the address from the **EX/MEM** - To load the data into the **MEM/WB** ![image](https://hackmd.io/_uploads/S11L0yMNa.png) - **Write back (WB)** - To read the data from the **MEM/WB** - To write the data into the register file ![image](https://hackmd.io/_uploads/Sy4FAkfNp.png) - Ex: **`sw`** goes through the 5 stages of pipelined execution - Fig. 4.39 and 4.40 - Two key points in the pipelined datapath - To pass something from an early pipe stage to a later pipe stage, the information must be placed in a pipeline register - Otherwise the information is lost when the next instruction enters that pipeline stage - For the **`sw`** instruction, the data been stored was first placed in the **ID/EX** and then passed to the **EX/MEM** - Each logical component of the datapath can be used only within a ***single*** pipeline stage - Otherwise a structural hazard will occur - A bug in this design: **the write register number** of `lw` is not supplied! $\Rightarrow$ To preserve the destination register number in `lw` instruction - **`lw`** must pass the write register number from the **ID/EX** through **EX/MEM** to the **MEM/WB** ![image](https://hackmd.io/_uploads/ryzBkefVp.png) ![image](https://hackmd.io/_uploads/BkgvqrjSa.png) ### Graphically representing pipelines - **Multiple-clock-cycle pipeline diagram** - Simpler, do not contain all the details - Occupying the proper clock cycles - Time advances from left to right - Instructions advance from the top to the bottom ![image](https://hackmd.io/_uploads/Sk6vgefVT.png) - **Single-clock-cycle pipeline diagram** - To show the details of what is happening within the pipeline during each clock cycle - Typically the drawings appear in groups to show pipeline operation over a sequence of clock cycles ![image](https://hackmd.io/_uploads/H1vnxez4a.png) ### Pipelined control - To label the control lines - Similar as those used in the single-cycle implementation - No separate write signals for **PC** and pipeline registers - **PC** and pipeline registers are written on each clock cycle ![image](https://hackmd.io/_uploads/r1Y1WlGN6.png) ![image](https://hackmd.io/_uploads/HJve-lzNa.png) ![image](https://hackmd.io/_uploads/B1l-blMN6.png) - Control lines corresponding to 5 pipeline stages - Instruction fetch & Instruction decode/register file read - No optional control lines to set - Execution/address calculation - ***RegDst***, ***ALUOp***, ***ALUSrc*** - Memory access - ***Branch***, ***PCSrc***, ***MemRead***, ***MemWrite*** - Write back - ***MemtoReg***, ***RegWrite*** ![image](https://hackmd.io/_uploads/B1a7A0q46.png) - To implement control - To extend the pipeline registers to include control information - To create the control information during instruction decode ![image](https://hackmd.io/_uploads/ryQA0C94T.png) ![image](https://hackmd.io/_uploads/SkgVyys4p.png) ## Data Hazards: Forwarding versus Stalling ### Data hazards and forwarding - Example ``` sub $2, $1, $3 // only $2 here is write and $12, $2, $5 or $13, $6, $2 add $14, $2, $2 sw $15, 100($2) ``` - Value of `$2` changes from 10 to -20 at **CC5** - `sw` gets the correct value - Maybe `add` can also get the correct value - `and`, `or` get the incorrect value because dependence lines go **backward** in time ![image](https://hackmd.io/_uploads/SJB9k1jE6.png =500x) - To resolve data hazards - The design of the register file hardware - Write: 1^st^ half of the cycle Read: 2^nd^ half of the cycle - The read delivers what is written - ***Forwarding*** - To forward the data as soon as it is available to any units that need it before it is available to read from the register file - The data is available at the end of the ***EX*** (R-type) or ***MEM*** (load)stage - To forward the data to an operation in the ***EX*** stage - ALU operation, effective address calculation - To detect data hazards - Hazard conditions - R**d**: destination ![image](https://hackmd.io/_uploads/BkMPIzsVp.png) - Example: Dependence detection - All instructions detect data hazards in the (beginning of) ***EX*** stage - **sub-and** - CC 4: instruction and is in the EX stage - Hazard type 1a: ***EX/MEM.RegisterRd*** = ID/EX.RegisterRs = `$2` - **sub-or** - CC 5: instruction or is in the EX stage - Hazard type 2b: MEM/WB.RegisterRd = ID/EX.RegisterRt = `$2` - **sub-add** - CC 6: instruction add is in the EX stage - Not a hazard, the register file supplies the proper data - **sub-sw**: no data hazard - Above hazard condition is inaccurate - If an instruction does not write registers, its forwarding is unnecessary - Examining the WB control field of the pipeline register during the EX and MEM stages to determine whether RegWrite is asserted - If `$0` is used as destination register, its value will not be forwarded - MIPS requires that every use of `$0` as an operand must yield an operand value of 0 - Adding EX/MEM.RegisterRd $\ne$ 0 or MEM/WB.RegisterRd $\ne$ 0 ![image](https://hackmd.io/_uploads/Bke7jfsVT.png) ![image](https://hackmd.io/_uploads/SkHUsfs46.png) - To forward the proper data - The pipeline registers holding the data to be forwarded - Adding multiplexors and controls to the input of the ALU - ***Forwarding control*** is the <font class='r'>EX</font> stage ![image](https://hackmd.io/_uploads/HkY7dSVr6.png) ![image](https://hackmd.io/_uploads/ryVcdHNBp.png) ![image](https://hackmd.io/_uploads/Syt5drEB6.png) ![image](https://hackmd.io/_uploads/r1f2uBNH6.png) - Hazard detection and control signals - ***EX*** hazard - ``` if (EX/MEM.RegWrite and (EX/MEM.RegisterRd != 0) and (EX/MEM.RegisterRd = ID/EX.RegisterRs)) ForwardA = 10 ``` - ``` if (EX/MEM.RegWrite and (EX/MEM.RegisterRd != 0) and (EX/MEM.RegisterRd = ID/EX.RegisterRt)) ForwardB = 10 ``` ![image](https://hackmd.io/_uploads/H19J5HESa.png) - ***MEM*** hazard (original) - ``` if (MEM/WB.RegWrite and (MEM/WB.RegisterRd != 0) and (MEM/WB.RegisterRd = ID/EX.RegisterRs)) ForwardA = 01 ``` - ``` if (MEM/WB.RegWrite and (MEM/WB.RegisterRd != 0) and (MEM/WB.RegisterRd = ID/EX.RegisterRt)) ForwardB = 01 ``` ![image](https://hackmd.io/_uploads/Bkn6-wiST.png) - ***MEM*** hazard (revised) - EX and MEM hazards occurs simultaneously - $\rightarrow$ Want to use the most recent value - $\rightarrow$ Only forwards if EX hazard condition is <font class='r'>not true</font> ![image](https://hackmd.io/_uploads/S1RgpHNr6.png) - ``` if (MEM/WB.RegWrite and (MEM/WB.RegisterRd != 0) and not (EX/MEM.RegWrite and (EX/MEM.RegisterRd != 0) and (EX/MEM.RegisterRd = ID/EX.RegisterRs)) and (MEM/WB.RegisterRd = ID/EX.RegisterRs)) ForwardA = 01 ``` - ``` if (MEM/WB.RegWrite and (MEM/WB.RegisterRd != 0) and not (EX/MEM.RegWrite and (EX/MEM.RegisterRd != 0) and (EX/MEM.RegisterRd = ID/EX.RegisterRt)) and (MEM/WB.RegisterRd = ID/EX.RegisterRt)) ForwardB = 01 ``` ![image](https://hackmd.io/_uploads/Hyf-Y-UH6.png) ![image](https://hackmd.io/_uploads/SJmGF-UHp.png) ### Data hazards and stalls - Forwarding cannot resolve all data hazards - An instruction tries to read a register following a `lw` that writes the same register - $\Rightarrow$ Something must ***stall*** the pipeline for the combination of `lw` followed by an instruction that reads its result ![image](https://hackmd.io/_uploads/SJRec-Ira.png) - ***Hazard detection unit*** - It operates during the ***ID*** stage to insert the ***stall*** between the load and its use - Condition - ``` if (ID/EX.MemRead and ((ID/EX.RegisterRt = IF/ID.RegisterRs) or (ID/EX.RegisterRt = IF/ID.RegisterRt))) stall the pipeline ``` - *If the condition holds, the instruction followed `lw` stalls one clock cycle* - The instruction in the ***IF*** stage must also be stalled - To prevent the ***PC*** and ***IF***/***ID*** pipeline register from changing - The instruction in the ***IF*** stage will continue to be read using the same ***PC*** - The registers in the ***ID*** stage will continue to be read using the same instruction fields in the ***IF***/***ID*** pipeline register - The back half of the `lw` pipeline starting with the ***EX*** stage $\Rightarrow$ ***nops*** - An instruction that does no operation to change stage - How to insert ***nops*** - To deassert all 9 control signals in the ***EX***, ***MEM***, and ***WB*** stages - To change the ***EX***, ***MEM***, and ***WB*** control fields of the ***ID/EX*** pipeline register to 0 - The changed control signals will **percolate** through the pipeline - No registers or memories are written if the control values are all 0 - Example - `and` becomes a nop, and all instructions following and are delayed one cycle - `and` and `or` repeat in clock cycle 4 what they did in clock cycle 3 - A stall bubble delays everything behind it ![image](https://hackmd.io/_uploads/BJU1JM8HT.png) ![image](https://hackmd.io/_uploads/HkSC8f8r6.png) ![image](https://hackmd.io/_uploads/rkUCiz8r6.png) ![image](https://hackmd.io/_uploads/Syfz2MLr6.png) ## Control Hazard ### Control hazards (Branch Hazard) - The delay in determining the proper instruction to fetch - An instruction must be fetched at every clock cycle to sustain the pipeline - The decision about whether to branch doesn't occur until the ***MEM*** stage - Relatively simple and occur less frequently than data hazards ### Assume branch not taken - Stalling until the branch is complete is too slow - To ***predict*** the branch will be ***untaken*** - To continue execution down the sequential instruction stream - If the branch is **untaken** - The prediction is correct, and the pipeline is full-filled - If the branch is **taken** - The prediction is wrong - To discard instruction that are fetched and being exectuted - Execution continues at the branch target - To discard instructions (***flush*** instructions) - In previous design, the `beq` decides whether to branch in the ***MEM*** stage - Three insturctions in ***IF***, ***ID***, and ***EX*** stages must be flushed - 3 clock cycles delay - Just change control to 0 in the ***ID*** is not sufficient ![image](https://hackmd.io/_uploads/rkHbgXIHT.png) ### Reducing the delay of branches - If the branch execution is moved earlier in the pipeline, then fewer instructions need be flushed - Many branched rely only on simple tests - Do not require a full ALU operation but can be done with at most a few gates - A separate instruction that uses an ALU for a more complex branch decision - To move the branch decision up - Branch address calculation - To move the branch adder from the ***EX*** stage to the ***ID*** stage - To be performed for all instructions but only used when needed - Branch decision evaluation - To perform equality comparison in ***ID*** stage using XOR an OR gates - Additional forwarding and hazard detection hardware - To forward results to the equality test logic that operates during ***ID*** - A data hazard can occur and a stall will be needed - ALU instruction immediately preceding a branch ![image](https://hackmd.io/_uploads/ryvc-7Lra.png) - To reduce the penalty of a branch to one instruction if the branch is taken - To flush instructions in the ***IF*** stage - ***IF.Flush*** is added to zero the instruction field of the ***IF/ID*** pipeline register - The fetched instruction is transformed into a ***nop*** - Example: Pipelined branch - Assumption - The pipeline is optimized for branches that are untaken - Branch execution is moved to the ***ID*** stage - ``` 36 sub $10, $4, $8 40 beq $1, $3, 7 # PC-relative branch to 40 + 4 + 7 * 4 = 72 44 and $12, $2, $5 48 or $13, $2, $6 52 and $14, $4, $2 56 slt $15, $6, $7 ... 72 lw $4, 50($7) ``` - Only one pipeline bubble on a taken branch ![image](https://hackmd.io/_uploads/HysxuMwH6.png) ![image](https://hackmd.io/_uploads/BkhEufvSp.png) ### Dynamic branch prediction - In an aggressive pipeline, a simple **static prediction** scheme may waste too much performance - Prediction of branches at runtime using information - Prediction is a hint that we hope is correct - Fetching begins in the predicted direction (taken or untaken) - **Branch prediction buffer (branch history table)** - A small memory indexed by the low-order address bits of branch - To contain one or more bits indicating whether the branch was recently taken or not - Several branched with the same low-order address bits will share the same entry in the branch prediction buffer ![image](https://hackmd.io/_uploads/HJVzqzDST.png) - Dynamic branch prediction schemes - **1-bit prediction** scheme - Using 1 bit to record and predict - Example: loops and prediction - Loop branch, branch taken 9 times and then untaken once - Mispredict on the first and last loop iteration - ***Prediction accuracy***: 80% - **2-bit prediction** scheme - Using 2 bits to record and predict - A prediction must be wrong twice before it is changed ![image](https://hackmd.io/_uploads/BkAS3MwHp.png) - **Correlating predictor** - Using information about both a local branch and the global behavior of recently executed branches - Greater prediction accuracy for the same number of prediction bits - **Tournament branch predictor** - Multiple predictor for each branch - A selection mechanism that chooses which predictor to enable for a given branch - **Branch target buffer** - A structure that caches the destination PC or destination instruction for a branch - Usually organized as a cache with tags - More costly than a simple prediction buffer - A branch prediction buffer can be accessed during the ***IF*** stage - Predict taken: the target address can be known as early as the ***ID*** stage - Predict untaken: sequential fetching - If the prediction turns out to be wrong, the prediction bits are changed based on the prediction scheme it used - **Branch delay slot** - The slot after a delayed branch instruction - Compilers and assemblers try to place an instruction that always executes after the branch - To make the successor instructions valid and useful - Three ways to schedule the branch delay slot ![image](https://hackmd.io/_uploads/ryFhCGPBp.png) ### Example: Comparing performance of several control schemes - Problem - To compare performance for single-cycle, multicycle, and pipeline - Instruction mix - 25% loads, 10% stores, 11% branches, 2% jumps, 52% ALU instuctions - Function unit times - Memory access: 200ps - ALU operation: 100ps - Register file R/W: 50ps - Pipelined - Half of the loads are immediately followed by an instruction that uses the results - Branch delay on misprediction is 1 clock cycle - 1/4 branches are mispredited - Jumps always pay 1 full clock cycle (average time is 2 clock cycle) - Solution - Single-cycle - Clock cycle: 200 + 50 + 100 + 200 + 50 = 600 ps - CPI: 1 - Multicycle - Clock cycle: 200 px - CPI: 0.25 * 5 + 0.1 * 4 + 0.11 * 3 + 0.02 * 3 + 0.52 * 4 = 4.12 - Pipelined - Clock cycle: 200 ps - CPI: - Loads: 0.5 * 1 + 0.5 * 2 = 1.5 - Stores & ALU instruction: 1 - Branches: 0.75 * 1 + 0.25 * 2 = 1.25 - Jumps: 2 - $\Rightarrow$ 0.25 * 1.5 + 0.1 * 1 + 0.11 * 1.25 + 0.02 * 2 + 0.52 * 1 = 1.17 - Average instruction time - Single-cycle: 600 * 1 = 600 ps - Multicycle: 200 * 4.12 = 824 ps - Pipelined: 200 * 1.17 = 234 ps ![image](https://hackmd.io/_uploads/rJRMRIKHp.png) ## Exceptions ### Preliminary - Control is the most challenging aspect of processor design - The hardest part to get right and make fast - **Exception (interrupt)** - Events other than branches or jumps that change the normal flow of instruction execution - Exception - Any unexpected change in control flow without distinguishing whether the cause is **internal or external** - Interrupt - An exception that comes from **outside of the processor** - To detect exceptional conditions and taking the appropriate action is often on the critical timing path of a processor ![image](https://hackmd.io/_uploads/SJUNSPFHT.png) ### How exceptions are handled in the MIPS architechture - Execution of an **undefined instruction**, an **arithmetic overflow** - When an exception occurs - To save the address of the offending instruction in the **exception program counter (EPC)** - To determine where to restart the execution of the program - To transfer control to the operating system at some **specified address** - To provide some service to the user program - To take some **predefined action** in response to an overflow - To stop the execution and report an error - To **terminate or continue** the program - Two ways to communicate the reason for an exception - Using a ***Cause*** register to indicate the reason for the exception - **Vectored interrupts** - The address to which control is transferred is determined by the cause of the exception - The addresses are separated by **32 bytes or eight instructions** - When the exception is not vectored - A single entry point for all exceptions - The OS decodes the status register to find the cause ![image](https://hackmd.io/_uploads/H1RmhDtSa.png) - To implement the exception system used in the MIPS architecture - The single entry point being the address 8000 0180~hex~ - **EPC** register - A 32-bit register used to hold **the address of the affected instruction** - **Cause** register - A 32-bit register used to record the cause of the exception - Ex: - 10: undefined instruction - 12: arithmetic overflow ### Exceptions in a pipelined implementation - To treat exceptions as another form of control hazard - Steps to deal with exception (using an arithmetic overflow as an example) - To transfer control to the exception routine immediately - To flush instructions that follow the instrction caused exception - ***IF*** stage - To zero the instruction field of the ***IF/ID*** pipeline register (turing into a **nop** instruction) - Similar as steps for branch misprediction - ***ID*** stage - To zero control signals using the multiplexor already in the ***ID*** stage - Similar as steps for stall - ***ID.Flush*** is ORed with the stall signal from the hazard detection unit - ***EX*** stage - ***EX.Flush*** is added to cause new multiplexors to zero the control lines - To prevent the instruction in the ***EX*** stage from writing its result in the ***WB*** stage - To start fetching instructions from location 8000 0180~hex~ - To add an additional input to the ***PC*** multiplexor that sends 8000 0180~hex~ to the ***PC*** - To stop or complete the instruction that caused the exception - To stop exection in the middle of the instruction if necessary - If the instrction requires to be eventually completed, the instruction will be restarted from the beginning after the exception is handled - To save the address of the offending instruction in the ***EPC*** register - In reality it saves the address + 4 - The exception handling routine must first subtract 4 from the saved value ![image](https://hackmd.io/_uploads/H1LI5uFHp.png) - Example: Exception in a pipelined computer - Instruction sequence - ``` hex 40 sub $11, $2, $4 44 and $12, $2, $5 48 or $13, $2, $6 4C add $1, $2, $1 50 slt $15, $6, $7 54 lw $16, 50($7) ... ``` - Instruction to be invoked on an exception - ``` 80000180 sw $25, 1000($0) 80000184 sw $26, 1004($0) ``` - Assume an overflow exception occurs in the `add` instruction ![image](https://hackmd.io/_uploads/HyoTjdKSa.png) ![image](https://hackmd.io/_uploads/Hyn-hOYSp.png) - Multiple exceptions can occur simultaneously - To prioritize the exceptions (by the hardware) - I/O device requests and hardware malfunctions are not associated with a specific instruction - The implementation has some flexibility as to when to interrupt the pipeline - To match the exception to the instruction - The ***EPC*** register captures the address of the interrupted instructions - The ***Cause*** register records all possible exceptions in a clock cycle - Knowing in which pipeline stage a type of exception can occur - An undefined instruction is discovered in the ID stage - Invoking the operating system occurs in the EX stage - The difficulty of always associating the correct exception with the correct instruction in pipelined computers - **Imprecise interrupt (imprecise exception)** - An interrupt or execution that is not associated with the exact instruction that was the cause the interrupt or exception - The operating system will determine which instruction caused the problem - **Precise interrupt (precise exception)** - An interrupt or execution that is always associated with the correct instruction