CA 25: 12/30 In-class Discussion

> [!note] AI Tools Usage > I used Gemini for researching, brainstorming and proofreading. # Why must branches be resolved early in the pipeline? > Reference: [Lectures 21-23: RISC-V 5-Stage Pipeline ](https://docs.google.com/presentation/d/1v-Squx8lK-oOrflFOwBZh-ue94seVHZudqDOgJmFf5Q/edit?slide=id.g30cd39f6c3e_86_146#slide=id.g30cd39f6c3e_86_146) ![image](https://hackmd.io/_uploads/r1rCi39Hbe.png) In the Lecture, the 5-stage pipeline CPU resolves in the **EX stage**. Branch Resolution consists of two tasks: determine the branch outcome (taken or not taken) and calculate the branch target address. **Branch Comp**: A Branch Comparator in EX stage is used to determine whether branch condition is met (e.g., `beq`, `bge`, `blt`). **Branch Target Address**: This can be calculated by the ALU (or a separate adder). However, branches can be optimized by resolving earlier in the **ID stage**. In the following section, I'll introduce the concept of ID-stage resolution. ```text KEY Concept: Why resolve branch early? -> To reduce branch penalties. Trade-off: Early resolution introduces more hazards. -> The earlier branch resolution, the more hazards. ``` ## Trade-off: Hazards and Penalties ### Control Hazard In a 5-stage pipeline CPU, branch resolution typically occurs in one of two stages (ID and EX). * Resolution in **EX stage**: The branch decision is known at the end of EX stage. - If mispredicted, need to flush `IF` and `ID`. - Penalty: **2 cycles**. For example, ```asm= 0x10: beq t0, t1, LABEL_if # Assume Branch Taken 0x14: add t2, t3, t4 # Fetched (in ID) -> FLUSHED! 0x18: sub t5, t6, t7 # Fetched (in IF) -> FLUSHED! ``` * Resolution in **ID stage**: The branch decision is known at the end of ID stage. - If mispredicted, need to flush `IF`. - Penalty: **1 cycle**. For example, ```asm= 0x10: beq t0, t1, LABEL_if # Assume Branch Taken 0x14: add t2, t3, t4 # Fetched (in IF) -> FLUSHED! 0x18: sub t5, t6, t7 # Not Fetched ``` ### Data Hazard As mentioned, earlier branch resolution introduces more hazards. Here, we analyze the data hazards when branch resolution is moved to **ID** stage. **Load-Use Data Hazard** * Resolution in EX stage: ```asm= 0x10: lw t0, 0(sp) # Producer (In MEM stage) 0x14: beq t0, t1, LABEL_if # Consumer (In EX stage) ``` **Analysis:** There is a **LOAD-USE data hazard** on register `t0`. Since the `lw` instruction retrieves data at the end of **MEM** stage, even if there is full forwarding, the `beq` instruction can not get the required data at the beginning of **EX** stage for comparison. Therefore, the pipeline must **stall for 1 cycle** (or insert a NOP) to wait for the data become available. * Resolution in ID stage: Use the same example, ```asm= 0x10: lw t0, 0(sp) # Producer (In EX stage) 0x14: beq t0, t1, LABEL_if # Consumer (In ID stage) ``` **Analysis:** There is a **LOAD-USE data hazard** on register `t0`. The `beq` instruction requires `t0` in **ID** stage to determine whether the branch is taken or not. However, `lw` is currently in **EX** stage, and it is impossible to access memory in `EX` stage. Therefore, **2 stalls** (or insert NOPs) are needed. * Stall 1: Since `lw` is in EX stage, we must insert a NOP to wait for `lw` to reach **MEM** stage. * Stall 2: Even when `lw` is in MEM stage, `lw` retrieves data at the **end** of this cycle. The `beq` instruction (now stalled in ID) needs data at the **beginning** of the cycle. Thus, a second NOP is required to wait for the `lw` reach **WB** stage. **Conclusion:** Only when `lw` reaches **WB** stage can we forward the data from `lw` to `beq` to resolve the branch. ```asm= lw t0, 0(sp) nop # Stall (Wait for EX -> MEM) nop # Stall (Wait for MEM -> WB) beq t0, t1, LABEL_if ``` However, what happens if it is merely a **RAW data hazard**, not a LOAD-USE hazard? **RAW Data Hazard** Consider the below code snippet, where a **RAW data hazard** occurs: ```asm= 0x10: addi t0, t1, 0x7F # Producer 0x14: beq t0, t2, LABEL_else # Consumer ``` The same situation happens. Even though `addi` is not a memory instruction, it calculates the result in the **EX** stage. Since the `beq` instruction needs the data in **ID** stage, the result is not yet ready. Therefore, the `beq` instruction must **stall for 1 cycle**. ## Verification ### Branch Resolution in ID stage * Control Hazard: ```asm= .text .globl main main: li t0, 1 li t1, 1 beq t0, t1, Label_Taken # To be flushed addi t2, x0, 0x11 addi t3, x0, 0x22 j Test_Fail Label_Taken: addi t4, x0, 0x33 li t5, 0xCAFE j Exit Test_Fail: li t5, 0xDEAD Exit: li a0, 0 ``` ![image](https://hackmd.io/_uploads/rkzd_o2HWl.png) * `beq t0, t1, Label_Taken` = `0x00628863` * `addi t2, x0, 0x11` = `0x01100393` * `addi t4, x0, 0x33` = `0x03300E93` `0x00628863` corresponds to `beq t0, t1, Label_Taken`. Since the branch is taken and **resolved is in ID stage**, the intruction fetched in IF stage (`0x01100393`, namely `addi t2, x0, 0x11`)is flushed. Consequently, a NOP (`0x00000013`) appears in the ID stage in the following cycle, therefore, the penalty is 1 cycle. :::success Control Hazard Verified ::: * LOAD-USE Data Hazard ```asm= .text .globl main main: addi t1, x0, 5 addi t2, x0, 5 sw t2, 0(sp) lw t0, 0(sp) # LOAD-USE Data Hazard beq t0, t1, Label_Taken # To be flushed addi t2, x0, 0x11 addi t3, x0, 0x22 j Test_Fail Label_Taken: addi t4, x0, 0x33 li t5, 0xCAFE j Exit Test_Fail: li t5, 0xDEAD Exit: li a0, 0 ``` ![image](https://hackmd.io/_uploads/H19I3CnBZg.png) * `lw t0, 0(sp)` = `0x00012283` * `beq t0, t1, Label_Taken`: `0x00628863` * `addi t2, x0, 0x11`: `0x01100393` When the `beq` instruction is in ID stage, it depends on the `lw` instructions (currently in EX stage). Since `lw` retrieves data at the end of **MEM** stage, but `beq` needs to resolve the branch condition with ID stage, the data is not yet available. Consequently, `beq` must stall for **two cycles** in ID stage to wait for `lw` complete. :::success LOAD-USE Hazard Verified ::: * RAW Data Hazard ```asm= .text .globl main main: addi t1, x0, 5 addi t2, x0, 6 addi t0, t1, 1 # RAW Data Hazard beq t0, t2, Label_Taken # To be flushed addi t2, x0, 0x11 addi t3, x0, 0x22 j Test_Fail Label_Taken: addi t4, x0, 0x33 li t5, 0xCAFE j Exit Test_Fail: li t5, 0xDEAD Exit: li a0, 0 ``` ![image](https://hackmd.io/_uploads/BJrsCT2rZe.png) * `addi t0, t1, 1` = `0x00130293` * `beq t0, t2, Label_Taken` = `0x00728863` * `addi t2, x0, 0x11` = `0x01100393` * `addi t4, x0, 0x33` = `0x03300E93` when the `beq` instruction is in ID stage, a RAW data hazard occurs on register `t0`. Consequently, `beq` must stall for **one** cycle in ID stage. This allows the required data to be forwarded from **MEM** stage in the next cycle. Since the branch is taken, `addi t2, x0, 0x11` is flushed. :::success RAW Data Hazard Verified ::: # How can compressed \(C) instructions affect pipeline fetch? ## RVC (RISC-V Compressed Instruction) [Compressed Instruction Set](https://en.wikipedia.org/wiki/Compressed_instruction_set) > Quote from [Chapter 26. "C" Extension for Compressed Instructions, Version 2.0 ](https://docs.riscv.org/reference/isa/v20240411/_attachments/riscv-unprivileged.pdf) > This chapter describes the RISC-V standard compressed instruction-set extension, named "C", which reduces static and dynamic code size by adding short 16-bit instruction encodings for common operations. The C extension can be added to any of the base ISAs (RV32I, RV32E, RV64I, RV64E), and we use the generic term "RVC" to cover any of these. Typically, 50%-60% of the RISC-V instructions in a program can be replaced with RVC instructions, resulting in a 25%-30% code-size reduction. RISC-V Compressed Instruction introduces **short 16-bit** instructions, RVC allows **common** (namely, frequently used) 32-bit instructions to encode into 16-bit. > Note: RVC does not add any new instructions, each 16-bit instruction maps one-to-one to an existing 32-bit instruction. Furthermore, 32-bit and 16-bit instructions can be intermixed. The benefits of RVC is that the same I-cache can store more instructions, which reduces instruction cache misses (**spatial locality**). > Quote from [Chapter 26. "C" Extension for Compressed Instructions, Version 2.0 ](https://docs.riscv.org/reference/isa/v20240411/_attachments/riscv-unprivileged.pdf) > The C extension is compatible with all other standard instruction extensions. The C extension allows 16-bit instructions to be freely intermixed with 32-bit instructions, with the latter now able to start on any 16-bit boundary, i.e., IALIGN=16. With the addition of the C extension, no instructions can raise instruction-address-misaligned exceptions. ## How RVC Compress? RVC does not employ complex compression algorithm. Instead, it utilizes the **regularity** of code. It achieves the 16-bit target by mixing and matching specific constraints: 1. Small Immediates: Standard instructions allocate 12 bits for immediates. RVC shrinks this to 5~6 bits. 2. The registers `x0` (Zero), `x1` (ra) and `x2` (sp) are used most frequently. Since these three registers are essential for control flow and stack, RVC hard-codes them into the OPCode. The Opcode implies these registers, saving the 5 bits usually required for register addressing. 3. The destination register and the first source register are identical. Many instructions are of this type, for example, `addi s0, s0, 1`. In this case, RVC encodes the register index only once. 4. Popular Register Only: Many RVC instructions can only access the 8 most frequently used registers (`x8`-`x15`, namely `s0-s1` and `a0`-`a5`). This allows the register to be specified using only 3 bits instead of 5. > Note: These conditions are **strategies**, not simultaneous requirements. An instruction usually needs to use 1 or 2 of these strategies to fit into 16 bits. ```text // Compressed 16-bit RVC instruction formats _________________________________________________________________________________ |Format| Meaning |15 14 13 12 |11 10 9 8 7 | 6 5 4 3 2 | 1 0 | --------------------------------------------------------------------------------- | CR | Register | funct4 | rd/rs1 | rs2 | op | | CI | Immediate |funct3| imm | rd/rs1 | imm | op | | CSS | Stack-relative Store |funct3| imm | rs2 | op | | CIW | Wide Immediate |funct3| imm | rd' | op | | CL | Load |funct3| imm | rs1' | imm | rd' | op | | CS | Store |funct3| imm | rs1' | imm | rs2' | op | | CA | Arithmetic | fucnt6 |rd'/rs1'|funct2| rs2' | op | | CB | Branch/Arithmetic |funct3| offset |rd'/rs1'|offset | op | | CJ | Jump |funct3| jump target | op | --------------------------------------------------------------------------------- -> rd/rs1: Standard 5-bit register field (Can access x0-x31) -> rd'/rs1': Compressed 3-bit register field (Can only access x8-x15) -> imm: Immediate values that are less than 12 bits ``` ## My Thoughts ### Effects on Instruction Fetch Since the length of instructions have two different sizes(32-bit and 16-bit), there will have two impact on instruction fetch stage: 1. The PC update logic requires a additional **MUX** to select between `PC + 4` and `PC　+ 2`. Originally, the PC always incremented by 4, but now the amount of increment depends on the current instruction's length. 2. We must determine the instruction length in **IF** stage. While the hardware doesn't need to fully decode the instruction here, it only has to check the the **lower 2 bits** to decide the length. In conclusion, RVC increases the hardware complexity of the **IF** stage. ## Analysis ### Spatial Locality Assume a Direct-mapped L1 Instruction Cache size of **32 KB** with a cache block size of **64 Bytes**. * In standard RISC-V (Worst Case), a 64-byte cache block can store: $$ 64 \ \text{Bytes} / 4 \ \text{Bytes} = 16 \ \text{instructions} $$ * In RVC (Best Case), if all instructions were compressed to 16-bit, a 64-byte cache block can store: $$ 64 \ \text{Bytes} / 2 \ \text{Bytes} = 32 \ \text{instructions} $$ * In reality, programs are a mix. Assume that one 32-bit instruction is mixed with two 16-bit instructions (1:2 ratio). - Size of these 3 instructions: $4 + 2 + 2 = 8 \ \text{Bytes}$ - A 64-byte cache block can store: $$(64 \ \text{Bytes} / 8 \ \text{Bytes}) \times 3 \ \text{instructions} = 24 \ \text{instructions}$$ ### Miss Rate Based on the calculation above, we can analyze the cache miss rate (assuming instruction are sequentially executed): * **Standard RISC-V:** The CPU has to fetch a new cache block from main memory once every 16 instructions. $$ \text{Miss Rate} = \frac{1 \ \text{Miss}}{16 \ \text{Instructions}} = 6.25\% $$ * **RVC (Mixed):** The CPU has to fetch a new cache block from main memory once every 24 instructions. $$ \text{Miss Rate} = \frac{1 \ \text{Miss}}{24 \ \text{Instructions}} \approx 4.17\% $$ ### How to validate? To validate the my analysis, we can implement the RVC on 4-SoC MyCPU: 1. Modify the **IF** stage: Add a **MUX** to select between `PC+4` and `PC+2`, controlled by decoding the **lower 2 bits** (Opcode for RVC instructions) of the current instruction. 2. Compilation: Use `riscv-none-elf-gcc` toolchain with the `-march=rv32ic` flag to get the assembly binaries (`.asmbin`). 3. Simulation: Run the simulation and use `gtkwave` to analyze the waveforms. Calculate the total duration of IF stage stalls. # Reference [Debunking CISC vs RISC code density](https://www.bitsnbites.eu/cisc-vs-risc-code-density/) ["C" Extension for Compressed Instructions, Version 2.0](https://github.com/riscv/riscv-isa-manual/blob/main/src/c-st-ext.adoc) # AI Usage [Gemini - RISC-V Compressed Instructions Explained](https://gemini.google.com/share/c2a851e7e5a2)