> [!note] AI Tools Usage
> I used Gemini for researching, brainstorming and proofreading.
# Why must branches be resolved early in the pipeline?
> Reference: [Lectures 21-23: RISC-V 5-Stage Pipeline
](https://docs.google.com/presentation/d/1v-Squx8lK-oOrflFOwBZh-ue94seVHZudqDOgJmFf5Q/edit?slide=id.g30cd39f6c3e_86_146#slide=id.g30cd39f6c3e_86_146)

In the Lecture, the 5-stage pipeline CPU resolves in the **EX stage**. Branch Resolution consists of two tasks: determine the branch outcome (taken or not taken) and calculate the branch target address.
**Branch Comp**: A Branch Comparator in EX stage is used to determine whether branch condition is met (e.g., `beq`, `bge`, `blt`).
**Branch Target Address**: This can be calculated by the ALU (or a separate adder).
However, branches can be optimized by resolving earlier in the **ID stage**. In the following section, I'll introduce the concept of ID-stage resolution.
```text
KEY Concept: Why resolve branch early? -> To reduce branch penalties.
Trade-off: Early resolution introduces more hazards. -> The earlier branch resolution, the more hazards.
```
## Trade-off: Hazards and Penalties
### Control Hazard
In a 5-stage pipeline CPU, branch resolution typically occurs in one of two stages (ID and EX).
* Resolution in **EX stage**: The branch decision is known at the end of EX stage.
- If mispredicted, need to flush `IF` and `ID`.
- Penalty: **2 cycles**.
For example,
```asm=
0x10: beq t0, t1, LABEL_if # Assume Branch Taken
0x14: add t2, t3, t4 # Fetched (in ID) -> FLUSHED!
0x18: sub t5, t6, t7 # Fetched (in IF) -> FLUSHED!
```
* Resolution in **ID stage**: The branch decision is known at the end of ID stage.
- If mispredicted, need to flush `IF`.
- Penalty: **1 cycle**.
For example,
```asm=
0x10: beq t0, t1, LABEL_if # Assume Branch Taken
0x14: add t2, t3, t4 # Fetched (in IF) -> FLUSHED!
0x18: sub t5, t6, t7 # Not Fetched
```
### Data Hazard
As mentioned, earlier branch resolution introduces more hazards. Here, we analyze the data hazards when branch resolution is moved to **ID** stage.
**Load-Use Data Hazard**
* Resolution in EX stage:
```asm=
0x10: lw t0, 0(sp) # Producer (In MEM stage)
0x14: beq t0, t1, LABEL_if # Consumer (In EX stage)
```
**Analysis:** There is a **LOAD-USE data hazard** on register `t0`. Since the `lw` instruction retrieves data at the end of **MEM** stage, even if there is full forwarding, the `beq` instruction can not get the required data at the beginning of **EX** stage for comparison. Therefore, the pipeline must **stall for 1 cycle** (or insert a NOP) to wait for the data become available.
* Resolution in ID stage: Use the same example,
```asm=
0x10: lw t0, 0(sp) # Producer (In EX stage)
0x14: beq t0, t1, LABEL_if # Consumer (In ID stage)
```
**Analysis:** There is a **LOAD-USE data hazard** on register `t0`. The `beq` instruction requires `t0` in **ID** stage to determine whether the branch is taken or not. However, `lw` is currently in **EX** stage, and it is impossible to access memory in `EX` stage. Therefore, **2 stalls** (or insert NOPs) are needed.
* Stall 1: Since `lw` is in EX stage, we must insert a NOP to wait for `lw` to reach **MEM** stage.
* Stall 2: Even when `lw` is in MEM stage, `lw` retrieves data at the **end** of this cycle. The `beq` instruction (now stalled in ID) needs data at the **beginning** of the cycle. Thus, a second NOP is required to wait for the `lw` reach **WB** stage.
**Conclusion:** Only when `lw` reaches **WB** stage can we forward the data from `lw` to `beq` to resolve the branch.
```asm=
lw t0, 0(sp)
nop # Stall (Wait for EX -> MEM)
nop # Stall (Wait for MEM -> WB)
beq t0, t1, LABEL_if
```
However, what happens if it is merely a **RAW data hazard**, not a LOAD-USE hazard?
**RAW Data Hazard**
Consider the below code snippet, where a **RAW data hazard** occurs:
```asm=
0x10: addi t0, t1, 0x7F # Producer
0x14: beq t0, t2, LABEL_else # Consumer
```
The same situation happens. Even though `addi` is not a memory instruction, it calculates the result in the **EX** stage. Since the `beq` instruction needs the data in **ID** stage, the result is not yet ready. Therefore, the `beq` instruction must **stall for 1 cycle**.
## Verification
### Branch Resolution in ID stage
* Control Hazard:
```asm=
.text
.globl main
main:
li t0, 1
li t1, 1
beq t0, t1, Label_Taken
# To be flushed
addi t2, x0, 0x11
addi t3, x0, 0x22
j Test_Fail
Label_Taken:
addi t4, x0, 0x33
li t5, 0xCAFE
j Exit
Test_Fail:
li t5, 0xDEAD
Exit:
li a0, 0
```

* `beq t0, t1, Label_Taken` = `0x00628863`
* `addi t2, x0, 0x11` = `0x01100393`
* `addi t4, x0, 0x33` = `0x03300E93`
`0x00628863` corresponds to `beq t0, t1, Label_Taken`. Since the branch is taken and **resolved is in ID stage**, the intruction fetched in IF stage (`0x01100393`, namely `addi t2, x0, 0x11`)is flushed. Consequently, a NOP (`0x00000013`) appears in the ID stage in the following cycle, therefore, the penalty is 1 cycle.
:::success
Control Hazard Verified
:::
* LOAD-USE Data Hazard
```asm=
.text
.globl main
main:
addi t1, x0, 5
addi t2, x0, 5
sw t2, 0(sp)
lw t0, 0(sp) # LOAD-USE Data Hazard
beq t0, t1, Label_Taken
# To be flushed
addi t2, x0, 0x11
addi t3, x0, 0x22
j Test_Fail
Label_Taken:
addi t4, x0, 0x33
li t5, 0xCAFE
j Exit
Test_Fail:
li t5, 0xDEAD
Exit:
li a0, 0
```

* `lw t0, 0(sp)` = `0x00012283`
* `beq t0, t1, Label_Taken`: `0x00628863`
* `addi t2, x0, 0x11`: `0x01100393`
When the `beq` instruction is in ID stage, it depends on the `lw` instructions (currently in EX stage). Since `lw` retrieves data at the end of **MEM** stage, but `beq` needs to resolve the branch condition with ID stage, the data is not yet available. Consequently, `beq` must stall for **two cycles** in ID stage to wait for
`lw` complete.
:::success
LOAD-USE Hazard Verified
:::
* RAW Data Hazard
```asm=
.text
.globl main
main:
addi t1, x0, 5
addi t2, x0, 6
addi t0, t1, 1 # RAW Data Hazard
beq t0, t2, Label_Taken
# To be flushed
addi t2, x0, 0x11
addi t3, x0, 0x22
j Test_Fail
Label_Taken:
addi t4, x0, 0x33
li t5, 0xCAFE
j Exit
Test_Fail:
li t5, 0xDEAD
Exit:
li a0, 0
```

* `addi t0, t1, 1` = `0x00130293`
* `beq t0, t2, Label_Taken` = `0x00728863`
* `addi t2, x0, 0x11` = `0x01100393`
* `addi t4, x0, 0x33` = `0x03300E93`
when the `beq` instruction is in ID stage, a RAW data hazard occurs on register `t0`. Consequently, `beq` must stall for **one** cycle in ID stage. This allows the required data to be forwarded from **MEM** stage in the next cycle.
Since the branch is taken, `addi t2, x0, 0x11` is flushed.
:::success
RAW Data Hazard Verified
:::
# How can compressed \(C) instructions affect pipeline fetch?
## RVC (RISC-V Compressed Instruction)
[Compressed Instruction Set](https://en.wikipedia.org/wiki/Compressed_instruction_set)
> Quote from [Chapter 26. "C" Extension for Compressed Instructions, Version 2.0
](https://docs.riscv.org/reference/isa/v20240411/_attachments/riscv-unprivileged.pdf)
> This chapter describes the RISC-V standard compressed instruction-set extension, named "C", which reduces static and dynamic code size by adding short 16-bit instruction encodings for common operations. The C extension can be added to any of the base ISAs (RV32I, RV32E, RV64I, RV64E), and we use the generic term "RVC" to cover any of these. Typically, 50%-60% of the RISC-V instructions in a program can be replaced with RVC instructions, resulting in a 25%-30% code-size reduction.
RISC-V Compressed Instruction introduces **short 16-bit** instructions, RVC allows **common** (namely, frequently used) 32-bit instructions to encode into 16-bit.
> Note: RVC does not add any new instructions, each 16-bit instruction maps one-to-one to an existing 32-bit instruction.
Furthermore, 32-bit and 16-bit instructions can be intermixed. The benefits of RVC is that the same I-cache can store more instructions, which reduces instruction cache misses (**spatial locality**).
> Quote from [Chapter 26. "C" Extension for Compressed Instructions, Version 2.0
](https://docs.riscv.org/reference/isa/v20240411/_attachments/riscv-unprivileged.pdf)
> The C extension is compatible with all other standard instruction extensions. The C extension allows 16-bit instructions to be freely intermixed with 32-bit instructions, with the latter now able to start on any 16-bit boundary, i.e., IALIGN=16. With the addition of the C extension, no instructions can raise instruction-address-misaligned exceptions.
## How RVC Compress?
RVC does not employ complex compression algorithm. Instead, it utilizes the **regularity** of code. It achieves the 16-bit target by mixing and matching specific constraints:
1. Small Immediates: Standard instructions allocate 12 bits for immediates. RVC shrinks this to 5~6 bits.
2. The registers `x0` (Zero), `x1` (ra) and `x2` (sp) are used most frequently. Since these three registers are essential for control flow and stack, RVC hard-codes them into the OPCode. The Opcode implies these registers, saving the 5 bits usually required for register addressing.
3. The destination register and the first source register are identical. Many instructions are of this type, for example, `addi s0, s0, 1`. In this case, RVC encodes the register index only once.
4. Popular Register Only: Many RVC instructions can only access the 8 most frequently used registers (`x8`-`x15`, namely `s0-s1` and `a0`-`a5`). This allows the register to be specified using only 3 bits instead of 5.
> Note: These conditions are **strategies**, not simultaneous requirements. An instruction usually needs to use 1 or 2 of these strategies to fit into 16 bits.
```text
// Compressed 16-bit RVC instruction formats
_________________________________________________________________________________
|Format| Meaning |15 14 13 12 |11 10 9 8 7 | 6 5 4 3 2 | 1 0 |
---------------------------------------------------------------------------------
| CR | Register | funct4 | rd/rs1 | rs2 | op |
| CI | Immediate |funct3| imm | rd/rs1 | imm | op |
| CSS | Stack-relative Store |funct3| imm | rs2 | op |
| CIW | Wide Immediate |funct3| imm | rd' | op |
| CL | Load |funct3| imm | rs1' | imm | rd' | op |
| CS | Store |funct3| imm | rs1' | imm | rs2' | op |
| CA | Arithmetic | fucnt6 |rd'/rs1'|funct2| rs2' | op |
| CB | Branch/Arithmetic |funct3| offset |rd'/rs1'|offset | op |
| CJ | Jump |funct3| jump target | op |
---------------------------------------------------------------------------------
-> rd/rs1: Standard 5-bit register field (Can access x0-x31)
-> rd'/rs1': Compressed 3-bit register field (Can only access x8-x15)
-> imm: Immediate values that are less than 12 bits
```
## My Thoughts
### Effects on Instruction Fetch
Since the length of instructions have two different sizes(32-bit and 16-bit), there will have two impact on instruction fetch stage:
1. The PC update logic requires a additional **MUX** to select between `PC + 4` and `PC + 2`. Originally, the PC always incremented by 4, but now the amount of increment depends on the current instruction's length.
2. We must determine the instruction length in **IF** stage. While the hardware doesn't need to fully decode the instruction here, it only has to check the the **lower 2 bits** to decide the length.
In conclusion, RVC increases the hardware complexity of the **IF** stage.
## Analysis
### Spatial Locality
Assume a Direct-mapped L1 Instruction Cache size of **32 KB** with a cache block size of **64 Bytes**.
* In standard RISC-V (Worst Case), a 64-byte cache block can store:
$$
64 \ \text{Bytes} / 4 \ \text{Bytes} = 16 \ \text{instructions}
$$
* In RVC (Best Case), if all instructions were compressed to 16-bit, a 64-byte cache block can store:
$$
64 \ \text{Bytes} / 2 \ \text{Bytes} = 32 \ \text{instructions}
$$
* In reality, programs are a mix. Assume that one 32-bit instruction is mixed with two 16-bit instructions (1:2 ratio).
- Size of these 3 instructions: $4 + 2 + 2 = 8 \ \text{Bytes}$
- A 64-byte cache block can store:
$$(64 \ \text{Bytes} / 8 \ \text{Bytes}) \times 3 \ \text{instructions} = 24 \ \text{instructions}$$
### Miss Rate
Based on the calculation above, we can analyze the cache miss rate (assuming instruction are sequentially executed):
* **Standard RISC-V:** The CPU has to fetch a new cache block from main memory once every 16 instructions.
$$
\text{Miss Rate} = \frac{1 \ \text{Miss}}{16 \ \text{Instructions}} = 6.25\%
$$
* **RVC (Mixed):** The CPU has to fetch a new cache block from main memory once every 24 instructions.
$$
\text{Miss Rate} = \frac{1 \ \text{Miss}}{24 \ \text{Instructions}} \approx 4.17\%
$$
### How to validate?
To validate the my analysis, we can implement the RVC on 4-SoC MyCPU:
1. Modify the **IF** stage: Add a **MUX** to select between `PC+4` and `PC+2`, controlled by decoding the **lower 2 bits** (Opcode for RVC instructions) of the current instruction.
2. Compilation: Use `riscv-none-elf-gcc` toolchain with the `-march=rv32ic` flag to get the assembly binaries (`.asmbin`).
3. Simulation: Run the simulation and use `gtkwave` to analyze the waveforms. Calculate the total duration of IF stage stalls.
# Reference
[Debunking CISC vs RISC code density](https://www.bitsnbites.eu/cisc-vs-risc-code-density/)
["C" Extension for Compressed Instructions, Version 2.0](https://github.com/riscv/riscv-isa-manual/blob/main/src/c-st-ext.adoc)
# AI Usage
[Gemini - RISC-V Compressed Instructions Explained](https://gemini.google.com/share/c2a851e7e5a2)