# Assignment3: Your Own RISC-V CPU
> contributed by < [Shaoen-Lin](https://github.com/Shaoen-Lin) >
## Explaination of Hello World in Chisel
```scala
class Hello extends Module {
val io = IO(new Bundle {
val led = Output(UInt(1.W))
})
val CNT_MAX = (50000000 / 2 - 1).U;
val cntReg = RegInit(0.U(32.W))
val blkReg = RegInit(0.U(1.W))
cntReg := cntReg + 1.U
when(cntReg === CNT_MAX) {
cntReg := 0.U
blkReg := ~blkReg
}
io.led := blkReg
}
```
### Circuit Behavior
The module implements a blinking LED controller functioning as a frequency divider to generate a visually perceptible blinking effect. It is a synchronous sequential circuit where state transitions are driven by the system clock. A counter (`cntReg`) increments on every clock tick, and upon reaching a predefined threshold (`CNT_MAX`), the control logic resets the counter and inverts the LED state register (`blkReg`), effectively toggling the LED output.
A critical implementation detail lies in the calculation of `CNT_MAX`, defined as `(50000000 / 2 - 1)`. Assuming a system clock frequency of 50 MHz, a 0.5 second interval requires exactly 25,000,000 clock cycles. The subtraction of 1 is essential because the counter is **zero-indexed** (initializing at 0.U). To count $N$ cycles, the counter must range from $0$ to $N-1$.
### Enhancement: External Enable Control
To augment the interactivity and controllability of the design, I introduced a control input `sw` (Switch) to the `IO Bundle`. This transforms the circuit from a free-running oscillator into a controlled sequential circuit.
```scala
class HelloImproved extends Module {
val io = IO(new Bundle {
val led = Output(UInt(1.W))
val sw = Input(Bool())
})
val CNT_MAX = (50000000 / 2 - 1).U
val cntReg = RegInit(0.U(32.W))
val blkReg = RegInit(0.U(1.W))
// Enhancement
when(io.sw) {
cntReg := cntReg + 1.U
when(cntReg === CNT_MAX) {
cntReg := 0.U
blkReg := ~blkReg
}
} .otherwise {
blkReg := 0.U
cntReg := 0.U
}// When switch is off, reset counter and force LED off
io.led := blkReg
}
```
* **Active State (sw is High):** The circuit behaves normally. The counter increments on every clock cycle, and the inversion logic toggles the LED state upon reaching `CNT_MAX`.
* **Inactive State (sw is Low):** The otherwise block is triggered. The circuit enters a reset state where `cntReg` is cleared to 0 and `blkReg` (the LED output) is forced low. This ensures the LED is extinguished immediately when the switch is turned off, rather than pausing in an undefined state.
## What should we know before designing CPU?
### 1. Chisel
Chisel is a **Domain-Specific Language (DSL)** implemented using Scala's macro features, designed specifically for hardware description. Consequently, all circuit logic must be implemented using the definitions provided by the Chisel library, rather than standard Scala keywords.
The true power of Chisel lies in its embedding within Scala. When the Scala program is executed, it undergoes an **"Elaboration"** process which means the modules, IO ports, registers, and other components you write are connected together in memory, forming a complete circuit diagram. During this phase, the high-level Chisel modules are translated into intermediate representations (**FIRRTL**) and ultimately generated into standard Verilog code.

It is important to note that Chisel is not a direct replacement for Verilog, but rather a **Hardware Generator Language (HGL)**.
### 2. Verilator
Verilator functions as a compiler that translates Verilog hardware descriptions into highly optimized C++ models. It allows you to use a C++ program on a regular computer to simulate the behavior of the CPU hardware you designed.
* **Mechanism:** It is a cycle-based simulator. This means it evaluates the circuit state only at clock boundaries, effectively ignoring intra-cycle timing delays and signal glitches.
* **Advantage:** By abstracting away these physical timing details, Verilator achieves extremely high simulation speeds.
### 3. Simulation and Verification Framework
To validate the functionality of our RISC-V CPU, we established a complete cross-compilation and simulation workflow. The verification process involves the interaction between the software toolchain and the hardware simulation environment, as illustrated below:
* **Software Toolchain Flow**
* **Compilation:** Source codes written in C (`.c`) or Assembly (`.S`) are compiled using the RISC-V GNU Toolchain (`riscv64-unknown-elf-gcc`) to generate an executable ELF file.
* **Binary Extraction:** Since the hardware simulation requires raw machine code, we use objcopy to extract the instruction stream from the ELF file. In our build system, this generates a file with the `.asmbin` extension.
* **Hardware Simulation Flow**
* **RTL Generation:** As previously described, the Chisel source code is elaborated and compiled into synthesizable Verilog (RTL).
* **Simulation Execution:** Simulation Execution: We utilize Verilator, a high-performance open-source simulator, to instantiate the CPU design. The testbench loads the generated binary (`.asmbin`) into the CPU's memory model and executes the program cycle-by-cycle.
* **Waveform Generation:** During execution, the simulator records signal transitions into a VCD (Value Change Dump) file.
* **Verification and Debugging**
* **Visual Analysis:** We employ Surfer (or GTKWave) to visualize the `.vcd` waveform.
* **Cross-Referencing:** By comparing the Program Counter (PC) and register states in the waveform against the instruction sequence in the disassembly file (.asmbin), we can verify if the pipeline executes the instructions correctly, handles hazards as expected, and writes back the correct results.
### 4. Execution Workflow
* **Step 1 - Add your test file to `csrc`**
> cd ca2025-mycpu/3-pipeline/csrc
> (Place your test file in this directory)
* **Step 2 - Generate the `.asmbin` file**
> make update
* **Step 3 - Navigate back to the project root (ca2025)**
> cd ..
* **Step 4 - Run your test program on MyCPU**
> make test
* **Step 5 - Simulate, generate the `Vtop` model, and record the `.vcd` waveform**
> make sim SIM_ARGS="-instruction src/main/resources/xxx.asmbin"
* **Step 6 - Run RISCOF compliance tests**
> make compiliance
## Architecture Overview

(Note: The diagram above illustrates the Single Cycle Machine architecture schematic.)
The RISC-V CPU we are designing can execute a core subset of RISC-V instructions (RV32I):
| **Type** | **Function Description** | **Typical Instructions** |
| -------- | --------------------------------------- | ------------------------ |
| **R** | Register-to-register operations | `add`, `sub`, `slt` |
| **I** | Immediate operations / Load / JALR | `addi`, `lw`, `jalr` |
| **S** | Store to memory | `sb`, `sw` |
| **B** | Conditional branch | `beq`, `bne` |
| **U** | Upper immediate (high 20-bit immediate) | `lui`, `auipc` |
| **J** | Unconditional jump | `jal` |
We divide the instruction execution into five different stages:
1. Instruction Fetch: Fetching the instruction data from memory.
2. Instruction Decode: Understanding the meaning of the instruction and reading register data.
3. Execute: Calculating the result using the ALU.
4. Memory Access (load/store instructions): Reading from and writing to memory.
5. Write-back (for all instructions except store): Writing the result back to registers.
We build the datapath components step by step according to the stages above, then instantiate and connect them in the CPU top-level module, and we implement three CPU variants: `1-single-cycle`, `2-MMIO-trap`, and `3-pipelined`, which you can find the source code in my repository.
Furthermore, In `3-pipelined` we successfully executed a comprehensive suite of test programs across all three CPU variants to demonstrate computational capability and robustness. The specific test files utilized are categorized as follows:
* **Algorithmic Benchmarks:** `fibonacci.c` and `quicksort.c` were used to verify general logic and C-level execution.
* **Pipeline & Hazard Validation:** `hazard.S`, `hazard_extended.S`, and `sb.S` were employed to strictly validate data dependency resolution and byte-level memory operations.
* **System Integration:** `uart.c`, `irqtrap.c`, and `mmio.h` were used to confirm I/O communication and interrupt handling.
In addition to these custom tests, we verified architectural compliance by passing the **RISCOF** test suite. With the functional correctness established by this extensive verification, the following section details the micro-architectural implementation and waveform analysis.
## Adaptation of Homework 2 for MyCPU(3-pipeline)
Based on Homework 2, I adapted three programs—`LeetCode 260`, `BFloat16 operations`, and `fast_rsqrt` — and successfully executed them on **MyCPU**. In the following section, I will conduct a detailed waveform analysis of the `fast_rsqrt` program across the five pipeline stages. I specifically utilized the `FiveStageFinal` architecture for this implementation (below is the different implementation of `3-pipelined`) . This design was chosen because it represents the most robust configuration, combining **data forwarding** to minimize stalls caused by RAW hazards, refined **flush** logic for accurate control flow handling, and an optimized **CLINT/CSR handshake** to support system-level operations.
| Implementation | Stages | Highlight |
| --- | --- | --- |
| `ImplementationType.ThreeStage` | IF → ID → EX/MEM/WB (folded) | Minimal pipeline that introduces control-flow redirection and CLINT interaction with a single execute stage. |
| `ImplementationType.FiveStageStall` | IF → ID → EX → MEM → WB | Classic five-stage design that resolves data hazards with interlocks (bubbles) and performs branch resolution in EX. |
| `ImplementationType.FiveStageForward` | IF → ID → EX → MEM → WB | Adds bypass paths from MEM/WB back to EX to reduce stalls caused by RAW hazards. |
| `ImplementationType.FiveStageFinal` | IF → ID → EX → MEM → WB | Combines forwarding, refined flush logic, and the optimized CLINT/CSR handshake that matches the interrupt-capable single-cycle core. |
Before presenting the waveform analysis, I would like to detail the specific code revisions required to successfully port the `fast_rsqrt` algorithm to the MyCPU bare-metal environment:
* **Adaptation for Bare-Metal Execution:** Since the CPU operates in a bare-metal environment without an operating system, standard libraries such as `<stdio.h>` are unavailable. Consequently, I removed all standard I/O dependencies. Instead of using `printf`, I verified execution results by writing data to specific memory addresses (e.g., storing the result at `0x100`). Furthermore, to signal program completion to the simulation testbench, I implemented a termination mechanism by writing a signature value to a specific address:
```scala
*((volatile int *) 0x4) = 0xDEADBEEF; // Signal simulation termination
while(1); // Halt execution
```
* **Testbench Verification Strategy (`PipelineProgramTest.scala`):** I modified the testbench configuration in `src/test/scala/riscv/PipelineProgramTest.scala`. Direct comparison of floating-point results in simulation is problematic, primarily because the reference values printed by the C program are often **truncated or rounded** for display, causing inevitable mismatches with the precise internal representation. To address this, I converted the expected floating-point results into **Q0.16 fixed-point format**. This approach allows for deterministic integer-based verification, ensuring that the `fast_rsqrt` implementation produces the correct approximation within the expected error margin.
``` scala
it should "correctly run Assignment 2 (FastRsqrt Tests)" in {
val expectedAnswers = Seq(
65536,
46341,
37836,
32768,
20724,
6553,
2072,
655,
1
)
runProgram("fast_rsqrt.asmbin", cfg) { c =>
c.clock.setTimeout(0)
for (_ <- 0 until 500) {
c.clock.step(1000)
c.io.mem_debug_read_address.poke(0.U)
}
for ((expected, i) <- expectedAnswers.zipWithIndex) {
val addr = 0x100 + i * 4
c.io.mem_debug_read_address.poke(addr.U)
c.clock.step()
c.io.mem_debug_read_data.expect(expected.U, s"Failed at index $i")
}
}
}
```
### Modified `fast_rsqrt.c`
```scala
#include <stdint.h>
static const uint16_t rsqrt_table[32] = {
65535, 46341, 32768, 23170, 16384, 11585, 8192, 5792,
4096, 2896, 2048, 1448, 1024, 724, 512, 362,
256, 181, 128, 91, 64, 45, 32, 23,
16, 11, 8, 6, 4, 3, 2, 1};
static int clz(uint32_t x) {
if (!x) return 32;
int n = 0;
if (!(x & 0xFFFF0000u)) { n += 16; x <<= 16; }
if (!(x & 0xFF000000u)) { n += 8; x <<= 8; }
if (!(x & 0xF0000000u)) { n += 4; x <<= 4; }
if (!(x & 0xC0000000u)) { n += 2; x <<= 2; }
if (!(x & 0x80000000u)) { n += 1; }
return n;
}
static uint64_t mul32(uint32_t a, uint32_t b) {
uint64_t r = 0;
for (int i = 0; i < 32; i++) {
if (b & (1u << i))
r += ((uint64_t)a << i);
}
return r;
}
uint32_t fast_rsqrt(uint32_t x) {
if (x == 0) return 0xFFFFFFFF;
if (x == 1) return 65536;
int exp = 31 - clz(x);
uint32_t y_base = rsqrt_table[exp];
uint32_t y_next = (exp < 31) ? rsqrt_table[exp + 1] : 0;
uint32_t diff = y_base - y_next;
uint32_t base = 1u << exp;
uint32_t frac = ((x - base) << 16) >> exp;
uint32_t y = y_base - (uint32_t)(mul32(diff, frac) >> 16);
for (int i = 0; i < 2; i++) {
uint32_t y2 = (uint32_t)(mul32(y, y) );
uint32_t xy2 = (uint32_t)(mul32(x, y2) >> 16);
uint32_t term = (3u << 16) - xy2;
uint32_t prod = (uint32_t)(mul32(y, term) >> 16);
y = prod >> 1;
}
return y;
}
void main() {
uint32_t test_inputs[] = {1, 2, 3, 4, 10, 100, 1000, 10000, 4294967295};
int num_tests = 9;
for (int i = 0; i < num_tests; i++) {
uint32_t res = fast_rsqrt(test_inputs[i]);
volatile int* output_addr = (volatile int*)(0x100 + i * 4);
*output_addr = res;
}
*((volatile int *) 0x4) = 0xDEADBEEF;
while(1);
}
```
After the Adaption on the last part, let's now bump into the waveform analysis part:
### Stage 1: IF (Instruction Fetch)
Unlike the subsequent stages (ID, EXE) where we analyze the datapath based on specific instruction types (R, I, S, etc.), IF stage requires a different approach. At this point in the pipeline, the hardware is **instruction-agnostic** — it treats the fetched data simply as a raw 32-bit sequence without knowledge of its format or function. The critical logic in this stage is defined not by what the instruction performs, but by how the Program Counter (PC) is updated.
Consequently, instead of categorizing by instruction type, we structure the following analysis around the two fundamental behaviors of the fetch logic: **Initialization and Sequential Fetching** and **Control Flow Redirection and Flush Logic**.
#### Analysis of Initialization and Sequential Fetching:

As shown in the waveform, the simulation begins with the reset signal asserted, holding the **Program Counter (PC) at `0x0`** . Upon the de-assertion of reset (at **3** ps), the PC is correctly initialized to the system's boot address, `0x1000`. Following this initialization, the `IF` stage enters a steady state where the PC increments by 4 bytes per cycle (`0x1000` → `0x1004` → `0x1008`...), fetching instructions sequentially. This behavior confirms that the IF stage sustains a throughput of **1 instruction per cycle (IPC)** under linear execution flow, free of structural hazards.
#### Analysis of Control Flow Redirection and Flush Logic:

A critical control hazard resolution is observed when the PC reaches `0x1020`(at **33** ps). As this instruction enters the **ID** stage, the control logic detects a jump requirement:
* **Jump Detection:** The `io_jump_flag_id` signal asserts, indicating that the instruction flow must be redirected.
* **Target Resolution:** The target address is simultaneously provided via `io_jump_address_id` as `0x102c`.
* **Refined Flush Mechanism:** The `io_stall_flag_ctrl` signal goes high. This is the sign of the `FiveStageFinal` implementation's refined flush logic. It ensures that the instruction sequentially fetched at `0x1024` is flushed or stalled, preventing invalid execution.
* **Result:** The PC transitions immediately from `0x1020` to `0x102c` in the subsequent cycle, successfully completing the control transfer without executing the invalid path.
### Stage 2: ID (Instruction Decode)
In this section, we focus on describing how different instruction types affect key signals in the corresponding components during the decode stage. These instructions will also be discussed in later sections.
#### R type

The waveform at **241 ps** captures the decoding of a standard R-Type arithmetic instruction at PC `0x161c`.
* **Instruction Identification:** The opcode is `0x33` which is the standard opcode for register-to-register operations (like ADD, SUB, SLT). The funct3 value is `0x0`, indicating an **ADD or SUB** operation.
* **Operand Fetch:** The ID stage correctly identifies the source register rs1 as `0x0f` (which is register **a5**). The waveform shows `reg2_data` providing the value `0x00400000`, ready for the ALU.
* **Destination:** The destination register rd is decoded as `0x0f` (which is register **a5**), meaning the result will be written back to the same register.
* **Control Signals:** Unlike I, S, or U types, the immediate value is not the primary operand. While `io_ex_immediate` shows a value, the Control Unit will signal the Mux to ignore the immediate and use `reg2_data` instead.
#### I type

The waveform at **29 ps** captures the decoding of an I-Type arithmetic instruction (ADDI) at PC `0x101c`.
* **Instruction Identification:** The opcode is `0x13` (which corresponds to OP-IMM) and the funct3 value is `0x0`, confirming this is an **ADDI** instruction.
* **Operand Fetch:** The ID stage correctly identifies the source register rs1 as `0x06` (which is register **t1**). This creates a data dependency on the previous instruction. Instead of a second register value, the 12-bit immediate is **sign-extended**, and `io_ex_immediate` shows the value `0x00000050`.
* **Destination:** The instruction targets a destination register rd, meaning the result of the addition (0x00 + 0x50) will be written back to the Register File in the WB stage.
* **Control Signals:** Unlike R-Type instructions, the Control Unit will signal the ALU Mux to ignore `reg2_data` and instead select the immediate value (`io_ex_immediate`) as the second operand for the calculation.
#### S type

The waveform at **85 ps** illustrates the ID stage handling a Store instruction at PC `0x15a4`.
* **Instruction Identification:** The opcode is `0x23` and funct3 is `0x2` which decode the instruction **SW**.
* **Address Calculation:** The decoder identifies the base address register rs1 as `0x2` (sp, Stack Pointer). The store offset is reconstructed on the `io_ex_immediate` signal as `0x00000048`.
* **Data Preparation:** For S-Type instructions, the value to be stored is read from the second source register. The `reg2_data` signal carries the valid data `0x00001050`, ready to be written to memory in the execution stage.
* **Result:** This confirms the decoding of an instruction like `sw rs2, 72(sp)`, where the CPU prepares to store the value `0x1050` into the stack at offset 72.
#### B type

The waveform at **33 ps** illustrates the ID stage's handling of a B-Type instruction, specifically a conditional branch located at PC `0x1020`.
* **Decoding:** The control unit identifies the instruction format via opcode `0x63` (BRANCH) and funct3 `0x7` which decode the instruction **BGEU**.
* **Operand Retrieval:** The ID stage retrieves the operands for comparison. `reg1_data` carries the value `0x00002064`, and `reg2_data` provides `0x00002014`.
* **Branch Decision:** The internal comparator evaluates the condition (`0x2064` >= `0x2014`). Since the condition is true, the `io_if_jump_flag` signal asserts high immediately.
* **Target Calculation:** Simultaneously, the branch target address is calculated and output via `io_if_jump_address` as `0x0000102c`. This signal is sent to the IF stage to override the next PC, executing the jump analyzed in the previous section.
#### U type

The waveform at **25 ps** captures the decoding of a U-Type instruction (AUIPC) at PC `0x1018`.
* **Instruction Identification:** The opcode is `0x17` , which identifies the instruction as **AUIPC** (Add Upper Immediate to PC).
* **Operand Generation:** Unlike R or I types, U-Type instructions do not access source registers (rs1 or rs2). The decoder extracts the upper 20-bit immediate value and shifts it, generating the value `0x00001000` on the `io_ex_immediate` signal.
* **Destination:** The destination register rd is decoded as `0x06` (which is register **t1**). The result of the calculation (PC + Immediate) will be written back to this register.
* **Control Signals:** The Control Unit will configure the ALU Mux signals to select the Program Counter (PC) and the Immediate value (`io_ex_immediate`) as operands, effectively bypassing the register file outputs (`reg1_data` and `reg2_data`).
#### J type

The waveform at **73 ps** highlights the decoding of an unconditional jump.
* **Instruction Identification:** The opcode is `0x6f`. This corresponds to the **JAL** instruction.
* **Link Register:** The destination register rd is decoded as `0x01`, which is the conventional ra register.
* **Offset Generation:** The J-Type immediate (jump offset) is reconstructed and sign-extended on the `io_ex_immediate bus`, showing the value `0x00000550`.
* **Target Calculation:** The ID stage simultaneously calculates the target address by adding the current PC (`0x104c`) and the offset (`0x550`), resulting in the jump target `0x0000159c`.
* **Control Flow:** The `io_if_jump_flag signal` asserts high immediately, and the calculated target address (`0x159c`) is sent to the IF stage to override the next PC fetch. This jump is unconditional.
### Stage 3: EX (Execute)
#### R type

The execution stage processes the R-Type instruction previously decoded at PC `0x161c`. Execution begins at the clock edge at **245 ps**.
* **Operation:** The control unit asserts the ALU function based on the instruction's funct3 (which was 0) and funct7. This typically results in an **ADD** operation.
* **Operand Fetch:** The EX stage receives the two source operands from the pipeline registers.
* **ALU Op1:** The first operand (`alu_op1`) is the content of rs1 (register a5), carrying the value `0xfffffff0`.
* **ALU Op2:** The second operand (`alu_op2`) is the content of rs2 (a different register), carrying the value `0x00400000`.
* **Result Calculation:** The ALU performs the addition: `0xfffffff0 + 0x00400000`.
* **ALU Result:** The result (`alu_result`) is `0x003ffff0`. This result is passed to the MEM and WB stages.
* **Implication:** Since this is an R-Type instruction, the EX stage does not calculate a memory address or branch target; its sole function is to compute the result to be written back to the register file.
#### I type

The waveform at **33 ps** captures the execution phase of the I-type instruction.
* **Operation:** The ALU Control receives Opcode `0x13` and funct3 `0x0`, configuring the ALU for an **ADDI** operation.
The waveform at **33 ps** details the execution of the ADDI instruction at PC `0x1020`.
* **Operand Fetch:** The ALU inputs are prepared from the ID stage:
* **ALU Op1:** The first operand (`alu_io_op1`) carries the value `0x00002014`. This comes from the source register rs1.
* **ALU Op2:** The second operand (`alu_io_op2`) carries the immediate value `0x00000050`. For I-Type instructions, the immediate is used directly as the second operand.
* **Execution Result:** The ALU performs the addition (`0x2014+0x50`):
* **ALU Result:** The calculated result (`alu_io_result`) is `0x00002064`.
* **Outcome:** This result represents the new value that will be written back to the rd in the WB stage.
#### S type

The waveform at **89 ps** captures the execution phase of the Store Word instruction (S-Type).
* **Operation:** The ALU Control receives Opcode `0x23` and funct3 `0x2`, configuring the ALU for an **SW** operation.
* **Address Calculation:** The primary role of the ALU for a Store instruction is to calculate the effective memory address.
* **ALU Op1:** The first operand (`alu_io_op1`) carries the base address from the stack pointer (sp), with the value `0x003fffb0`.
* **ALU Op2:** The second operand (`alu_io_op2`) carries the sign-extended immediate offset `0x0000004c` .
* **ALU Result:** The ALU performs the addition (`0x3fffb0 + 0x4c`).
* The resulting effective address (`alu_io_result`) is `0x003ffffc`. This is the physical memory address where the data will be written in the next stage.
* **Data Preparation:** Unlike other instructions, the S-Type must also pass the data to be stored.
* The signal `io_mem_reg2_data` (or equivalent data bus derived from rs2) holds the value `0x00001050`. This data is forwarded to the memory stage alongside the calculated address.
* **Control Flow:** The `alu_ctrl_io_opcode` is `0x23`, confirming the Store operation is active throughout the execution cycle.
#### B type

The waveform at **41 ps** details the execution of a Conditional Branch instruction.
* **Instruction Identification:** The `alu_ctrl_io_opcode` is `0x63` and the funct3 value is `0x7`, indicating a **BGEU (Branch if Greater or Equal Unsigned)** instruction.
* **Target Address Calculation:** In this execution cycle, the ALU is tasked with calculating the potential jump target address.
* **ALU Op1:** The first operand (`alu_io_op1`) carries the Program Counter (PC) value of the current instruction: `0x0000101c`.
* **ALU Op2:** The second operand (`alu_io_op2`) carries the sign-extended immediate offset: `0x00000010`.
* **ALU Result:** The ALU performs the addition (PC+offset).
* **Result:** The calculated `alu_io_result` is `0x0000102c`. This is the target address the PC will jump to if the branch condition is met.
* **Execution Logic:** While the ALU calculates the target address, the branch comparison (checking if conditions are met) is typically handled in ID stage by a comparator logic, determining whether to assert the jump flag for the next fetch cycle.
#### U type

The waveform at **29 ps** details the execution of the AUIPC instruction at PC `0x1014`.
* **Instruction Identification:** The **alu_ctrl_io_opcode** reads `0x17`, which is the unique opcode for the **AUIPC** instruction.
* **Operand Fetch:** The ALU receives the operands required for the PC-relative calculation.
* **ALU Op1:** The first operand (`alu_io_op1`) carries the Program Counter (PC) of the current instruction: `0x00001014`.
* **ALU Op2:** The second operand (`alu_io_op2`) carries the U-Type immediate value, which has already been shifted left by 12 bits: `0x00001000`.
* **ALU Operation:** The ALU performs a standard addition operation (PC+Immediate), which is `0x00001014 + 0x00001000`.
* **Execution Result:**
* **ALU Result:** The calculated result (`alu_io_result`) is `0x00002014`.
* **Outcome:** This value represents a physical memory address (or a high-part constant) that will be written into the rd in the subsequent WB stage.
#### J type

The waveform at **77 ps** captures the execution of the **JAL (Jump and Link)** instruction originally at PC 0x104c.
* **Instruction Identification:** The `alu_ctrl_io_opcode` is `0x6f`, confirming the instruction type in the execution stage.
* **Role of ALU:** Since the jump target address (`0x159c`) was already calculated and taken in the previous ID stage, the ALU's task in the EX stage is to compute the return address to store in ra.
* **Operand Fetch:**
* **ALU Op1:** The first operand (`alu_io_op1`) carries the PC of the JAL instruction itself: `0x0000104c`.
* **ALU Op2:** The second operand (`alu_io_op2`) carries the constant value 4 (`0x00000004`). This represents the standard instruction size.
* **Execution Result:**
* **ALU Result:** The ALU performs the addition (PC+4). The result (`alu_io_result`) is `0x00001050`.
* **Outcome:** This value (`0x1050`) is the address of the instruction sequentially following the jump. It will be passed to the Write Back stage to be stored in register 1 (ra), allowing a future ret instruction to return to the correct location.
### Stage 4: MEM (Memory Access)
In this stage, with the exception of Data Memory access instructions (such as **LW and SW**), other instructions simply bypass this stage and proceed directly to the WB stage. Therefore, we analyze the behavior in two scenarios:
#### No Memory Access (R-Type Example)

The waveform at **249 ps** captures the MEM stage of the **ADD** instruction.
Except instructions like S-Type (Store Word) or I-Type (Load Word), Other instructions(R、I(arithmatic), B U J type) do not require data memory access, the MEM stage acts as a conduit and do nothing.
* **Control Signals:** As seen in the waveform, both `io_memory_read_enable` and `io_memory_write_enable` signals remain low. This confirms that no interaction with the data memory occurs.
* **Data Forwarding:** The primary task in this scenario is to preserve the execution result. The ALU result (`0x003ffff0`), which was calculated in the previous EX stage, is passed directly through the `io_forward_data bus`.
* Result: The value is simply forwarded to the Write Back stage to be written into the rd.
#### Memory Access (S-Type Example)

The waveform at **93 ps** captures the MEM stage of the **SW** instruction.
For instructions like S-Type (Store Word) or I-Type (Load Word), this stage is actively involved in reading from or writing to the memory.
* **Control Signals:**
* The `io_memory_write_enable` signal asserts High, clearly indicating that a write operation is in progress.
* The write strobes (`io_bundle_write_strobe_0` through 3) are **all high**, confirming a 4-byte (Word) write.
* **Addressing:** The effective memory address, calculated by the ALU in the EX stage as `0x003ffffc`, is applied to the memory address bus (`io_bundle_address`).
* **Data Storage:**
* The value to be stored, `0x00001050` (retrieved from register rs2 in the ID stage and forwarded), appears on the write data bus (`io_bundle_write_data`). This value is then written to the specified address `0x003ffffc`.
### Stage 5: WB (Write Back)
In the final stage of the pipeline, the instruction determines whether to write the execution or memory results back to the Register File. We categorize this behavior into two scenarios.
#### Write Back to Register (R, I, U, J Types) I type for example

The waveform at **41 ps** captures the WB stage of the **ADDI** instruction.
Instructions such as R-Type, I-Type, U-Type, and J-Type require updating rd.
* **Scenario Analysis:** Taking the I-Type instruction at PC `0x101c` as an example, the instruction has passed through the ID, EX, and MEM stages.
* **Data Selection:** The ALU result calculated in the EX stage was `0x00002064`. Since this is an arithmetic instruction, this ALU result is selected as the write-back data.
* **Write Operation:** The waveform shows the `io_regs_write_data` signal carrying the value `0x00002064`. The `io_regs_write_enable` signal asserts High, allowing this value to be written into the destination register rd.
#### No Write Back (S, B Types) B type for example

The waveform at **45 ps** captures the WB stage of the **BGEU** instruction.
Instructions such as S-Type (Store) and B-Type (Branch) do not produce a result that needs to be stored in the General Purpose Registers.
* **Scenario Analysis**: The instruction processing at this stage is the Branch instruction fetched at PC `0x00001020`(+ 16 = `0x00001030`).
* **Behavior:** The purpose of this instruction was resolved in the EX stage (determining whether to jump). It does not target any destination register (rd).
* **Control Signal:**
* Consequently, the io_regs_write_data signal shows `0x00000000`, confirming that no valid data is being written back to the register file.
## (Undo) propose effective approaches to further compress the Nyancat program.
## Reference
* [Lab3: Construct a RISC-V CPU with Chisel](https://hackmd.io/@sysprog/B1Qxu2UkZx)
* [Assignment2: Complete Applications](https://hackmd.io/vVgWuZHLSpW3sMXR-zEXhQ)
* [Assignment3: Your Own RISC-V CPU](https://hackmd.io/@sysprog/2025-arch-homework3)