# Assignment 3
## Chisel
### **Operation of "Hello World in Chisel**
The "Hello World in Chisel" example demonstrates that Chisel is a hardware construction language rather than a software programming language. When the program is executed, Chisel elaborates the design and generates the corresponding Verilog code. The `println` statement is executed during the elaboration phase, not during hardware runtime. This example illustrates the fundamental difference between software execution and hardware generation in Chisel.
### **Enhanced "Hello World in Chisel" with an Adder Circuit**
To enhance the original "Hello World in Chisel" example, a combinational adder circuit was integrated into the design to demonstrate actual hardware computation.
In the enhanced module, two 8-bit unsigned input signals `a` and `b` are defined using `UInt(8.W)`. The output signal `sum` is defined as a 9-bit unsigned value to prevent overflow. The addition operation is described using the following assignment:`io.sum := io.a + io.b`

The line `println((new ChiselStage).emitVerilog(new HelloAdder))` triggers the Chisel elaboration process and prints out the generated Verilog code of the `HelloAdder` module. This confirms that the high-level Chisel description is successfully translated into a corresponding hardware representation in Verilog.
### What I Have Learned from the Chisel Bootcamp
Through the Chisel Bootcamp, I built a solid foundation in hardware design using Chisel and gradually developed a clearer understanding of how digital hardware is constructed from a high-level perspective. Compared with traditional Verilog, Chisel feels much more flexible. By using Scala features such as parameterization and object-oriented programming, I was able to design hardware modules that are more reusable and easier to scale.
## CA25
### 1. Single-Cycle CPU
#### Exercise 1 – Immediate Extension (InstructionDecode)
This exercise implements immediate value generation for all RISC-V RV32I instruction formats.
Implemented immediate types include:
- I-type: Used for load, immediate arithmetic, and JALR.
- S-type: Used for store instructions.
- B-type: Used for branch instructions with PC-relative offset.
- U-type: Used for LUI and AUIPC.
- J-type: Used for JAL.
Each immediate follows the official RISC-V bit reordering rule and is properly sign-extended to 32 bits using Fill.
#### Exercise 2 – Control Signal Generation (InstructionDecode)
This exercise completes the control logic that determines:
- Write-back source selection
- ALU operand source selection
- Immediate type selection
Key implemented signals include:
- Write-back source
- ALU result (default)
- Memory read data (load)
- PC + 4 (JAL/JALR)
ALU operand 1 source
- Register value
- Program counter (for branch, AUIPC, JAL)
ALU operand 2 source
- Register value
- Immediate value
Immediate type selector
- Correctly assigns I, S, B, U, J based on instruction type
This control logic ensures that each instruction type routes correct operands and data paths in the single-cycle CPU.
#### Exercise 3 – ALU Control Logic (ALUControl)
This exercise implements ALU function decoding based on:
- opcode
- funct3
- funct7
Implemented features:
- Differentiation of ADD vs SUB using funct7(5)
- Differentiation of SRL vs SRA using funct7(5)
Support for all RV32I arithmetic and logical operations:
- ADD, SUB, AND, OR, XOR, SLL, SRL, SRA, SLT, SLTU
This enables correct ALU behavior for both R-type and I-type instructions.
#### Exercise 4 – Branch Comparison Logic (Execute)
This exercise implements all six RISC-V branch conditions:
- BEQ: rs1 == rs2
- BNE: rs1 ≠ rs2
- BLT: signed less than
- BGE: signed greater or equal
- BLTU: unsigned less than
- BGEU: unsigned greater or equal
Signed comparisons were implemented using .asSInt, while unsigned branches use direct UInt comparison.
Correct branch evaluation is critical for control flow correctness.
#### Exercise 5 – Jump Target Address Calculation (Execute)
This exercise implements target address generation for:
- Branch: PC + immediate
- JAL: PC + immediate
- JALR: (rs1 + immediate) & ~1
The least significant bit of JALR target is cleared to enforce 2-byte alignment.
This ensures correct control flow transfer without misaligned jumps.
#### Exercise 6 – Load Data Extension (MemoryAccess)
This exercise implements proper sign extension and zero extension for all load instructions:
- LB → sign-extended 8-bit
- LBU → zero-extended 8-bit
- LH → sign-extended 16-bit
- LHU → zero-extended 16-bit
- LW → full 32-bit
The implementation extracts byte/halfword using the memory address index and applies Fill + Cat to correctly extend data to 32 bits.
#### Exercise 7 – Store Data Alignment (MemoryAccess)
This exercise implements store alignment and byte strobe control:
- SB: enables one byte strobe based on address index, shifts 8-bit data.
- SH: enables two adjacent bytes based on mem_address_index(1):
- lower halfword → bytes 0–1
- upper halfword → bytes 2–3
- SW: enables all four byte strobes, writes full 32-bit data.
This ensures correct partial and full memory writes.
#### Exercise 8 – WriteBack Source Selection (WriteBack)
This exercise completes the final write-back data multiplexer:
- Default: ALU result
- Load instruction: memory read data
- JAL/JALR: PC + 4 return address
Correct write-back behavior guarantees architectural register correctness.
#### Exercise 9 – PC Update Logic (InstructionFetch)
This exercise implements program counter update logic:
- If branch/jump is taken : PC = jump target address
- Otherwise : PC = PC + 4
- If instruction is invalid : PC holds its value
This enables both sequential execution and correct control flow redirection.
**RISC-V compliance tests (make compliance)**
Next, I ran the full RISC-V compliance suite for RV32I by executing make compliance inside the 1-single-cycle directory.
This target uses the riscv-arch-test programs under` tests/riscv-arch-test/riscv-test-suite/rv32i_m/`.
Each test is written in assembly (e.g., add-01.S, beq-01.S, lb-align-01.S), compiled by the RISC-V toolchain into ELF binaries, and then converted into signatures.
The MyCPU test harness runs each program on my single-cycle core and compares the final memory signature with the reference implementation.
Initially, all compliance tests failed with the error message:
`java.lang.Exception: No RISC-V toolchain found. Set $RISCV or install to $HOME/riscv/toolchain`
To fix this, I installed a RV32I cross-compiler toolchain and set the RISCV environment variable so that riscv32-unknown-elf-readelf and related tools are available on the PATH.
After configuring the toolchain, I re-ran make compliance, and all 41 RV32I tests (arithmetic/logic, branches, jumps, loads/stores with alignment, LUI/AUIPC, and privilege corner cases) passed successfully.
The final HTML report (1-single-cycle/results/report.html) shows every test case marked as Passed.
**Final Result**
`make test`

After completing all exercises in 1-single-cycle, all unit tests passed successfully
`make compliance`


### 2. MMIO & Trap
#### Exercise 1 – Immediate Extension (InstructionDecode.scala)
This exercise implements RISC-V immediate generation for all instruction formats:
- I-type (ADDI, LW, JALR)
- S-type (SW, SH, SB)
- B-type (BEQ, BLT, etc.)
- U-type (LUI, AUIPC)
- J-type (JAL)
Each immediate type requires:
- Proper bit reordering
- Correct sign extension
- Correct alignment bit insertion (for B-type and J-type)
This ensures that:
- Branch targets are computed correctly
- Jump offsets are interpreted correctly
- Memory accesses use correct address offsets
This exercise is foundational for correct control flow and memory behavior.
#### CA25: Exercise 3 – ALU Control Logic (ALUControl.scala)
This exercise completes the ALU operation decoding for:
- I-type arithmetic instructions (ADDI, XORI, ORI, ANDI, SLLI, SRLI, SRAI)
- R-type arithmetic instructions (ADD, SUB, XOR, OR, AND, SLL, SRL, SRA)
Key behaviors implemented:
- Differentiation of ADD vs SUB via funct7[5]
- Differentiation of SRL vs SRA via funct7[5]
- Correct mapping of funct3 to ALU operations
This guarantees that all integer ALU instructions behave exactly as specified by RV32I.
#### CA25: Exercise 4 – Branch Comparison Logic (Execute.scala)
This exercise implements all six RISC-V branch conditions:
Instruction Comparison Type
- BEQ Equality
- BNE Inequality
- BLT Signed less-than
- BGE Signed greater-or-equal
- BLTU Unsigned less-than
- BGEU Unsigned greater-or-equal
Signed comparisons require explicit type reinterpretation, while unsigned comparisons use raw values.
This logic controls whether a branch is taken, directly affecting:
- Program counter (PC) redirection
- Control flow correctness
- Compliance test behavior
#### CA25: Exercise 5 – Jump Target Address Calculation (Execute.scala)
This exercise completes:
- Branch target address calculation
- JAL target calculation
- JALR target calculation with alignment masking
Key logic:
- Branch and JAL targets use PC-relative immediate addition
- JALR target clears the lowest bit for alignment
- Jump decision sets if_jump_flag correctly
- This guarantees correctness of:
- Subroutine calls (JAL, JALR)
- Returns
- Indirect jumps
#### CA25: Exercise 6 – Control Signal Generation (InstructionDecode.scala)
This exercise determines:
- Which instructions use rs1/rs2
- Which instructions write to rd
Which source is used in write-back stage:
- ALU Result
- Memory Data
- CSR Data
- PC+4 (for JAL/JALR)
It also configures:
- ALU operand selection
- Memory read/write enables
- CSR write enables
This is the central routing logic of the pipeline.
#### CA25: Exercise 10 – CSR Address Mapping (CSR.scala)
This exercise builds the CSR register lookup table, mapping CSR addresses to physical registers:
- mstatus
- mie
- mtvec
- mscratch
- mepc
- mcause
- cycle (split into CycleL and CycleH)
This table enables:
- CSR read instructions to return correct values
- Debug access to inspect CSR state
#### CA25: Exercise 11 – CSR Write Priority Logic (CSR.scala)
This exercise resolves write conflicts between CPU and CLINT:
- When both attempt to write the same cycle : CLINT has absolute priority
Implemented for:
- mstatus
- mepc
- mcause
This is critical for precise interrupt handling, ensuring that traps are not corrupted by simultaneous CSR instructions.
#### CA25: Exercise 12 – WriteBack Source Selection (WriteBack.scala)
This extends write-back from 3 sources → 4 sources:
- Source Used By
- ALU Result Arithmetic
- Memory Data Loads
- CSR Read Data CSR Instructions
- PC+4 JAL, JALR
This ensures:
- CSR read-modify-write instructions work correctly
- Return addresses are saved correctly
#### CA25: Exercise 12 – Load Data Extension (MemoryAccess.scala)
This exercise implements LB/LH/LW/LBU/LHU:
- LB/LH use sign extension
- LBU/LHU use zero extension
- Correct byte/halfword is selected using mem_address_index
This guarantees:
- All load instructions behave correctly under unaligned addresses
- Data interpretation matches the RISC-V specification
#### CA25: Exercise 13 – Store Data Alignment (MemoryAccess.scala)
This exercise implements SB, SH, SW using:
- Byte strobe vectors
- Correct data shifting
Behavior:
- SB enables exactly one byte lane
- SH enables two adjacent byte lanes
- SW enables all four lanes
This ensures that:
- Memory writes update only the intended bytes
- MMIO devices receive correct byte-level updates
#### CA25: Exercise 13 – Interrupt Entry (CLINT.scala)
- This exercise implements trap entry behavior:
- Save current PC into mepc
- Update mcause
- Update mstatus interrupt bits
- Redirect PC to mtvec
This establishes full machine-mode interrupt support.
#### CA25: Exercise 14 – Trap Return (CLINT.scala)
This exercise restores CPU state on mret:
- Restores interrupt enable from mstatus
- Redirects PC back to mepc
This allows:
- Clean return from interrupt handlers
- Correct nested trap behavior
#### CA25: Exercise 15 – PC Update with Interrupt Priority (InstructionFetch.scala)
This exercise implements PC update priority logic:
- Interrupt highest priority
- Jump/Branch second
- Sequential PC+4 lowest
This ensures:
- No jump can override an interrupt
- No interrupt can be masked by a branch
Fully deterministic control flow
During development, multiple implementation bugs were encountered and resolved:
| Issue | Root Cause | Fix |
| ------------------------- | -------------------------------------- | -------------------------------------- |
| Load sign extension error | Incorrect LB/LH bit replication | Corrected Fill + Cat logic |
| Store byte misalignment | Strobe index mismatch | Aligned strobes with mem_address_index |
| CSR not updating | Priority conflict with CLINT | Enforced CLINT write priority |
| JALR wrong target | LSB not cleared | Applied alignment mask |
| PC update incorrect | Interrupt priority handled incorrectly | Implemented nested Mux |
**Final Result**
`make test`

`make compliance`

**Nyancat VGA Rendering Verification**
The Nyancat demo was executed using the Verilator-based simulation by running:
`make demo`

The VGA window was successfully displayed, and the Nyancat animation was rendered correctly with continuous updates. The rainbow trail and cat sprite appeared with proper colors and alignment, without visual artifacts such as flickering or corrupted pixels. This confirms that:
- The CPU correctly performs MMIO store operations to the VGA frame buffer.
- Store byte and store halfword alignment logic works correctly.
- The trap and CSR mechanisms are stable during long-running graphical execution.
**Proposed Nyancat Program Compression Methods**
Although the current Nyancat program runs correctly, its size can be further reduced using the following methods:
1. Loop-based drawing
Replace fully unrolled pixel stores with nested loops to draw rows and columns, significantly reducing instruction count.
2. Tile-based pattern reuse
Store small reusable image blocks (tiles) and copy them to multiple screen locations instead of storing full-frame pixel data.
3. Symmetry optimization
Since the Nyancat image is horizontally symmetric, only half of the sprite needs to be stored, and the other half can be generated by mirroring.
4. Run-Length Encoding (RLE)
Encode long horizontal regions of the same color using (color, length) pairs to reduce frame data size.
These techniques can greatly reduce both the program size and the memory footprint while preserving the same VGA output.
### 3.Pipeline
#### Exercise 16 – ALU Completion for Pipeline CPU
In this exercise, the missing ALU operations required by the pipelined CPU were completed. These included shift instructions (SLL, SRL, SRA), comparison instructions (SLT, SLTU), and logical operations (XOR, OR, AND). All shift instructions correctly use only the lower 5 bits of the shift amount as specified by RV32I. Signed operations were implemented using proper SInt casting. This ensures that all arithmetic and logical instructions function correctly in later pipeline stages.
#### Exercise 17 – Data Forwarding to EX Stage
This exercise implements EX-stage data forwarding to resolve classic RAW (Read-After-Write) hazards without stalling. If the destination register in MEM or WB matches rs1_ex or rs2_ex, and the register is not x0, the correct value is forwarded directly to the ALU inputs. Forwarding from MEM has higher priority than from WB. This allows back-to-back dependent instructions such as:
```
ADD x1, x2, x3
SUB x4, x1, x5
```
to execute without pipeline stalls.
#### Exercise 18 – Data Forwarding to ID Stage for Branch
This exercise adds early forwarding to the ID stage specifically for branch instructions. Since branch comparisons are performed in ID, operands may still be waiting in MEM or WB. By forwarding values to rs1_id and rs2_id, the branch condition can be evaluated earlier. This reduces the branch penalty from 2 cycles to 1 cycle, significantly improving performance for control-heavy programs.
#### Exercise 19 – Hazard Detection and Stall Control
This exercise implements the main hazard detection logic. The control unit detects:
- Load-use hazards (load in EX, dependent instruction in ID)
- Jump-register dependencies
- Load–jump dependencies across MEM stage
When such hazards are detected, the following control actions occur:
- pc_stall = 1 → freeze PC
- if_stall = 1 → freeze IF/ID
- id_flush = 1 → insert a bubble into ID/EX
This guarantees correct execution when forwarding alone is insufficient.
#### Exercise 20 – IF/ID Pipeline Register with Stall and Flush
This exercise completes the IF2ID pipeline register using PipelineRegister modules. Three registers were implemented:
- Instruction register (flush outputs NOP)
- Instruction address register (flush outputs entry PC)
- Interrupt flag register (flush clears interrupt)
Each register correctly responds to:
- Normal operation → pass-through
- Stall → hold value
- Flush → reset to default
This ensures correct interaction with hazard detection and branch control.
#### Exercise 21 – Hazard Detection Summary & Waveform Analysis
This final exercise focuses on conceptual understanding and verification:
- Key conclusions from this design : Load-use hazards must stall because load data is only available after MEM stage.
- Stall vs Flush : Stall freezes pipeline state.
- Flush discards wrong-path instructions.
- Jump with register dependency must stall because jump target computation depends on EX/MEM results.
Branch penalty is only 1 cycle because:
- Branch is resolved in ID stage
- Only IF stage is flushed
Without hazard detection, the pipeline would:
- Read stale values
- Jump to wrong addresses
- Fail both program tests and compliance
Using Verilator + GTKWave, I observed:
- Load-use stalls
- One-cycle branch flush behavior
- Correct jump dependency stall
- Proper forwarding signal activity
These waveforms confirm that the implemented hazard and forwarding logic works exactly as designed.
`make test`
After completing the exercises, make test runs all ScalaTest suites:
1. PipelineProgramTest
2. Three-stage pipeline
3. Five-stage pipeline with stalling only
4. Five-stage pipeline with forwarding
5. Five-stage pipeline with reduced branch delay (final version)
For each core configuration, the tests check:
1. Recursive Fibonacci – verifies arithmetic, control flow, and stack operations.
2. Quicksort – verifies memory loads/stores and deeper recursion.
3. Byte load/store – checks LB/LH/LW/LBU/LHU and SB/SH/SW.
4. Hazard programs – specifically designed to trigger data hazards, control hazards, and CSR/trap paths.
5. Machine-mode traps – ensures correct mepc, mcause, and mret behavior even in the pipelined design.
All tests passed for the final five-stage implementation, confirming that forwarding and hazard detection behave correctly.

`make compliance`
Running make compliance in 3-pipeline invokes RISCOF on the pipelined CPU : The test suite executes the official RV32I architectural compliance tests.
It checks:
1. ALU operations (add/sub/shift/logic/compare),
2. Loads and stores with different alignments and sizes,
3. Branches and jumps,
4. System and CSR instructions (including ecall, mret).
My pipelined design passed all compliance tests.
This confirms that the forwarding, stalling, and flushing logic does not break the RISC-V ISA semantics.

`make sim`
This Generates Verilog for the five-stage reduced-branch-delay CPU,executes the simulation and dumps a VCD file to:
```
3-pipeline/trace.vcd
```
I then opened trace.vcd in GTKWave and added key signals to analyze hazards.
I inserted at least the following wires:
**Load-Use / Jump Stall Observation**
Add these wires:
```
pc
io_instruction
io_output_instruction
```

Same pc value appears twice consecutively in waveform.
These are enough to see:
which instruction is in each stage
when a NOP/bubble gets injected (instruction becomes 0x00000013)
when PC stops (stall) or jumps to a new address.
**Branch / Jump Flush Observation**
Add these wires:
```
io_if_jump_flag
io_if_jump_address
if2id_io_output_instruction
```

Same pc value appears twice consecutively in waveform.
This stall is necessary because the required data is not yet available from the memory stage.
**Flush Behavior**
Add these wires:
```
pc
io_instruction
io_output_instruction
```

Because branch decision is resolved in the ID stage with early forwarding, only one bubble (NOP) is inserted instead of two cycles.
Only one NOP appears between valid instructions.
## From HW2 - Hanoi tower
**First, make and sbt test are all pass !!**


**Goal of This Exercise**
In this exercise, I run an **iterative Tower of Hanoi (n = 3)** program on the 5-stage pipelined CPU (`fivestage_stall`) and:
1. Identify load–use data hazards in the baseline RISC-V implementation.
2. Optimize the assembly to eliminate unnecessary load–use stalls by instruction reordering (without changing program semantics).
3. Compare the cycle counts before and after optimization and interpret the results from a pipeline point of view.
**assembly code(before opt)**
```
.data
move_fmt: .asciz "Move Disk %c from %c to %c\n"
cycle_fmt: .asciz "CSR cycles: %llu\n"
.text
.align 2
.global main
.globl _start
_start:
j main
.global tower_of_hanoi_iterative
tower_of_hanoi_iterative:
addi sp, sp, -40
sw ra, 36(sp)
sw s0, 32(sp)
sw s1, 28(sp)
sw s2, 24(sp)
sw s3, 20(sp)
sw s4, 16(sp)
li t0, 0
sw t0, 12(sp)
sw t0, 8(sp)
sw t0, 4(sp)
li s0, 1
sll s0, s0, a0
li s1, 1
.loop_start:
bge s1, s0, .loop_end
# g_curr = gray(step);
addi t0, s1, 0
srli t1, t0, 1
xor t2, t0, t1
# g_prev = gray(step - 1);
addi t3, s1, -1
srli t4, t3, 1
xor t5, t3, t4
# diff = g_curr ^ g_prev;
xor s3, t2, t5
# disk: while ((diff >>= 1)) disk++;
li s2, 0
addi t0, s3, 0
.disk_loop:
srli t0, t0, 1
beqz t0, .disk_loop_end
addi s2, s2, 1
j .disk_loop
.disk_loop_end:
slli t0, s2, 2
addi t0, t0, 12
add t0, t0, sp
lw t1, 0(t0)
addi t6, t1, 0
bnez s2, .else_disk_not_zero
# disk == 0
addi t2, t1, 1
li t5, 3
rem t2, t2, t5
addi t6, t2, 0
j .printf_call
.else_disk_not_zero:
li t2, 3
lw s4, 12(sp)
sub t2, t2, t1
sub t6, t2, s4
.printf_call:
la a0, move_fmt
li t3, 'A'
add a1, s2, t3
add a2, t6, t3
add a3, t6, t3
jal ra, printf
sw t6, 0(t0)
addi s1, s1, 1
j .loop_start
.loop_end:
lw ra, 36(sp)
lw s0, 32(sp)
lw s1, 28(sp)
lw s2, 24(sp)
lw s3, 20(sp)
lw s4, 16(sp)
addi sp, sp, 40
ret
main:
addi sp, sp, -8
sw ra, 4(sp)
rdcycle t0
addi s6, t0, 0
li a0, 3
jal ra, tower_of_hanoi_iterative
rdcycle t1
sub a1, t1, s6
- start
la a0, cycle_fmt
li a2, 0
jal ra, printf
li a0, 0
lw ra, 4(sp)
addi sp, sp, 8
ret
.globl printf
printf:
ret
```

Hazard explanation:
1. Hazard 1: lw t1, 0(t0) followed immediately by addi t6, t1, 0, In a 5-stage pipeline with no forwarding, the result of lw is only available at the end of MEM stage.
The very next instruction tries to use t1 in EX stage → creates a load–use hazard, and the hazard unit must insert a bubble (stall).
2. Hazard 2: lw s4, 12(sp) followed immediately by sub t6, t2, s4
Same pattern: lw followed by an instruction that needs s4 on the very next cycle → another load–use stall.
So for each loop iteration, we pay two stall cycles that do not perform useful work.

From the waveform, we can observe that the value of the program counter (PC) remains unchanged across consecutive clock cycles. Instead of increasing by 4 as in normal sequential execution, the PC repeats the same value (e.g., 0x00001014 → 0x00001014).
This behavior indicates that a pipeline stall has occurred.
**Optimized Hanoi – Removing Load–Use Hazards**
1. Reorder instructions so that a load is followed by at least one independent instruction that does not use the loaded register.
2. Use that “gap” to perform other useful work (branch decisions or arithmetic), so we are filling the bubble with real work instead of letting the hardware insert a stall.
**assembly code(after opt)**
```
.section .text
.globl main
.globl tower_of_hanoi_iterative
# ----------------------------------------------------------------------
# void tower_of_hanoi_iterative(int n);
# Uses Gray code to generate moves. pegs[disk] is stored on stack.
# ----------------------------------------------------------------------
tower_of_hanoi_iterative:
# Prologue: allocate stack frame
addi sp, sp, -40
sw ra, 36(sp)
sw s0, 32(sp)
sw s1, 28(sp)
sw s2, 24(sp)
sw s3, 20(sp)
sw s4, 16(sp)
# pegs[0..2] = 0
li t0, 0
sw t0, 12(sp)
sw t0, 8(sp)
sw t0, 4(sp)
# s0 = total_steps = 1 << n
li s0, 1
sll s0, s0, a0 # s0 = 2^n
# s1 = step = 1
li s1, 1
.loop_start:
# if (step >= total_steps) break;
bge s1, s0, .loop_end
# g_curr = gray(step) = step ^ (step >> 1)
addi t0, s1, 0
srli t1, t0, 1
xor t2, t0, t1
# g_prev = gray(step-1)
addi t3, s1, -1
srli t4, t3, 1
xor t5, t3, t4
# diff = g_curr ^ g_prev
xor s3, t2, t5
# disk: while ((diff >>= 1)) disk++;
li s2, 0
addi t0, s3, 0
.disk_loop:
srli t0, t0, 1
beqz t0, .disk_loop_end
addi s2, s2, 1
j .disk_loop
.disk_loop_end:
# t0 = &pegs[disk]
slli t0, s2, 2
addi t0, t0, 4
add t0, t0, sp
lw t1, 0(t0)
# *** Hazard 1 removed ***
# Instead of using t1 immediately, first decide on disk==0 branch.
bnez s2, .else_disk_not_zero # does not use t1 → no load-use hazard
# ---------- Case: disk == 0 ----------
addi t6, t1, 0
addi t2, t6, 1
li t5, 3
rem t2, t2, t5
addi t6, t2, 0
j .have_to
.else_disk_not_zero:
li t2, 3
lw s4, 4(sp)
sub t2, t2, t1
sub t6, t2, s4
.have_to:
sw t6, 0(t0)
# la a0, move_fmt
# li t7, 'A'
# add a1, s2, t7
# add a2, t1, t7
# add a3, t6, t7
# jal ra, printf
addi s1, s1, 1
j .loop_start
.loop_end:
# Epilogue
lw ra, 36(sp)
lw s0, 32(sp)
lw s1, 28(sp)
lw s2, 24(sp)
lw s3, 20(sp)
lw s4, 16(sp)
addi sp, sp, 40
ret
main:
addi sp, sp, -8
sw ra, 4(sp)
li a0, 3
jal ra, tower_of_hanoi_iterative
li a0, 0
lw ra, 4(sp)
addi sp, sp, 8
ret
```
In the baseline version of the loop, There are two load–use hazards in the hot path:
1. lw t1, 0(t0) → next instruction uses t1
2. lw s4, 12(sp) → next instruction uses s4
So each iteration of the main loop pays 2 extra stall cycles.
In the optimized version, We carefully reorder instructions so that:
1. After lw t1, 0(t0), the next instruction is bnez s2, ... (which only reads s2).
2. After lw s4, 4(sp), the next instruction is sub t2, t2, t1 (which does not use s4).
Therefore, the hazard detection logic no longer sees a load–use RAW dependency on the next cycle, and no load–use stalls are inserted.

In this waveform, the program counter (PC) increases normally by 4 every clock cycle (for example, from 0x00001014 to 0x00001018) without any repeated values. This confirms that no pipeline stall occurs in the optimized version of the program.
Here is the compare table:
| Metric | Original Code (w/ Hazard) | Optimized Code (No Hazard) | Difference |
| :--- | :---: | :---: | :---: |
| **Instruction Count (IC)** | 223 | 223 | 0 |
| **Data Hazards (Load-Use)** | 7|0|-7 |
| **Control Hazards (Branch/Jump)** | 40 | 40 | 0 |
| **Total Cycles** | 270 | 263 | **-7** |
Performance Improvement Summary
| Performance Metric | Value |
| :--- | :--- |
| **Cycles Saved** | **7 Cycles** |
| **Percentage Reduction** | **2.6%** |
| **CPI (Cycles Per Instruction)** | **1.21** $\to$ **1.18** |
| **Speedup** | **1.027x** |
(To accurately determine the cycle count and evaluate the optimization, this analysis assumes the use of a Standard 5-Stage RISC-V Pipeline architecture featuring Data Forwarding)