arch2025-homework3

# Homework 3 Chisel ## Introduction This report documents my implementation of a RISC-V RV32I CPU in Chisel. The project includes three main versions: - **1 single cycle**: Basic CPU that executes one instruction per cycle - **2 mmio trap**: Extended version with CSR registers and interrupt handling - **3 pipeline**: Pipelined CPU with forwarding and hazard detection for better performance --- ## Chisel Bootcamp Learnings ### What is Chisel? Chisel is a hardware description language embedded in Scala. ### Key Concepts Learned useful for the project #### 1. Wire, Reg and interface Understanding the difference between combinational and sequential logic: ```scala val sum = Wire(UInt(32.W)) sum := a + b val counter = RegInit(0.U(32.W)) counter := counter + 1.U // updates on clock // IO - Interface val io = IO(new Bundle { val input = Input(UInt(8.W)) val output = Output(UInt(8.W)) }) ``` #### 2. Chisel Values ```scala 5.U // Width automatic 5.U(8.W) // 8 bits, value 5 "b1010".U // Binary "hFF".U // Hex ``` #### 3. Mux ```scala val result = Mux(a > b, a, b) // Equivalent to: // if (a > b) // result = a // else // result = b ``` #### 4. Conditional Logic ```scala // when / elsewhen / otherwise when(cond) { result := v1 }.elsewhen(condition2) { result := v2 }.otherwise { result := dV } // MuxLookup: finds exact match val aluOp = MuxLookup(funct3, 0.U, Seq( "b000".U -> 0.U, // ADD "b001".U -> 1.U, // SLL "b010".U -> 2.U, // SLT "b100".U -> 4.U // XOR )) ``` #### 5. Bit Manipulation ```scala // Extract bits val byte0 = data(7, 0) // Bits 7 to 0 val signBit = data(31) // Single bit // Sign extension val signExtended = Cat(Fill(24, byte(7)), byte) ``` --- ## Project 1: Single-Cycle CPU ### Architecture Overview The single-cycle CPU executes one instruction completely in a single clock cycle. All five stages (IF, ID, EX, MEM, WB) operate combinationally. ### Exercises Summary | Exercise | File | Description | |----------|------|-------------| | 1 | InstructionDecode.scala | Extract S & B & J type immediates with sign extension | | 2 | InstructionDecode.scala | Generate control signals (WB source, ALU operands) | | 3 | ALUControl.scala | Map opcode/funct3/funct7 to ALU operations | | 4 | Execute.scala | Implement 6 branch conditions (BEQ/BNE/BLT/BGE/BLTU/BGEU) | | 5 | Execute.scala | Calculate jump targets for Branch/JAL/JALR | | 6-7 | MemoryAccess.scala | Load/Store with sign extension and byte strobes | --- #### Unit Tests | Test File | Test Name | What It Tests | Expected | |-----------|-----------|---------------|----------| | RegisterFileTest.scala | "read previously written register values" | Write x1=0xdeadbeef, read x1 | 0xdeadbeef | | RegisterFileTest.scala | "keep x0 hardwired to zero" | Write x0=0xdeadbeef, read x0 | 0 | | RegisterFileTest.scala | "support write-through" | Read during write cycle | 0xdeadbeef | | InstructionFetchTest.scala | "correctly update PC and handle jumps" | 100 random PC updates | PC+4 or jump target | | InstructionDecoderTest.scala | "decode RV32I instructions" | 9 instruction types | Correct control signals | | ExecuteTest.scala | "execute ALU operations and branch logic" | 100 random | Correct sum or branch flag | --- ## Project 2: MMIO-Trap CPU # ### Exercises Summary | Exercise | File | Description | |----------|------|-------------| | 8-9 | InstructionDecode.scala | Decode CSR instructions (CSRRW, CSRRS, CSRRC) | | 10 | CSR.scala | CSR register lookup (mstatus, mie, mtvec, mepc, mcause) | | 11 | CSR.scala | CLINT write priority over CPU writes for atomic trap entry | | 12 | InstructionFetch.scala | Interrupt priority in PC update logic | | 13-14 | CLINT.scala | Timer interrupt detection and trap entry | | 15 | WriteBack.scala | Add CSR read as write-back source | --- ### Nyancat VGA Display Demo #### Running the Nyancat Demo The Nyancat animation runs using Verilator with SDL2 when we use `make demo`. #### Simulation Output ``` [SDL2] Window opened: 640x480 'VGA Display - MyCPU' [SDL2] Press ESC or close window to stop simulation early Simulation progress: 1% Simulation progress: 2% ... Simulation progress: 100% ``` #### VGA Display Result The Nyancat animation renders on a 640×480 VGA display through SDL2: ![image](https://hackmd.io/_uploads/HyArjZjWbl.png) *Nyancat animation on VGA display (640×480@72Hz)* #### Compression Ideas The current `nyancat.asmbin` is 8,891 bytes. Here are some ways to make it smaller: **1.** Many consecutive pixels have the same color (especially the blue background). Instead of storing each pixel individually: ``` Before: [blue][blue][blue][blue][blue][red][red][red] After: [blue, 5][red, 3] ``` **2. Delta** Most pixels don't change between frames. Store only what changes: - First frame: full image - Next frames: only changed pixels This would work well since the cat animation and stars are small compared to the background. **3. Huffman Encoding** Huffman encoding would assign shorter codes to frequent colors (like the blue background). **4. Combined Approach** The best compression would probably combine everythings. ### Test Results All 9 tests pass successfully. --- ## Project 3: Pipeline CPU ### Exercises Summary | Exercise | File | Key Concepts | |----------|------|--------------| | 16 | ALU.scala | 10 ALU operations. Shifts masked to 5 bits; SRA needs signed conversion | | 17-18 | Forwarding.scala | Data forwarding to resolve hazards. MEM has priority over WB; never forward from x0 | | 19 | Control.scala | Hazard detection: stall for load-use and jump dependencies; flush for branches | | 20 | IF2ID.scala | Pipeline registers with stall (hold value), flush (NOP), and normal modes | ### Test Results **4 CPU variants tested:** ThreeStage, FiveStageStall, FiveStageForward, FiveStageFinal | Test | What It Tests | Expected | |------|---------------|----------| | Fibonacci | Recursion, JAL/JALR, stack | mem[4] = 55 | | Quicksort | Loops, comparisons, memory | mem[4..40] = sorted [0..9] | | SB/LB | Byte store/load | x5=0xdeadbeef, x6=0xef, x1=0x15ef | | Hazard | Basic hazards | x1=cfg.hazardX1, mem[4]=1, mem[8]=3 | | Hazard Extended | 10 hazard scenarios | See table below | **Hazard Extended Test (10 sections):** | Section | mem addr | Expected | Hazard Type | |---------|----------|----------|-------------| | 1 | 0x10 | 2 | WAW | | 2 | 0x14 | 0xAB | Store-Load Forwarding | | 3 | 0x18 | 0 | Multiple Consecutive Loads | | 4 | 0x1C | 10 | Branch Condition RAW | | 5 | 0x20 | (addr) | JAL Return Address | | 6 | 0x24 | 0x1888-0x1900 | CSR RAW | | 7 | 0x28 | 5 | Long Dependency Chain | | 8 | 0x2C | 7 | WB Stage Forwarding | | 9 | 0x30 | 0 | Load-to-Store Forwarding | | 10 | 0x34 | 20 | Multi-RAW Branch | ### Exercise 21: Hazard Detection Analysis #### Q1: Why do we need to stall for load-use hazards? Because the load instruction produces result in the MEM stage, but the other instruction needs the data in the EX. Forwarding cannot jump this 1 cycle, so we must stall. #### Q2: What is the difference between "stall" and "flush"? Stall freezes the pipeline stages, inserting a bubble. Flush clears the pipeline register to NOP, discarding the wrong path after a branch. #### Q3: Why does jump with register dependency need stall? JALR needs the register value to compute the jump target address. If the register value is not ready, we must stall until the value is available. #### Q4: Why is branch penalty only 1 cycle instead of 2? Branch resolution happens in ID stage instead of EX stage. With ID-forwarding, branch operands are available earlier, so only the instruction in IF needs to be flushed. #### Q5: What would happen without hazard detection? The processor would produce incorrect results (using stale register values) and control hazards (executing wrong-path instructions after branches). --- ## Homework 2 Integration: Tower of Hanoi ### Adaptation from Homework 2 The original `homework2.s` used Linux syscalls (`ecall`) for I/O. Since MyCPU doesn't have OS support, I modified it to write results directly to memory instead. **Key changes:** - Removed all `ecall` instructions - Results stored at specific memory addresses for verification - Added infinite loop at end (needed for bare-metal to prevent executing garbage) ### Test Verification | Check | Address | Expected | Description | |-------|---------|----------|-------------| | Total moves | mem[4] | 7 | 2³-1 moves for 3 disks | | Disk 0 final | mem[0x90] | 2 | Smallest disk 2| | Disk 1 final | mem[0x94] | 2 | Medium disk 2 | | Disk 2 final | mem[0x98] | 2 | Largest disk 2 | | Move 1: disk | mem[8] | 0 | First move is disk 0 | | Move 1: from | mem[12] | 0 | From 0 | | Move 1: to | mem[16] | 2 | To 2 | ### Results All 4 pipeline variants pass the test: **ThreeStage**, **FiveStageStall**, **FiveStageForward**, **FiveStageFinal** ## VGA Controller Analysis The VGA peripheral in `VGA.scala` drives a 640×480@72Hz display. ### How it works - CPU clock: handles MMIO writes (palette, framebuffer upload, control) - Pixel clock (31.5 MHz): generates sync signals, reads framebuffer, outputs pixels **Framebuffer**: 12 frames × 64×64 pixels, 4-bit palette index each. Stored as 8 pixels per 32-bit word. Display is 6× upscaled to 384×384, centered on screen. **Pixel pipeline** (2 cycles): 1. Calculate address from screen coordinates 2. Read framebuffer word, extract 4-bit pixel 3. Lookup 6-bit color in palette, output to VGA pins --- **MMIO registers** at 0x30000000: - CTRL: enable display, select frame (0-11) - STREAM_DATA: write 8 pixels at once - PALETTE[0-15]: 16 colors, 6-bit each (RRGGBB) **Animation**: CPU uploads frames, timer interrupt triggers frame change by writing to CTRL register. **SDL2 simulation**: C++ captures VGA signals, converts 6-bit color to 24-bit RGB, renders to window. --- ## Nyancat Compression Current size: 8,891 bytes (with delta encoding already applied , 91% reduction from raw). Delta encoding stores only pixels that changed between frames. Since background stays mostly static, this is very effective for animation. Further compression ideas: RLE for repeated colors, Huffman coding (frequent colors get shorter codes), tile-based rendering. ---