Assignment3: My Own RISC-V CPU

# My Own RISC-V CPU ## Single Cycle Design ### Tests summary In the single cycle design of RISC-V CPU (`1-single-cycle`), there are 5 tests, which examine the functionality of our CPU implemented by Chisel. When a program is executed, the compiled machine code will be loaded in the instruction memory and then fetched into the CPU datapath. The datapath is composed of Instruction Fetech (IF), Instruction Decode (ID), Execute (EX), MemoryAccess (MEM), and WriteBack (WB). Therefore, the 5 tests matches the 5 components of the CPU datapath. * First of all, `InstructionFetchTest.scala` plays a vital role on determining the fetching logics can handle the program counter (PC). The PC-relative behavior includes regular update (fetching the next line of loaded program) and jump-based update (basic control flow). * Second, `InstructionDecoderTest.scala` is responsible for checking the behaviors of all types of RISC-V instructions case-by-case. Types in RISC-V CPU have R, I, S, B, J, and U. To thoroughly examine the implemented logics, `InstructionDecoderTest.scala` checks the operator sources, the immediates, destination sources, enable signal of memory, and updated instruction address. The contents of memory and registers are remaining for another test code, which represents a more meaningful results. (e.g. fibonacci or quicksort) * Third, `ExecuteTest.scala` checks the logic in EX, that is, the arithmetic-logic-unit (ALU). Therefore, test cases of addition, subtraction, logic operators, and comparisons for branch instructions are addressed in this file. * Fourth, `RegisterFileTest.scala` is used to check whether WB logics performs well. For example, if a R-type instruction updates the registers, we can expect the updated value is located in the corresponding register in the next cycle. Unlike the pipeline design, the write back behavior is performed at the same time with the execution. * Last, `CPUTest.scala` includes three programs called fibonacci.asmbin, quicksort.asmbin, and sb.asmbin to evaluate the CPU. Since the check of three programs behaviors can be seen as observation on the final memory content, ```CPUTest.scala``` can effectively check the memory update logics in our CPU. ### Development In my first try [0c0dc24](https://github.com/kyl6092/ca2025-mycpu/commit/0c0dc24133f0c7d419cb6d710fe226711dfc216a), I follow the single cycle CPU datapath learned from lessons and fill in the remained parts of code snippets. I started with the `instructionDecode.scala` instead of `instructionFetch.scala` because I am familiar with the decoding at that time. (cuz just reviewed...) Then, after tracing the code structure and realizing how to propagate program counter (PC), I implemented the `instructionFetch.scala, WriteBack.scala, InstructionDecode.scala, ALUControl.scala`, since these parts were straightforward now. However, I encounter some troubles in `Execute.scala ` and `MemoryAccess`. The unsigned and signed issues occur in the branch instructions. The 2-byte alignment for jal/jalr gave me a hard time. After modifying the `Makefile` and simulate with verilator, I found the code for PC updating is wrong and my mistakes for obfuscating the `1.U` and `1.U(32.W)`. The modified version is here. [f1ba3a8 ](https://github.com/kyl6092/ca2025-mycpu/commit/f1ba3a8895b93a8a0bb84ea8d4ac75ce0d11bbaf) ## MMIO & Trap Design ### Tests summary An basic RISC-V CPU doesn't contain the design of memory-mapped input/output (MMIO) and logics of trap mechanism (Trap). In `2-mmio-trap` project, we want to extend our CPU to support the both functionality, so the corresponding tests are included. There are 5 tests evaluating MMIO and Trap, `TimerTest.scala, UartMMIOTest.scala, CLINTCSRTest.scala, ExecuteTest.scala, CPUTest.scala`. We use the timer and Uart as the MMIO devices. * MMIO maps the control registers of I/O devices to memory, so the access of the I/O deivces becomes easier. In detail, we can use load/store instructions to modify the specific memory location mapped to control registers. `TimerTest.scala, UartMMIOTest.scala` directly pokes the enable signa along with the address to modify the memory content, and verify the results. * `CLINTCSRTest.scala, ExecuteTest.scala` similarly poke the interrupt flag to simulate the interruption from external devices and getting expected results from CSR register to verify the behavior of updating CSR registers. * `CPUTest.scala` is additionally extent with a `irqtrap.asmbin` program to verify correct jumping behavior between regular process and interruption. ### Development After achieving the success at the `1-single-cycle`, I just copied some code snippets from `1-single-cycle` to `2-mmio-trap` project. At this time, I carefully read the content provided in the comments, I successfully pass most of the tests. [1c6c26b](https://github.com/kyl6092/ca2025-mycpu/commit/1c6c26b46b149c506a52f24298b844153adc5635) There is a mistake for trasplanting my `1-single-cycle` into `2-mmio-trap`, which is addressed in `InstructionFetch.scala`. If the control is handed over to other devices during a interruption, the instruction fetching address is different from original rule. With a interrupt_flag `interrupt_assert`, the pc is switched to `interrupt_handler_address`. After modifying the multiplexer, I have a all-pass test. [4a14a25](https://github.com/kyl6092/ca2025-mycpu/commit/4a14a2599b698b90d142c958729f5d3a538525dc) ### Nyancat Compression The key optimization for compressing Nyancat program is handling the stored nyancat frames. By default, run-length encoding (RLE) is used to compress and make 4096 pixels of each frame < 4096 operation codes (opcodes). The encoder calculates the number of consecutive symbol and record the number and the symbol as opcodes. This will achieve good compression for a long-run of symbols. For example, symbols with 7 `,` and 3 `.` such as `,,,,,,,...` will be encoded as `7,3.`. The reduction is from 10 pixels to 4 opcodes. After analyzing all possible pixel value occurred in the Nyancat animation, I found that there are only 13 kinds of symbols. That is, I can represent them using 4-bit value according to the method of dictionary encoding. I also observe that in `compress_frame_opcode_rle`, the way recording the color wastes 4 bits space since it serves the color pixels as bytes. In my opinion, a grouping scheme can be adopted to save space. In detail, I pack two consecutive different colors into one specific location by simple bitwise operations. Originally, the opcodes array stores value like ```python opcodes: [0, 49, 37] # 0 is background color # 49 and 37 are counters represented in different way ``` After the next color is recorded, the opcodes array will be ```python opcodes [0, 49, 37, 1, 32] # 1 is star color # 32 is the counter number ``` My modification is to make the opcodes can store like ``` python # Initially opcodes (at index 0, zero is inserted) opcodes [0, 49, 37] # When it comes to the next color, the opcodes will be opcodes [1, 49, 37, 32] ``` Here, I use the stored background color `current_color = 0` and calculate `refined_opcodes = (current_color << 4 | color)` when the next different color is dealt (i.e. `color = 1`). Since I record the index where to store, the result `1=0<<4|1` is updated at the index=0, showing opcodes `[1, 49, 37, 32]`. The modified code snippets are listed below: ```python def compress_frame_opcode_rle(pixels: List[str]) -> List[int]: """ Compress frame using opcode-based RLE (baseline compression). Returns list of opcodes (integers 0-255). """ if len(pixels) != 4096: print(f"Error: Frame must have 4096 pixels, got {len(pixels)}", file=sys.stderr) sys.exit(1) opcodes_refine = [] # Added List color_buffer_idx = 0 # Added variable color_current_idx = -1 # Added variable opcodes = [] i = 0 current_color = -1 encode_count = 0 group = 0 # Added variable while i < len(pixels): color = map_color_to_palette(pixels[i]) # Set color if different from current if color != current_color: opcodes.append(0x00 | color) if group == 1: # if two color run length is determined, # stored them together with a byte opcodes_refine[color_buffer_idx] = (0x00 | (current_color<< 4 | color)) else: opcodes_refine.append(0x00 | 0x00) # stored something in list initially color_buffer_idx = color_current_idx+1 # record where to store color_current_idx+=1 # update pointer group = (~group)&1 current_color = color ``` ```python # Count consecutive pixels of same color count = 1 while i + count < len(pixels) and map_color_to_palette(pixels[i + count]) == color: count += 1 # Encode run length with appropriate opcodes (may need multiple for long runs) remaining = count while remaining > 0: if remaining <= 16: opcodes.append(0x20 | (remaining - 1)) opcodes_refine.append(0x20 | (remaining - 1)) # Added update statement remaining = 0 color_current_idx+=1 elif remaining <= 256: chunks = min(remaining // 16, 16) if chunks > 0: opcodes.append(0x30 | (chunks - 1)) opcodes_refine.append(0x30 | (chunks - 1)) # Added update statement remaining -= chunks * 16 color_current_idx+=1 else: opcodes.append(0x30 | 0x0F) opcodes_refine.append(0x30 | 0x0F) # Added update statement remaining -= 256 color_current_idx+=1 ``` Compared to the original RLE encoding, my refined encoding utilize the dictionary encoding and manage the storing behavior, resulting in gains in compression. ```bash Original RLE Frame 0: 4096 pixels → 576 opcodes (86% reduction) Frame 1: 4096 pixels → 545 opcodes (87% reduction) Frame 2: 4096 pixels → 570 opcodes (87% reduction) Frame 3: 4096 pixels → 567 opcodes (87% reduction) Frame 4: 4096 pixels → 541 opcodes (87% reduction) Frame 5: 4096 pixels → 573 opcodes (87% reduction) Frame 6: 4096 pixels → 584 opcodes (86% reduction) Frame 7: 4096 pixels → 551 opcodes (87% reduction) Frame 8: 4096 pixels → 551 opcodes (87% reduction) Frame 9: 4096 pixels → 585 opcodes (86% reduction) Frame 10: 4096 pixels → 547 opcodes (87% reduction) Frame 11: 4096 pixels → 525 opcodes (88% reduction) Refined Method Frame 0: 4096 pixels → 444 opcodes (90% reduction) Frame 1: 4096 pixels → 419 opcodes (90% reduction) Frame 2: 4096 pixels → 438 opcodes (90% reduction) Frame 3: 4096 pixels → 436 opcodes (90% reduction) Frame 4: 4096 pixels → 417 opcodes (90% reduction) Frame 5: 4096 pixels → 442 opcodes (90% reduction) Frame 6: 4096 pixels → 449 opcodes (90% reduction) Frame 7: 4096 pixels → 423 opcodes (90% reduction) Frame 8: 4096 pixels → 424 opcodes (90% reduction) Frame 9: 4096 pixels → 450 opcodes (90% reduction) Frame 10: 4096 pixels → 422 opcodes (90% reduction) Frame 11: 4096 pixels → 404 opcodes (91% reduction) ``` ## Pipeline Design ### Tests summary Pipeline design optimizes the performance of the single-cycle CPU. The key concept are the pipeline registers, which buffer the previous instructions flow and allow components to read the next instruction. However, the data dependency occurs in every program and need to be addressed. If the buffered instruction hasn't completed, the next instruction which depends on the prvious one would be meaningless. Only using Nop instructions for the issue will cause a large overhead. Therefore, we implement the forwarding mechanism and handle flush and stall rule in `3-pipeline`. * `PipelineRegisterTest.scala` is the first thing we need to consider. Because the biggest difference between `3-pipeline` and `1-single-cycle` is the presence of pipeline registers (also known as the Flip-Flops (FFs)), we care about the instantiation and updating. In this test program, the module is verified by checking if the register can be updated, flushed. As for updating, other tests will cover this through more meaningful ways. (Correct Program results) * In addition to original programs such as `fibonacci.asmbin, quicksort.asmbin, sb.asmbin` `PipelineProgramTest.scala` executes the dedicated program which includes different type of data hazard. ### Development The pipeline design in the mycpu project is 5-staged pipeline. That is, 4 pipeline registers including IF/ID, ID/EX, EX/MEM, and MEM/EX are used to buffer instructions and calculated results. These pipeline register can be also used to determine the data hazard and perform forwarding technique. My concrete work in the `3-pipeline` is to realize when do the circuit forward the previous results to the issued instructions. (i.e. My main work is to handle the data hazard and control hazard) The instructor's hints are very useful for me to write down the condition. Moreover, I referred to a book [Computer Organization and Design RISC-V Edition](https://www.cs.sfu.ca/~ashriram/Courses/CS295/assets/books/HandP_RISCV.pdf) to rich my thoughts. More advanced analysis is at ***Hazard Detection Summary***. ### Hazard Detection Summary #### data hazard There are some instruction combinations causing data hazards in a pipeline CPU. We conventionally term the use of a register as read (R or use) and the update of a register as write (W). Moreover, loading content from memory into a register is abbreviated as load (L). Common instruction combinations include read-after-write (RAW), RAR, WAR, and WAW. One may read value from memory into registers for operations, so load-use frequently occurs, too. Some combinations are not critical because related registers don't have data dependency. RAR, WAR, and WAW are seamless behaviors. I interpret them as "a register can be read several times", "a register can be updated at any time", and "some redundant codes may appear or the memory type requires many times of writing". However, RAW will cause the uncertainty appears in read behavior since the access data may be out-of-date. Similarly, load-use stands for the updating register but the access will happen very soon. Next, I will concentrate on the both kind of instruction combinations. The read-after-write (RAW) not only indicate the writing along with reading at the next line. All possible read operations following a write operation can be seen as RAW. For example, ```riscv add t0, x0, a0 addi t1, t0, -1 # RAW ! beq t0, x0, to_zero # RAW ! add t2, a2, t0 # RAW ! ``` Similarly, the load-use warns if any access of the register happend after a instruction loads value from memory. ```riscv lw t0, 0(a0) addi t1, t0, -1 # load-use ! beq t0, x0, to_zero # load-use ! add t2, a2, t0 # load-use ! ``` The solution to these possible access of out-of-date value is to modify the logic circuits in the CPU, which making instruction and program counter stall or make read-or-loaded data bypass to the desired location. We don't want to modify the written or compiled program. So, each instruction is fetched in sequence. The forwarding technique makes data at the end of execution stage (EX) bypass to the beginning of the EX. Therefore, the next instruction get up-to-date value. ![image](https://hackmd.io/_uploads/rJGmVaNTzbe.png) However, the value from memory occurs in the MEM stage, so 'stall' is required. ![image](https://hackmd.io/_uploads/rJkLCNTGZe.png) #### control hazard A pipeline CPU buffers the instructions and calculated results to fetch next instructions as soon as possible. However, when it comes to controlling flow, these immediate fetched instructions may be invalid because the following correct instructinos are somewhere else. The correct address or jump decision are determined at the EX stage. As the CPU determine to jump to another address, original fetched instructions are invalid, so they are required to be eliminated. At that time, all buffered results in pipeline registers also need to be flushed. Unlike the stall instruction above, PC would be updated based on the instruction, and the contents are flush to no-operation (NOP). Take this code snippet for example ```riscv add t0, x0, a0 beq t0, x0, to_zero # RAW ! add t2, a2, t0 # RAW ! . . . to_zero: add t2, a3, t0 ``` ![image](https://hackmd.io/_uploads/ryo5IHTM-x.png) Besides, the immediate offset is calculated at the ID stage, those jump instructions with data hazard need more stall even forwarding is used. In this design, I implemented a forwarding from EX stage to ID stage, so only one more stall is sufficient. (If it only has EX stage to Ex stage, two stalls are needed.) ![image](https://hackmd.io/_uploads/BkX4BBTMbe.png) In summary, a pipeline CPU buffers instructions and the calculated results. The instruction combinations such as RAW and load-use cause CPU access the out-of-date value and produce incorrect data. Besides, because the instructions are fetched as soon as possible, wrong address may be fetched if the jump behavior is determined. With detection, we can put a large number of stalls to prevent these issue from affecting the final result. With additional forwarding logics, we can apply fewer stalls to make correct answer, improving the overall performance. ### Handwritten RISC-V assembly code Deployment #### Issues The biggest issues I encountered in this assignement is the stack pointer maintenance. I do a lot of procedure programming, so there are some updates for `sp` register. However, the simulations in `my-cpu` project doesn't allow me to access the address such as `0xffffff1c, -0xfffffffc, etc.`, which make my program broken. I spent a lot of time to debug this and finally adjust my code from assignemnt two [25d7afa](https://github.com/kyl6092/ca2025_assignment2_rv32emu/commit/25d7afa6176ce7230c13376e6880f5db7777c34b). The successful version is here [9c63fab](https://github.com/kyl6092/ca2025-mycpu/commit/9c63fabe3edce694702d7db2b7b6afeea40f6c38), and I extract a part of my issued snippets to illustrate the modification. ```riscv .globl _start _start: # a0 a (return) # a1 b # a2 c addi sp, sp, -12 li x28, 0x3f8d li x29, 0x3f9a li x30, 0x3fa6 sw x28, 0(sp) sw x29, 4(sp) sw x30, 8(sp) addi a3, x0, 25 addi a4, x0, 7 addi a5, x0, 15 addi a6, x0, 15 add a0, x0, x28 add a1, x0, x29 jal ra, my_add # a+b lw a1, 8(sp) # problem here ``` ![Screenshot from 2025-12-14 17-49-18](https://hackmd.io/_uploads/r1sKOWnGZg.png) The problem is that when I want to load the `0x3fa6` to `a1`, the log said that I cannot access the address. So, I check the waveform and found that it actually failed to load. ```riscv .globl _start _start: # a0 a (return) # a1 b # a2 c addi sp, sp, 12 li x28, 0x3f8d li x29, 0x3f9a li x30, 0x3fa6 sw x28, 0(sp) sw x29, -4(sp) sw x30, -8(sp) addi a3, x0, 25 addi a4, x0, 7 addi a5, x0, 15 addi a6, x0, 15 add a0, x0, x28 add a1, x0, x29 jal ra, my_add # a+b lw a1, -8(sp) ``` ![image](https://hackmd.io/_uploads/rkeFtW3fbg.png) After modifying the address, I can see my `0x3fa6` appear in the position marked by the marker. In the process, I also find my `my_mul` addressed a critical issue which is the undefined storing behavior (line 30). In last assignment, I served the my_mul as the function call in my rv32emu simulation. I always pass something at `a7`, so the address is updated. However, `my-cpu` see `a7` as `0x00000000`, the `sw 0(a7)` is very dangerous to my program. After debugging with waveform, I modified the code snippet. ```riscv= # === my_mul === my_mul: # a0 out1 (in1) # a1 in2 # a7 (out2) add x29, x0, a0 beq a0, zero, my_mul_ret addi a0, x0, 0 addi t0, x0, 0 addi x19, x0, 0 addi x18, x0, 0 beq a1, zero, my_mul_ret my_mul_loop: slli x19, x19, 1 add x19, x19, x18 andi x28, a1, 1 beq x28, zero, my_mul_loop_1 add x28, a0, x29 sltu x18, x28, a0 add t0, t0, x18 add t0, t0, x19 add a0, x0, x28 my_mul_loop_1: slli x28, x29, 1 sltu x18, x28, x29 add x29, x0, x28 srli a1, a1, 1 bne a1, zero, my_mul_loop my_mul_ret: sw t0, 0(a7) ret ``` ```riscv # === my_mul === my_mul: # a0 out1 (in1) # a1 in2 # a7 (out2) add x29, x0, a0 beq a0, zero, my_mul_ret addi a0, x0, 0 addi t1, x0, 0 addi x19, x0, 0 addi x18, x0, 0 beq a1, zero, my_mul_ret my_mul_loop: slli x19, x19, 1 add x19, x19, x18 andi x28, a1, 1 beq x28, zero, my_mul_loop_1 add x28, a0, x29 sltu x18, x28, a0 add t1, t1, x18 add t1, t1, x19 add a0, x0, x28 my_mul_loop_1: slli x28, x29, 1 sltu x18, x28, x29 add x29, x0, x28 srli a1, a1, 1 bne a1, zero, my_mul_loop my_mul_ret: ret ``` ## Appendix ### RISCOF Tests Results (Just for recording) ![image](https://hackmd.io/_uploads/Bydahi8Gbx.png) ![image](https://hackmd.io/_uploads/ByMu2s8GZg.png) ![image](https://hackmd.io/_uploads/Bywc2iLzZg.png) ### NyanCat Animation (Just for recording) ![image](https://hackmd.io/_uploads/ryqeA2LG-g.png)