Assignment3: single-cycle RISC-V CPU

# Assignment3: single-cycle RISC-V CPU contributed by < [HenryChaing](https://github.com/HenryChaing/ca2023-lab3) > ## Single Cycle Datapath setup * This time, we are going to learn how to design a Single Cycle RISCV Processor using the `Scala` programming language. Throughout the process, we need to complete the implementation of the code for each stage. It's crucial to understand how each `Module` specifies its input and output, and then use Chisel to bring them to life. Commonly used syntax includes `when`, `mux`, and so on. Additionally, we need to refer to example diagrams to accomplish the signal transmission between different stages. * Next is `Run Assembly Code` on Processor. We need to place the previously written homework onto the CPU simulated with Chisel and observe the output results at specific memory addresses to determine if they match. This helps verify whether the CPU can properly execute various instructions. Finally, we can use `GtkWave` to observe the signals within the CPU. :::danger :warning: Read [the requirements](https://hackmd.io/@sysprog/2023-arch-homework3) carefully! ::: ### Instruction Fetch * Instruction Fetch. scala Regarding the missing instruction fetch stage, the most obvious issue is that the PC value has not been updated. As described by the instructor in Lab3, the update of the PC value in jump and branch instructions differs from the normal instruction update of PC+4. Therefore, considering a similar Multiplexer-like approach shown in the diagram to update the PC value, the following code is the result of the implementation, updating PC to either `PC+4` or `jump address`. ![SCD_instruction_fetch](https://hackmd.io/_uploads/rkBfzZQHa.png) :::info [Hint from Lab3] When an instruction is valid, the current instruction pointed to by the PC is fetched. If a jump is required, the PC is directed to the jump address; otherwise, it is incremented to PC + 4. ::: * Instruction Fetch Process Overview: The key updates in this module include two aspects: the update of the PC value and the update of the instruction address. The updating process involves first determining whether the instruction is valid. If it's invalid, the PC value remains unchanged, and the instruction is changed to "nop." If it's valid, it then checks if a jump is warranted. If a jump is confirmed, the update is made to the jump address; otherwise, it's updated to PC+4. ### Instruction Decode * Instruction Decode. scala Here, we can briefly observe the yet-to-be-processed signals from the input/output of the decoder. This represents the parts that need to be addressed in our Instruction Decode stage. The untreated signals are `Mem_Read_Enable` and `Mem_Write_Enable`. The implementation involves checking whether the instruction is a Load/Store instruction. If true, the corresponding Enable lines are set to True. ![SCD_instruction_decode](https://hackmd.io/_uploads/ByRFE-Xr6.png) :::info I used Mux() to solve this question, you can simply use the boolean value compare from opcode and it is instruction type L or not. ::: * Instruction Decode Process Overview: In the Decoder, there are two crucial components: the output control signal lines and the data lines to be transmitted, all determined by the instruction. For instance, `ALU_Op` decides the source for the ALU input, while `MEM_RE/WE` determines whether to read from or write to memory. As for `RegWS`, `Reg1/2RA`, and `immediate`, they determine the destination registers, rs1, rs2, enabling the following stages to read data and execute properly. ### Execution * Execute. scala This stage is more complex than the previous two. The parts to be implemented include the ALU, ALUControl as explained in Lab3, and our additional implementation of ALUSrcJudge. Initially, the ALU unit has no input/output settings. Therefore, we first import the result of ALUControl, `alu_ctrl.io.alu_funct`, into the ALU unit. Next is specifying the operands of the ALU. While `alu.io.op1` remains unchanged in source regardless of the instruction, the source of `alu.io.op2` might be `immediate` or `register_2`. Hence, we need to implement the ALUSrcJudge component to specify the operand source. My implementation involves using a Multiplexer, with the control unit using `io.aluop2_source` sent by the Decoder. With this, the input for the ALU unit is set up, and `io.mem_alu_result` can successfully compute the correct result. ![SCD_execution](https://hackmd.io/_uploads/ByA3B4XHT.png) * Execute Process Overview: We will start by introducing ALU control. ALU control determines the operation that the ALU needs to perform based on the instruction, such as addition for add, subtraction for sub, and multiplication for sll. ALUSrcJudge, on the other hand, decides the inputs for ALU based on `ALU_Op_Source`, determining whether the input is `rs2` or `immediate`. Finally, `JumpJudge` determines which jump instruction to execute and, in the output control signal lines, decides whether to branch or not. ### MyCPU * CPU. scala Here, a considerable amount of time was spent verifying the connections between various modules. In the end, I only confirmed the parts related to the inputs of each module. I found that the missing ones were all inputs to the Execute module. Next, I connected these inputs to the outputs of other modules to complete this stage of the work. ![SCD](https://hackmd.io/_uploads/ryeWmBXH6.png) :::info In the implementation, it is necessary to additionally understand the `CPU_Bundule`, `RAM_Bundule`, and the `Flipped` function. ::: * CPU Process Overview: This is a schematic diagram of the Single Cycle DataPath. The `PC value` is first sent to the `instruction memory`. The instruction fetched from there is then sent to the `Decoder` for instruction decoding, which generates corresponding control signal lines and data lines. At the same time, the corresponding register values are read from the `Register File`. Before performing `ALU` operations, `ALUControl` and `SrcJudge` determine the ALU inputs. Finally, the `ALU` generates output signals. If a `Memory Access` is required, the `Memory Control` decides whether to read or write based on `RE` and `WE`. The Multiplexer in the `WriteBack` stage determines whether to write back the ALU operation result or the memory read result to the `Register File`. These are the actions performed by the CPU in one cycle. ## Run Hand Written Assembly On MyCPU ### introduction Because the Input/Output rules for HW2 were somewhat unclear, I chose someone else's HW1 as a test subject, [<SUE3k>](https://hackmd.io/@-plyrukoRemmLd0FT8Qy3A/SkbtANuJ6/edit). The topic of this post is 'Find the position of MSB'. Simply input a string of hexadecimal numbers, and it will return the position of the Most Significant Bit (MSB) of that number (counting from 0). * implementation For the main program, I retained only two functions, `clz` and `main`, and, referring to `Fibonacci.c`, stored the final result at memory address `105`. As for `CPUTest.scala`, I followed the `FibonacciTest` class. We will check memory address `105` to see if it corresponds to MSB (0x00010011) = `16`. However, the initial problems encountered are shown as described in the "problem faced" section. :::spoiler CPUTest_scala code ```scala=66 class MSBTest extends AnyFlatSpec with ChiselScalatestTester { behavior.of("Single Cycle CPU") it should "calculate MSB(0x0011)" in { test(new TestTopModule("main.asmbin")).withAnnotations(TestAnnotations.annos) { c => for (i <- 1 to 50) { c.clock.step(1000) c.io.mem_debug_read_address.poke((i * 4).U) // Avoid timeout } c.io.mem_debug_read_address.poke(105.U) c.clock.step() c.io.mem_debug_read_data.expect(16.U) } } } ``` ::: :::spoiler main_c code ```c= #include <stdio.h> #include <stdint.h> #include <inttypes.h> uint16_t count_leading_zeros(uint64_t x){ x |= (x >> 1); x |= (x >> 2); x |= (x >> 4); x |= (x >> 8); x |= (x >> 16); x |= (x >> 32); x -= ((x >> 1) & 0x55555555); x = ((x >> 2) & 0x33333333) + (x & 0x33333333); x = ((x >> 4) + x) & 0x0f0f0f0f; x += (x >> 8); x += (x >> 16); x += (x >> 32); return (32 - (x & 0x7f)); } int main(){ uint32_t test_data[] = {0x00000011, 0x00001101, 0x00010011}; for (int i = 0; i < sizeof(test_data) / sizeof(test_data[0]); i++){ uint32_t clz = count_leading_zeros(test_data[i]); if (clz < 32){ uint32_t msb = (uint32_t)(31 - clz); *((volatile int*)(105)) = msb; } } return 0; } ``` ::: :::info Try to use high clock step count, the result might not be evaluated if step is too low. ::: ### problem faced (c code) > [info] <span style="color:green">MSBTest:</span> > [info] <span style="color:green">Single Cycle CPU</span> > [info] <span style="color:red">- should calculate MSB(0x00010011) *** FAILED ***</span> > [info] <span style="color:red">  io_mem_debug_read_data=15461 (0x3c65) did not equal expected=16 (0x10) (lines in CPUTest.scala: 77, 69) (CPUTest.scala:77)</span> ```c ~/ca2023-lab3$ make && make update ``` ### problem fix I created a project under `rv32emu/test/` for the last assignment and used the `Makefile` from the previous assignment to generate the `ELF file`. Finally, I used `riscv-none-elf-objcopy` to convert it to `main.asmbin`. However, even though the file is present under `./resources`, it was found that it could not run smoothly during CPUTest. Later, I discovered that the original Makefile did not use the linking script for Lab3, resulting in different contents in the ELF file. Alternatively, I might have forgotten to use `make update` after `make`, causing the files under `./resources` not to be updated. ``` Makefile (part of the content) %.elf: %.c init.o $(CC) $(CFLAGS) -c -o $(@:.elf=.o) $< $(CROSS_COMPILE)ld -o $@ -T link.lds $(LDFLAGS) $(@:.elf=.o) init.o ``` ## Waveform ### Analyze hello.elf wave #### wave in Instruction Fetch stage ![6](https://hackmd.io/_uploads/Syw0CQ8Bp.jpg) Here, you can see that `jump_flag_id` is 1, and the current PC value is still 1004. However, because this is a jump instruction, the next PC value update should be the value of `jump_address_id`, which is the current read value `15D8`. Therefore, you can observe that the PC value in the next cycle is updated to `15D8`. #### wave in Instruction Decode stage ![4](https://hackmd.io/_uploads/SJttSJwST.jpg) Here, you can see the operation of the decode unit. The instruction being read this time is `addi x8 x2 16`. After processing, the obtained `opcode` is 0x13. Since opcode `0x13` corresponds to an `I-type` instruction that writes back to the register, the value of `io_reg_write_enable` is 1, and it will write back to register x8. #### wave in Execute stage ![3](https://hackmd.io/_uploads/SynsC1wB6.jpg) In this section, we are going to explore the data source of `alu_op2`, as it can come either from data retrieved from the register or from the immediate data extracted from the instruction. Our determination is based on the information provided by the decode unit, specifically the `alu_op2_source`, which decides whether the source for this operation is `register2` or `immediate`. #### wave in Memory stage ![2_1](https://hackmd.io/_uploads/HkRiYEUB6.jpg) ![2_2](https://hackmd.io/_uploads/S1mYTNUrp.jpg) Here, we are going to look at the execution of Memory write and read. We observe the signals `read_enable` and `write_enable` transmitted from the decode unit. For instance, in the top diagram, the `read_enable` value is 1. Consequently, in `wb_memory_read_data`, we can read the data output from Memory. Next is the `write_enable` value. As observed in the diagram, the `write_enable` value for this cycle is 1. Therefore, in `memory_bundle_write_data`, the data is effectively read for writing. #### wave in WriteBack stage ![5](https://hackmd.io/_uploads/HJgsB-wBT.jpg) The Writeback stage has only one multiplexer, and the source of writing back to the register is determined by `io_regs_write_source`. In this example, our `write_source` is `00`. Therefore, the value to be written comes from `alu_result`. For example, in this case, the result is `1000`. Consequently, the result written back is also `1000`, which is the `regs_write_data` for this cycle. ### Analyze main.elf wave #### riscv-none-elf-objdump ``` 00001000 <_start>: 1000: 00001137 lui sp,0x1 1004: 5a8000ef jal 15ac <main> ... 000015ac <main>: 15ac: fd010113 add sp,sp,-48 15b0: 02112623 sw ra,44(sp) 15b4: 02812423 sw s0,40(sp) 15b8: 03212223 sw s2,36(sp) ... ``` The above results are obtained through the Lab2 Gnu toolchain. The executed command is `riscv-none-elf-objdump`, which allows us to see the instructions in the ELF file. I also selected two instructions for analysis, which are easy to discern: the `jal` and `sw` instructions located at lines `1004` and `15b0`, respectively. #### simulation of wave and instruction **jal instruction** ![9](https://hackmd.io/_uploads/H19aPBwHa.jpg) This illustrates the situation of the `jal` instruction in various signal lines of the CPU. The CPU, in the EX stage, determines whether this instruction is a jump or branch and sets the signal line `io_jump_flag_id` to 1. Afterwards, it updates the PC value to `io_jump_address_id`. Thus, you can observe that the PC value for the next cycle has been updated to `15AC`, completing the `jal` operation. **sw instruction** ![8](https://hackmd.io/_uploads/H1TFurDHT.jpg) We are examining the `sw` instruction. Firstly, since `sw` writes back to Memory, you can see that `io_memory_write_enable` is set to 1. Additionally, in `io_memory_bundle_write_data`, you can observe the value being written to Memory, which is the value in `register s2` (0x1008). Thus, the write to Memory operation is completed accordingly.