# Assignment3: Single-Cycle RISC-V CPU contributed by < [fewletter](https://github.com/fewletter/ca2023-lab3) > ## Environment setup ### Operating System I use the Ubuntu Linux 20.04.1 as my operating system. ```shell $ uname -a Linux fewletter 5.15.0-89-generic #99~20.04.1-Ubuntu SMP Thu Nov 2 15:16:47 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux ``` ### Install sbt Follow the command in [lab3](https://hackmd.io/@sysprog/r1mlr3I7p#Install-sbt) use sdkman to install sbt. ```shell # Install sdkman $ curl -s "https://get.sdkman.io" | bash $ source "$HOME/.sdkman/bin/sdkman-init.sh" # Install Eclipse Temurin JDK 11 $ sdk install java 11.0.21-tem $ sdk install sbt ``` ### Chisel Bootcamp - Local Installation in Mac/Linux Follow the command in [Local Installation - Mac/Linux](https://github.com/freechipsproject/chisel-bootcamp/blob/master/Install.md#local-installation---maclinux). It's important that we have install the Eclipse Temurin JDK 11 in the above command. However, the **Note** in Chisel Bootcamp shows that you should have JDK 8 installed to initialize the Chisel Bootcamp. > Note: Make sure you are using Java 8 (NOT Java 9) and have the JDK8 installed. Coursier/jupyter-scala does not appear to be compatible with Java 9 yet as of January 2018. Follow the hint of the **Note**, I try to see what java version do I have. ```shell $ java -version openjdk version "11.0.21" 2023-10-17 OpenJDK Runtime Environment Temurin-11.0.21+9 (build 11.0.21+9) OpenJDK 64-Bit Server VM Temurin-11.0.21+9 (build 11.0.21+9, mixed mode) ``` Obviously, I install JDK 11 in my system, so the java version is `11.0.21`. Then I attempt to change the java version by the following command. ```shell $ sdk list java | | 20.0.2 | tem | | 20.0.2-tem | | 20.0.1 | tem | | 20.0.1-tem | | 17.0.9 | tem | | 17.0.9-tem | | 17.0.8 | tem | | 17.0.8-tem | | 17.0.8.1 | tem | | 17.0.8.1-tem | | 17.0.7 | tem | | 17.0.7-tem | | 11.0.21 | tem | installed | 11.0.21-tem | | 11.0.20 | tem | | 11.0.20-tem | | 11.0.20.1 | tem | | 11.0.20.1-tem | | 11.0.19 | tem | | 11.0.19-tem | >>> | 8.0.392 | tem | installed | 8.0.392-tem | | 8.0.382 | tem | | 8.0.382-tem | | 8.0.372 | tem | | 8.0.372-tem Tencent | | 17.0.9 | kona | | 17.0.9-kona | | 17.0.8 | kona | | 17.0.8-kona | | 17.0.7 | kona | | 17.0.7-kona | | 11.0.21 | kona | | 11.0.21-kona | | 11.0.20 | kona | | 11.0.20-kona | | 11.0.19 | kona | | 11.0.19-kona | | 8.0.392 | kona | | 8.0.392-kona | | 8.0.382 | kona | | 8.0.382-kona | | 8.0.372 | kona | | 8.0.372-kona $ sdk install java 8.0.392-tem $ sdk use java 8.0.392-tem Using java version 8.0.392-tem in this shell. $ sdk current Using: java: 8.0.392-tem sbt: 1.9.7 ``` Then open another terminal to initialize jupyter notebook. ``` $ cd chisel-bootcamp /chisel-bootcamp$ mkdir -p ~/.jupyter/custom /chisel-bootcamp$ cp source/custom.js ~/.jupyter/custom/custom.js /chisel-bootcamp$ jupyterbook ``` Take **Module 2.2: Combinational Logic** as example. ![2023-11-28 16-08-33 screenshot](https://hackmd.io/_uploads/Hk8GxmQrp.png) It seems that works well. ## Single-Cycle RISC-V CPU in Chisel There are four files `InstructionFetch.scala`, `InstructionDecode.scala`, `Execute.scala`, `CPU.scala` need to be filled with the code and finish the test. ![image](https://hackmd.io/_uploads/Bk5njdXHT.png) ### InstructionFetch In the part, we can see that there are two ouputs `InsAddr` and `Ins`. `InsAddr` depends on if the branch is detected from the execute phase. `Ins` is to read the data from the instruction. ![2023-11-28 22-40-25 screenshot](https://hackmd.io/_uploads/rySg2O7Hp.png) To validate the result, the following command is to generate the `.vcd` file and view the waveform. ```shell fewletter@fewletter:~/ca2023-lab3$ WRITE_VCD=1 sbt "testOnly riscv.singlecycle.InstructionFetchTest" ``` The waveform shows that ![2023-11-28 23-23-41 screenshot](https://hackmd.io/_uploads/rkqf8YXBa.png) ### InstructionDecode In this part, the main idea is to parse the information from the instruction like the figure . ![2023-11-29 15-16-59 screenshot](https://hackmd.io/_uploads/ByttSv4BT.png) Therefore, to fill the code, we should focus on the instruction type L and S. These both types are allowed to let the instruction to read or write from the memory. ![2023-11-28 23-07-33 screenshot](https://hackmd.io/_uploads/HyjSfFXBp.png) Based on the following code, `InstructionDecoderTest` tests S type, U type and R type instruction, so the `io.memory_read_enable` has never been tested like the waveform shows. ```python ... c.io.instruction.poke(0x00a02223L.U) // S-type c.io.ex_aluop1_source.expect(ALUOp1Source.Register) c.io.ex_aluop2_source.expect(ALUOp2Source.Immediate) c.io.regs_reg1_read_address.expect(0.U) c.io.regs_reg2_read_address.expect(10.U) c.clock.step() c.io.instruction.poke(0x000022b7L.U) // lui c.io.regs_reg1_read_address.expect(0.U) c.io.ex_aluop1_source.expect(ALUOp1Source.Register) c.io.ex_aluop2_source.expect(ALUOp2Source.Immediate) c.clock.step() c.io.instruction.poke(0x002081b3L.U) // add c.io.ex_aluop1_source.expect(ALUOp1Source.Register) c.io.ex_aluop2_source.expect(ALUOp2Source.Register) c.clock.step() ``` ![2023-11-29 15-20-48 screenshot](https://hackmd.io/_uploads/SyrvIPVHT.png) ### Execute To finish this part, it is important to parse the input instruction. In the figure below, ALU is determined by the singal of ALUFunct, and ALU inputs are depends on the `ALUOp1Src` and `ALUOp2Src` to determine whether they are register data or the instruction address and immediate. ![2023-11-28 23-41-46 screenshot](https://hackmd.io/_uploads/HypBcYXBa.png) The waveform shows the crucial part of the Execute phase that is ALU operations are depend on the ALUFunct. ![2023-11-30 15-02-08 screenshot](https://hackmd.io/_uploads/Hy2tX2SHa.png) ### CPU To finish this part, it is important to figure out the inputs and the outputs of different phases of the CPU. First, take a look at the `CPU.scala`, it is obvious that it lacks of the execution phase. ![2023-11-28 23-41-46 screenshot](https://hackmd.io/_uploads/HypBcYXBa.png) So, to accomplish the file, it is necessary to take a look in the execute phase of the cpu. There is no need to care what the relationship between the output and input, because `Execute.scala` has done it by scratch. Instead it is need to be focus on how do the inputs come from the other phases. ```python val io = IO(new Bundle { val instruction = Input(UInt(Parameters.InstructionWidth)) val instruction_address = Input(UInt(Parameters.AddrWidth)) val reg1_data = Input(UInt(Parameters.DataWidth)) val reg2_data = Input(UInt(Parameters.DataWidth)) val immediate = Input(UInt(Parameters.DataWidth)) val aluop1_source = Input(UInt(1.W)) val aluop2_source = Input(UInt(1.W)) val mem_alu_result = Output(UInt(Parameters.DataWidth)) val if_jump_flag = Output(Bool()) val if_jump_address = Output(UInt(Parameters.DataWidth)) }) ``` Take `sb.S` for example, the file is to test if that the register `t0` has the value `0xDEADBEFF` and the regitser `s2` has the value `0x15`. ``` # mycpu is freely redistributable under the MIT License. See the file # "LICENSE" for information on usage and redistribution of this file. .global _start _start: li a0, 0x4 li t0, 0xDEADBEEF sb t0, 0(a0) lw t1, 0(a0) li s2, 0x15 sb s2, 1(a0) lw ra, 0(a0) loop: j loop ``` Therefore focus on the how do the execute phase get the input in following two examples: * `li t0, 0xDEADBEEF` ![2023-11-30 17-54-26 screenshot](https://hackmd.io/_uploads/Hyo1n0SHT.png) * `li s2, 0x15` ![2023-11-30 17-55-03 screenshot](https://hackmd.io/_uploads/Skef2AHHT.png) ## Run HW2 on Mycpu ### Setup In [HW2](https://hackmd.io/@fewletter/riscvtoolchain), the code doesn't fit the ISA in Mycpu, so I remove the `get_cycles` and the system call in the assembly code. :::spoiler modify code ```c .org 0 # Provide program starting address to linker .global _start .data data_1: .word 0x12345678 data_2: .word 0xffffdddd mask_1: .word 0x55555555 mask_2: .word 0x33333333 mask_3: .word 0x0f0f0f0f .text _start: lw s0, data_1 #s0 = A lw s1, data_2 #s1 = B mv a0, s0 jal ra, CLZ mv t5, a0 #A's CLZ -> t5 mv a0, s1 jal ra, CLZ mv t6, a0 #B's CLZ -> t6 slt t0, t5, t6 # if A's zero less than B's, t0=1 li a0, 32 jal ra, get_cycles mv a4, a3 bne t0, zero, start_mul start_mul: #reset mv t0, s0 #A ^= B; mv s0, s1 #B ^= A; mv s1, t0 #A ^= B; mv t6, t5 sub a0, a0, t6 li t0, 0 li t1, 0 li t2, 0 li s2, 0 #s2: high 32 of number li s3, 0 #s3: low 32 of number li s4, 0 #used to check how many bit should shift int_mul: slt t1, s4, a0 beq t1, zero, exit srl t0, s1, s4 andi t0, t0, 0x00000001 #check B's rightest bit beq t0, zero, skip #if(rightest bit is zero) jump sll s5,s0,s4 #s0 is A,S5 the low bit i want li t2, 32 sub t2, t2, s4 srl s6, s0, t2 #s0 is A, S6 the high bit i want add s7, s3, s5 #s7 is 32_low + low bit i want sltu t3, s7, s3 mv s3, s7 beq t3, zero, no_overflow # if not jump --> overflow add s2, s2, s6 addi s2, s2, 1 addi s4, s4 ,1 no_overflow: add s2, s2, s6 jal skip skip: addi s4, s4 ,1 jal int_mul CLZ: #a0: the num(x) you want to count CLZ #t0: shifted x srli t0, a0, 1 # t0 = x >> 1 or a0, a0, t0 # x |= x >> 1 srli t0, a0, 2 # t0 = x >> 2 or a0, a0, t0 # x |= x >> 2 srli t0, a0, 4 # t0 = x >> 4 or a0, a0, t0 # x |= x >> 4 srli t0, a0, 8 # t0 = x >> 8 or a0, a0, t0 # x |= x >> 8 srli t0, a0, 16 # t0 = x >> 16 or a0, a0, t0 # x |= x >> 16 #start_mask lw t2, mask_1 srli t0, a0, 1 # t0 = x >> 1 and t1, t0, t2 # t1 = (x >> 1) & mask1 sub a0, a0, t1 # x -= ((x >> 1) & mask1) lw t2, mask_2 # load mask2 to t2 srli t0, a0, 2 # t0 = x >> 2 and t1, t0, t2 # (x >> 2) & mask2 and a0, a0, t2 # x & mask2 add a0, t1, a0 # ((x >> 2) & mask2) + (x & mask2) srli t0, a0, 4 # t0 = x >> 4 add a0, a0, t0 # x + (x >> 4) lw t2, mask_3 # load mask3 to t2 and a0, a0, t2 # ((x >> 4) + x) & mask4 srli t0, a0, 8 # t0 = x >> 8 add a0, a0, t0 # x += (x >> 8) srli t0, a0, 16 # t0 = x >> 16 add a0, a0, t0 # x += (x >> 16) andi t0, a0, 0x3f # t0 = x & 0x3f li a0, 32 # a0 = 32 sub a0, a0, t0 # 32 - (x & 0x3f) ret exit: j exit ``` ::: Then change the Makefile in the `ca2023/csrc`, the Makefile can generate the `.asmbin` file from the `.elf` directly. ```diff ... BINS = \ fibonacci.asmbin \ hello.asmbin \ mmio.asmbin \ quicksort.asmbin \ sb.asmbin \ + mul_clz.asmbin ... ``` Every time when the `mul_clz.S` is modified, the following commands can generate a new `.asmbin` file and update the `.asmbin` file in the main test directory. ``` csrc$ make riscv-none-elf-as -R -march=rv32i_zicsr -mabi=ilp32 -o mul_clz.o mul_clz.S mul_clz.S: Assembler messages: mul_clz.S: Warning: end of file not at end of a line; newline inserted riscv-none-elf-ld -o mul_clz.elf -T link.lds --oformat=elf32-littleriscv mul_clz.o riscv-none-elf-objcopy -O binary -j .text -j .data mul_clz.elf mul_clz.asmbin rm mul_clz.elf csrc$ make update cp -f fibonacci.asmbin hello.asmbin mmio.asmbin quicksort.asmbin sb.asmbin mul_clz.asmbin ../src/main/resources ``` ### Run and Debug HW2 on CPUTest To test the assembly code of HW2, I prepare a `mul_clzTest` to see if the result is stored in register `s2` and `s3` correctly. If the result is correct, the test should failed because the value isn't `0x0`. ```python class mul_clzTest extends AnyFlatSpec with ChiselScalatestTester { behavior.of("Single Cycle CPU") it should "multiply two numbers with counting leading zeros" in { test(new TestTopModule("mul_clz.asmbin")).withAnnotations(TestAnnotations.annos) { c => for (i <- 1 to 500) { c.clock.step() c.io.mem_debug_read_address.poke((i * 4).U) // Avoid timeout } c.io.regs_debug_read_address.poke(18.U) // s2 c.io.regs_debug_read_data.expect(0x0.U) c.io.regs_debug_read_address.poke(19.U) // s3 c.io.regs_debug_read_data.expect(0x0.U) } } } ``` Here is the result. Obviously that doesn't fit my expectation, so I begin to find where the problem is. ``` $ WRITE_VCD=1 sbt test ... [info] mul_clzTest: [info] Single Cycle CPU [info] - should multiply two numbers with counting leading zeros [info] ByteAccessTest: [info] Single Cycle CPU [info] - should store and load a single byte [info] FibonacciTest: [info] Single Cycle CPU [info] - should recursively calculate Fibonacci(10) [info] ExecuteTest: [info] Execution of Single Cycle CPU [info] - should execute correctly [info] QuicksortTest: [info] Single Cycle CPU [info] - should perform a quicksort on 10 numbers [info] RegisterFileTest: [info] Register File of Single Cycle CPU [info] - should read the written content [info] - should x0 always be zero [info] - should read the writing content [info] Run completed in 13 seconds, 333 milliseconds. [info] Total number of tests run: 10 [info] Suites: completed 8, aborted 0 [info] Tests: succeeded 10, failed 0, canceled 0, ignored 0, pending 0 [info] All tests passed. [success] Total time: 18 s, completed 2023年11月30日 下午4:45:22 ``` I modify the assembly code to only count the leading zeros of the data and the **CPUTest** to test if the register `t5` and `t6` are `0x3` and `0x7`. :::spoiler modify code ``` .org 0 # Provide program starting address to linker .global _start .data data_1: .word 0x12345678 data_2: .word 0xffffdddd mask_1: .word 0x55555555 mask_2: .word 0x33333333 mask_3: .word 0x0f0f0f0f .text _start: lw s0, data_1 #s0 = A lw s1, data_2 #s1 = B mv a0, s0 jal ra, CLZ mv t5, a0 #A's CLZ -> t5 mv a0, s1 jal ra, CLZ mv t6, a0 #B's CLZ -> t6 slt t0, t5, t6 # if A's zero less than B's, t0=1 loop: j loop CLZ: #a0: the num(x) you want to count CLZ #t0: shifted x srli t0, a0, 1 # t0 = x >> 1 or a0, a0, t0 # x |= x >> 1 srli t0, a0, 2 # t0 = x >> 2 or a0, a0, t0 # x |= x >> 2 srli t0, a0, 4 # t0 = x >> 4 or a0, a0, t0 # x |= x >> 4 srli t0, a0, 8 # t0 = x >> 8 or a0, a0, t0 # x |= x >> 8 srli t0, a0, 16 # t0 = x >> 16 or a0, a0, t0 # x |= x >> 16 #start_mask lw t2, mask_1 srli t0, a0, 1 # t0 = x >> 1 and t1, t0, t2 # t1 = (x >> 1) & mask1 sub a0, a0, t1 # x -= ((x >> 1) & mask1) lw t2, mask_2 # load mask2 to t2 srli t0, a0, 2 # t0 = x >> 2 and t1, t0, t2 # (x >> 2) & mask2 and a0, a0, t2 # x & mask2 add a0, t1, a0 # ((x >> 2) & mask2) + (x & mask2) srli t0, a0, 4 # t0 = x >> 4 add a0, a0, t0 # x + (x >> 4) lw t2, mask_3 # load mask3 to t2 and a0, a0, t2 # ((x >> 4) + x) & mask4 srli t0, a0, 8 # t0 = x >> 8 add a0, a0, t0 # x += (x >> 8) srli t0, a0, 16 # t0 = x >> 16 add a0, a0, t0 # x += (x >> 16) andi t0, a0, 0x3f # t0 = x & 0x3f li a0, 32 # a0 = 32 sub a0, a0, t0 # 32 - (x & 0x3f) ret ``` ::: **CPUTest** ``` ... c.io.regs_debug_read_address.poke(30.U) // t5 c.io.regs_debug_read_data.expect(0x3.U) c.io.regs_debug_read_address.poke(31.U) // t6 c.io.regs_debug_read_data.expect(0x7.U) ... ``` Here is the result. The test still not pass. ``` [info] mul_clzTest: [info] Single Cycle CPU [info] - should multiply two numbers with counting leading zeros *** FAILED *** [info] io_regs_debug_read_data=0 (0x0) did not equal expected=7 (0x7) (lines in CPUTest.scala: 128, 120) (CPUTest.scala:128) ``` So I view the waveform, I want to wee how these two lines behave in the waveform. ``` mv a0, s0 jal ra, CLZ mv t5, a0 #A's CLZ -> t5 <-- 1 mv a0, s1 jal ra, CLZ mv t6, a0 #B's CLZ -> t6 <-- 2 ``` In 433 ns, the value in `a0` (`register_10`) is moved to the `t5` (`register_30`), but `t6` (`register_31`) is always zero. Therefore, it seems that the bug appears in these sentences. ![2023-11-30 18-40-50 screenshot](https://hackmd.io/_uploads/H1r0IyIB6.png) #### Time matters Since that the number in specific register isn't right, I decide to view the waveform to see what happens. ![2023-11-30 23-48-45 screenshot](https://hackmd.io/_uploads/H1SzeVUHa.png) The final instruction of the **CPUTest** is `0c030c63`, which can be translated to `beq t1, zero, init_mul`. The situation means that **only part of the assembly code is executed**, so that is why I can't get the right result in the specific register. Finally I modify the time of the **CPUTest**, so the assembly code can pass the test with the accurate value. ```diff class mul_clzTest extends AnyFlatSpec with ChiselScalatestTester { behavior.of("Single Cycle CPU") it should "multiply two numbers with counting leading zeros" in { test(new TestTopModule("mul_clz.asmbin")).withAnnotations(TestAnnotations.annos) { c => - for (i <- 1 to 500) { + for (i <- 1 to 5000) { c.clock.step() c.io.mem_debug_read_address.poke((i * 4).U) // Avoid timeout } c.io.regs_debug_read_address.poke(30.U) // t5 c.io.regs_debug_read_data.expect(0x3.U) + c.io.regs_debug_read_address.poke(18.U) // s2 + c.io.regs_debug_read_data.expect(0x1234540a.U) } } } ``` The whole assembly code cost 3577 ns to accomplish, and the result is as same as the value I got in [HW2](https://hackmd.io/ncRkOZMfQlq9zJMb1hFgZQ?view#Print-in-HEX-form). ![2023-11-30 23-50-00 screenshot](https://hackmd.io/_uploads/SkgEzVLBa.png)