# Assignment3: Single-Cycle RISC-V CPU contributed by [kk908676](https://github.com/kk908676) ## Environment #### 1. Install sbt Follow the instructions from [here](https://www.scala-sbt.org/release/docs/Installing-sbt-on-Linux.html) #### 2. Install Java (If you have not downloaded before) $sudo apt install default-jdk #### 3. Install the dependent packages $ sudo apt install build-essential verilator gtkwave #### 4. Chisel $ git clone https://github.com/ucb-bar/chisel-tutorial Before testing your system, ensure that you have sbt (the Scala build tool) installed. ```shell $ sbt run ``` #### 5. Assignment $ git clone https://github.com/sysprog21/ca2023-lab3 ## Single-cycle RISC-V CPU The following image is the single-cycle CPU that we will be implementing in this assignment. * Full ![Full](https://hackmd.io/_uploads/r1PKdsaNT.png) * Simplifed ![Simplifed](https://hackmd.io/_uploads/Hyljds646.png) As the instructor has already completed the majority of the content, I only need to make modifications to the following file : * InstructionFetch.scala * InstructionDecode.scala * Execute.scala * CPU.scala After completion, you can obtain the following information. >[info] <font color="green">ExecuteTest:</font> [info] <font color="green">Execution of Single Cycle CPU</font> [info] <font color="green">- should execute correctly</font> [info] <font color="green">ByteAccessTest:</font> [info] <font color="green">Single Cycle CPU</font> [info] <font color="green">- should store and load a single byte</font> [info] <font color="green">FibonacciTest:</font> [info] <font color="green">Single Cycle CPU</font> [info] <font color="green">- should recursively calculate Fibonacci(10)</font> [info] <font color="green">InstructionDecoderTest:</font> [info] <font color="green">InstructionDecoder of Single Cycle CPU</font> [info] <font color="green">- should produce correct control signal</font> [info] <font color="green">QuicksortTest:</font> [info] <font color="green">Single Cycle CPU</font> [info] <font color="green">- should perform a quicksort on 10 numbers</font> [info] <font color="green">InstructionFetchTest:</font> [info] <font color="green">InstructionFetch of Single Cycle CPU</font> [info] <font color="green">- should fetch instruction</font> [info] <font color="green">RegisterFileTest:</font> [info] <font color="green">Register File of Single Cycle CPU</font> [info] <font color="green">- should read the written content</font> [info] <font color="green">- should x0 always be zero</font> [info] <font color="green">- should read the writing content</font> [info] <font color="blue">Run completed in 20 seconds, 443 milliseconds.</font> [info] <font color="blue">Total number of tests run: 9</font> [info] <font color="blue">Suites: completed 7, aborted 0</font> [info] <font color="blue">Tests: succeeded 9, failed 0, canceled 0, ignored 0, pending 0</font> >[info] <font color="green">All tests passed.</font> #### The next task is to run the code from the previous assignment on this CPU. * made some modifications to the content before running it. I removed the code for print. * also added the process of placing the output into specific memory to ensure successful retrieval during testing Below is the complete C code. ```scala #include <stdint.h> #include <stdio.h> uint16_t count_leading_zeros(uint64_t x) { x |= (x >> 1); x |= (x >> 2); x |= (x >> 4); x |= (x >> 8); x |= (x >> 16); x |= (x >> 32); x -= ((x >> 1) & 0x5555555555555555); x = ((x >> 2) & 0x3333333333333333) + (x & 0x3333333333333333); x = ((x >> 4) + x) & 0x0f0f0f0f0f0f0f0f; x += (x >> 8); x += (x >> 16); x += (x >> 32); return (64 - (x & 0x7f)); } int find_string(uint64_t x, int n){ int clz; // leading zeros of x int pos = 0; // position of fist '1' bit from significant bit while(x != 0){ clz = count_leading_zeros(x); x = x << clz; pos = pos + clz; clz = count_leading_zeros(~x); if (clz >= n) return pos; x = x << clz; pos = pos + clz; } return -1; } int main() { uint64_t test_data[] = {0x0f00000000000000, 0x0000000000000000, 0x0123456789abcdef}; *((volatile int *) (4)) = find_string(test_data[0], 4);//4 *((volatile int *) (8)) = find_string(test_data[1], 4);//-1 *((volatile int *) (12)) = find_string(test_data[2], 4);//29 return 0; } ``` Generating Waveform Files During Testing: While running tests, if you set the environment variable `WRITE_VCD` to 1, waveform files will be generated. ```shell $ WRITE_VCD=1 sbt test ``` ### Verilator After the first run and every time you modify the Chisel code, you need to execute the following command in the project's root directory to generate Verilog files: ```shell $ make verilator ``` For example, load the `hello.asmbin` file, simulate for 1000 cycles, and save the simulation waveform to the `dump.vcd` file ```shell $ ./run-verilator.sh -instruction src/main/resources/hello.asmbin -time 2000 -vcd dump.vcd ``` ### GTKWave #### InstructionFetch <a href="https://github.com/kk908676/Single-Cycle-RISC-V-CPU/blob/7797f9e660678e4ab7b5aa0012737a04f239cde0/Single_img/IF.PNG"> <img src="https://github.com/kk908676/Single-Cycle-RISC-V-CPU/raw/7797f9e660678e4ab7b5aa0012737a04f239cde0/Single_img/IF.PNG" width="100%" title="click to open the link" alt="InstructionFetch gtkwave"></a><br> <br>Observing the pattern, it is evident that with each clock cycle, our program counter (pc) increments by 4. #### InstructionDecode <a href="https://github.com/kk908676/Single-Cycle-RISC-V-CPU/blob/7797f9e660678e4ab7b5aa0012737a04f239cde0/Single_img/ID.PNG"><img src="https://github.com/kk908676/Single-Cycle-RISC-V-CPU/raw/7797f9e660678e4ab7b5aa0012737a04f239cde0/Single_img/ID.PNG" width="100%" title="click to open the link" alt="InstructionDecode gtkwave"></a><br> <br>Taking clock cycle 3 for example, the instruction is 0x22B7 (equivalent to 'lui x5, 2'). Therefore, the io_reg_write_address is 5, and io_ex_immediate is 2. #### Execute <a href="https://github.com/kk908676/Single-Cycle-RISC-V-CPU/blob/7797f9e660678e4ab7b5aa0012737a04f239cde0/Single_img/EX.PNG"><img src="https://github.com/kk908676/Single-Cycle-RISC-V-CPU/raw/7797f9e660678e4ab7b5aa0012737a04f239cde0/Single_img/EX.PNG" width="100%" title="click to open the link" alt="Execute gtkwave"></a><br> <br>Here, with an opcode of 33 (equivalent to addu), the operation involves adding io_op1 (0x05B3E659) to io_op2 (0x0380F18C), and the result is stored in io_result (0x934D7E5). Later, we need to add our own test data in **CPUTest.scala**. ```scala class Find_stringTest extends AnyFlatSpec with ChiselScalatestTester { behavior.of("Single Cycle CPU") it should "execute calculate find_string" in { test(new TestTopModule("main.asmbin")).withAnnotations(TestAnnotations.annos) { c => for (i <- 1 to 50000) { c.clock.step() c.io.mem_debug_read_address.poke((i * 4).U) // Avoid timeout } c.io.mem_debug_read_address.poke(4.U) // #1 c.clock.step() c.io.mem_debug_read_data.expect(4.U) c.io.mem_debug_read_address.poke(8.U) // #2 c.clock.step() c.io.mem_debug_read_data.expect("hffffffff".U) c.io.mem_debug_read_address.poke(12.U) // #3 c.clock.step() c.io.mem_debug_read_data.expect(29.U) } } } ``` <br>Finally, use **$ sbt test** to ensure that our modifications have passed compilation.<br> >[info] <font color="green">Find_stringTest:</font> [info] <font color="green">Single Cycle CPU</font> >[info] <font color="green">- should execute calculate find_string</font> ## Modify the handwritten RISC-V assembly code in Homework2 * made some modifications to the content before running it. I removed the code for ecall. :::spoiler assembly code ```scala .global main .data t1_u: .word 0x0f000000 # upper bits of test data 1, test_data1[0~31] t1_l: .word 0x00000000 # lower bits of test data 2, test_data2[32~63] t2_u: .word 0x00000000 t2_l: .word 0x00000000 t3_u: .word 0x01234567 t3_l: .word 0x89abcdef buffer: .word 0 .text main: # initial setting la t0, t1_u # load address of upper bits of test data 1 into s0 la t1, t2_u la t2, t3_u addi sp, sp, -12 sw t0, 0(sp) sw t1, 4(sp) sw t2, 8(sp) add s0, zero, t0 # s0 = & test_data_upper add s1, zero, zero # int i (used for test data loop control) addi s2, zero, 3 # upper bound of i (used for loop control) addi s3, zero, 4 addi s4, zero, -1 # be used to do not operation addi s10, zero, 3 main_for_loop: # call finding_string procedure #mv a0, s0 # a0 = & test_data_1_upper begin: jal ra, fs output: mv s5, a0 addi s9, s9, 1 addi s0, s0, 8 blt s9, s10, begin li a7, 11 j Exit fs: addi sp, sp, -16 sw ra, 0(sp) sw s1, 4(sp) sw s2, 8(sp) sw s3, 12(sp) #sw s4, 16(sp) addi s1, s0, 4 # s1 = & test_date_lower lw a1, 0(s0) # a1 = value of test_data upper lw a2, 0(s1) # a2 = value of test_date lower, test_data = [a1, a2] #li s2, 0 # s2 = clz = 0 li s2, 0 # s2 = pos = 0 x_equal_0_check: bne a1, zero, x_not_equal_0 bne a2, zero, x_not_equal_0 x_eqaul_0: addi a0, zero, -1 j fs_end x_not_equal_0: jal ra, CLZ # x = x << clz li t0, 32 sub t0, t0, a0 srl a4, a2, t0 sll a3, a1, a0 or a3, a3, a4 # a1 = a1 << clz sll a4, a2, a0 # a2 = a2 << clz, x([a3, a4]) = x([a1, a2]) << clz # pos = pos + clz add s2, s2, a0 # x = -x, [a3, a4] = - [a1, a2] xor a1, a3, s4 xor a2, a4, s4 jal ra, CLZ # check: clz > n bge a0, s3, finish ## < case # x = x << clz sub t0, t0, a0 srl a2, a4, t0 sll a1, a3, a0 or a1, a1, a2 # a1 = a3 << clz sll a2, a4, a0 # a2 = a4 << clz, x([a1, a2]) = x([a3, a4]) << clz # pos = pos + clz add s2, s2, a0 j x_equal_0_check ## >= base finish: mv a0, s2 j fs_end fs_end: lw ra, 0(sp) lw s1, 4(sp) lw s2, 8(sp) lw s3, 12(sp) addi sp, sp, 16 j output CLZ: addi sp, sp, -4 sw ra, 0(sp) mv t0, a1 mv t1, a2 li t4, 0x55555555 li t5, 0x33333333 li t6, 0x0f0f0f0f li a5, 1 li a6, 32 loop: sub a7, a6, a5 srl t3, t1, a5 # shift lower bits of test data right with n bit sll t2, t0, a7 # shift upper bits of test data left with 31-n bits or t3, t2, t3 # combine to get new lower bits of test data srl t2, t0, a5 # shift upper bound of test data right with n bit or t0, t0, t2 # [0~31]x | [0~31](x >> n) or t1, t1, t3 # [32~63]x | [32~63](x >> n) slli a5, a5, 1 beq a5, a6, CE j loop CE: # x |= (x>>32) li t2, 0 add t3, t0, zero or t0, t0, t2 or t1, t1, t3 # x -= ((x>>1) & 0x5555555555555555) ## [t2, t3] = x>>1 ([t0, t1]>>1) srli t3, t1, 1 slli t2, t0, 31 or t3, t2, t3 srli t2, t0, 1 ## (x>>1) & 0x5~ and t2, t2, t4 and t3, t3, t4 # [t2, t3] = (x>>1)&0x5~ sub t3, t1, t3 add t1, t3, zero # t1=t3 sub t0, t0, t2 # no underflow at lower bits, [t0, t1]=> x -= ((x>>1) & 0x5555555555555555) # x = ((x>>2)&0x333333333333333) + (x & 0x3333333333333333) ## [t2, t3] = x>>2 ([t0, t1]>>2) srli t3, t1, 2 slli t2, t0, 30 or t3, t3, t2 srli t2, t0, 2 # [t2, t3] = x>>2 ## (x>>1) & 0x3~ and t2, t2, t5 and t3, t3, t5 # [t2, t3] = ((x>>2)&0x3~) ## x & 0x3~ and t0, t0, t5 and t1, t1, t5 # [t0, t1] = (x & 0x3~) add t1, t1, t3 add t0, t0, t2 # x += ((x>>4)+x) & 0x0f~0f ## [t2, t3] = x>>4 ([t0, t1]>>4) srli t3, t1, 4 slli t2, t0, 28 or t3, t3, t2 srli t2, t0, 4 ## (x>>4) + x add t1, t1, t3 add t0, t0, t2 ## ((x>>4) + x) & 0x0f~0f and t0, t0, t6 and t1, t1, t6 # x += x(x>>8) srli t3, t1, 8 slli t2, t0, 24 or t3, t3, t2 srli t2, t0, 8 # [t2, t3] = x>>8 add t0, t0, t2 add t1, t1, t3 # x += x(x>>16) srli t3, t1, 16 slli t2, t0, 16 or t3, t3, t2 srli t2, t0, 16 # [t2, t3] = x>>8 add t0, t0, t2 add t1, t1, t3 # x += (x>>32) add t3, t0, zero add t2, zero, zero add t0, t0, t2 add t1, t1, t3 # 64 - (x & (0x7f)) li t4, 0x7f li a0, 64 and t1, t1, t4 sub a0, a0, t1 #lw ra, 0(sp) addi sp, sp, 4 jalr ra Exit: j Exit ``` ::: <br>Afterwards, perform the same test as above by checking the values inside the registers to ensure their correctness<br> ```scala class Find_stringTest_for_main_opt extends AnyFlatSpec with ChiselScalatestTester { behavior.of("Single Cycle CPU") it should "execute calculate find_string for main_opt" in { test(new TestTopModule("main_opt.asmbin")).withAnnotations(TestAnnotations.annos) { c => for (i <- 1 to 3000) { c.clock.step() c.io.mem_debug_read_address.poke((i * 4).U) // Avoid timeout } c.io.regs_debug_read_address.poke(21.U) // #3 c.clock.step() c.io.regs_debug_read_data.expect(0x1d.U) } } } ``` ## GTKWave ### verify generate Wave ```scala ./run-verilator.sh -instruction csrc/main_opt.asmbin -time 4000 -vcd dump_opt.vcd -time 4000 ``` <a href="https://github.com/kk908676/Single-Cycle-RISC-V-CPU/blob/cedb0933d2a3ee319476a11d4059a91c849ae6a6/Single_img/redister_21.PNG"><img src="https://github.com/kk908676/Single-Cycle-RISC-V-CPU/raw/cedb0933d2a3ee319476a11d4059a91c849ae6a6/Single_img/redister_21.PNG" width="100%" title="click to open the link" alt="Execute gtkwave"></a><br> <br>At the 3ns timestamp, it is evident from our inspection of register_21 that the observed value perfectly aligns with our expectations, being '29' in hexadecimal (0x1d). ## Reference * [Lab3: Construct a single-cycle RISC-V CPU with Chisel](https://github.com/sysprog21/ca2023-lab3)