Assignment3: Single-Cycle RISC-V CPU

contributed by < fewletter >

Environment setup

Operating System

I use the Ubuntu Linux 20.04.1 as my operating system.

$ uname -a
Linux fewletter 5.15.0-89-generic #99~20.04.1-Ubuntu SMP Thu Nov 2 15:16:47 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

Install sbt

Follow the command in lab3 use sdkman to install sbt.

# Install sdkman
$ curl -s "https://get.sdkman.io" | bash
$ source "$HOME/.sdkman/bin/sdkman-init.sh"

# Install Eclipse Temurin JDK 11
$ sdk install java 11.0.21-tem 
$ sdk install sbt

Chisel Bootcamp - Local Installation in Mac/Linux

Follow the command in Local Installation - Mac/Linux. It's important that we have install the Eclipse Temurin JDK 11 in the above command. However, the Note in Chisel Bootcamp shows that you should have JDK 8 installed to initialize the Chisel Bootcamp.

Note: Make sure you are using Java 8 (NOT Java 9) and have the JDK8 installed. Coursier/jupyter-scala does not appear to be compatible with Java 9 yet as of January 2018.

Follow the hint of the Note, I try to see what java version do I have.

$ java -version
openjdk version "11.0.21" 2023-10-17
OpenJDK Runtime Environment Temurin-11.0.21+9 (build 11.0.21+9)
OpenJDK 64-Bit Server VM Temurin-11.0.21+9 (build 11.0.21+9, mixed mode)

Obviously, I install JDK 11 in my system, so the java version is 11.0.21. Then I attempt to change the java version by the following command.

$ sdk list java
               |     | 20.0.2       | tem     |            | 20.0.2-tem          
               |     | 20.0.1       | tem     |            | 20.0.1-tem          
               |     | 17.0.9       | tem     |            | 17.0.9-tem          
               |     | 17.0.8       | tem     |            | 17.0.8-tem          
               |     | 17.0.8.1     | tem     |            | 17.0.8.1-tem        
               |     | 17.0.7       | tem     |            | 17.0.7-tem          
               |     | 11.0.21      | tem     | installed  | 11.0.21-tem         
               |     | 11.0.20      | tem     |            | 11.0.20-tem         
               |     | 11.0.20.1    | tem     |            | 11.0.20.1-tem       
               |     | 11.0.19      | tem     |            | 11.0.19-tem         
               | >>> | 8.0.392      | tem     | installed  | 8.0.392-tem         
               |     | 8.0.382      | tem     |            | 8.0.382-tem         
               |     | 8.0.372      | tem     |            | 8.0.372-tem         
 Tencent       |     | 17.0.9       | kona    |            | 17.0.9-kona         
               |     | 17.0.8       | kona    |            | 17.0.8-kona         
               |     | 17.0.7       | kona    |            | 17.0.7-kona         
               |     | 11.0.21      | kona    |            | 11.0.21-kona        
               |     | 11.0.20      | kona    |            | 11.0.20-kona        
               |     | 11.0.19      | kona    |            | 11.0.19-kona        
               |     | 8.0.392      | kona    |            | 8.0.392-kona        
               |     | 8.0.382      | kona    |            | 8.0.382-kona        
               |     | 8.0.372      | kona    |            | 8.0.372-kona        

$ sdk install java 8.0.392-tem
$ sdk use java 8.0.392-tem

Using java version 8.0.392-tem in this shell.
$ sdk current

Using:

java: 8.0.392-tem
sbt: 1.9.7

Then open another terminal to initialize jupyter notebook.

$ cd chisel-bootcamp
/chisel-bootcamp$ mkdir -p ~/.jupyter/custom
/chisel-bootcamp$ cp source/custom.js ~/.jupyter/custom/custom.js
/chisel-bootcamp$ jupyterbook

Take Module 2.2: Combinational Logic as example.

It seems that works well.

Single-Cycle RISC-V CPU in Chisel

There are four files InstructionFetch.scala, InstructionDecode.scala, Execute.scala, CPU.scala need to be filled with the code and finish the test.

InstructionFetch

In the part, we can see that there are two ouputs InsAddr and Ins. InsAddr depends on if the branch is detected from the execute phase. Ins is to read the data from the instruction.

To validate the result, the following command is to generate the .vcd file and view the waveform.

fewletter@fewletter:~/ca2023-lab3$ WRITE_VCD=1 sbt "testOnly riscv.singlecycle.InstructionFetchTest"

The waveform shows that

InstructionDecode

In this part, the main idea is to parse the information from the instruction like the figure .

Therefore, to fill the code, we should focus on the instruction type L and S. These both types are allowed to let the instruction to read or write from the memory.

Based on the following code, InstructionDecoderTest tests S type, U type and R type instruction, so the io.memory_read_enable has never been tested like the waveform shows.

...
c.io.instruction.poke(0x00a02223L.U) // S-type
c.io.ex_aluop1_source.expect(ALUOp1Source.Register)
c.io.ex_aluop2_source.expect(ALUOp2Source.Immediate)
c.io.regs_reg1_read_address.expect(0.U)
c.io.regs_reg2_read_address.expect(10.U)
c.clock.step()

c.io.instruction.poke(0x000022b7L.U) // lui
c.io.regs_reg1_read_address.expect(0.U)
c.io.ex_aluop1_source.expect(ALUOp1Source.Register)
c.io.ex_aluop2_source.expect(ALUOp2Source.Immediate)
c.clock.step()

c.io.instruction.poke(0x002081b3L.U) // add
c.io.ex_aluop1_source.expect(ALUOp1Source.Register)
c.io.ex_aluop2_source.expect(ALUOp2Source.Register)
c.clock.step()

Execute

To finish this part, it is important to parse the input instruction. In the figure below, ALU is determined by the singal of ALUFunct, and ALU inputs are depends on the ALUOp1Src and ALUOp2Src to determine whether they are register data or the instruction address and immediate.

The waveform shows the crucial part of the Execute phase that is ALU operations are depend on the ALUFunct.

CPU

To finish this part, it is important to figure out the inputs and the outputs of different phases of the CPU. First, take a look at the CPU.scala, it is obvious that it lacks of the execution phase.

So, to accomplish the file, it is necessary to take a look in the execute phase of the cpu. There is no need to care what the relationship between the output and input, because Execute.scala has done it by scratch. Instead it is need to be focus on how do the inputs come from the other phases.

val io = IO(new Bundle {
    val instruction         = Input(UInt(Parameters.InstructionWidth))
    val instruction_address = Input(UInt(Parameters.AddrWidth))
    val reg1_data           = Input(UInt(Parameters.DataWidth))
    val reg2_data           = Input(UInt(Parameters.DataWidth))
    val immediate           = Input(UInt(Parameters.DataWidth))
    val aluop1_source       = Input(UInt(1.W))
    val aluop2_source       = Input(UInt(1.W))

    val mem_alu_result  = Output(UInt(Parameters.DataWidth))
    val if_jump_flag    = Output(Bool())
    val if_jump_address = Output(UInt(Parameters.DataWidth))
  })

Take sb.S for example, the file is to test if that the register t0 has the value 0xDEADBEFF and the regitser s2 has the value 0x15.

# mycpu is freely redistributable under the MIT License. See the file                                                                                        
# "LICENSE" for information on usage and redistribution of this file.

.global _start
_start:
    li a0, 0x4
    li t0, 0xDEADBEEF
    sb t0, 0(a0)
    lw t1, 0(a0)
    li s2, 0x15
    sb s2, 1(a0)
    lw ra, 0(a0)
loop:
    j loop

Therefore focus on the how do the execute phase get the input in following two examples:

li t0, 0xDEADBEEF
li s2, 0x15

Run HW2 on Mycpu

Setup

In HW2, the code doesn't fit the ISA in Mycpu, so I remove the get_cycles and the system call in the assembly code.

modify code

.org 0
# Provide program starting address to linker
.global _start

.data
    data_1: .word 0x12345678
    data_2: .word 0xffffdddd
    mask_1: .word 0x55555555
    mask_2: .word 0x33333333
    mask_3: .word 0x0f0f0f0f
.text
    
_start:
    lw s0, data_1   #s0 = A
    lw s1, data_2   #s1 = B
    
    mv a0, s0    
    jal ra, CLZ
    mv t5, a0    #A's CLZ ->  t5
    mv a0, s1
    jal ra, CLZ
    mv t6, a0    #B's CLZ ->  t6
    slt t0, t5, t6 # if A's zero less than B's, t0=1
    li a0, 32
    jal ra, get_cycles
    mv  a4, a3
    bne t0, zero, start_mul
    
start_mul:
    #reset
    mv t0, s0      #A ^= B;
    mv s0, s1      #B ^= A;
    mv s1, t0      #A ^= B;
    mv t6, t5
    sub a0, a0, t6
    li t0, 0
    li t1, 0
    li t2, 0    
    li s2, 0        #s2: high 32 of number
    li s3, 0        #s3: low 32 of number
    li s4, 0        #used to check how many bit should shift   

int_mul:
    slt t1, s4, a0
    beq t1, zero, exit
    srl t0, s1, s4
    andi t0, t0, 0x00000001       #check B's rightest bit
    beq t0, zero, skip            #if(rightest bit is zero) jump
    sll s5,s0,s4                  #s0 is A,S5 the low bit i want
    li t2, 32
    sub t2, t2, s4
    srl s6, s0, t2             #s0 is A, S6 the high bit i want
    add s7, s3, s5             #s7 is 32_low + low bit i want

    sltu t3, s7, s3
    mv s3, s7
    beq t3, zero, no_overflow
    # if not jump  -->  overflow
    add s2, s2, s6
    addi s2, s2, 1
    addi s4, s4 ,1
    
no_overflow:
    add s2, s2, s6
    jal skip
    
skip:
    addi s4, s4 ,1
    jal int_mul

CLZ:
    #a0: the num(x) you want to count CLZ
    #t0: shifted x
    srli t0, a0, 1    # t0 = x >> 1
    or a0, a0, t0     # x |= x >> 1
    srli t0, a0, 2    # t0 = x >> 2
    or a0, a0, t0     # x |= x >> 2
    srli t0, a0, 4    # t0 = x >> 4
    or a0, a0, t0     # x |= x >> 4
    srli t0, a0, 8    # t0 = x >> 8
    or a0, a0, t0     # x |= x >> 8
    srli t0, a0, 16   # t0 = x >> 16
    or a0, a0, t0     # x |= x >> 16
    #start_mask
    lw t2, mask_1
    srli t0, a0, 1    # t0 = x >> 1
    and t1, t0, t2    # t1 = (x >> 1) & mask1
    sub a0, a0, t1    # x -= ((x >> 1) & mask1)
    lw t2, mask_2     # load mask2 to t2
    srli t0, a0, 2    # t0 = x >> 2
    and t1, t0, t2    # (x >> 2) & mask2
    and a0, a0, t2    # x & mask2
    add a0, t1, a0    # ((x >> 2) & mask2) + (x & mask2)
    srli t0, a0, 4    # t0 = x >> 4
    add a0, a0, t0    # x + (x >> 4)
    lw t2, mask_3      # load mask3 to t2
    and a0, a0, t2    # ((x >> 4) + x) & mask4
    srli t0, a0, 8    # t0 = x >> 8
    add a0, a0, t0    # x += (x >> 8)
    srli t0, a0, 16   # t0 = x >> 16
    add a0, a0, t0    # x += (x >> 16)
    andi t0, a0, 0x3f # t0 = x & 0x3f
    li a0, 32         # a0 = 32
    sub a0, a0, t0    # 32 - (x & 0x3f)
    ret
    
exit:
    j exit

Then change the Makefile in the ca2023/csrc, the Makefile can generate the .asmbin file from the .elf directly.

...
BINS = \
	fibonacci.asmbin \
	hello.asmbin \
	mmio.asmbin \
	quicksort.asmbin \
	sb.asmbin \
+	mul_clz.asmbin
...

Every time when the mul_clz.S is modified, the following commands can generate a new .asmbin file and update the .asmbin file in the main test directory.

csrc$ make
riscv-none-elf-as -R -march=rv32i_zicsr -mabi=ilp32 -o mul_clz.o mul_clz.S
mul_clz.S: Assembler messages:
mul_clz.S: Warning: end of file not at end of a line; newline inserted
riscv-none-elf-ld -o mul_clz.elf -T link.lds --oformat=elf32-littleriscv mul_clz.o
riscv-none-elf-objcopy -O binary -j .text -j .data mul_clz.elf mul_clz.asmbin
rm mul_clz.elf
csrc$ make update
cp -f fibonacci.asmbin hello.asmbin mmio.asmbin quicksort.asmbin sb.asmbin mul_clz.asmbin ../src/main/resources

Run and Debug HW2 on CPUTest

To test the assembly code of HW2, I prepare a mul_clzTest to see if the result is stored in register s2 and s3 correctly. If the result is correct, the test should failed because the value isn't 0x0.

class mul_clzTest extends AnyFlatSpec with ChiselScalatestTester {
  behavior.of("Single Cycle CPU")
  it should "multiply two numbers with counting leading zeros" in {
    test(new TestTopModule("mul_clz.asmbin")).withAnnotations(TestAnnotations.annos) { c =>
      for (i <- 1 to 500) {
        c.clock.step()
        c.io.mem_debug_read_address.poke((i * 4).U) // Avoid timeout
      }
      c.io.regs_debug_read_address.poke(18.U) // s2
      c.io.regs_debug_read_data.expect(0x0.U)
      c.io.regs_debug_read_address.poke(19.U) // s3
      c.io.regs_debug_read_data.expect(0x0.U)
    }
  }
}

Here is the result. Obviously that doesn't fit my expectation, so I begin to find where the problem is.

$ WRITE_VCD=1 sbt test
...
[info] mul_clzTest:
[info] Single Cycle CPU
[info] - should multiply two numbers with counting leading zeros
[info] ByteAccessTest:
[info] Single Cycle CPU
[info] - should store and load a single byte
[info] FibonacciTest:
[info] Single Cycle CPU
[info] - should recursively calculate Fibonacci(10)
[info] ExecuteTest:
[info] Execution of Single Cycle CPU
[info] - should execute correctly
[info] QuicksortTest:
[info] Single Cycle CPU
[info] - should perform a quicksort on 10 numbers
[info] RegisterFileTest:
[info] Register File of Single Cycle CPU
[info] - should read the written content
[info] - should x0 always be zero
[info] - should read the writing content
[info] Run completed in 13 seconds, 333 milliseconds.
[info] Total number of tests run: 10
[info] Suites: completed 8, aborted 0
[info] Tests: succeeded 10, failed 0, canceled 0, ignored 0, pending 0
[info] All tests passed.
[success] Total time: 18 s, completed 2023年11月30日 下午4:45:22

I modify the assembly code to only count the leading zeros of the data and the CPUTest to test if the register t5 and t6 are 0x3 and 0x7.

modify code

.org 0
# Provide program starting address to linker
.global _start


.data
    data_1: .word 0x12345678
    data_2: .word 0xffffdddd
    mask_1: .word 0x55555555
    mask_2: .word 0x33333333
    mask_3: .word 0x0f0f0f0f
.text
    
_start:
    lw s0, data_1   #s0 = A
    lw s1, data_2   #s1 = B
    
    mv a0, s0    
    jal ra, CLZ
    mv t5, a0    #A's CLZ ->  t5
    mv a0, s1
    jal ra, CLZ
    mv t6, a0    #B's CLZ ->  t6
    slt t0, t5, t6 # if A's zero less than B's, t0=1

loop:
    j loop

CLZ:
    #a0: the num(x) you want to count CLZ
    #t0: shifted x
    srli t0, a0, 1    # t0 = x >> 1
    or a0, a0, t0     # x |= x >> 1
    srli t0, a0, 2    # t0 = x >> 2
    or a0, a0, t0     # x |= x >> 2
    srli t0, a0, 4    # t0 = x >> 4
    or a0, a0, t0     # x |= x >> 4
    srli t0, a0, 8    # t0 = x >> 8
    or a0, a0, t0     # x |= x >> 8
    srli t0, a0, 16   # t0 = x >> 16
    or a0, a0, t0     # x |= x >> 16
    #start_mask
    lw t2, mask_1
    srli t0, a0, 1    # t0 = x >> 1
    and t1, t0, t2    # t1 = (x >> 1) & mask1
    sub a0, a0, t1    # x -= ((x >> 1) & mask1)
    lw t2, mask_2     # load mask2 to t2
    srli t0, a0, 2    # t0 = x >> 2
    and t1, t0, t2    # (x >> 2) & mask2
    and a0, a0, t2    # x & mask2
    add a0, t1, a0    # ((x >> 2) & mask2) + (x & mask2)
    srli t0, a0, 4    # t0 = x >> 4
    add a0, a0, t0    # x + (x >> 4)
    lw t2, mask_3      # load mask3 to t2
    and a0, a0, t2    # ((x >> 4) + x) & mask4
    srli t0, a0, 8    # t0 = x >> 8
    add a0, a0, t0    # x += (x >> 8)
    srli t0, a0, 16   # t0 = x >> 16
    add a0, a0, t0    # x += (x >> 16)
    andi t0, a0, 0x3f # t0 = x & 0x3f
    li a0, 32         # a0 = 32
    sub a0, a0, t0    # 32 - (x & 0x3f)
    ret

CPUTest

...
c.io.regs_debug_read_address.poke(30.U) // t5
c.io.regs_debug_read_data.expect(0x3.U)
c.io.regs_debug_read_address.poke(31.U) // t6
c.io.regs_debug_read_data.expect(0x7.U)
...

Here is the result. The test still not pass.

[info] mul_clzTest:
[info] Single Cycle CPU
[info] - should multiply two numbers with counting leading zeros *** FAILED ***
[info]   io_regs_debug_read_data=0 (0x0) did not equal expected=7 (0x7) (lines in CPUTest.scala: 128, 120) (CPUTest.scala:128)

So I view the waveform, I want to wee how these two lines behave in the waveform.

mv a0, s0    
jal ra, CLZ
mv t5, a0    #A's CLZ ->  t5      <-- 1
mv a0, s1
jal ra, CLZ
mv t6, a0    #B's CLZ ->  t6      <-- 2

In 433 ns, the value in a0 (register_10) is moved to the t5 (register_30), but t6 (register_31) is always zero. Therefore, it seems that the bug appears in these sentences.

Time matters

Since that the number in specific register isn't right, I decide to view the waveform to see what happens.

The final instruction of the CPUTest is 0c030c63, which can be translated to beq t1, zero, init_mul. The situation means that only part of the assembly code is executed, so that is why I can't get the right result in the specific register. Finally I modify the time of the CPUTest, so the assembly code can pass the test with the accurate value.

class mul_clzTest extends AnyFlatSpec with ChiselScalatestTester {
  behavior.of("Single Cycle CPU")
  it should "multiply two numbers with counting leading zeros" in {
    test(new TestTopModule("mul_clz.asmbin")).withAnnotations(TestAnnotations.annos) { c =>
-     for (i <- 1 to 500) {
+     for (i <- 1 to 5000) {
        c.clock.step()
        c.io.mem_debug_read_address.poke((i * 4).U) // Avoid timeout
      }
      c.io.regs_debug_read_address.poke(30.U) // t5
      c.io.regs_debug_read_data.expect(0x3.U)

+     c.io.regs_debug_read_address.poke(18.U) // s2
+     c.io.regs_debug_read_data.expect(0x1234540a.U)
    }
  }
}

The whole assembly code cost 3577 ns to accomplish, and the result is as same as the value I got in HW2.