Try   HackMD

Assignment3: Single-Cycle RISC-V CPU

contributed by kk908676

Environment

1. Install sbt

Follow the instructions from here

2. Install Java (If you have not downloaded before)

​​​​$sudo apt install default-jdk

3. Install the dependent packages

​​​​$ sudo apt install build-essential verilator gtkwave

4. Chisel

​​​​$ git clone https://github.com/ucb-bar/chisel-tutorial

Before testing your system, ensure that you have sbt (the Scala build tool) installed.

$ sbt run

5. Assignment

​​​​$ git clone https://github.com/sysprog21/ca2023-lab3

Single-cycle RISC-V CPU

The following image is the single-cycle CPU that we will be implementing in this assignment.

  • Full

    Image Not Showing Possible Reasons
    • The image was uploaded to a note which you don't have access to
    • The note which the image was originally uploaded to has been deleted
    Learn More →

  • Simplifed

    Image Not Showing Possible Reasons
    • The image was uploaded to a note which you don't have access to
    • The note which the image was originally uploaded to has been deleted
    Learn More →

As the instructor has already completed the majority of the content, I only need to make modifications to the following file :

  • InstructionFetch.scala
  • InstructionDecode.scala
  • Execute.scala
  • CPU.scala

After completion, you can obtain the following information.

[info] ExecuteTest:
[info] Execution of Single Cycle CPU
[info] - should execute correctly
[info] ByteAccessTest:
[info] Single Cycle CPU
[info] - should store and load a single byte
[info] FibonacciTest:
[info] Single Cycle CPU
[info] - should recursively calculate Fibonacci(10)
[info] InstructionDecoderTest:
[info] InstructionDecoder of Single Cycle CPU
[info] - should produce correct control signal
[info] QuicksortTest:
[info] Single Cycle CPU
[info] - should perform a quicksort on 10 numbers
[info] InstructionFetchTest:
[info] InstructionFetch of Single Cycle CPU
[info] - should fetch instruction
[info] RegisterFileTest:
[info] Register File of Single Cycle CPU
[info] - should read the written content
[info] - should x0 always be zero
[info] - should read the writing content
[info] Run completed in 20 seconds, 443 milliseconds.
[info] Total number of tests run: 9
[info] Suites: completed 7, aborted 0
[info] Tests: succeeded 9, failed 0, canceled 0, ignored 0, pending 0
[info] All tests passed.

The next task is to run the code from the previous assignment on this CPU.

  • made some modifications to the content before running it. I removed the code for print.
  • also added the process of placing the output into specific memory to ensure successful retrieval during testing

Below is the complete C code.

#include <stdint.h>
#include <stdio.h>

uint16_t count_leading_zeros(uint64_t x)
{
    x |= (x >> 1);
    x |= (x >> 2);
    x |= (x >> 4);
    x |= (x >> 8);
    x |= (x >> 16);
    x |= (x >> 32);

    x -= ((x >> 1) & 0x5555555555555555);
    x = ((x >> 2) & 0x3333333333333333) + (x & 0x3333333333333333);
    x = ((x >> 4) + x) & 0x0f0f0f0f0f0f0f0f;
    x += (x >> 8);
    x += (x >> 16);
    x += (x >> 32);

    return (64 - (x & 0x7f));
}

int find_string(uint64_t x, int n){
    int clz;    // leading zeros of x
    
    int pos = 0;    // position of fist '1' bit from significant bit
    
    while(x != 0){
        clz = count_leading_zeros(x);
        x = x << clz;
        pos = pos + clz;
        clz = count_leading_zeros(~x);
        if (clz >= n)
            return pos;
        x = x << clz;
        pos = pos + clz;
    }

    return -1;
}

int main() {
    
    uint64_t test_data[] = {0x0f00000000000000, 
                            0x0000000000000000, 
                            0x0123456789abcdef};
    
    *((volatile int *) (4)) = find_string(test_data[0], 4);//4
    *((volatile int *) (8)) = find_string(test_data[1], 4);//-1
    *((volatile int *) (12)) = find_string(test_data[2], 4);//29
    return 0;  
}

Generating Waveform Files During Testing:
While running tests, if you set the environment variable WRITE_VCD to 1, waveform files will be generated.

$ WRITE_VCD=1 sbt test

Verilator

After the first run and every time you modify the Chisel code, you need to execute the following command in the project's root directory to generate Verilog files:

$ make verilator

For example, load the hello.asmbin file, simulate for 1000 cycles, and save the simulation waveform to the dump.vcd file

$ ./run-verilator.sh -instruction src/main/resources/hello.asmbin
-time 2000 -vcd dump.vcd

GTKWave

InstructionFetch



Observing the pattern, it is evident that with each clock cycle, our program counter (pc) increments by 4.

InstructionDecode



Taking clock cycle 3 for example, the instruction is 0x22B7 (equivalent to 'lui x5, 2'). Therefore, the io_reg_write_address is 5, and io_ex_immediate is 2.

Execute



Here, with an opcode of 33 (equivalent to addu), the operation involves adding io_op1 (0x05B3E659) to io_op2 (0x0380F18C), and the result is stored in io_result (0x934D7E5).

Later, we need to add our own test data in CPUTest.scala.

class Find_stringTest extends AnyFlatSpec with ChiselScalatestTester {
  behavior.of("Single Cycle CPU")
  it should "execute calculate find_string" in {
    test(new TestTopModule("main.asmbin")).withAnnotations(TestAnnotations.annos) { c =>
      for (i <- 1 to 50000) {
        c.clock.step()
        c.io.mem_debug_read_address.poke((i * 4).U) // Avoid timeout
      }

      c.io.mem_debug_read_address.poke(4.U) // #1
      c.clock.step()
      c.io.mem_debug_read_data.expect(4.U)
      
      c.io.mem_debug_read_address.poke(8.U) // #2
      c.clock.step()
      c.io.mem_debug_read_data.expect("hffffffff".U)
      
      c.io.mem_debug_read_address.poke(12.U) // #3
      c.clock.step()
      c.io.mem_debug_read_data.expect(29.U)
    }
  }
}


Finally, use $ sbt test to ensure that our modifications have passed compilation.

[info] Find_stringTest:
[info] Single Cycle CPU
[info] - should execute calculate find_string

Modify the handwritten RISC-V assembly code in Homework2

  • made some modifications to the content before running it. I removed the code for ecall.
assembly code
.global main

.data
    t1_u: .word 0x0f000000        # upper bits of test data 1, test_data1[0~31]
    t1_l: .word 0x00000000        # lower bits of test data 2, test_data2[32~63]
    t2_u: .word 0x00000000
    t2_l: .word 0x00000000
    t3_u: .word 0x01234567
    t3_l: .word 0x89abcdef
    buffer: .word 0 
    
.text
main:
    # initial setting
    la    t0, t1_u                # load address of upper bits of test data 1 into s0 
    la    t1, t2_u
    la    t2, t3_u
    addi  sp, sp, -12
    sw    t0, 0(sp)
    sw    t1, 4(sp)
    sw    t2, 8(sp)
    add   s0, zero, t0            # s0 = & test_data_upper
    add   s1, zero, zero        # int i (used for test data loop control)            
    addi  s2, zero, 3            # upper bound of i (used for loop control)
    addi  s3, zero, 4
    addi  s4, zero, -1            # be used to do not operation
    
    addi  s10, zero, 3
    
main_for_loop:
    # call finding_string procedure
    #mv   a0, s0        # a0 = & test_data_1_upper
begin:
    jal   ra, fs
output:    
    mv    s5, a0
    
    addi  s9, s9, 1
    addi  s0, s0, 8
    blt   s9, s10, begin
    li    a7, 11
 
    j     Exit
    

fs:
    addi  sp, sp, -16
    sw    ra, 0(sp)
    sw    s1, 4(sp)
    sw    s2, 8(sp)
    sw    s3, 12(sp)
    #sw    s4, 16(sp)
    addi  s1, s0, 4    # s1 = & test_date_lower
    lw    a1, 0(s0)    # a1 = value of test_data upper
    lw    a2, 0(s1)    # a2 = value of test_date lower, test_data = [a1, a2]
    #li    s2, 0        # s2 = clz = 0
    li    s2, 0        # s2 = pos = 0   
   
x_equal_0_check:
    bne   a1, zero, x_not_equal_0
    bne   a2, zero, x_not_equal_0 
    
x_eqaul_0:
    addi  a0, zero, -1
    j     fs_end
    
x_not_equal_0:
    jal   ra, CLZ
    # x = x << clz
    li    t0, 32
    sub   t0, t0, a0    
    srl   a4, a2, t0 
    sll   a3, a1, a0
    or    a3, a3, a4    # a1 = a1 << clz 
    sll   a4, a2, a0    # a2 = a2 << clz, x([a3, a4]) = x([a1, a2]) << clz
    # pos = pos + clz
    add   s2, s2, a0
    # x = -x, [a3, a4] = - [a1, a2]
    xor   a1, a3, s4
    xor   a2, a4, s4
    jal   ra, CLZ
    # check: clz > n
    bge   a0, s3, finish
    ## < case
    # x = x << clz
    sub   t0, t0, a0    
    srl   a2, a4, t0 
    sll   a1, a3, a0
    or    a1, a1, a2    # a1 = a3 << clz 
    sll   a2, a4, a0    # a2 = a4 << clz, x([a1, a2]) = x([a3, a4]) << clz
    # pos = pos + clz
    add   s2, s2, a0
    j     x_equal_0_check
    ## >= base
finish:    
    mv    a0, s2
    j     fs_end
    
fs_end:
    lw    ra, 0(sp)
    lw    s1, 4(sp)
    lw    s2, 8(sp)
    lw    s3, 12(sp)
    addi  sp, sp, 16
    j  output
CLZ:
    addi  sp, sp, -4
    sw    ra, 0(sp)
    mv    t0, a1
    mv    t1, a2
    li    t4, 0x55555555
    li    t5, 0x33333333
    li    t6, 0x0f0f0f0f
    li    a5, 1
    li    a6, 32
loop:
    sub   a7, a6, a5
    srl   t3, t1, a5    # shift lower bits of test data right with n bit
    sll  t2, t0, a7   # shift upper bits of test data left with 31-n bits 
    or    t3, t2, t3   # combine to get new lower bits of test data
    srl   t2, t0, a5    # shift upper bound of test data right with n bit
    or    t0, t0, t2   # [0~31]x | [0~31](x >> n)
    or    t1, t1, t3   # [32~63]x | [32~63](x >> n)
    slli  a5, a5, 1
    beq   a5, a6, CE 
    j     loop
CE:
    # x |= (x>>32)
    li    t2, 0
    add   t3, t0, zero
    or    t0, t0, t2
    or    t1, t1, t3 
    
    # x -= ((x>>1) & 0x5555555555555555)
    ## [t2, t3] = x>>1 ([t0, t1]>>1)
    srli  t3, t1, 1    
    slli  t2, t0, 31   
    or    t3, t2, t3  
    srli  t2, t0, 1   
    ## (x>>1) & 0x5~
    and   t2, t2, t4
    and   t3, t3, t4        # [t2, t3] = (x>>1)&0x5~
    sub   t3, t1, t3    
    add   t1, t3, zero    # t1=t3
    sub   t0, t0, t2    # no underflow at lower bits, [t0, t1]=> x -= ((x>>1) & 0x5555555555555555)
    
    
    # x = ((x>>2)&0x333333333333333) + (x & 0x3333333333333333) 
    ## [t2, t3] = x>>2 ([t0, t1]>>2)
    srli    t3, t1, 2
    slli    t2, t0, 30
    or      t3, t3, t2
    srli    t2, t0, 2    # [t2, t3] = x>>2
    ## (x>>1) & 0x3~
    and     t2, t2, t5
    and     t3, t3, t5    # [t2, t3] = ((x>>2)&0x3~)
    ## x & 0x3~
    and     t0, t0, t5
    and     t1, t1, t5    # [t0, t1] = (x & 0x3~)    
    add     t1, t1, t3
    add     t0, t0, t2
    
    
    # x += ((x>>4)+x) & 0x0f~0f
    ## [t2, t3] = x>>4 ([t0, t1]>>4)
    srli    t3, t1, 4
    slli    t2, t0, 28
    or      t3, t3, t2
    srli    t2, t0, 4
    ## (x>>4) + x
    add t1, t1, t3
    add t0, t0, t2
    
    ## ((x>>4) + x) & 0x0f~0f
    and t0, t0, t6
    and t1, t1, t6

    # x += x(x>>8)
    srli    t3, t1, 8
    slli    t2, t0, 24
    or      t3, t3, t2
    srli    t2, t0, 8    # [t2, t3] = x>>8 
    add     t0, t0, t2
    add     t1, t1, t3
    
    
     # x += x(x>>16)
    srli    t3, t1, 16
    slli    t2, t0, 16
    or      t3, t3, t2
    srli    t2, t0, 16    # [t2, t3] = x>>8 
    add     t0, t0, t2
    add     t1, t1, t3
    
    # x += (x>>32)
    add     t3, t0, zero
    add     t2, zero, zero
    add     t0, t0, t2
    add     t1, t1, t3
    
    
    # 64 - (x & (0x7f))
    li      t4, 0x7f   
    li      a0, 64
    and     t1, t1, t4
    sub     a0, a0, t1
    #lw      ra, 0(sp)
    addi    sp, sp, 4
    jalr    ra
    

Exit:
    j Exit


Afterwards, perform the same test as above by checking the values inside the registers to ensure their correctness

class Find_stringTest_for_main_opt extends AnyFlatSpec with ChiselScalatestTester {
  behavior.of("Single Cycle CPU")
  it should "execute calculate find_string for main_opt" in {
    test(new TestTopModule("main_opt.asmbin")).withAnnotations(TestAnnotations.annos) { c =>
      for (i <- 1 to 3000) {
        c.clock.step()
        c.io.mem_debug_read_address.poke((i * 4).U) // Avoid timeout
      }     
      c.io.regs_debug_read_address.poke(21.U) // #3
      c.clock.step()
      c.io.regs_debug_read_data.expect(0x1d.U)
    }
  }
}

GTKWave

verify

generate Wave

./run-verilator.sh -instruction csrc/main_opt.asmbin -time 4000 -vcd dump_opt.vcd
-time 4000



At the 3ns timestamp, it is evident from our inspection of register_21 that the observed value perfectly aligns with our expectations, being '29' in hexadecimal (0x1d).

Reference