# Assignment3: Single-Cycle RISC-V CPU
contributed by [kk908676](https://github.com/kk908676)
## Environment
#### 1. Install sbt
Follow the instructions from [here](https://www.scala-sbt.org/release/docs/Installing-sbt-on-Linux.html)
#### 2. Install Java (If you have not downloaded before)
$sudo apt install default-jdk
#### 3. Install the dependent packages
$ sudo apt install build-essential verilator gtkwave
#### 4. Chisel
$ git clone https://github.com/ucb-bar/chisel-tutorial
Before testing your system, ensure that you have sbt (the Scala build tool) installed.
```shell
$ sbt run
```
#### 5. Assignment
$ git clone https://github.com/sysprog21/ca2023-lab3
## Single-cycle RISC-V CPU
The following image is the single-cycle CPU that we will be implementing in this assignment.
* Full

* Simplifed

As the instructor has already completed the majority of the content, I only need to make modifications to the following file :
* InstructionFetch.scala
* InstructionDecode.scala
* Execute.scala
* CPU.scala
After completion, you can obtain the following information.
>[info] <font color="green">ExecuteTest:</font>
[info] <font color="green">Execution of Single Cycle CPU</font>
[info] <font color="green">- should execute correctly</font>
[info] <font color="green">ByteAccessTest:</font>
[info] <font color="green">Single Cycle CPU</font>
[info] <font color="green">- should store and load a single byte</font>
[info] <font color="green">FibonacciTest:</font>
[info] <font color="green">Single Cycle CPU</font>
[info] <font color="green">- should recursively calculate Fibonacci(10)</font>
[info] <font color="green">InstructionDecoderTest:</font>
[info] <font color="green">InstructionDecoder of Single Cycle CPU</font>
[info] <font color="green">- should produce correct control signal</font>
[info] <font color="green">QuicksortTest:</font>
[info] <font color="green">Single Cycle CPU</font>
[info] <font color="green">- should perform a quicksort on 10 numbers</font>
[info] <font color="green">InstructionFetchTest:</font>
[info] <font color="green">InstructionFetch of Single Cycle CPU</font>
[info] <font color="green">- should fetch instruction</font>
[info] <font color="green">RegisterFileTest:</font>
[info] <font color="green">Register File of Single Cycle CPU</font>
[info] <font color="green">- should read the written content</font>
[info] <font color="green">- should x0 always be zero</font>
[info] <font color="green">- should read the writing content</font>
[info] <font color="blue">Run completed in 20 seconds, 443 milliseconds.</font>
[info] <font color="blue">Total number of tests run: 9</font>
[info] <font color="blue">Suites: completed 7, aborted 0</font>
[info] <font color="blue">Tests: succeeded 9, failed 0, canceled 0, ignored 0, pending 0</font>
>[info] <font color="green">All tests passed.</font>
#### The next task is to run the code from the previous assignment on this CPU.
* made some modifications to the content before running it. I removed the code for print.
* also added the process of placing the output into specific memory to ensure successful retrieval during testing
Below is the complete C code.
```scala
#include <stdint.h>
#include <stdio.h>
uint16_t count_leading_zeros(uint64_t x)
{
x |= (x >> 1);
x |= (x >> 2);
x |= (x >> 4);
x |= (x >> 8);
x |= (x >> 16);
x |= (x >> 32);
x -= ((x >> 1) & 0x5555555555555555);
x = ((x >> 2) & 0x3333333333333333) + (x & 0x3333333333333333);
x = ((x >> 4) + x) & 0x0f0f0f0f0f0f0f0f;
x += (x >> 8);
x += (x >> 16);
x += (x >> 32);
return (64 - (x & 0x7f));
}
int find_string(uint64_t x, int n){
int clz; // leading zeros of x
int pos = 0; // position of fist '1' bit from significant bit
while(x != 0){
clz = count_leading_zeros(x);
x = x << clz;
pos = pos + clz;
clz = count_leading_zeros(~x);
if (clz >= n)
return pos;
x = x << clz;
pos = pos + clz;
}
return -1;
}
int main() {
uint64_t test_data[] = {0x0f00000000000000,
0x0000000000000000,
0x0123456789abcdef};
*((volatile int *) (4)) = find_string(test_data[0], 4);//4
*((volatile int *) (8)) = find_string(test_data[1], 4);//-1
*((volatile int *) (12)) = find_string(test_data[2], 4);//29
return 0;
}
```
Generating Waveform Files During Testing:
While running tests, if you set the environment variable `WRITE_VCD` to 1, waveform files will be generated.
```shell
$ WRITE_VCD=1 sbt test
```
### Verilator
After the first run and every time you modify the Chisel code, you need to execute the following command in the project's root directory to generate Verilog files:
```shell
$ make verilator
```
For example, load the `hello.asmbin` file, simulate for 1000 cycles, and save the simulation waveform to the `dump.vcd` file
```shell
$ ./run-verilator.sh -instruction src/main/resources/hello.asmbin
-time 2000 -vcd dump.vcd
```
### GTKWave
#### InstructionFetch
<a href="https://github.com/kk908676/Single-Cycle-RISC-V-CPU/blob/7797f9e660678e4ab7b5aa0012737a04f239cde0/Single_img/IF.PNG">
<img src="https://github.com/kk908676/Single-Cycle-RISC-V-CPU/raw/7797f9e660678e4ab7b5aa0012737a04f239cde0/Single_img/IF.PNG" width="100%" title="click to open the link" alt="InstructionFetch gtkwave"></a><br>
<br>Observing the pattern, it is evident that with each clock cycle, our program counter (pc) increments by 4.
#### InstructionDecode
<a href="https://github.com/kk908676/Single-Cycle-RISC-V-CPU/blob/7797f9e660678e4ab7b5aa0012737a04f239cde0/Single_img/ID.PNG"><img src="https://github.com/kk908676/Single-Cycle-RISC-V-CPU/raw/7797f9e660678e4ab7b5aa0012737a04f239cde0/Single_img/ID.PNG" width="100%" title="click to open the link" alt="InstructionDecode gtkwave"></a><br>
<br>Taking clock cycle 3 for example, the instruction is 0x22B7 (equivalent to 'lui x5, 2'). Therefore, the io_reg_write_address is 5, and io_ex_immediate is 2.
#### Execute
<a href="https://github.com/kk908676/Single-Cycle-RISC-V-CPU/blob/7797f9e660678e4ab7b5aa0012737a04f239cde0/Single_img/EX.PNG"><img src="https://github.com/kk908676/Single-Cycle-RISC-V-CPU/raw/7797f9e660678e4ab7b5aa0012737a04f239cde0/Single_img/EX.PNG" width="100%" title="click to open the link" alt="Execute gtkwave"></a><br>
<br>Here, with an opcode of 33 (equivalent to addu), the operation involves adding io_op1 (0x05B3E659) to io_op2 (0x0380F18C), and the result is stored in io_result (0x934D7E5).
Later, we need to add our own test data in **CPUTest.scala**.
```scala
class Find_stringTest extends AnyFlatSpec with ChiselScalatestTester {
behavior.of("Single Cycle CPU")
it should "execute calculate find_string" in {
test(new TestTopModule("main.asmbin")).withAnnotations(TestAnnotations.annos) { c =>
for (i <- 1 to 50000) {
c.clock.step()
c.io.mem_debug_read_address.poke((i * 4).U) // Avoid timeout
}
c.io.mem_debug_read_address.poke(4.U) // #1
c.clock.step()
c.io.mem_debug_read_data.expect(4.U)
c.io.mem_debug_read_address.poke(8.U) // #2
c.clock.step()
c.io.mem_debug_read_data.expect("hffffffff".U)
c.io.mem_debug_read_address.poke(12.U) // #3
c.clock.step()
c.io.mem_debug_read_data.expect(29.U)
}
}
}
```
<br>Finally, use **$ sbt test** to ensure that our modifications have passed compilation.<br>
>[info] <font color="green">Find_stringTest:</font>
[info] <font color="green">Single Cycle CPU</font>
>[info] <font color="green">- should execute calculate find_string</font>
## Modify the handwritten RISC-V assembly code in Homework2
* made some modifications to the content before running it. I removed the code for ecall.
:::spoiler assembly code
```scala
.global main
.data
t1_u: .word 0x0f000000 # upper bits of test data 1, test_data1[0~31]
t1_l: .word 0x00000000 # lower bits of test data 2, test_data2[32~63]
t2_u: .word 0x00000000
t2_l: .word 0x00000000
t3_u: .word 0x01234567
t3_l: .word 0x89abcdef
buffer: .word 0
.text
main:
# initial setting
la t0, t1_u # load address of upper bits of test data 1 into s0
la t1, t2_u
la t2, t3_u
addi sp, sp, -12
sw t0, 0(sp)
sw t1, 4(sp)
sw t2, 8(sp)
add s0, zero, t0 # s0 = & test_data_upper
add s1, zero, zero # int i (used for test data loop control)
addi s2, zero, 3 # upper bound of i (used for loop control)
addi s3, zero, 4
addi s4, zero, -1 # be used to do not operation
addi s10, zero, 3
main_for_loop:
# call finding_string procedure
#mv a0, s0 # a0 = & test_data_1_upper
begin:
jal ra, fs
output:
mv s5, a0
addi s9, s9, 1
addi s0, s0, 8
blt s9, s10, begin
li a7, 11
j Exit
fs:
addi sp, sp, -16
sw ra, 0(sp)
sw s1, 4(sp)
sw s2, 8(sp)
sw s3, 12(sp)
#sw s4, 16(sp)
addi s1, s0, 4 # s1 = & test_date_lower
lw a1, 0(s0) # a1 = value of test_data upper
lw a2, 0(s1) # a2 = value of test_date lower, test_data = [a1, a2]
#li s2, 0 # s2 = clz = 0
li s2, 0 # s2 = pos = 0
x_equal_0_check:
bne a1, zero, x_not_equal_0
bne a2, zero, x_not_equal_0
x_eqaul_0:
addi a0, zero, -1
j fs_end
x_not_equal_0:
jal ra, CLZ
# x = x << clz
li t0, 32
sub t0, t0, a0
srl a4, a2, t0
sll a3, a1, a0
or a3, a3, a4 # a1 = a1 << clz
sll a4, a2, a0 # a2 = a2 << clz, x([a3, a4]) = x([a1, a2]) << clz
# pos = pos + clz
add s2, s2, a0
# x = -x, [a3, a4] = - [a1, a2]
xor a1, a3, s4
xor a2, a4, s4
jal ra, CLZ
# check: clz > n
bge a0, s3, finish
## < case
# x = x << clz
sub t0, t0, a0
srl a2, a4, t0
sll a1, a3, a0
or a1, a1, a2 # a1 = a3 << clz
sll a2, a4, a0 # a2 = a4 << clz, x([a1, a2]) = x([a3, a4]) << clz
# pos = pos + clz
add s2, s2, a0
j x_equal_0_check
## >= base
finish:
mv a0, s2
j fs_end
fs_end:
lw ra, 0(sp)
lw s1, 4(sp)
lw s2, 8(sp)
lw s3, 12(sp)
addi sp, sp, 16
j output
CLZ:
addi sp, sp, -4
sw ra, 0(sp)
mv t0, a1
mv t1, a2
li t4, 0x55555555
li t5, 0x33333333
li t6, 0x0f0f0f0f
li a5, 1
li a6, 32
loop:
sub a7, a6, a5
srl t3, t1, a5 # shift lower bits of test data right with n bit
sll t2, t0, a7 # shift upper bits of test data left with 31-n bits
or t3, t2, t3 # combine to get new lower bits of test data
srl t2, t0, a5 # shift upper bound of test data right with n bit
or t0, t0, t2 # [0~31]x | [0~31](x >> n)
or t1, t1, t3 # [32~63]x | [32~63](x >> n)
slli a5, a5, 1
beq a5, a6, CE
j loop
CE:
# x |= (x>>32)
li t2, 0
add t3, t0, zero
or t0, t0, t2
or t1, t1, t3
# x -= ((x>>1) & 0x5555555555555555)
## [t2, t3] = x>>1 ([t0, t1]>>1)
srli t3, t1, 1
slli t2, t0, 31
or t3, t2, t3
srli t2, t0, 1
## (x>>1) & 0x5~
and t2, t2, t4
and t3, t3, t4 # [t2, t3] = (x>>1)&0x5~
sub t3, t1, t3
add t1, t3, zero # t1=t3
sub t0, t0, t2 # no underflow at lower bits, [t0, t1]=> x -= ((x>>1) & 0x5555555555555555)
# x = ((x>>2)&0x333333333333333) + (x & 0x3333333333333333)
## [t2, t3] = x>>2 ([t0, t1]>>2)
srli t3, t1, 2
slli t2, t0, 30
or t3, t3, t2
srli t2, t0, 2 # [t2, t3] = x>>2
## (x>>1) & 0x3~
and t2, t2, t5
and t3, t3, t5 # [t2, t3] = ((x>>2)&0x3~)
## x & 0x3~
and t0, t0, t5
and t1, t1, t5 # [t0, t1] = (x & 0x3~)
add t1, t1, t3
add t0, t0, t2
# x += ((x>>4)+x) & 0x0f~0f
## [t2, t3] = x>>4 ([t0, t1]>>4)
srli t3, t1, 4
slli t2, t0, 28
or t3, t3, t2
srli t2, t0, 4
## (x>>4) + x
add t1, t1, t3
add t0, t0, t2
## ((x>>4) + x) & 0x0f~0f
and t0, t0, t6
and t1, t1, t6
# x += x(x>>8)
srli t3, t1, 8
slli t2, t0, 24
or t3, t3, t2
srli t2, t0, 8 # [t2, t3] = x>>8
add t0, t0, t2
add t1, t1, t3
# x += x(x>>16)
srli t3, t1, 16
slli t2, t0, 16
or t3, t3, t2
srli t2, t0, 16 # [t2, t3] = x>>8
add t0, t0, t2
add t1, t1, t3
# x += (x>>32)
add t3, t0, zero
add t2, zero, zero
add t0, t0, t2
add t1, t1, t3
# 64 - (x & (0x7f))
li t4, 0x7f
li a0, 64
and t1, t1, t4
sub a0, a0, t1
#lw ra, 0(sp)
addi sp, sp, 4
jalr ra
Exit:
j Exit
```
:::
<br>Afterwards, perform the same test as above by checking the values inside the registers to ensure their correctness<br>
```scala
class Find_stringTest_for_main_opt extends AnyFlatSpec with ChiselScalatestTester {
behavior.of("Single Cycle CPU")
it should "execute calculate find_string for main_opt" in {
test(new TestTopModule("main_opt.asmbin")).withAnnotations(TestAnnotations.annos) { c =>
for (i <- 1 to 3000) {
c.clock.step()
c.io.mem_debug_read_address.poke((i * 4).U) // Avoid timeout
}
c.io.regs_debug_read_address.poke(21.U) // #3
c.clock.step()
c.io.regs_debug_read_data.expect(0x1d.U)
}
}
}
```
## GTKWave
### verify
generate Wave
```scala
./run-verilator.sh -instruction csrc/main_opt.asmbin -time 4000 -vcd dump_opt.vcd
-time 4000
```
<a href="https://github.com/kk908676/Single-Cycle-RISC-V-CPU/blob/cedb0933d2a3ee319476a11d4059a91c849ae6a6/Single_img/redister_21.PNG"><img src="https://github.com/kk908676/Single-Cycle-RISC-V-CPU/raw/cedb0933d2a3ee319476a11d4059a91c849ae6a6/Single_img/redister_21.PNG" width="100%" title="click to open the link" alt="Execute gtkwave"></a><br>
<br>At the 3ns timestamp, it is evident from our inspection of register_21 that the observed value perfectly aligns with our expectations, being '29' in hexadecimal (0x1d).
## Reference
* [Lab3: Construct a single-cycle RISC-V CPU with Chisel](https://github.com/sysprog21/ca2023-lab3)