# Assignment3: Single-Cycle RISC-V CPU
contributed by < [fewletter](https://github.com/fewletter/ca2023-lab3) >
## Environment setup
### Operating System
I use the Ubuntu Linux 20.04.1 as my operating system.
```shell
$ uname -a
Linux fewletter 5.15.0-89-generic #99~20.04.1-Ubuntu SMP Thu Nov 2 15:16:47 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
```
### Install sbt
Follow the command in [lab3](https://hackmd.io/@sysprog/r1mlr3I7p#Install-sbt) use sdkman to install sbt.
```shell
# Install sdkman
$ curl -s "https://get.sdkman.io" | bash
$ source "$HOME/.sdkman/bin/sdkman-init.sh"
# Install Eclipse Temurin JDK 11
$ sdk install java 11.0.21-tem
$ sdk install sbt
```
### Chisel Bootcamp - Local Installation in Mac/Linux
Follow the command in [Local Installation - Mac/Linux](https://github.com/freechipsproject/chisel-bootcamp/blob/master/Install.md#local-installation---maclinux). It's important that we have install the Eclipse Temurin JDK 11 in the above command. However, the **Note** in Chisel Bootcamp shows that you should have JDK 8 installed to initialize the Chisel Bootcamp.
> Note: Make sure you are using Java 8 (NOT Java 9) and have the JDK8 installed. Coursier/jupyter-scala does not appear to be compatible with Java 9 yet as of January 2018.
Follow the hint of the **Note**, I try to see what java version do I have.
```shell
$ java -version
openjdk version "11.0.21" 2023-10-17
OpenJDK Runtime Environment Temurin-11.0.21+9 (build 11.0.21+9)
OpenJDK 64-Bit Server VM Temurin-11.0.21+9 (build 11.0.21+9, mixed mode)
```
Obviously, I install JDK 11 in my system, so the java version is `11.0.21`. Then I attempt to change the java version by the following command.
```shell
$ sdk list java
| | 20.0.2 | tem | | 20.0.2-tem
| | 20.0.1 | tem | | 20.0.1-tem
| | 17.0.9 | tem | | 17.0.9-tem
| | 17.0.8 | tem | | 17.0.8-tem
| | 17.0.8.1 | tem | | 17.0.8.1-tem
| | 17.0.7 | tem | | 17.0.7-tem
| | 11.0.21 | tem | installed | 11.0.21-tem
| | 11.0.20 | tem | | 11.0.20-tem
| | 11.0.20.1 | tem | | 11.0.20.1-tem
| | 11.0.19 | tem | | 11.0.19-tem
| >>> | 8.0.392 | tem | installed | 8.0.392-tem
| | 8.0.382 | tem | | 8.0.382-tem
| | 8.0.372 | tem | | 8.0.372-tem
Tencent | | 17.0.9 | kona | | 17.0.9-kona
| | 17.0.8 | kona | | 17.0.8-kona
| | 17.0.7 | kona | | 17.0.7-kona
| | 11.0.21 | kona | | 11.0.21-kona
| | 11.0.20 | kona | | 11.0.20-kona
| | 11.0.19 | kona | | 11.0.19-kona
| | 8.0.392 | kona | | 8.0.392-kona
| | 8.0.382 | kona | | 8.0.382-kona
| | 8.0.372 | kona | | 8.0.372-kona
$ sdk install java 8.0.392-tem
$ sdk use java 8.0.392-tem
Using java version 8.0.392-tem in this shell.
$ sdk current
Using:
java: 8.0.392-tem
sbt: 1.9.7
```
Then open another terminal to initialize jupyter notebook.
```
$ cd chisel-bootcamp
/chisel-bootcamp$ mkdir -p ~/.jupyter/custom
/chisel-bootcamp$ cp source/custom.js ~/.jupyter/custom/custom.js
/chisel-bootcamp$ jupyterbook
```
Take **Module 2.2: Combinational Logic** as example.
![2023-11-28 16-08-33 screenshot](https://hackmd.io/_uploads/Hk8GxmQrp.png)
It seems that works well.
## Single-Cycle RISC-V CPU in Chisel
There are four files `InstructionFetch.scala`, `InstructionDecode.scala`, `Execute.scala`, `CPU.scala` need to be filled with the code and finish the test.
![image](https://hackmd.io/_uploads/Bk5njdXHT.png)
### InstructionFetch
In the part, we can see that there are two ouputs `InsAddr` and `Ins`. `InsAddr` depends on if the branch is detected from the execute phase. `Ins` is to read the data from the instruction.
![2023-11-28 22-40-25 screenshot](https://hackmd.io/_uploads/rySg2O7Hp.png)
To validate the result, the following command is to generate the `.vcd` file and view the waveform.
```shell
fewletter@fewletter:~/ca2023-lab3$ WRITE_VCD=1 sbt "testOnly riscv.singlecycle.InstructionFetchTest"
```
The waveform shows that
![2023-11-28 23-23-41 screenshot](https://hackmd.io/_uploads/rkqf8YXBa.png)
### InstructionDecode
In this part, the main idea is to parse the information from the instruction like the figure .
![2023-11-29 15-16-59 screenshot](https://hackmd.io/_uploads/ByttSv4BT.png)
Therefore, to fill the code, we should focus on the instruction type L and S. These both types are allowed to let the instruction to read or write from the memory.
![2023-11-28 23-07-33 screenshot](https://hackmd.io/_uploads/HyjSfFXBp.png)
Based on the following code, `InstructionDecoderTest` tests S type, U type and R type instruction, so the `io.memory_read_enable` has never been tested like the waveform shows.
```python
...
c.io.instruction.poke(0x00a02223L.U) // S-type
c.io.ex_aluop1_source.expect(ALUOp1Source.Register)
c.io.ex_aluop2_source.expect(ALUOp2Source.Immediate)
c.io.regs_reg1_read_address.expect(0.U)
c.io.regs_reg2_read_address.expect(10.U)
c.clock.step()
c.io.instruction.poke(0x000022b7L.U) // lui
c.io.regs_reg1_read_address.expect(0.U)
c.io.ex_aluop1_source.expect(ALUOp1Source.Register)
c.io.ex_aluop2_source.expect(ALUOp2Source.Immediate)
c.clock.step()
c.io.instruction.poke(0x002081b3L.U) // add
c.io.ex_aluop1_source.expect(ALUOp1Source.Register)
c.io.ex_aluop2_source.expect(ALUOp2Source.Register)
c.clock.step()
```
![2023-11-29 15-20-48 screenshot](https://hackmd.io/_uploads/SyrvIPVHT.png)
### Execute
To finish this part, it is important to parse the input instruction. In the figure below, ALU is determined by the singal of ALUFunct, and ALU inputs are depends on the `ALUOp1Src` and `ALUOp2Src` to determine whether they are register data or the instruction address and immediate.
![2023-11-28 23-41-46 screenshot](https://hackmd.io/_uploads/HypBcYXBa.png)
The waveform shows the crucial part of the Execute phase that is ALU operations are depend on the ALUFunct.
![2023-11-30 15-02-08 screenshot](https://hackmd.io/_uploads/Hy2tX2SHa.png)
### CPU
To finish this part, it is important to figure out the inputs and the outputs of different phases of the CPU. First, take a look at the `CPU.scala`, it is obvious that it lacks of the execution phase.
![2023-11-28 23-41-46 screenshot](https://hackmd.io/_uploads/HypBcYXBa.png)
So, to accomplish the file, it is necessary to take a look in the execute phase of the cpu. There is no need to care what the relationship between the output and input, because `Execute.scala` has done it by scratch. Instead it is need to be focus on how do the inputs come from the other phases.
```python
val io = IO(new Bundle {
val instruction = Input(UInt(Parameters.InstructionWidth))
val instruction_address = Input(UInt(Parameters.AddrWidth))
val reg1_data = Input(UInt(Parameters.DataWidth))
val reg2_data = Input(UInt(Parameters.DataWidth))
val immediate = Input(UInt(Parameters.DataWidth))
val aluop1_source = Input(UInt(1.W))
val aluop2_source = Input(UInt(1.W))
val mem_alu_result = Output(UInt(Parameters.DataWidth))
val if_jump_flag = Output(Bool())
val if_jump_address = Output(UInt(Parameters.DataWidth))
})
```
Take `sb.S` for example, the file is to test if that the register `t0` has the value `0xDEADBEFF` and the regitser `s2` has the value `0x15`.
```
# mycpu is freely redistributable under the MIT License. See the file
# "LICENSE" for information on usage and redistribution of this file.
.global _start
_start:
li a0, 0x4
li t0, 0xDEADBEEF
sb t0, 0(a0)
lw t1, 0(a0)
li s2, 0x15
sb s2, 1(a0)
lw ra, 0(a0)
loop:
j loop
```
Therefore focus on the how do the execute phase get the input in following two examples:
* `li t0, 0xDEADBEEF`
![2023-11-30 17-54-26 screenshot](https://hackmd.io/_uploads/Hyo1n0SHT.png)
* `li s2, 0x15`
![2023-11-30 17-55-03 screenshot](https://hackmd.io/_uploads/Skef2AHHT.png)
## Run HW2 on Mycpu
### Setup
In [HW2](https://hackmd.io/@fewletter/riscvtoolchain), the code doesn't fit the ISA in Mycpu, so I remove the `get_cycles` and the system call in the assembly code.
:::spoiler modify code
```c
.org 0
# Provide program starting address to linker
.global _start
.data
data_1: .word 0x12345678
data_2: .word 0xffffdddd
mask_1: .word 0x55555555
mask_2: .word 0x33333333
mask_3: .word 0x0f0f0f0f
.text
_start:
lw s0, data_1 #s0 = A
lw s1, data_2 #s1 = B
mv a0, s0
jal ra, CLZ
mv t5, a0 #A's CLZ -> t5
mv a0, s1
jal ra, CLZ
mv t6, a0 #B's CLZ -> t6
slt t0, t5, t6 # if A's zero less than B's, t0=1
li a0, 32
jal ra, get_cycles
mv a4, a3
bne t0, zero, start_mul
start_mul:
#reset
mv t0, s0 #A ^= B;
mv s0, s1 #B ^= A;
mv s1, t0 #A ^= B;
mv t6, t5
sub a0, a0, t6
li t0, 0
li t1, 0
li t2, 0
li s2, 0 #s2: high 32 of number
li s3, 0 #s3: low 32 of number
li s4, 0 #used to check how many bit should shift
int_mul:
slt t1, s4, a0
beq t1, zero, exit
srl t0, s1, s4
andi t0, t0, 0x00000001 #check B's rightest bit
beq t0, zero, skip #if(rightest bit is zero) jump
sll s5,s0,s4 #s0 is A,S5 the low bit i want
li t2, 32
sub t2, t2, s4
srl s6, s0, t2 #s0 is A, S6 the high bit i want
add s7, s3, s5 #s7 is 32_low + low bit i want
sltu t3, s7, s3
mv s3, s7
beq t3, zero, no_overflow
# if not jump --> overflow
add s2, s2, s6
addi s2, s2, 1
addi s4, s4 ,1
no_overflow:
add s2, s2, s6
jal skip
skip:
addi s4, s4 ,1
jal int_mul
CLZ:
#a0: the num(x) you want to count CLZ
#t0: shifted x
srli t0, a0, 1 # t0 = x >> 1
or a0, a0, t0 # x |= x >> 1
srli t0, a0, 2 # t0 = x >> 2
or a0, a0, t0 # x |= x >> 2
srli t0, a0, 4 # t0 = x >> 4
or a0, a0, t0 # x |= x >> 4
srli t0, a0, 8 # t0 = x >> 8
or a0, a0, t0 # x |= x >> 8
srli t0, a0, 16 # t0 = x >> 16
or a0, a0, t0 # x |= x >> 16
#start_mask
lw t2, mask_1
srli t0, a0, 1 # t0 = x >> 1
and t1, t0, t2 # t1 = (x >> 1) & mask1
sub a0, a0, t1 # x -= ((x >> 1) & mask1)
lw t2, mask_2 # load mask2 to t2
srli t0, a0, 2 # t0 = x >> 2
and t1, t0, t2 # (x >> 2) & mask2
and a0, a0, t2 # x & mask2
add a0, t1, a0 # ((x >> 2) & mask2) + (x & mask2)
srli t0, a0, 4 # t0 = x >> 4
add a0, a0, t0 # x + (x >> 4)
lw t2, mask_3 # load mask3 to t2
and a0, a0, t2 # ((x >> 4) + x) & mask4
srli t0, a0, 8 # t0 = x >> 8
add a0, a0, t0 # x += (x >> 8)
srli t0, a0, 16 # t0 = x >> 16
add a0, a0, t0 # x += (x >> 16)
andi t0, a0, 0x3f # t0 = x & 0x3f
li a0, 32 # a0 = 32
sub a0, a0, t0 # 32 - (x & 0x3f)
ret
exit:
j exit
```
:::
Then change the Makefile in the `ca2023/csrc`, the Makefile can generate the `.asmbin` file from the `.elf` directly.
```diff
...
BINS = \
fibonacci.asmbin \
hello.asmbin \
mmio.asmbin \
quicksort.asmbin \
sb.asmbin \
+ mul_clz.asmbin
...
```
Every time when the `mul_clz.S` is modified, the following commands can generate a new `.asmbin` file and update the `.asmbin` file in the main test directory.
```
csrc$ make
riscv-none-elf-as -R -march=rv32i_zicsr -mabi=ilp32 -o mul_clz.o mul_clz.S
mul_clz.S: Assembler messages:
mul_clz.S: Warning: end of file not at end of a line; newline inserted
riscv-none-elf-ld -o mul_clz.elf -T link.lds --oformat=elf32-littleriscv mul_clz.o
riscv-none-elf-objcopy -O binary -j .text -j .data mul_clz.elf mul_clz.asmbin
rm mul_clz.elf
csrc$ make update
cp -f fibonacci.asmbin hello.asmbin mmio.asmbin quicksort.asmbin sb.asmbin mul_clz.asmbin ../src/main/resources
```
### Run and Debug HW2 on CPUTest
To test the assembly code of HW2, I prepare a `mul_clzTest` to see if the result is stored in register `s2` and `s3` correctly. If the result is correct, the test should failed because the value isn't `0x0`.
```python
class mul_clzTest extends AnyFlatSpec with ChiselScalatestTester {
behavior.of("Single Cycle CPU")
it should "multiply two numbers with counting leading zeros" in {
test(new TestTopModule("mul_clz.asmbin")).withAnnotations(TestAnnotations.annos) { c =>
for (i <- 1 to 500) {
c.clock.step()
c.io.mem_debug_read_address.poke((i * 4).U) // Avoid timeout
}
c.io.regs_debug_read_address.poke(18.U) // s2
c.io.regs_debug_read_data.expect(0x0.U)
c.io.regs_debug_read_address.poke(19.U) // s3
c.io.regs_debug_read_data.expect(0x0.U)
}
}
}
```
Here is the result. Obviously that doesn't fit my expectation, so I begin to find where the problem is.
```
$ WRITE_VCD=1 sbt test
...
[info] mul_clzTest:
[info] Single Cycle CPU
[info] - should multiply two numbers with counting leading zeros
[info] ByteAccessTest:
[info] Single Cycle CPU
[info] - should store and load a single byte
[info] FibonacciTest:
[info] Single Cycle CPU
[info] - should recursively calculate Fibonacci(10)
[info] ExecuteTest:
[info] Execution of Single Cycle CPU
[info] - should execute correctly
[info] QuicksortTest:
[info] Single Cycle CPU
[info] - should perform a quicksort on 10 numbers
[info] RegisterFileTest:
[info] Register File of Single Cycle CPU
[info] - should read the written content
[info] - should x0 always be zero
[info] - should read the writing content
[info] Run completed in 13 seconds, 333 milliseconds.
[info] Total number of tests run: 10
[info] Suites: completed 8, aborted 0
[info] Tests: succeeded 10, failed 0, canceled 0, ignored 0, pending 0
[info] All tests passed.
[success] Total time: 18 s, completed 2023年11月30日 下午4:45:22
```
I modify the assembly code to only count the leading zeros of the data and the **CPUTest** to test if the register `t5` and `t6` are `0x3` and `0x7`.
:::spoiler modify code
```
.org 0
# Provide program starting address to linker
.global _start
.data
data_1: .word 0x12345678
data_2: .word 0xffffdddd
mask_1: .word 0x55555555
mask_2: .word 0x33333333
mask_3: .word 0x0f0f0f0f
.text
_start:
lw s0, data_1 #s0 = A
lw s1, data_2 #s1 = B
mv a0, s0
jal ra, CLZ
mv t5, a0 #A's CLZ -> t5
mv a0, s1
jal ra, CLZ
mv t6, a0 #B's CLZ -> t6
slt t0, t5, t6 # if A's zero less than B's, t0=1
loop:
j loop
CLZ:
#a0: the num(x) you want to count CLZ
#t0: shifted x
srli t0, a0, 1 # t0 = x >> 1
or a0, a0, t0 # x |= x >> 1
srli t0, a0, 2 # t0 = x >> 2
or a0, a0, t0 # x |= x >> 2
srli t0, a0, 4 # t0 = x >> 4
or a0, a0, t0 # x |= x >> 4
srli t0, a0, 8 # t0 = x >> 8
or a0, a0, t0 # x |= x >> 8
srli t0, a0, 16 # t0 = x >> 16
or a0, a0, t0 # x |= x >> 16
#start_mask
lw t2, mask_1
srli t0, a0, 1 # t0 = x >> 1
and t1, t0, t2 # t1 = (x >> 1) & mask1
sub a0, a0, t1 # x -= ((x >> 1) & mask1)
lw t2, mask_2 # load mask2 to t2
srli t0, a0, 2 # t0 = x >> 2
and t1, t0, t2 # (x >> 2) & mask2
and a0, a0, t2 # x & mask2
add a0, t1, a0 # ((x >> 2) & mask2) + (x & mask2)
srli t0, a0, 4 # t0 = x >> 4
add a0, a0, t0 # x + (x >> 4)
lw t2, mask_3 # load mask3 to t2
and a0, a0, t2 # ((x >> 4) + x) & mask4
srli t0, a0, 8 # t0 = x >> 8
add a0, a0, t0 # x += (x >> 8)
srli t0, a0, 16 # t0 = x >> 16
add a0, a0, t0 # x += (x >> 16)
andi t0, a0, 0x3f # t0 = x & 0x3f
li a0, 32 # a0 = 32
sub a0, a0, t0 # 32 - (x & 0x3f)
ret
```
:::
**CPUTest**
```
...
c.io.regs_debug_read_address.poke(30.U) // t5
c.io.regs_debug_read_data.expect(0x3.U)
c.io.regs_debug_read_address.poke(31.U) // t6
c.io.regs_debug_read_data.expect(0x7.U)
...
```
Here is the result. The test still not pass.
```
[info] mul_clzTest:
[info] Single Cycle CPU
[info] - should multiply two numbers with counting leading zeros *** FAILED ***
[info] io_regs_debug_read_data=0 (0x0) did not equal expected=7 (0x7) (lines in CPUTest.scala: 128, 120) (CPUTest.scala:128)
```
So I view the waveform, I want to wee how these two lines behave in the waveform.
```
mv a0, s0
jal ra, CLZ
mv t5, a0 #A's CLZ -> t5 <-- 1
mv a0, s1
jal ra, CLZ
mv t6, a0 #B's CLZ -> t6 <-- 2
```
In 433 ns, the value in `a0` (`register_10`) is moved to the `t5` (`register_30`), but `t6` (`register_31`) is always zero. Therefore, it seems that the bug appears in these sentences.
![2023-11-30 18-40-50 screenshot](https://hackmd.io/_uploads/H1r0IyIB6.png)
#### Time matters
Since that the number in specific register isn't right, I decide to view the waveform to see what happens.
![2023-11-30 23-48-45 screenshot](https://hackmd.io/_uploads/H1SzeVUHa.png)
The final instruction of the **CPUTest** is `0c030c63`, which can be translated to `beq t1, zero, init_mul`. The situation means that **only part of the assembly code is executed**, so that is why I can't get the right result in the specific register. Finally I modify the time of the **CPUTest**, so the assembly code can pass the test with the accurate value.
```diff
class mul_clzTest extends AnyFlatSpec with ChiselScalatestTester {
behavior.of("Single Cycle CPU")
it should "multiply two numbers with counting leading zeros" in {
test(new TestTopModule("mul_clz.asmbin")).withAnnotations(TestAnnotations.annos) { c =>
- for (i <- 1 to 500) {
+ for (i <- 1 to 5000) {
c.clock.step()
c.io.mem_debug_read_address.poke((i * 4).U) // Avoid timeout
}
c.io.regs_debug_read_address.poke(30.U) // t5
c.io.regs_debug_read_data.expect(0x3.U)
+ c.io.regs_debug_read_address.poke(18.U) // s2
+ c.io.regs_debug_read_data.expect(0x1234540a.U)
}
}
}
```
The whole assembly code cost 3577 ns to accomplish, and the result is as same as the value I got in [HW2](https://hackmd.io/ncRkOZMfQlq9zJMb1hFgZQ?view#Print-in-HEX-form).
![2023-11-30 23-50-00 screenshot](https://hackmd.io/_uploads/SkgEzVLBa.png)