Assignment3: Single-Cycle RISC-V CPU
contributed by < fewletter >
Environment setup
Operating System
I use the Ubuntu Linux 20.04.1 as my operating system.
Install sbt
Follow the command in lab3 use sdkman to install sbt.
Chisel Bootcamp - Local Installation in Mac/Linux
Follow the command in Local Installation - Mac/Linux. It's important that we have install the Eclipse Temurin JDK 11 in the above command. However, the Note in Chisel Bootcamp shows that you should have JDK 8 installed to initialize the Chisel Bootcamp.
Note: Make sure you are using Java 8 (NOT Java 9) and have the JDK8 installed. Coursier/jupyter-scala does not appear to be compatible with Java 9 yet as of January 2018.
Follow the hint of the Note, I try to see what java version do I have.
Obviously, I install JDK 11 in my system, so the java version is 11.0.21
. Then I attempt to change the java version by the following command.
Then open another terminal to initialize jupyter notebook.
Take Module 2.2: Combinational Logic as example.
Image Not Showing
Possible Reasons
- The image was uploaded to a note which you don't have access to
- The note which the image was originally uploaded to has been deleted
Learn More →
It seems that works well.
Single-Cycle RISC-V CPU in Chisel
There are four files InstructionFetch.scala
, InstructionDecode.scala
, Execute.scala
, CPU.scala
need to be filled with the code and finish the test.
Image Not Showing
Possible Reasons
- The image was uploaded to a note which you don't have access to
- The note which the image was originally uploaded to has been deleted
Learn More →
InstructionFetch
In the part, we can see that there are two ouputs InsAddr
and Ins
. InsAddr
depends on if the branch is detected from the execute phase. Ins
is to read the data from the instruction.
Image Not Showing
Possible Reasons
- The image was uploaded to a note which you don't have access to
- The note which the image was originally uploaded to has been deleted
Learn More →
To validate the result, the following command is to generate the .vcd
file and view the waveform.
The waveform shows that
Image Not Showing
Possible Reasons
- The image was uploaded to a note which you don't have access to
- The note which the image was originally uploaded to has been deleted
Learn More →
InstructionDecode
In this part, the main idea is to parse the information from the instruction like the figure .
Image Not Showing
Possible Reasons
- The image was uploaded to a note which you don't have access to
- The note which the image was originally uploaded to has been deleted
Learn More →
Therefore, to fill the code, we should focus on the instruction type L and S. These both types are allowed to let the instruction to read or write from the memory.
Image Not Showing
Possible Reasons
- The image was uploaded to a note which you don't have access to
- The note which the image was originally uploaded to has been deleted
Learn More →
Based on the following code, InstructionDecoderTest
tests S type, U type and R type instruction, so the io.memory_read_enable
has never been tested like the waveform shows.
Image Not Showing
Possible Reasons
- The image was uploaded to a note which you don't have access to
- The note which the image was originally uploaded to has been deleted
Learn More →
Execute
To finish this part, it is important to parse the input instruction. In the figure below, ALU is determined by the singal of ALUFunct, and ALU inputs are depends on the ALUOp1Src
and ALUOp2Src
to determine whether they are register data or the instruction address and immediate.
Image Not Showing
Possible Reasons
- The image was uploaded to a note which you don't have access to
- The note which the image was originally uploaded to has been deleted
Learn More →
The waveform shows the crucial part of the Execute phase that is ALU operations are depend on the ALUFunct.
Image Not Showing
Possible Reasons
- The image was uploaded to a note which you don't have access to
- The note which the image was originally uploaded to has been deleted
Learn More →
CPU
To finish this part, it is important to figure out the inputs and the outputs of different phases of the CPU. First, take a look at the CPU.scala
, it is obvious that it lacks of the execution phase.
Image Not Showing
Possible Reasons
- The image was uploaded to a note which you don't have access to
- The note which the image was originally uploaded to has been deleted
Learn More →
So, to accomplish the file, it is necessary to take a look in the execute phase of the cpu. There is no need to care what the relationship between the output and input, because Execute.scala
has done it by scratch. Instead it is need to be focus on how do the inputs come from the other phases.
Take sb.S
for example, the file is to test if that the register t0
has the value 0xDEADBEFF
and the regitser s2
has the value 0x15
.
Therefore focus on the how do the execute phase get the input in following two examples:
-
li t0, 0xDEADBEEF

-
li s2, 0x15

Run HW2 on Mycpu
Setup
In HW2, the code doesn't fit the ISA in Mycpu, so I remove the get_cycles
and the system call in the assembly code.
modify code
.org 0
# Provide program starting address to linker
.global _start
.data
data_1: .word 0x12345678
data_2: .word 0xffffdddd
mask_1: .word 0x55555555
mask_2: .word 0x33333333
mask_3: .word 0x0f0f0f0f
.text
_start:
lw s0, data_1 #s0 = A
lw s1, data_2 #s1 = B
mv a0, s0
jal ra, CLZ
mv t5, a0 #A's CLZ -> t5
mv a0, s1
jal ra, CLZ
mv t6, a0 #B's CLZ -> t6
slt t0, t5, t6 # if A's zero less than B's, t0=1
li a0, 32
jal ra, get_cycles
mv a4, a3
bne t0, zero, start_mul
start_mul:
#reset
mv t0, s0 #A ^= B;
mv s0, s1 #B ^= A;
mv s1, t0 #A ^= B;
mv t6, t5
sub a0, a0, t6
li t0, 0
li t1, 0
li t2, 0
li s2, 0 #s2: high 32 of number
li s3, 0 #s3: low 32 of number
li s4, 0 #used to check how many bit should shift
int_mul:
slt t1, s4, a0
beq t1, zero, exit
srl t0, s1, s4
andi t0, t0, 0x00000001 #check B's rightest bit
beq t0, zero, skip #if(rightest bit is zero) jump
sll s5,s0,s4 #s0 is A,S5 the low bit i want
li t2, 32
sub t2, t2, s4
srl s6, s0, t2 #s0 is A, S6 the high bit i want
add s7, s3, s5 #s7 is 32_low + low bit i want
sltu t3, s7, s3
mv s3, s7
beq t3, zero, no_overflow
# if not jump --> overflow
add s2, s2, s6
addi s2, s2, 1
addi s4, s4 ,1
no_overflow:
add s2, s2, s6
jal skip
skip:
addi s4, s4 ,1
jal int_mul
CLZ:
#a0: the num(x) you want to count CLZ
#t0: shifted x
srli t0, a0, 1 # t0 = x >> 1
or a0, a0, t0 # x |= x >> 1
srli t0, a0, 2 # t0 = x >> 2
or a0, a0, t0 # x |= x >> 2
srli t0, a0, 4 # t0 = x >> 4
or a0, a0, t0 # x |= x >> 4
srli t0, a0, 8 # t0 = x >> 8
or a0, a0, t0 # x |= x >> 8
srli t0, a0, 16 # t0 = x >> 16
or a0, a0, t0 # x |= x >> 16
#start_mask
lw t2, mask_1
srli t0, a0, 1 # t0 = x >> 1
and t1, t0, t2 # t1 = (x >> 1) & mask1
sub a0, a0, t1 # x -= ((x >> 1) & mask1)
lw t2, mask_2 # load mask2 to t2
srli t0, a0, 2 # t0 = x >> 2
and t1, t0, t2 # (x >> 2) & mask2
and a0, a0, t2 # x & mask2
add a0, t1, a0 # ((x >> 2) & mask2) + (x & mask2)
srli t0, a0, 4 # t0 = x >> 4
add a0, a0, t0 # x + (x >> 4)
lw t2, mask_3 # load mask3 to t2
and a0, a0, t2 # ((x >> 4) + x) & mask4
srli t0, a0, 8 # t0 = x >> 8
add a0, a0, t0 # x += (x >> 8)
srli t0, a0, 16 # t0 = x >> 16
add a0, a0, t0 # x += (x >> 16)
andi t0, a0, 0x3f # t0 = x & 0x3f
li a0, 32 # a0 = 32
sub a0, a0, t0 # 32 - (x & 0x3f)
ret
exit:
j exit
Then change the Makefile in the ca2023/csrc
, the Makefile can generate the .asmbin
file from the .elf
directly.
Every time when the mul_clz.S
is modified, the following commands can generate a new .asmbin
file and update the .asmbin
file in the main test directory.
Run and Debug HW2 on CPUTest
To test the assembly code of HW2, I prepare a mul_clzTest
to see if the result is stored in register s2
and s3
correctly. If the result is correct, the test should failed because the value isn't 0x0
.
Here is the result. Obviously that doesn't fit my expectation, so I begin to find where the problem is.
I modify the assembly code to only count the leading zeros of the data and the CPUTest to test if the register t5
and t6
are 0x3
and 0x7
.
modify code
.org 0
# Provide program starting address to linker
.global _start
.data
data_1: .word 0x12345678
data_2: .word 0xffffdddd
mask_1: .word 0x55555555
mask_2: .word 0x33333333
mask_3: .word 0x0f0f0f0f
.text
_start:
lw s0, data_1 #s0 = A
lw s1, data_2 #s1 = B
mv a0, s0
jal ra, CLZ
mv t5, a0 #A's CLZ -> t5
mv a0, s1
jal ra, CLZ
mv t6, a0 #B's CLZ -> t6
slt t0, t5, t6 # if A's zero less than B's, t0=1
loop:
j loop
CLZ:
#a0: the num(x) you want to count CLZ
#t0: shifted x
srli t0, a0, 1 # t0 = x >> 1
or a0, a0, t0 # x |= x >> 1
srli t0, a0, 2 # t0 = x >> 2
or a0, a0, t0 # x |= x >> 2
srli t0, a0, 4 # t0 = x >> 4
or a0, a0, t0 # x |= x >> 4
srli t0, a0, 8 # t0 = x >> 8
or a0, a0, t0 # x |= x >> 8
srli t0, a0, 16 # t0 = x >> 16
or a0, a0, t0 # x |= x >> 16
#start_mask
lw t2, mask_1
srli t0, a0, 1 # t0 = x >> 1
and t1, t0, t2 # t1 = (x >> 1) & mask1
sub a0, a0, t1 # x -= ((x >> 1) & mask1)
lw t2, mask_2 # load mask2 to t2
srli t0, a0, 2 # t0 = x >> 2
and t1, t0, t2 # (x >> 2) & mask2
and a0, a0, t2 # x & mask2
add a0, t1, a0 # ((x >> 2) & mask2) + (x & mask2)
srli t0, a0, 4 # t0 = x >> 4
add a0, a0, t0 # x + (x >> 4)
lw t2, mask_3 # load mask3 to t2
and a0, a0, t2 # ((x >> 4) + x) & mask4
srli t0, a0, 8 # t0 = x >> 8
add a0, a0, t0 # x += (x >> 8)
srli t0, a0, 16 # t0 = x >> 16
add a0, a0, t0 # x += (x >> 16)
andi t0, a0, 0x3f # t0 = x & 0x3f
li a0, 32 # a0 = 32
sub a0, a0, t0 # 32 - (x & 0x3f)
ret
CPUTest
Here is the result. The test still not pass.
So I view the waveform, I want to wee how these two lines behave in the waveform.
In 433 ns, the value in a0
(register_10
) is moved to the t5
(register_30
), but t6
(register_31
) is always zero. Therefore, it seems that the bug appears in these sentences.

Time matters
Since that the number in specific register isn't right, I decide to view the waveform to see what happens.

The final instruction of the CPUTest is 0c030c63
, which can be translated to beq t1, zero, init_mul
. The situation means that only part of the assembly code is executed, so that is why I can't get the right result in the specific register. Finally I modify the time of the CPUTest, so the assembly code can pass the test with the accurate value.
The whole assembly code cost 3577 ns to accomplish, and the result is as same as the value I got in HW2.
