# Assignment3: SoftCPU
contributed by < [Chen Ching Wen]() >
###### tags: `RISC-V`, `Computer Architure 2022`
[TOC]
## Requirements 1,3
### Problem choosen from homework1
After basic tools installation(include riscv-gnu-toolchain, verilator, and srv32), I picked up the problem done by [莊集](https://hackmd.io/@y8jRQNyoRe6WG-qekloIlA/Sk0PXEDzj) and try to compare the results generated from the c code and the handwritten assembly code.
In order to make the handwritten assembly code executable on the srv32, we need to modify the **printArray** part in the assembly code.
In this part, we need to notice few things; first of all, before call printf, we shouldn't save any variable that couldn't be modified outside the function in tx registers(e.g. nums head addr in t1). Secondly, we need to printf a "\n" in the end of printArray, or it couldn't print out anything. I didn't get it why it's in this way.
```assembly
printArray: # void printArray(*nums, numsSize), nums:a0, numsSize:a1
addi sp, sp, -16 # store ra, s0, s1
sw ra, 0(sp) # store ra
sw s0, 4(sp) # store s0
sw s1, 8(sp) # store s1
sw s2, 12(sp)
mv s0, a0 # keep nums head addr in s0
mv s1, a1 # keep numsSize in s1
mv s2, a0 # keep nums head addr in t1
print_conti:
beq x0, s1, print_end
la a0, iformat
lw a1, 0(s2)
call printf
addi s1, s1, -1
addi s2, s2, 4
la a0, comma
call printf
j print_conti
print_end:
la a0, newLine
call printf
lw ra, 0(sp)
lw s0, 4(sp)
lw s1, 8(sp)
lw s2, 12(sp)
addi sp, sp, 16
ret
```
- **Result of C Code (Instruction Set Simulator)**
```shell=
$export CROSS_COMPILE=riscv-none-elf-
$export EXTRA_CFLAGS="-misa-spec=2.2 -march=rv32im"
```
```shell=
$ ./rvsim --memsize 128 -l trace.log ../sw/containDup_c/containDup.elf
Question1 Accepted
Excuting 2787 instructions, 3891 cycles, 1.396 CPI
Program terminate
Simulation statistics
=====================
Simulation time : 0.002 s
Simulation cycles: 3891
Simulation speed : 2.477 MHz
```
- **Result of C Code (RTL Simulation)**
```shell=
##~/srv32/sim
$ make containDup_c.run
cp ../sw/containDup_c/*.bin .
stdbuf -o0 -e0 ./sim +trace +dump | awk -f checkcode.awk
Question1 Accepted
Excuting 2787 instructions, 3891 cycles, 1.396 CPI
Program terminate
- ../rtl/../testbench/testbench.v:434: Verilog $finish
Simulation statistics
=====================
Simulation time : 0.051 s
Simulation cycles: 3902
Simulation speed : 0.0765098 MHz
```
- **Result of Assembly Code (RTL Simulation)**
```shell=
##~/srv32/sim
$ make containDup.run
cp ../sw/containDup/*.bin .
stdbuf -o0 -e0 ./sim +trace +dump | awk -f checkcode.awk
Question1 Accepted
DMEM address 00040000 out of range
- ../rtl/../testbench/testbench.v:439: Verilog $finish
Simulation statistics
=====================
Simulation time : 0.074 s
Simulation cycles: 6341
Simulation speed : 0.0856892 MHz
```
- Observation
We notice that the hanwritten assembly takes almost two time cycles than c-generated assembly code. The reason is that the branch that is used by handwritten assembly is more than the other.
- Optimization
- change the order of instruction(reduce the jump instruction)
```assembly=
##original
heapifySkip:
slli s10, t4, 2 # let s10 = 4 * j
add s10, a0, s10 # s10 = (nums + 4j)
lw s10, 0(s10) # s10 = *(nums + 4j)
slt s10, t5, s10 # if (currValue >= nums[j]) then break
beq s10, zero, assignCurrVal
andi t3, t4, 1 # if (j % 2 == 0) then parent = j / 2 - 1
beq t3, zero, innerEvenAssign # else if (j % 2 == 1) then parent = j / 2
j innerOddAssign
innerEvenAssign:
srli t3, t4, 1 # let t3 be parent = j / 2 - 1
addi t3, t3, -1
j innerOddAssignSkip
innerOddAssign:
srli t3, t4, 1 # let t3 be parent = j / 2
innerOddAssignSkip:
slli t3, t3, 2 # t3 = parent * 4
add t3, a0, t3 # t3 = (nums + 4 * parent)
slli s11, t4, 2 # let s11 = 4 * j
add s11, a0, s11 # s11 = nums + 4j
lw s11, 0(s11) # s11 = nums[j]
sw s11, 0(t3) # nums[parent] = nums[j]
slli t4, t4, 1 # j = j * 2 + 1
addi t4, t4, 1
j heapifyWhile
```
```assembly=
##original
heapifySkip:
slli s10, t4, 2 # let s10 = 4 * j
add s10, a0, s10 # s10 = (nums + 4j)
lw s10, 0(s10) # s10 = *(nums + 4j)
slt s10, t5, s10 # if (currValue >= nums[j]) then break
beq s10, zero, assignCurrVal
andi t3, t4, 1 # if (j % 2 == 0) then parent = j / 2 - 1
beq t3, zero, innerEvenAssign # else if (j % 2 == 1) then parent = j / 2
innerOddAssign:
srli t3, t4, 1 # let t3 be parent = j / 2
innerOddAssignSkip:
slli t3, t3, 2 # t3 = parent * 4
add t3, a0, t3 # t3 = (nums + 4 * parent)
slli s11, t4, 2 # let s11 = 4 * j
add s11, a0, s11 # s11 = nums + 4j
lw s11, 0(s11) # s11 = nums[j]
sw s11, 0(t3) # nums[parent] = nums[j]
slli t4, t4, 1 # j = j * 2 + 1
addi t4, t4, 1
j heapifyWhile
innerEvenAssign:
srli t3, t4, 1 # let t3 be parent = j / 2 - 1
addi t3, t3, -1
j innerOddAssignSkip
```
```shell=
##rtl simulation result after reduce the jump instruction
Simulation statistics
=====================
Simulation time : 0.072 s
Simulation cycles: 6323
Simulation speed : 0.0878194 MHz
```
```assembly=
assignCurrVal:
andi t3, t4, 1 # if (j % 2 == 0) then parent = j / 2 - 1
beq t3, zero, evenAssign # else if (j % 2 == 1) then parent = j / 2
j oddAssign
evenAssign:
srli t3, t4, 1 # let t3 be parent = j / 2 - 1
addi t3, t3, -1
j oddAssignSkip
oddAssign:
srli t3, t4, 1 # let t3 be parent = j / 2
oddAssignSkip:
slli t3, t3, 2 # t3 = parent * 4
add t3, a0, t3 # t3 = (nums + 4 * parent)
sw t5, 0(t3) # nums[parent] = currValue
lw s10, 0(sp) # restore s10
lw s11, 4(sp) # restore s11
addi sp, sp, 8
jr ra
```
```assembly=
assignCurrVal:
andi t3, t4, 1 # if (j % 2 == 0) then parent = j / 2 - 1
beq t3, zero, evenAssign # else if (j % 2 == 1) then parent = j / 2
oddAssign:
srli t3, t4, 1 # let t3 be parent = j / 2
oddAssignSkip:
slli t3, t3, 2 # t3 = parent * 4
add t3, a0, t3 # t3 = (nums + 4 * parent)
sw t5, 0(t3) # nums[parent] = currValue
lw s10, 0(sp) # restore s10
lw s11, 4(sp) # restore s11
addi sp, sp, 8
jr ra
evenAssign:
srli t3, t4, 1 # let t3 be parent = j / 2 - 1
addi t3, t3, -1
j oddAssignSkip
```
```shell=
Simulation statistics
=====================
Simulation time : 0.071 s
Simulation cycles: 6296
Simulation speed : 0.0886761 MHz
```
```assembly=
#####unroll begin
bge t4, a2, assignCurrVal # while j < numsSize do adjust parent node
addi t3, a2, -1 # if j < (numsSize - 1) && nums[j] < nums[j + 1] then j++
slt t3, t4, t3 # if j < numsSize - 1 then keep doing
beq t3, zero, heapifySkip
addi s11, t4, 1 # let s11 be dereference of *(nums + (j + 1))
slli s11, s11, 2 # s11 = 4 * (j + 1)
add s11, a0, s11 # s11 = nums + 4 * (j + 1)
lw s11, 0(s11) # s11 = *(nums + 4(j + 1)) = nums[j + 1]
slli s10, t4, 2 # s10 = 4 * j
add s10, a0, s10 # s10 = (num + 4j)
lw s10, 0(s10) # s10 = *(num + 4j) = nums[j]
slt t3, s10, s11 # if nums[j] < nums[j + 1] ?
beq t3, zero, heapifySkip # if nums[j] < nums[j + 1] then j++
addi t4, t4, 1 # j++
slli s10, t4, 2 # let s10 = 4 * j
add s10, a0, s10 # s10 = (nums + 4j)
lw s10, 0(s10) # s10 = *(nums + 4j)
slt s10, t5, s10 # if (currValue >= nums[j]) then break
beq s10, zero, assignCurrVal
andi t3, t4, 1 # if (j % 2 == 0) then parent = j / 2 - 1
beq t3, zero, innerEvenAssign # else if (j % 2 == 1) then parent = j / 2
srli t3, t4, 1 # let t3 be parent = j / 2
slli t3, t3, 2 # t3 = parent * 4
add t3, a0, t3 # t3 = (nums + 4 * parent)
slli s11, t4, 2 # let s11 = 4 * j
add s11, a0, s11 # s11 = nums + 4j
lw s11, 0(s11) # s11 = nums[j]
sw s11, 0(t3) # nums[parent] = nums[j]
slli t4, t4, 1 # j = j * 2 + 1
addi t4, t4, 1
#####unroll end
j heapifyWhile
```
```shell=
Simulation statistics
=====================
Simulation time : 0.072 s
Simulation cycles: 6275
Simulation speed : 0.0871528 MHz
```
```assembly=
callHeapifyWhile:
blt t6, zero, endHeapifywhile # while i >= 0, then do heapify
heapSortSkip:
mv a0, s0 # a0 = *nums
mv a1, t6 # a1 = i
mv a2, s1 # a2 = numsSize
jal heapify # call heapify
addi t6, t6, -1 # i--
j callHeapifyWhile # do next heapify while i still >= 0
endHeapifywhile:
lw ra, 0(sp) # restore ra
lw s0, 4(sp) # restore s0
lw s1, 8(sp) # restore s1
lw s2, 12(sp) # restore s2
addi sp, sp, 16
jr ra # return from heapSort
```
```shell=
Simulation statistics
=====================
Simulation time : 0.073 s
Simulation cycles: 6261
Simulation speed : 0.0857671 MHz
```
### LeetCode [85. Maximal Rectangle](https://leetcode.com/problems/maximal-rectangle/)
- Given a rows x cols binary matrix filled with 0's and 1's, find the largest rectangle containing only 1's and return its area.
```cpp=
#include <stdio.h>
#include <stdlib.h>
int largestRectangleArea(int *heights, int h_len)
{
if(h_len == 0){
return 0;
}
int *leftLessMin = (int *)malloc(sizeof(int)*h_len);
leftLessMin[0] = -1;
for(int i = 1; i < h_len; i++){
int l = i - 1;
while(l >= 0 && heights[l] >= heights[i]){
l = leftLessMin[l];
}
leftLessMin[i] = l;
}
int *rightLessMin = (int *)malloc(sizeof(int)*h_len);
rightLessMin[h_len - 1] = h_len;
for(int i = h_len - 2; i >= 0; i--){
int r = i + 1;
while(r <= h_len - 1 && heights[r] >= heights[i]){
r = rightLessMin[r];
}
rightLessMin[i] = r;
}
int maxArea = 0;
for(int i= 0;i < h_len; i++){
int area = (rightLessMin[i] - leftLessMin[i] - 1)*heights[i];
maxArea = area > maxArea ? area : maxArea;
}
return maxArea;
}
int maxRectangle(int matrixSize, int matrixColSize, int matrix[][matrixColSize])
{
if(matrixSize == 0){
return 0;
}
//int *heights = (int *)malloc(sizeof(int)*matrixColSize);
int *heights = calloc(matrixColSize, sizeof(int));
int maxArea = 0;
int tmp = 0;
for(int row = 0; row < matrixSize; row++){
for(int col = 0; col < matrixColSize; col++){
if(matrix[row][col] == 1){
heights[col] += 1;
}else{
heights[col] = 0;
}
}
tmp = largestRectangleArea(heights, matrixColSize);
maxArea = maxArea > tmp ? maxArea : tmp;
}
return maxArea;
}
int main(){
int problem1[4][5] = {{1, 0, 1, 0, 0}, {1, 0, 1, 1, 1}, {1, 1, 1, 1, 1}, {1, 0, 0, 1, 0}};
int ans = maxRectangle( 4, 5, problem1);
printf("Answer for problem1 = %d\n", ans);
int problem2[1][1] = {{1}};
ans = maxRectangle( 4, 5, problem2);
printf("Answer for problem2 = %d\n", ans);
int problem3[3][6] = {{1, 1, 1, 1, 1, 1}, {0, 0, 1, 1, 1, 1}, {0, 1, 1, 1, 1, 1}};
ans = maxRectangle( 3, 6, problem3);
printf("Answer for problem3 = %d\n", ans);
}
```
```shell
Answer for problem1 = 6
Answer for problem2 = 1
Answer for problem3 = 12
Excuting 11951 instructions, 16329 cycles, 1.366 CPI
Program terminate
- ../rtl/../testbench/testbench.v:434: Verilog $finish
Simulation statistics
=====================
Simulation time : 0.169 s
Simulation cycles: 16340
Simulation speed : 0.0966864 MHz
```
```shell=
$ ./rvsim --memsize 128 -l trace.log ../sw/MaxRectArea/MaxRectArea.elf
Answer for problem1 = 6
Answer for problem2 = 1
Answer for problem3 = 12
Excuting 11951 instructions, 16329 cycles, 1.366 CPI
Program terminate
Simulation statistics
=====================
Simulation time : 0.008 s
Simulation cycles: 16329
Simulation speed : 2.165 MHz
```
## Requirements 2
### load
There is no needed stall since when load occurs and has a data dependency with the next instruction, the **wb_mem2reg** becomes 1 and threrefore, **wb_data** = **reg_rdata2**
```assembly
7428: 000c2583 lw a1,0(s8)
742c: 00ba85b3 add a1,s5,a1
7430: 00bc2023 sw a1,0(s8)
```

### forwarding
- Firstly, we notice that add and sw have a RAW data hazard; however through forwarding, **a1** would be sent to the next exe. stage(run **sw** instruction) after **add** instruction finishes in exe stage. We can verify it through checking out **wb_result** and **reg_rdata2**, which have the same values **0x418**.
```assembly=
742c: 00ba85b3 add a1,s5,a1
7430: 00bc2023 sw a1,0(s8)
```

### branch
It need flushes two instructions. Since whether to jump or not for the branch is decided in the EXE stage, and before it send the signal to the mux in the IF/ID stage to determine the next pc, the second flushed instruction already be read from the program counter.
```assembly=
7440: 28079c63 bnez a5,76d8 <_malloc_r+0x5ac>
7444: 0089ab83 lw s7,8(s3)
7448: 015b07b3 add a5,s6,s5
744c: 0017e793 ori a5,a5,1
```

## Requirements 4
There are two major points I want to figure out through this homework; first off, how verilator is executed with verilog; secondly, how interrupt is implemented in a CPU via verilog.
### verilator
- Verilator is a simulation tool which is similar to the rule of testbench of verilog, except it's written in c++.
```verilog=
//our.v
module our;
initial begin $display("Hello World"); $finish; end
endmodule
```
```cpp=
#include "Vour.h"
#include "verilated.h"
int main(int argc, char **argv, char **env) {
Verilated::commandArgs(argc, argv);
Vour* top = new Vour;
while (!Verilated::gotFinish()) { top->eval(); }
exit(0);
}
```
```shell=
$export VERILATOR_ROOT=/path/to/where/verilator/was/installed
$VERILATOR_ROOT/bin/verilator --cc our.v --exe sim_main.cpp
$cd obj_dir
$make -j -f Vour.mk Vour
$cd ..
$obj_dir/Vour
```
### interruption/exception
In general, a complete CPU should contains three mode: user, machine, supervisor mode. In srv32, it's mainly implemented for machine mode(e.g. mtvec means trap-vector base-address register under machine mode)
Following are the corresponding CSR registers for interruption and exception
- CSR registers
| register | function |
| ---- | -------- |
| mtvec | Define the pc address of the instruction that cause exception |
| mcause |Define the reason for exception |
|mtval|The value corresponding for the exception. e.g.If exception occurs by r/w from/to memory, than mtval store the visited memory address|
|mepc|There would be a slightly difference. For interruption, it point to the next non-executed instruction, while for exception, it points to the current instruction which cause the exception|
|mstatus|Contain multiple information related to the status. e.g.(1)**MIE**:machine mode interruption enable (2)**MPIE**: previous MIE|
|mie|Control different types of interruptions(a mask)|
|mip|Pending status for different interruption|
## Reference
[leetcode 85 tutorial](https://leetcode.wang/leetCode-85-Maximal-Rectangle.html)
[verilator(1) - Linux man page](https://linux.die.net/man/1/verilator)