# Assignment3: SoftCPU
contributed by < [`Peter`](https://hackmd.io/@kaminto-1999) >
###### tags: `RISC-V`
## Setting Up Environment
I set up the `RISC-V toolchains` in my environment, `WSL2 Ubuntu 20.04`, by the tutorial from [chinghongfang](https://hackmd.io/@chinghongfang/HJuNqq-cF). Next I copy the c code used in [Assignment 2](https://hackmd.io/@jackli/CA_hw1) into `srv32/sw/Ex2/Ex2.c`; then I add a `Makefile` which is copied from `srv32/sw/qsort`. The same process is made with `LeetCode 357`, but I build the file in `srv/sw/Ex3`.
Next, I changed name of the target in Makefile to my C-code file's name.
Finally, I go into the `srv32` directory and run `make *NameOfExample*` to compile it and get the result of RTL simulation and ISS simulator.
## Requirements 1: [LeetCode 357:Count Numbers With Unique Digits](https://leetcode.com/problems/count-numbers-with-unique-digits/)
### Description:
Given an integer `n`, return the count of all numbers with unique digits, `x`, where `0 <= x < 10n`.
```
Example:
Input: n = 2
Output: 91
Explanation: The answer should be the total numbers in the range of 0 ≤ x < 100, excluding 11,22,33,44,55,66,77,88,99
```
### C-code
```c
int countNumbersWithUniqueDigits(int n) {
if(n==0){return 1;}
if(n==1){return 10;}
int temp=9;
int temp1=9;
int ret=10;
while(n>1){
temp=temp1*temp;
ret=temp+ret;
temp1--;
n--;
}
return ret;
}
```
### Validate the results
Below are the execution results that are tested by `verilator` and `rvsim`. As you see, the results are the same and the traces between RTL and ISS simulator are also same.
Run `make Ex3` in `srv` directory.
#### RTL Simulation result
```
With n = 0
Result is:1
*************
With n = 1
Result is:10
*************
With n = 2
Result is:91
*************
With n = 3
Result is:739
*************
With n = 4
Result is:5275
*************
Excuting 18142 instructions, 24518 cycles, 1.351 CPI
Program terminate
- ../rtl/../testbench/testbench.v:434: Verilog $finish
Simulation statistics
=====================
Simulation time : 0.239 s
Simulation cycles: 24529
Simulation speed : 0.102632 MHz
```
#### ISS (Instruction Set Simulator) result
```
./rvsim --memsize 128 -l trace.log ../sw/Ex3/Ex3.elf
With n = 0
Result is:1
*************
With n = 1
Result is:10
*************
With n = 2
Result is:91
*************
With n = 3
Result is:739
*************
With n = 4
Result is:5275
*************
Excuting 18142 instructions, 24518 cycles, 1.351 CPI
Program terminate
Simulation statistics
=====================
Simulation time : 0.008 s
Simulation cycles: 24518
Simulation speed : 3.210 MHz
make[1]: Leaving directory '/home/peter/srv32/tools'
Compare the trace between RTL and ISS simulator
=== Simulation passed ===
```
#### Assembly code
It is generated from GNU Toolchain for RISC-V with `-O3` optimization and it can be found in `srv32/sw/Ex3/Ex3.dis`.
Note that you can change the optimization option by edit **CFLAGS** `Makefile.common` file.

For the first time, I do not touch the CFAGS; the default value is `-O3`. I realize some interesting when looking for the function call in the source code. The -O3 optimization has reduced the function call into some specical address. This method can help reduce the code length and also speedup the core but it is hard to debug. At the end, I changed the flag to `-O1` so that I can track the signal easily.
```
0000003c <countNumbersWithUniqueDigits>:
3c: 02050a63 beqz a0,70 <countNumbersWithUniqueDigits+0x34>
40: 00100793 li a5,1
44: 02a7da63 bge a5,a0,78 <countNumbersWithUniqueDigits+0x3c>
48: 00a00693 li a3,10
4c: 40a686b3 sub a3,a3,a0
50: 00900793 li a5,9
54: 00a00513 li a0,10
58: 00900713 li a4,9
5c: 02f70733 mul a4,a4,a5
60: fff78793 addi a5,a5,-1
64: 00e50533 add a0,a0,a4
68: fef69ae3 bne a3,a5,5c <countNumbersWithUniqueDigits+0x20>
6c: 00008067 ret
70: 00100513 li a0,1
74: 00008067 ret
78: 00a00513 li a0,10
7c: 00008067 ret
```
## Requirements 2: [Homework2](https://hackmd.io/XThOd5JjQtGF9dzGY4yh_Q?both)
### Description
I choose the [Contains Duplicate](https://leetcode.com/problems/contains-duplicate/) from [莊集](https://hackmd.io/@y8jRQNyoRe6WG-qekloIlA/Sk0PXEDzj)
### Validate the results
Below are the execution results that are tested by `verilator` and `rvsim`. As you see, the results are the same and the traces between RTL and ISS simulator are also same.
Run `make debug=1 Ex2` in `srv` directory.
#### RTL Simulation result
```
Question1 Accepted
Question2 Accepted
Question3 Accepted
Excuting 6305 instructions, 8609 cycles, 1.365 CPI
Program terminate
- ../rtl/../testbench/testbench.v:434: Verilog $finish
Simulation statistics
=====================
Simulation time : 0.098 s
Simulation cycles: 8620
Simulation speed : 0.0879592 MHz
```
#### ISS (Instruction Set Simulator) result
```
Question1 Accepted
Question2 Accepted
Question3 Accepted
Excuting 6305 instructions, 8609 cycles, 1.365 CPI
Program terminate
Simulation statistics
=====================
Simulation time : 0.002 s
Simulation cycles: 8609
Simulation speed : 3.864 MHz
make[1]: Leaving directory '/home/peter/srv32/tools'
Compare the trace between RTL and ISS simulator
=== Simulation passed ===
```
### Assembly code
```
0000003c <swap>:
3c: 00052783 lw a5,0(a0)
40: 0005a703 lw a4,0(a1)
44: 00e52023 sw a4,0(a0)
48: 00f5a023 sw a5,0(a1)
4c: 00008067 ret
00000050 <heapify>:
50: 00259793 slli a5,a1,0x2
54: 00f507b3 add a5,a0,a5
58: 0007a803 lw a6,0(a5)
5c: 00159593 slli a1,a1,0x1
60: 00158793 addi a5,a1,1
64: 06c7dc63 bge a5,a2,dc <heapify+0x8c>
68: fff60593 addi a1,a2,-1
6c: 0280006f j 94 <heapify+0x44>
70: 01f7d713 srli a4,a5,0x1f
74: 00f70733 add a4,a4,a5
78: 40175713 srai a4,a4,0x1
7c: 00271713 slli a4,a4,0x2
80: 00e50733 add a4,a0,a4
84: 00d72023 sw a3,0(a4)
88: 00179793 slli a5,a5,0x1
8c: 00178793 addi a5,a5,1
90: 04c7d663 bge a5,a2,dc <heapify+0x8c>
94: 00b7de63 bge a5,a1,b0 <heapify+0x60>
98: 00279713 slli a4,a5,0x2
9c: 00e50733 add a4,a0,a4
a0: 00072683 lw a3,0(a4)
a4: 00472703 lw a4,4(a4)
a8: 00e6a733 slt a4,a3,a4
ac: 00e787b3 add a5,a5,a4
b0: 00279713 slli a4,a5,0x2
b4: 00e50733 add a4,a0,a4
b8: 00072683 lw a3,0(a4)
bc: 02d85063 bge a6,a3,dc <heapify+0x8c>
c0: 0017f713 andi a4,a5,1
c4: fa0716e3 bnez a4,70 <heapify+0x20>
c8: 01f7d713 srli a4,a5,0x1f
cc: 00f70733 add a4,a4,a5
d0: 40175713 srai a4,a4,0x1
d4: fff70713 addi a4,a4,-1
d8: fa5ff06f j 7c <heapify+0x2c>
dc: 0017f713 andi a4,a5,1
e0: 02071263 bnez a4,104 <heapify+0xb4>
e4: 01f7d713 srli a4,a5,0x1f
e8: 00f707b3 add a5,a4,a5
ec: 4017d793 srai a5,a5,0x1
f0: fff78793 addi a5,a5,-1
f4: 00279793 slli a5,a5,0x2
f8: 00f50533 add a0,a0,a5
fc: 01052023 sw a6,0(a0)
100: 00008067 ret
104: 01f7d713 srli a4,a5,0x1f
108: 00f707b3 add a5,a4,a5
10c: 4017d793 srai a5,a5,0x1
110: fe5ff06f j f4 <heapify+0xa4>
00000114 <heapSort>:
114: fe010113 addi sp,sp,-32
118: 00112e23 sw ra,28(sp)
11c: 00812c23 sw s0,24(sp)
120: 00912a23 sw s1,20(sp)
124: 01212823 sw s2,16(sp)
128: 01312623 sw s3,12(sp)
12c: 00050913 mv s2,a0
130: 00058413 mv s0,a1
134: 01f5d493 srli s1,a1,0x1f
138: 00b484b3 add s1,s1,a1
13c: 4014d493 srai s1,s1,0x1
140: fff48493 addi s1,s1,-1
144: 0204c063 bltz s1,164 <heapSort+0x50>
148: fff00993 li s3,-1
14c: 00040613 mv a2,s0
150: 00048593 mv a1,s1
154: 00090513 mv a0,s2
158: ef9ff0ef jal ra,50 <heapify>
15c: fff48493 addi s1,s1,-1
160: ff3496e3 bne s1,s3,14c <heapSort+0x38>
164: fff40493 addi s1,s0,-1
168: 0204ce63 bltz s1,1a4 <heapSort+0x90>
16c: 00241413 slli s0,s0,0x2
170: 00890433 add s0,s2,s0
174: fff00993 li s3,-1
178: 00092783 lw a5,0(s2)
17c: ffc42703 lw a4,-4(s0)
180: 00e92023 sw a4,0(s2)
184: fef42e23 sw a5,-4(s0)
188: 00048613 mv a2,s1
18c: 00000593 li a1,0
190: 00090513 mv a0,s2
194: ebdff0ef jal ra,50 <heapify>
198: fff48493 addi s1,s1,-1
19c: ffc40413 addi s0,s0,-4
1a0: fd349ce3 bne s1,s3,178 <heapSort+0x64>
1a4: 01c12083 lw ra,28(sp)
1a8: 01812403 lw s0,24(sp)
1ac: 01412483 lw s1,20(sp)
1b0: 01012903 lw s2,16(sp)
1b4: 00c12983 lw s3,12(sp)
1b8: 02010113 addi sp,sp,32
1bc: 00008067 ret
000001c0 <containsDuplicate>:
1c0: ff010113 addi sp,sp,-16
1c4: 00112623 sw ra,12(sp)
1c8: 00812423 sw s0,8(sp)
1cc: 00912223 sw s1,4(sp)
1d0: 00050413 mv s0,a0
1d4: 00058493 mv s1,a1
1d8: f3dff0ef jal ra,114 <heapSort>
1dc: 00100793 li a5,1
1e0: 0297d863 bge a5,s1,210 <containsDuplicate+0x50>
1e4: 00040513 mv a0,s0
1e8: fff48593 addi a1,s1,-1
1ec: 00000793 li a5,0
1f0: 00052683 lw a3,0(a0)
1f4: 00452703 lw a4,4(a0)
1f8: 02e68063 beq a3,a4,218 <containsDuplicate+0x58>
1fc: 00178793 addi a5,a5,1
200: 00450513 addi a0,a0,4
204: feb796e3 bne a5,a1,1f0 <containsDuplicate+0x30>
208: 00000513 li a0,0
20c: 0100006f j 21c <containsDuplicate+0x5c>
210: 00000513 li a0,0
214: 0080006f j 21c <containsDuplicate+0x5c>
218: 00100513 li a0,1
21c: 00c12083 lw ra,12(sp)
220: 00812403 lw s0,8(sp)
224: 00412483 lw s1,4(sp)
228: 01010113 addi sp,sp,16
22c: 00008067 ret
```
## Waveform analysis with GTKWave
### srv32 CPU
`srv32` is a simple RISC-V 3-stage pipeline processor with IF/ID, EX, WB stages first. The following block diagram is based on the [Lab3: srv32](https://hackmd.io/@sysprog/S1Udn1Xtt). Details of the system can be found in `srv32/rtl/riscv.v`- Verilog RTL language.

RISC-V provides 2 types of Resgister which are **GPR** (General Purpose Register) and **CSR** (Control and Status Register).
### Branch Penalty- flush the pipeline
`Branch penalty` is the number of instructions killed if a branch is TAKEN. Branch result is resolved at the end EX stage by ALU so the instruction fetch in IF/ID might need to be killed if a branch is taken. However, in `srv32`, the address of next instruction (next PC) should be fed into I-MEM a cycle ahead. Thus, the branch penalty for srv32 is 2. To clarify, by the time next PC is resolved, one instruction has been fetched into pipeline and another PC has been calculated because address should be computed one cycle ahead.
Consider the following instruction sequence and its corresponding waveform.
```
3c: 02050a63 beqz a0,70 <countNumbersWithUniqueDigits+0x34>
40: 00100793 li a5,1
44: 02a7da63 bge a5,a0,78
```

First, we look at the `branch` instruction (PC=00000044) to understand the penalty of `branch/jump` instruction.
As the instruction is in EX stage, it will set the `next_pc`, which is determined by the code below. For branch instruction, `next_pc = ex_imm = 00000070`, which is the pink box in the waveform. After two cycle, value of`if_pc` will be updated to destination `70`; therefore, the branch penalty is 2. Also, `ex_flush` and `wb_flush` are asserted for 2 cycle telling that next 2 instruction will be flushed out of the pipeline.
### Dealing with Data Hazard using Forwarding
`srv32` supports full forwarding, which indicates RAW data hazard can be resolved WITHOUT stalling the processor.
Consider the following instruction sequence and its corresponding waveform.
```
58: 00900713 li a4,9
5c: 02f70733 mul a4,a4,a5
60: fff78793 addi a5,a5,-1
64: 00e50533 add a0,a0,a4
```

Take a look at the the waveform, when RAW data hazard happens, data from `wb_result` is forwarded to `ex_rdata2`, value is `10`.
The implementation of forwarding in `srv32/rtl/riscv.v` is as follow:
```
assign reg_rdata1[31: 0] = (ex_src1_sel == 5'h0) ? 32'h0 :
(!wb_flush && wb_alu2reg &&
(wb_dst_sel == ex_src1_sel)) ? // register forwarding
(wb_mem2reg ? wb_rdata : wb_result) :
regs[ex_src1_sel];
assign reg_rdata2[31: 0] = (ex_src2_sel == 5'h0) ? 32'h0 :
(!wb_flush && wb_alu2reg &&
(wb_dst_sel == ex_src2_sel)) ? // register forwarding
(wb_mem2reg ? wb_rdata : wb_result) :
regs[ex_src2_sel];
```
### Load-use hazard
Load-use hazard is NOT an issue in `srv32`, because D-MEM is read at WB stage and register file is also read at WB stage. A single MUX is used to switch between 2 operands (operand from register file and operand from D-MEM). Therefore, `load-use hazard` can be resolved without stalling the processor.
Consider the following instruction sequence and its corresponding waveform.
```
6c: 0007a683 lw a3,0(a5)
70: fea688e3 beq a3,a0,60 <countNumbersWithUniqueDigits+0x24>
```

Let's first look at the blue box in the waveform. `wb_dst_sel (lw)` is equal to `ex_src1_sel (beq)`, which indicates that `RAW` data hazard happens; then we check that `wb_mem2reg` is also equal to 1, indicating that it is a load instruction that we have to bypass the value from memory to register file. That is, it will pass the `wb_rdata`, which is the output of `D-MEM`, to `reg_rdata1` directly, and we do not have to stall any cycle.
:::spoiler {state="close"} `load-use hazard` solution in `srv32/rtl/riscv.v`
```
always @(posedge clk or negedge resetb) begin
if (!resetb)
ex_mem2reg <= 1'b0;
else if (inst[`OPCODE] == OP_LOAD)
ex_mem2reg <= 1'b1;
else if (ex_mem2reg && dmem_rvalid)
ex_mem2reg <= 1'b0;
end
always @(posedge clk or negedge resetb) begin
if (!resetb) begin
wb_result <= 32'h0;
wb_alu2reg <= 1'b0;
wb_dst_sel <= 5'h0;
wb_branch <= 1'b0;
wb_branch_nxt <= 1'b0;
wb_mem2reg <= 1'b0;
wb_raddr <= 2'h0;
wb_alu_op <= 3'h0;
end else if (!ex_stall) begin
wb_result <= ex_result;
wb_alu2reg <= ex_alu || ex_lui || ex_auipc || ex_jal || ex_jalr ||
ex_csr ||
`ifdef RV32M_ENABLED
ex_mul ||
`endif
(ex_mem2reg && !ex_ld_align_excp);
wb_dst_sel <= ex_dst_sel;
wb_branch <= branch_taken || ex_trap;
wb_branch_nxt <= wb_branch;
wb_mem2reg <= ex_mem2reg;
wb_raddr <= dmem_raddr[1:0];
wb_alu_op <= ex_alu_op;
end
end
always @* begin
case(wb_alu_op)
OP_LB : begin
case(wb_raddr[1:0])
2'b00: wb_rdata[31: 0] = {{24{dmem_rdata[7]}},
dmem_rdata[ 7: 0]};
2'b01: wb_rdata[31: 0] = {{24{dmem_rdata[15]}},
dmem_rdata[15: 8]};
2'b10: wb_rdata[31: 0] = {{24{dmem_rdata[23]}},
dmem_rdata[23:16]};
2'b11: wb_rdata[31: 0] = {{24{dmem_rdata[31]}},
dmem_rdata[31:24]};
endcase
end
OP_LH : begin
wb_rdata = (wb_raddr[1]) ?
{{16{dmem_rdata[31]}}, dmem_rdata[31:16]} :
{{16{dmem_rdata[15]}}, dmem_rdata[15: 0]};
end
OP_LW : begin
wb_rdata = dmem_rdata;
end
OP_LBU : begin
case(wb_raddr[1:0])
2'b00: wb_rdata[31: 0] = {24'h0, dmem_rdata[7:0]};
2'b01: wb_rdata[31: 0] = {24'h0, dmem_rdata[15:8]};
2'b10: wb_rdata[31: 0] = {24'h0, dmem_rdata[23:16]};
2'b11: wb_rdata[31: 0] = {24'h0, dmem_rdata[31:24]};
endcase
end
OP_LHU : begin
wb_rdata = (wb_raddr[1]) ?
{16'h0, dmem_rdata[31:16]} :
{16'h0, dmem_rdata[15: 0]};
end
default: begin
wb_rdata = 32'h0;
end
endcase
end
assign reg_rdata1[31: 0] = (ex_src1_sel == 5'h0) ? 32'h0 :
(!wb_flush && wb_alu2reg &&
(wb_dst_sel == ex_src1_sel)) ? // register forwarding
(wb_mem2reg ? wb_rdata : wb_result) : // load instruction?
regs[ex_src1_sel];
assign reg_rdata2[31: 0] = (ex_src2_sel == 5'h0) ? 32'h0 :
(!wb_flush && wb_alu2reg &&
(wb_dst_sel == ex_src2_sel)) ? // register forwarding
(wb_mem2reg ? wb_rdata : wb_result) : // load instruction?
regs[ex_src2_sel];
```
:::
## Software Optimizations
In this part, I will try to optimize the code in C-code by applying loop unrolling technique.

Below are the orignal assmebly code and its result.
### Original C-code
```c
#include<stdio.h>
#include<stdbool.h>
void swap(int *x, int *y){
int temp = *x;
*x = *y;
*y = temp;
}
void heapify(int *nums, int i, int numsSize){
int currValue = nums[i];
int j = i * 2 + 1; // left child
int parent;
while (j < numsSize){
if (j < (numsSize - 1)){
if (nums[j] < nums[j + 1]){
j = j + 1;
}
}
if (currValue >= nums[j])
break;
else{
if (j % 2 == 0)
parent = j / 2 - 1;
else
parent = j / 2;
nums[parent] = nums[j];
j = j * 2 + 1;
}
}
if (j % 2 == 0)
parent = j / 2 - 1;
else
parent = j / 2;
nums[parent] = currValue;
}
void heapSort(int *nums, int numsSize){
for (int i = numsSize / 2 - 1; i >= 0; i--){
heapify(nums, i, numsSize);
}
for (int i = numsSize - 1; i >= 0; i--){
swap(&nums[0], &nums[i]);
heapify(nums, 0, i);
}
}
bool containsDuplicate(int* nums, int numsSize){
heapSort(nums, numsSize);
for (int i = 0; i < numsSize - 1; i++){
if (nums[i] == nums[i + 1])
return true;
}
return false;
}
int main(){
int question1[] = {1, 4, 5, 6, 2, 3};
bool answer1 = false;
printf("Question1 ");
if (answer1 == containsDuplicate(question1, (sizeof(question1)/sizeof(question1[0]))))
printf("Accepted\n");
else
printf("Failed\n");
int question2[] = {3, 2, 1, 4, 2, 1};
bool answer2 = true;
printf("Question2 ");
if (answer2 == containsDuplicate(question2, (sizeof(question2)/sizeof(question2[0]))))
printf("Accepted\n");
else
printf("Failed\n");
int question3[] = {-1, -2, 3, 4, -2, -1};
bool answer3 = true;
printf("Question3 ");
if (answer3 == containsDuplicate(question3, (sizeof(question3)/sizeof(question3[0]))))
printf("Accepted\n");
else
printf("Failed\n");
return 0;
}
```
#### RTL Simulation result
```
Question1 Accepted
Question2 Accepted
Question3 Accepted
Excuting 7581 instructions, 10181 cycles, 1.342 CPI
Program terminate
- ../rtl/../testbench/testbench.v:434: Verilog $finish
Simulation statistics
=====================
Simulation time : 0.131 s
Simulation cycles: 10192
Simulation speed : 0.0778015 MHz
```
#### ISS (Instruction Set Simulator) result
```
Question1 Accepted
Question2 Accepted
Question3 Accepted
Excuting 7581 instructions, 10181 cycles, 1.342 CPI
Program terminate
- ../rtl/../testbench/testbench.v:434: Verilog $finish
Simulation statistics
=====================
Simulation time : 0.03 s
Simulation cycles: 10192
Simulation speed : 4.02 MHz
make[1]: Leaving directory '/home/jack/srv32/tools'
Compare the trace between RTL and ISS simulator
=== Simulation passed ===
```
### C-Code optimization
#### Loop unrolling
I would like to compare 3 continuous element in the sorted array instead of 2 in the original code. That io comparation between `num[i]` to `num[i+1]` and `num[i+1]`-`num[i+2]`.
```C
bool containsDuplicate(int* nums, int numsSize){
heapSort(nums, numsSize);
//for (int i = 0; i < numsSize - 1; i++){
// if (nums[i] == nums[i + 1])
// return true;
//}
for (int i = 0; i < numsSize - 1; i+=2){
if ((nums[i] == nums[i + 1])||(nums[i+1] == nums[i+2]))
return true;
}
return false;
}
```
This would make the No. instruction become smaller since the No. loop also reduce.
#### Replace operation
Next, I replace the modulus `%2` to `&0x0001` operation. This method should reduce the number of intruction since the `%` is unsupported operation in RV32I ISA.
```c
//if (j % 2 == 0)
if ((j&0x0001) == 0)
```
### Optimized Simulation Result
```
Question1 Accepted
Question2 Accepted
Question3 Accepted
Excuting 7561 instructions, 10149 cycles, 1.342 CPI
Program terminate
- ../rtl/../testbench/testbench.v:434: Verilog $finish
Simulation statistics
=====================
Simulation time : 0.112 s
Simulation cycles: 10160
Simulation speed : 0.0907143 MHz
Excuting 7561 instructions, 10149 cycles, 1.342 CPI
Program terminate
Simulation statistics
=====================
Simulation time : 0.002 s
Simulation cycles: 10149
Simulation speed : 4.413 MHz
make[1]: Leaving directory '/home/peter/srv32/tools'
Compare the trace between RTL and ISS simulator
=== Simulation passed ===
```
### Conclusion
In conclusion, I get **20 instruction reduction** and **32 cycles reduction** . I also try another test vector but No. reduced instruction keeping unchanged.
## How RISC-V Compliance Tests works
A constantly updating set of tests called the [RISC-V Architectural Testing Framework](https://github.com/riscv-non-isa/riscv-arch-test/blob/master/doc/README.adoc) was developed to help confirm that software produced for a certain RISC-V Profile/Specification would function on all implementations that adhere to that profile. The RISC-V Architecture is not required for a design to pass the RISC-V Architectural Tests, though. These simple tests merely check the key elements of the specification without getting bogged down in the finer points. Additionally, the compliance test focuses on instruction set rather than RTL capabilities in order to certify functioning.
### Compliance tests results
To run the compliance tests for RTL, we can enter `$make tests` at `srv32` directory; for ISS simulator, enter `$make tests-sw`. Then, we will get the results below. As you see, srv32 passes the compilance tests of `rv32Zicsr`, `rv32i` and `rv32im` in both RTL and ISS sumulator, which represents `Control and Status Register`, `Base Integer Instruction Set, 32-bit` and `Standard Extension for Integer Multiplication and Division`, repectively.
#### RTL
```
OK: 6/6 RISCV_TARGET=srv32 RISCV_DEVICE=rv32Zicsr RISCV_ISA=rv32Zicsr
OK: 48/48 RISCV_TARGET=srv32 RISCV_DEVICE=rv32i RISCV_ISA=rv32i
OK: 8/8 RISCV_TARGET=srv32 RISCV_DEVICE=rv32im RISCV_ISA=rv32im
```
#### ISS simulator
```
OK: 6/6 RISCV_TARGET=srv32 RISCV_DEVICE=rv32Zicsr RISCV_ISA=rv32Zicsr
OK: 48/48 RISCV_TARGET=srv32 RISCV_DEVICE=rv32i RISCV_ISA=rv32i
OK: 8/8 RISCV_TARGET=srv32 RISCV_DEVICE=rv32im RISCV_ISA=rv32im
```
## Explain how srv32 works with Verilator.
Verilator is a tool which is used to verify the Verilog or SystemVerilog code.
The "Verilated" model of the user's top-level module is instantiated by the user through a small C++/SystemC wrapper file. A C++ compiler then compiles these C++/SystemC files. The design simulation is carried out by the executable that results.
Additionally, Verilator is about 100 times faster on a single thread than interpreted Verilog simulators like Icarus Verilog, and multithreading may result in speedups of 2–10 times. Therefore, it outperforms interpreted simulators by a factor of 200 to 1000.
When entering `make $(SUBDIRS)` command at `srv32/` ditectory, it will call the makefile in the `srv32/sw/SUBDIRS` with `make $@.run` command, which calls the makefile in `srv32/sim` directory. Compiling process with the below command:
```
verilator -O3 -cc -Wall -Wno-STMTDLY -Wno-UNUSED +define+MEMSIZE=$(memsize) --trace-fst --Mdir sim_cc --build --exe sim_main.cpp getch.cpp
```
which will create several files in the sim directory under `srv32/sim`. We can consult the website Files Read/Written for a description of each file. Additionally, it will run the command listed below to create a waveform and a tracelog to assist us in testing our code.
```
./sim +trace +dump
```