# Assignment3: SoftCPU contributed by < [`Peter`](https://hackmd.io/@kaminto-1999) > ###### tags: `RISC-V` ## Setting Up Environment I set up the `RISC-V toolchains` in my environment, `WSL2 Ubuntu 20.04`, by the tutorial from [chinghongfang](https://hackmd.io/@chinghongfang/HJuNqq-cF). Next I copy the c code used in [Assignment 2](https://hackmd.io/@jackli/CA_hw1) into `srv32/sw/Ex2/Ex2.c`; then I add a `Makefile` which is copied from `srv32/sw/qsort`. The same process is made with `LeetCode 357`, but I build the file in `srv/sw/Ex3`. Next, I changed name of the target in Makefile to my C-code file's name. Finally, I go into the `srv32` directory and run `make *NameOfExample*` to compile it and get the result of RTL simulation and ISS simulator. ## Requirements 1: [LeetCode 357:Count Numbers With Unique Digits](https://leetcode.com/problems/count-numbers-with-unique-digits/) ### Description: Given an integer `n`, return the count of all numbers with unique digits, `x`, where `0 <= x < 10n`. ``` Example: Input: n = 2 Output: 91 Explanation: The answer should be the total numbers in the range of 0 ≤ x < 100, excluding 11,22,33,44,55,66,77,88,99 ``` ### C-code ```c int countNumbersWithUniqueDigits(int n) { if(n==0){return 1;} if(n==1){return 10;} int temp=9; int temp1=9; int ret=10; while(n>1){ temp=temp1*temp; ret=temp+ret; temp1--; n--; } return ret; } ``` ### Validate the results Below are the execution results that are tested by `verilator` and `rvsim`. As you see, the results are the same and the traces between RTL and ISS simulator are also same. Run `make Ex3` in `srv` directory. #### RTL Simulation result ``` With n = 0 Result is:1 ************* With n = 1 Result is:10 ************* With n = 2 Result is:91 ************* With n = 3 Result is:739 ************* With n = 4 Result is:5275 ************* Excuting 18142 instructions, 24518 cycles, 1.351 CPI Program terminate - ../rtl/../testbench/testbench.v:434: Verilog $finish Simulation statistics ===================== Simulation time : 0.239 s Simulation cycles: 24529 Simulation speed : 0.102632 MHz ``` #### ISS (Instruction Set Simulator) result ``` ./rvsim --memsize 128 -l trace.log ../sw/Ex3/Ex3.elf With n = 0 Result is:1 ************* With n = 1 Result is:10 ************* With n = 2 Result is:91 ************* With n = 3 Result is:739 ************* With n = 4 Result is:5275 ************* Excuting 18142 instructions, 24518 cycles, 1.351 CPI Program terminate Simulation statistics ===================== Simulation time : 0.008 s Simulation cycles: 24518 Simulation speed : 3.210 MHz make[1]: Leaving directory '/home/peter/srv32/tools' Compare the trace between RTL and ISS simulator === Simulation passed === ``` #### Assembly code It is generated from GNU Toolchain for RISC-V with `-O3` optimization and it can be found in `srv32/sw/Ex3/Ex3.dis`. Note that you can change the optimization option by edit **CFLAGS** `Makefile.common` file. ![](https://i.imgur.com/Khh0MaG.png) For the first time, I do not touch the CFAGS; the default value is `-O3`. I realize some interesting when looking for the function call in the source code. The -O3 optimization has reduced the function call into some specical address. This method can help reduce the code length and also speedup the core but it is hard to debug. At the end, I changed the flag to `-O1` so that I can track the signal easily. ``` 0000003c <countNumbersWithUniqueDigits>: 3c: 02050a63 beqz a0,70 <countNumbersWithUniqueDigits+0x34> 40: 00100793 li a5,1 44: 02a7da63 bge a5,a0,78 <countNumbersWithUniqueDigits+0x3c> 48: 00a00693 li a3,10 4c: 40a686b3 sub a3,a3,a0 50: 00900793 li a5,9 54: 00a00513 li a0,10 58: 00900713 li a4,9 5c: 02f70733 mul a4,a4,a5 60: fff78793 addi a5,a5,-1 64: 00e50533 add a0,a0,a4 68: fef69ae3 bne a3,a5,5c <countNumbersWithUniqueDigits+0x20> 6c: 00008067 ret 70: 00100513 li a0,1 74: 00008067 ret 78: 00a00513 li a0,10 7c: 00008067 ret ``` ## Requirements 2: [Homework2](https://hackmd.io/XThOd5JjQtGF9dzGY4yh_Q?both) ### Description I choose the [Contains Duplicate](https://leetcode.com/problems/contains-duplicate/) from [莊集](https://hackmd.io/@y8jRQNyoRe6WG-qekloIlA/Sk0PXEDzj) ### Validate the results Below are the execution results that are tested by `verilator` and `rvsim`. As you see, the results are the same and the traces between RTL and ISS simulator are also same. Run `make debug=1 Ex2` in `srv` directory. #### RTL Simulation result ``` Question1 Accepted Question2 Accepted Question3 Accepted Excuting 6305 instructions, 8609 cycles, 1.365 CPI Program terminate - ../rtl/../testbench/testbench.v:434: Verilog $finish Simulation statistics ===================== Simulation time : 0.098 s Simulation cycles: 8620 Simulation speed : 0.0879592 MHz ``` #### ISS (Instruction Set Simulator) result ``` Question1 Accepted Question2 Accepted Question3 Accepted Excuting 6305 instructions, 8609 cycles, 1.365 CPI Program terminate Simulation statistics ===================== Simulation time : 0.002 s Simulation cycles: 8609 Simulation speed : 3.864 MHz make[1]: Leaving directory '/home/peter/srv32/tools' Compare the trace between RTL and ISS simulator === Simulation passed === ``` ### Assembly code ``` 0000003c <swap>: 3c: 00052783 lw a5,0(a0) 40: 0005a703 lw a4,0(a1) 44: 00e52023 sw a4,0(a0) 48: 00f5a023 sw a5,0(a1) 4c: 00008067 ret 00000050 <heapify>: 50: 00259793 slli a5,a1,0x2 54: 00f507b3 add a5,a0,a5 58: 0007a803 lw a6,0(a5) 5c: 00159593 slli a1,a1,0x1 60: 00158793 addi a5,a1,1 64: 06c7dc63 bge a5,a2,dc <heapify+0x8c> 68: fff60593 addi a1,a2,-1 6c: 0280006f j 94 <heapify+0x44> 70: 01f7d713 srli a4,a5,0x1f 74: 00f70733 add a4,a4,a5 78: 40175713 srai a4,a4,0x1 7c: 00271713 slli a4,a4,0x2 80: 00e50733 add a4,a0,a4 84: 00d72023 sw a3,0(a4) 88: 00179793 slli a5,a5,0x1 8c: 00178793 addi a5,a5,1 90: 04c7d663 bge a5,a2,dc <heapify+0x8c> 94: 00b7de63 bge a5,a1,b0 <heapify+0x60> 98: 00279713 slli a4,a5,0x2 9c: 00e50733 add a4,a0,a4 a0: 00072683 lw a3,0(a4) a4: 00472703 lw a4,4(a4) a8: 00e6a733 slt a4,a3,a4 ac: 00e787b3 add a5,a5,a4 b0: 00279713 slli a4,a5,0x2 b4: 00e50733 add a4,a0,a4 b8: 00072683 lw a3,0(a4) bc: 02d85063 bge a6,a3,dc <heapify+0x8c> c0: 0017f713 andi a4,a5,1 c4: fa0716e3 bnez a4,70 <heapify+0x20> c8: 01f7d713 srli a4,a5,0x1f cc: 00f70733 add a4,a4,a5 d0: 40175713 srai a4,a4,0x1 d4: fff70713 addi a4,a4,-1 d8: fa5ff06f j 7c <heapify+0x2c> dc: 0017f713 andi a4,a5,1 e0: 02071263 bnez a4,104 <heapify+0xb4> e4: 01f7d713 srli a4,a5,0x1f e8: 00f707b3 add a5,a4,a5 ec: 4017d793 srai a5,a5,0x1 f0: fff78793 addi a5,a5,-1 f4: 00279793 slli a5,a5,0x2 f8: 00f50533 add a0,a0,a5 fc: 01052023 sw a6,0(a0) 100: 00008067 ret 104: 01f7d713 srli a4,a5,0x1f 108: 00f707b3 add a5,a4,a5 10c: 4017d793 srai a5,a5,0x1 110: fe5ff06f j f4 <heapify+0xa4> 00000114 <heapSort>: 114: fe010113 addi sp,sp,-32 118: 00112e23 sw ra,28(sp) 11c: 00812c23 sw s0,24(sp) 120: 00912a23 sw s1,20(sp) 124: 01212823 sw s2,16(sp) 128: 01312623 sw s3,12(sp) 12c: 00050913 mv s2,a0 130: 00058413 mv s0,a1 134: 01f5d493 srli s1,a1,0x1f 138: 00b484b3 add s1,s1,a1 13c: 4014d493 srai s1,s1,0x1 140: fff48493 addi s1,s1,-1 144: 0204c063 bltz s1,164 <heapSort+0x50> 148: fff00993 li s3,-1 14c: 00040613 mv a2,s0 150: 00048593 mv a1,s1 154: 00090513 mv a0,s2 158: ef9ff0ef jal ra,50 <heapify> 15c: fff48493 addi s1,s1,-1 160: ff3496e3 bne s1,s3,14c <heapSort+0x38> 164: fff40493 addi s1,s0,-1 168: 0204ce63 bltz s1,1a4 <heapSort+0x90> 16c: 00241413 slli s0,s0,0x2 170: 00890433 add s0,s2,s0 174: fff00993 li s3,-1 178: 00092783 lw a5,0(s2) 17c: ffc42703 lw a4,-4(s0) 180: 00e92023 sw a4,0(s2) 184: fef42e23 sw a5,-4(s0) 188: 00048613 mv a2,s1 18c: 00000593 li a1,0 190: 00090513 mv a0,s2 194: ebdff0ef jal ra,50 <heapify> 198: fff48493 addi s1,s1,-1 19c: ffc40413 addi s0,s0,-4 1a0: fd349ce3 bne s1,s3,178 <heapSort+0x64> 1a4: 01c12083 lw ra,28(sp) 1a8: 01812403 lw s0,24(sp) 1ac: 01412483 lw s1,20(sp) 1b0: 01012903 lw s2,16(sp) 1b4: 00c12983 lw s3,12(sp) 1b8: 02010113 addi sp,sp,32 1bc: 00008067 ret 000001c0 <containsDuplicate>: 1c0: ff010113 addi sp,sp,-16 1c4: 00112623 sw ra,12(sp) 1c8: 00812423 sw s0,8(sp) 1cc: 00912223 sw s1,4(sp) 1d0: 00050413 mv s0,a0 1d4: 00058493 mv s1,a1 1d8: f3dff0ef jal ra,114 <heapSort> 1dc: 00100793 li a5,1 1e0: 0297d863 bge a5,s1,210 <containsDuplicate+0x50> 1e4: 00040513 mv a0,s0 1e8: fff48593 addi a1,s1,-1 1ec: 00000793 li a5,0 1f0: 00052683 lw a3,0(a0) 1f4: 00452703 lw a4,4(a0) 1f8: 02e68063 beq a3,a4,218 <containsDuplicate+0x58> 1fc: 00178793 addi a5,a5,1 200: 00450513 addi a0,a0,4 204: feb796e3 bne a5,a1,1f0 <containsDuplicate+0x30> 208: 00000513 li a0,0 20c: 0100006f j 21c <containsDuplicate+0x5c> 210: 00000513 li a0,0 214: 0080006f j 21c <containsDuplicate+0x5c> 218: 00100513 li a0,1 21c: 00c12083 lw ra,12(sp) 220: 00812403 lw s0,8(sp) 224: 00412483 lw s1,4(sp) 228: 01010113 addi sp,sp,16 22c: 00008067 ret ``` ## Waveform analysis with GTKWave ### srv32 CPU `srv32` is a simple RISC-V 3-stage pipeline processor with IF/ID, EX, WB stages first. The following block diagram is based on the [Lab3: srv32](https://hackmd.io/@sysprog/S1Udn1Xtt). Details of the system can be found in `srv32/rtl/riscv.v`- Verilog RTL language. ![](https://i.imgur.com/0tjrp72.png) RISC-V provides 2 types of Resgister which are **GPR** (General Purpose Register) and **CSR** (Control and Status Register). ### Branch Penalty- flush the pipeline `Branch penalty` is the number of instructions killed if a branch is TAKEN. Branch result is resolved at the end EX stage by ALU so the instruction fetch in IF/ID might need to be killed if a branch is taken. However, in `srv32`, the address of next instruction (next PC) should be fed into I-MEM a cycle ahead. Thus, the branch penalty for srv32 is 2. To clarify, by the time next PC is resolved, one instruction has been fetched into pipeline and another PC has been calculated because address should be computed one cycle ahead. Consider the following instruction sequence and its corresponding waveform. ``` 3c: 02050a63 beqz a0,70 <countNumbersWithUniqueDigits+0x34> 40: 00100793 li a5,1 44: 02a7da63 bge a5,a0,78 ``` ![](https://i.imgur.com/Ej47wa8.png) First, we look at the `branch` instruction (PC=00000044) to understand the penalty of `branch/jump` instruction. As the instruction is in EX stage, it will set the `next_pc`, which is determined by the code below. For branch instruction, `next_pc = ex_imm = 00000070`, which is the pink box in the waveform. After two cycle, value of`if_pc` will be updated to destination `70`; therefore, the branch penalty is 2. Also, `ex_flush` and `wb_flush` are asserted for 2 cycle telling that next 2 instruction will be flushed out of the pipeline. ### Dealing with Data Hazard using Forwarding `srv32` supports full forwarding, which indicates RAW data hazard can be resolved WITHOUT stalling the processor. Consider the following instruction sequence and its corresponding waveform. ``` 58: 00900713 li a4,9 5c: 02f70733 mul a4,a4,a5 60: fff78793 addi a5,a5,-1 64: 00e50533 add a0,a0,a4 ``` ![](https://i.imgur.com/nGZSBPv.png) Take a look at the the waveform, when RAW data hazard happens, data from `wb_result` is forwarded to `ex_rdata2`, value is `10`. The implementation of forwarding in `srv32/rtl/riscv.v` is as follow: ``` assign reg_rdata1[31: 0] = (ex_src1_sel == 5'h0) ? 32'h0 : (!wb_flush && wb_alu2reg && (wb_dst_sel == ex_src1_sel)) ? // register forwarding (wb_mem2reg ? wb_rdata : wb_result) : regs[ex_src1_sel]; assign reg_rdata2[31: 0] = (ex_src2_sel == 5'h0) ? 32'h0 : (!wb_flush && wb_alu2reg && (wb_dst_sel == ex_src2_sel)) ? // register forwarding (wb_mem2reg ? wb_rdata : wb_result) : regs[ex_src2_sel]; ``` ### Load-use hazard Load-use hazard is NOT an issue in `srv32`, because D-MEM is read at WB stage and register file is also read at WB stage. A single MUX is used to switch between 2 operands (operand from register file and operand from D-MEM). Therefore, `load-use hazard` can be resolved without stalling the processor. Consider the following instruction sequence and its corresponding waveform. ``` 6c: 0007a683 lw a3,0(a5) 70: fea688e3 beq a3,a0,60 <countNumbersWithUniqueDigits+0x24> ``` ![](https://i.imgur.com/0tYkjQv.jpg) Let's first look at the blue box in the waveform. `wb_dst_sel (lw)` is equal to `ex_src1_sel (beq)`, which indicates that `RAW` data hazard happens; then we check that `wb_mem2reg` is also equal to 1, indicating that it is a load instruction that we have to bypass the value from memory to register file. That is, it will pass the `wb_rdata`, which is the output of `D-MEM`, to `reg_rdata1` directly, and we do not have to stall any cycle. :::spoiler {state="close"} `load-use hazard` solution in `srv32/rtl/riscv.v` ``` always @(posedge clk or negedge resetb) begin if (!resetb) ex_mem2reg <= 1'b0; else if (inst[`OPCODE] == OP_LOAD) ex_mem2reg <= 1'b1; else if (ex_mem2reg && dmem_rvalid) ex_mem2reg <= 1'b0; end always @(posedge clk or negedge resetb) begin if (!resetb) begin wb_result <= 32'h0; wb_alu2reg <= 1'b0; wb_dst_sel <= 5'h0; wb_branch <= 1'b0; wb_branch_nxt <= 1'b0; wb_mem2reg <= 1'b0; wb_raddr <= 2'h0; wb_alu_op <= 3'h0; end else if (!ex_stall) begin wb_result <= ex_result; wb_alu2reg <= ex_alu || ex_lui || ex_auipc || ex_jal || ex_jalr || ex_csr || `ifdef RV32M_ENABLED ex_mul || `endif (ex_mem2reg && !ex_ld_align_excp); wb_dst_sel <= ex_dst_sel; wb_branch <= branch_taken || ex_trap; wb_branch_nxt <= wb_branch; wb_mem2reg <= ex_mem2reg; wb_raddr <= dmem_raddr[1:0]; wb_alu_op <= ex_alu_op; end end always @* begin case(wb_alu_op) OP_LB : begin case(wb_raddr[1:0]) 2'b00: wb_rdata[31: 0] = {{24{dmem_rdata[7]}}, dmem_rdata[ 7: 0]}; 2'b01: wb_rdata[31: 0] = {{24{dmem_rdata[15]}}, dmem_rdata[15: 8]}; 2'b10: wb_rdata[31: 0] = {{24{dmem_rdata[23]}}, dmem_rdata[23:16]}; 2'b11: wb_rdata[31: 0] = {{24{dmem_rdata[31]}}, dmem_rdata[31:24]}; endcase end OP_LH : begin wb_rdata = (wb_raddr[1]) ? {{16{dmem_rdata[31]}}, dmem_rdata[31:16]} : {{16{dmem_rdata[15]}}, dmem_rdata[15: 0]}; end OP_LW : begin wb_rdata = dmem_rdata; end OP_LBU : begin case(wb_raddr[1:0]) 2'b00: wb_rdata[31: 0] = {24'h0, dmem_rdata[7:0]}; 2'b01: wb_rdata[31: 0] = {24'h0, dmem_rdata[15:8]}; 2'b10: wb_rdata[31: 0] = {24'h0, dmem_rdata[23:16]}; 2'b11: wb_rdata[31: 0] = {24'h0, dmem_rdata[31:24]}; endcase end OP_LHU : begin wb_rdata = (wb_raddr[1]) ? {16'h0, dmem_rdata[31:16]} : {16'h0, dmem_rdata[15: 0]}; end default: begin wb_rdata = 32'h0; end endcase end assign reg_rdata1[31: 0] = (ex_src1_sel == 5'h0) ? 32'h0 : (!wb_flush && wb_alu2reg && (wb_dst_sel == ex_src1_sel)) ? // register forwarding (wb_mem2reg ? wb_rdata : wb_result) : // load instruction? regs[ex_src1_sel]; assign reg_rdata2[31: 0] = (ex_src2_sel == 5'h0) ? 32'h0 : (!wb_flush && wb_alu2reg && (wb_dst_sel == ex_src2_sel)) ? // register forwarding (wb_mem2reg ? wb_rdata : wb_result) : // load instruction? regs[ex_src2_sel]; ``` ::: ## Software Optimizations In this part, I will try to optimize the code in C-code by applying loop unrolling technique. ![](https://i.imgur.com/ikw4s3c.png) Below are the orignal assmebly code and its result. ### Original C-code ```c #include<stdio.h> #include<stdbool.h> void swap(int *x, int *y){ int temp = *x; *x = *y; *y = temp; } void heapify(int *nums, int i, int numsSize){ int currValue = nums[i]; int j = i * 2 + 1; // left child int parent; while (j < numsSize){ if (j < (numsSize - 1)){ if (nums[j] < nums[j + 1]){ j = j + 1; } } if (currValue >= nums[j]) break; else{ if (j % 2 == 0) parent = j / 2 - 1; else parent = j / 2; nums[parent] = nums[j]; j = j * 2 + 1; } } if (j % 2 == 0) parent = j / 2 - 1; else parent = j / 2; nums[parent] = currValue; } void heapSort(int *nums, int numsSize){ for (int i = numsSize / 2 - 1; i >= 0; i--){ heapify(nums, i, numsSize); } for (int i = numsSize - 1; i >= 0; i--){ swap(&nums[0], &nums[i]); heapify(nums, 0, i); } } bool containsDuplicate(int* nums, int numsSize){ heapSort(nums, numsSize); for (int i = 0; i < numsSize - 1; i++){ if (nums[i] == nums[i + 1]) return true; } return false; } int main(){ int question1[] = {1, 4, 5, 6, 2, 3}; bool answer1 = false; printf("Question1 "); if (answer1 == containsDuplicate(question1, (sizeof(question1)/sizeof(question1[0])))) printf("Accepted\n"); else printf("Failed\n"); int question2[] = {3, 2, 1, 4, 2, 1}; bool answer2 = true; printf("Question2 "); if (answer2 == containsDuplicate(question2, (sizeof(question2)/sizeof(question2[0])))) printf("Accepted\n"); else printf("Failed\n"); int question3[] = {-1, -2, 3, 4, -2, -1}; bool answer3 = true; printf("Question3 "); if (answer3 == containsDuplicate(question3, (sizeof(question3)/sizeof(question3[0])))) printf("Accepted\n"); else printf("Failed\n"); return 0; } ``` #### RTL Simulation result ``` Question1 Accepted Question2 Accepted Question3 Accepted Excuting 7581 instructions, 10181 cycles, 1.342 CPI Program terminate - ../rtl/../testbench/testbench.v:434: Verilog $finish Simulation statistics ===================== Simulation time : 0.131 s Simulation cycles: 10192 Simulation speed : 0.0778015 MHz ``` #### ISS (Instruction Set Simulator) result ``` Question1 Accepted Question2 Accepted Question3 Accepted Excuting 7581 instructions, 10181 cycles, 1.342 CPI Program terminate - ../rtl/../testbench/testbench.v:434: Verilog $finish Simulation statistics ===================== Simulation time : 0.03 s Simulation cycles: 10192 Simulation speed : 4.02 MHz make[1]: Leaving directory '/home/jack/srv32/tools' Compare the trace between RTL and ISS simulator === Simulation passed === ``` ### C-Code optimization #### Loop unrolling I would like to compare 3 continuous element in the sorted array instead of 2 in the original code. That io comparation between `num[i]` to `num[i+1]` and `num[i+1]`-`num[i+2]`. ```C bool containsDuplicate(int* nums, int numsSize){ heapSort(nums, numsSize); //for (int i = 0; i < numsSize - 1; i++){ // if (nums[i] == nums[i + 1]) // return true; //} for (int i = 0; i < numsSize - 1; i+=2){ if ((nums[i] == nums[i + 1])||(nums[i+1] == nums[i+2])) return true; } return false; } ``` This would make the No. instruction become smaller since the No. loop also reduce. #### Replace operation Next, I replace the modulus `%2` to `&0x0001` operation. This method should reduce the number of intruction since the `%` is unsupported operation in RV32I ISA. ```c //if (j % 2 == 0) if ((j&0x0001) == 0) ``` ### Optimized Simulation Result ``` Question1 Accepted Question2 Accepted Question3 Accepted Excuting 7561 instructions, 10149 cycles, 1.342 CPI Program terminate - ../rtl/../testbench/testbench.v:434: Verilog $finish Simulation statistics ===================== Simulation time : 0.112 s Simulation cycles: 10160 Simulation speed : 0.0907143 MHz Excuting 7561 instructions, 10149 cycles, 1.342 CPI Program terminate Simulation statistics ===================== Simulation time : 0.002 s Simulation cycles: 10149 Simulation speed : 4.413 MHz make[1]: Leaving directory '/home/peter/srv32/tools' Compare the trace between RTL and ISS simulator === Simulation passed === ``` ### Conclusion In conclusion, I get **20 instruction reduction** and **32 cycles reduction** . I also try another test vector but No. reduced instruction keeping unchanged. ## How RISC-V Compliance Tests works A constantly updating set of tests called the [RISC-V Architectural Testing Framework](https://github.com/riscv-non-isa/riscv-arch-test/blob/master/doc/README.adoc) was developed to help confirm that software produced for a certain RISC-V Profile/Specification would function on all implementations that adhere to that profile. The RISC-V Architecture is not required for a design to pass the RISC-V Architectural Tests, though. These simple tests merely check the key elements of the specification without getting bogged down in the finer points. Additionally, the compliance test focuses on instruction set rather than RTL capabilities in order to certify functioning. ### Compliance tests results To run the compliance tests for RTL, we can enter `$make tests` at `srv32` directory; for ISS simulator, enter `$make tests-sw`. Then, we will get the results below. As you see, srv32 passes the compilance tests of `rv32Zicsr`, `rv32i` and `rv32im` in both RTL and ISS sumulator, which represents `Control and Status Register`, `Base Integer Instruction Set, 32-bit` and `Standard Extension for Integer Multiplication and Division`, repectively. #### RTL ``` OK: 6/6 RISCV_TARGET=srv32 RISCV_DEVICE=rv32Zicsr RISCV_ISA=rv32Zicsr OK: 48/48 RISCV_TARGET=srv32 RISCV_DEVICE=rv32i RISCV_ISA=rv32i OK: 8/8 RISCV_TARGET=srv32 RISCV_DEVICE=rv32im RISCV_ISA=rv32im ``` #### ISS simulator ``` OK: 6/6 RISCV_TARGET=srv32 RISCV_DEVICE=rv32Zicsr RISCV_ISA=rv32Zicsr OK: 48/48 RISCV_TARGET=srv32 RISCV_DEVICE=rv32i RISCV_ISA=rv32i OK: 8/8 RISCV_TARGET=srv32 RISCV_DEVICE=rv32im RISCV_ISA=rv32im ``` ## Explain how srv32 works with Verilator. Verilator is a tool which is used to verify the Verilog or SystemVerilog code. The "Verilated" model of the user's top-level module is instantiated by the user through a small C++/SystemC wrapper file. A C++ compiler then compiles these C++/SystemC files. The design simulation is carried out by the executable that results. Additionally, Verilator is about 100 times faster on a single thread than interpreted Verilog simulators like Icarus Verilog, and multithreading may result in speedups of 2–10 times. Therefore, it outperforms interpreted simulators by a factor of 200 to 1000. When entering `make $(SUBDIRS)` command at `srv32/` ditectory, it will call the makefile in the `srv32/sw/SUBDIRS` with `make $@.run` command, which calls the makefile in `srv32/sim` directory. Compiling process with the below command: ``` verilator -O3 -cc -Wall -Wno-STMTDLY -Wno-UNUSED +define+MEMSIZE=$(memsize) --trace-fst --Mdir sim_cc --build --exe sim_main.cpp getch.cpp ``` which will create several files in the sim directory under `srv32/sim`. We can consult the website Files Read/Written for a description of each file. Additionally, it will run the command listed below to create a waveform and a tracelog to assist us in testing our code. ``` ./sim +trace +dump ```