# Assignment3: SoftCPU contributed by < [Chen Ching Wen]() > ###### tags: `RISC-V`, `Computer Architure 2022` [TOC] ## Requirements 1,3 ### Problem choosen from homework1 After basic tools installation(include riscv-gnu-toolchain, verilator, and srv32), I picked up the problem done by [莊集](https://hackmd.io/@y8jRQNyoRe6WG-qekloIlA/Sk0PXEDzj) and try to compare the results generated from the c code and the handwritten assembly code. In order to make the handwritten assembly code executable on the srv32, we need to modify the **printArray** part in the assembly code. In this part, we need to notice few things; first of all, before call printf, we shouldn't save any variable that couldn't be modified outside the function in tx registers(e.g. nums head addr in t1). Secondly, we need to printf a "\n" in the end of printArray, or it couldn't print out anything. I didn't get it why it's in this way. ```assembly printArray: # void printArray(*nums, numsSize), nums:a0, numsSize:a1 addi sp, sp, -16 # store ra, s0, s1 sw ra, 0(sp) # store ra sw s0, 4(sp) # store s0 sw s1, 8(sp) # store s1 sw s2, 12(sp) mv s0, a0 # keep nums head addr in s0 mv s1, a1 # keep numsSize in s1 mv s2, a0 # keep nums head addr in t1 print_conti: beq x0, s1, print_end la a0, iformat lw a1, 0(s2) call printf addi s1, s1, -1 addi s2, s2, 4 la a0, comma call printf j print_conti print_end: la a0, newLine call printf lw ra, 0(sp) lw s0, 4(sp) lw s1, 8(sp) lw s2, 12(sp) addi sp, sp, 16 ret ``` - **Result of C Code (Instruction Set Simulator)** ```shell= $export CROSS_COMPILE=riscv-none-elf- $export EXTRA_CFLAGS="-misa-spec=2.2 -march=rv32im" ``` ```shell= $ ./rvsim --memsize 128 -l trace.log ../sw/containDup_c/containDup.elf Question1 Accepted Excuting 2787 instructions, 3891 cycles, 1.396 CPI Program terminate Simulation statistics ===================== Simulation time : 0.002 s Simulation cycles: 3891 Simulation speed : 2.477 MHz ``` - **Result of C Code (RTL Simulation)** ```shell= ##~/srv32/sim $ make containDup_c.run cp ../sw/containDup_c/*.bin . stdbuf -o0 -e0 ./sim +trace +dump | awk -f checkcode.awk Question1 Accepted Excuting 2787 instructions, 3891 cycles, 1.396 CPI Program terminate - ../rtl/../testbench/testbench.v:434: Verilog $finish Simulation statistics ===================== Simulation time : 0.051 s Simulation cycles: 3902 Simulation speed : 0.0765098 MHz ``` - **Result of Assembly Code (RTL Simulation)** ```shell= ##~/srv32/sim $ make containDup.run cp ../sw/containDup/*.bin . stdbuf -o0 -e0 ./sim +trace +dump | awk -f checkcode.awk Question1 Accepted DMEM address 00040000 out of range - ../rtl/../testbench/testbench.v:439: Verilog $finish Simulation statistics ===================== Simulation time : 0.074 s Simulation cycles: 6341 Simulation speed : 0.0856892 MHz ``` - Observation We notice that the hanwritten assembly takes almost two time cycles than c-generated assembly code. The reason is that the branch that is used by handwritten assembly is more than the other. - Optimization - change the order of instruction(reduce the jump instruction) ```assembly= ##original heapifySkip: slli s10, t4, 2 # let s10 = 4 * j add s10, a0, s10 # s10 = (nums + 4j) lw s10, 0(s10) # s10 = *(nums + 4j) slt s10, t5, s10 # if (currValue >= nums[j]) then break beq s10, zero, assignCurrVal andi t3, t4, 1 # if (j % 2 == 0) then parent = j / 2 - 1 beq t3, zero, innerEvenAssign # else if (j % 2 == 1) then parent = j / 2 j innerOddAssign innerEvenAssign: srli t3, t4, 1 # let t3 be parent = j / 2 - 1 addi t3, t3, -1 j innerOddAssignSkip innerOddAssign: srli t3, t4, 1 # let t3 be parent = j / 2 innerOddAssignSkip: slli t3, t3, 2 # t3 = parent * 4 add t3, a0, t3 # t3 = (nums + 4 * parent) slli s11, t4, 2 # let s11 = 4 * j add s11, a0, s11 # s11 = nums + 4j lw s11, 0(s11) # s11 = nums[j] sw s11, 0(t3) # nums[parent] = nums[j] slli t4, t4, 1 # j = j * 2 + 1 addi t4, t4, 1 j heapifyWhile ``` ```assembly= ##original heapifySkip: slli s10, t4, 2 # let s10 = 4 * j add s10, a0, s10 # s10 = (nums + 4j) lw s10, 0(s10) # s10 = *(nums + 4j) slt s10, t5, s10 # if (currValue >= nums[j]) then break beq s10, zero, assignCurrVal andi t3, t4, 1 # if (j % 2 == 0) then parent = j / 2 - 1 beq t3, zero, innerEvenAssign # else if (j % 2 == 1) then parent = j / 2 innerOddAssign: srli t3, t4, 1 # let t3 be parent = j / 2 innerOddAssignSkip: slli t3, t3, 2 # t3 = parent * 4 add t3, a0, t3 # t3 = (nums + 4 * parent) slli s11, t4, 2 # let s11 = 4 * j add s11, a0, s11 # s11 = nums + 4j lw s11, 0(s11) # s11 = nums[j] sw s11, 0(t3) # nums[parent] = nums[j] slli t4, t4, 1 # j = j * 2 + 1 addi t4, t4, 1 j heapifyWhile innerEvenAssign: srli t3, t4, 1 # let t3 be parent = j / 2 - 1 addi t3, t3, -1 j innerOddAssignSkip ``` ```shell= ##rtl simulation result after reduce the jump instruction Simulation statistics ===================== Simulation time : 0.072 s Simulation cycles: 6323 Simulation speed : 0.0878194 MHz ``` ```assembly= assignCurrVal: andi t3, t4, 1 # if (j % 2 == 0) then parent = j / 2 - 1 beq t3, zero, evenAssign # else if (j % 2 == 1) then parent = j / 2 j oddAssign evenAssign: srli t3, t4, 1 # let t3 be parent = j / 2 - 1 addi t3, t3, -1 j oddAssignSkip oddAssign: srli t3, t4, 1 # let t3 be parent = j / 2 oddAssignSkip: slli t3, t3, 2 # t3 = parent * 4 add t3, a0, t3 # t3 = (nums + 4 * parent) sw t5, 0(t3) # nums[parent] = currValue lw s10, 0(sp) # restore s10 lw s11, 4(sp) # restore s11 addi sp, sp, 8 jr ra ``` ```assembly= assignCurrVal: andi t3, t4, 1 # if (j % 2 == 0) then parent = j / 2 - 1 beq t3, zero, evenAssign # else if (j % 2 == 1) then parent = j / 2 oddAssign: srli t3, t4, 1 # let t3 be parent = j / 2 oddAssignSkip: slli t3, t3, 2 # t3 = parent * 4 add t3, a0, t3 # t3 = (nums + 4 * parent) sw t5, 0(t3) # nums[parent] = currValue lw s10, 0(sp) # restore s10 lw s11, 4(sp) # restore s11 addi sp, sp, 8 jr ra evenAssign: srli t3, t4, 1 # let t3 be parent = j / 2 - 1 addi t3, t3, -1 j oddAssignSkip ``` ```shell= Simulation statistics ===================== Simulation time : 0.071 s Simulation cycles: 6296 Simulation speed : 0.0886761 MHz ``` ```assembly= #####unroll begin bge t4, a2, assignCurrVal # while j < numsSize do adjust parent node addi t3, a2, -1 # if j < (numsSize - 1) && nums[j] < nums[j + 1] then j++ slt t3, t4, t3 # if j < numsSize - 1 then keep doing beq t3, zero, heapifySkip addi s11, t4, 1 # let s11 be dereference of *(nums + (j + 1)) slli s11, s11, 2 # s11 = 4 * (j + 1) add s11, a0, s11 # s11 = nums + 4 * (j + 1) lw s11, 0(s11) # s11 = *(nums + 4(j + 1)) = nums[j + 1] slli s10, t4, 2 # s10 = 4 * j add s10, a0, s10 # s10 = (num + 4j) lw s10, 0(s10) # s10 = *(num + 4j) = nums[j] slt t3, s10, s11 # if nums[j] < nums[j + 1] ? beq t3, zero, heapifySkip # if nums[j] < nums[j + 1] then j++ addi t4, t4, 1 # j++ slli s10, t4, 2 # let s10 = 4 * j add s10, a0, s10 # s10 = (nums + 4j) lw s10, 0(s10) # s10 = *(nums + 4j) slt s10, t5, s10 # if (currValue >= nums[j]) then break beq s10, zero, assignCurrVal andi t3, t4, 1 # if (j % 2 == 0) then parent = j / 2 - 1 beq t3, zero, innerEvenAssign # else if (j % 2 == 1) then parent = j / 2 srli t3, t4, 1 # let t3 be parent = j / 2 slli t3, t3, 2 # t3 = parent * 4 add t3, a0, t3 # t3 = (nums + 4 * parent) slli s11, t4, 2 # let s11 = 4 * j add s11, a0, s11 # s11 = nums + 4j lw s11, 0(s11) # s11 = nums[j] sw s11, 0(t3) # nums[parent] = nums[j] slli t4, t4, 1 # j = j * 2 + 1 addi t4, t4, 1 #####unroll end j heapifyWhile ``` ```shell= Simulation statistics ===================== Simulation time : 0.072 s Simulation cycles: 6275 Simulation speed : 0.0871528 MHz ``` ```assembly= callHeapifyWhile: blt t6, zero, endHeapifywhile # while i >= 0, then do heapify heapSortSkip: mv a0, s0 # a0 = *nums mv a1, t6 # a1 = i mv a2, s1 # a2 = numsSize jal heapify # call heapify addi t6, t6, -1 # i-- j callHeapifyWhile # do next heapify while i still >= 0 endHeapifywhile: lw ra, 0(sp) # restore ra lw s0, 4(sp) # restore s0 lw s1, 8(sp) # restore s1 lw s2, 12(sp) # restore s2 addi sp, sp, 16 jr ra # return from heapSort ``` ```shell= Simulation statistics ===================== Simulation time : 0.073 s Simulation cycles: 6261 Simulation speed : 0.0857671 MHz ``` ### LeetCode [85. Maximal Rectangle](https://leetcode.com/problems/maximal-rectangle/) - Given a rows x cols binary matrix filled with 0's and 1's, find the largest rectangle containing only 1's and return its area. ```cpp= #include <stdio.h> #include <stdlib.h> int largestRectangleArea(int *heights, int h_len) { if(h_len == 0){ return 0; } int *leftLessMin = (int *)malloc(sizeof(int)*h_len); leftLessMin[0] = -1; for(int i = 1; i < h_len; i++){ int l = i - 1; while(l >= 0 && heights[l] >= heights[i]){ l = leftLessMin[l]; } leftLessMin[i] = l; } int *rightLessMin = (int *)malloc(sizeof(int)*h_len); rightLessMin[h_len - 1] = h_len; for(int i = h_len - 2; i >= 0; i--){ int r = i + 1; while(r <= h_len - 1 && heights[r] >= heights[i]){ r = rightLessMin[r]; } rightLessMin[i] = r; } int maxArea = 0; for(int i= 0;i < h_len; i++){ int area = (rightLessMin[i] - leftLessMin[i] - 1)*heights[i]; maxArea = area > maxArea ? area : maxArea; } return maxArea; } int maxRectangle(int matrixSize, int matrixColSize, int matrix[][matrixColSize]) { if(matrixSize == 0){ return 0; } //int *heights = (int *)malloc(sizeof(int)*matrixColSize); int *heights = calloc(matrixColSize, sizeof(int)); int maxArea = 0; int tmp = 0; for(int row = 0; row < matrixSize; row++){ for(int col = 0; col < matrixColSize; col++){ if(matrix[row][col] == 1){ heights[col] += 1; }else{ heights[col] = 0; } } tmp = largestRectangleArea(heights, matrixColSize); maxArea = maxArea > tmp ? maxArea : tmp; } return maxArea; } int main(){ int problem1[4][5] = {{1, 0, 1, 0, 0}, {1, 0, 1, 1, 1}, {1, 1, 1, 1, 1}, {1, 0, 0, 1, 0}}; int ans = maxRectangle( 4, 5, problem1); printf("Answer for problem1 = %d\n", ans); int problem2[1][1] = {{1}}; ans = maxRectangle( 4, 5, problem2); printf("Answer for problem2 = %d\n", ans); int problem3[3][6] = {{1, 1, 1, 1, 1, 1}, {0, 0, 1, 1, 1, 1}, {0, 1, 1, 1, 1, 1}}; ans = maxRectangle( 3, 6, problem3); printf("Answer for problem3 = %d\n", ans); } ``` ```shell Answer for problem1 = 6 Answer for problem2 = 1 Answer for problem3 = 12 Excuting 11951 instructions, 16329 cycles, 1.366 CPI Program terminate - ../rtl/../testbench/testbench.v:434: Verilog $finish Simulation statistics ===================== Simulation time : 0.169 s Simulation cycles: 16340 Simulation speed : 0.0966864 MHz ``` ```shell= $ ./rvsim --memsize 128 -l trace.log ../sw/MaxRectArea/MaxRectArea.elf Answer for problem1 = 6 Answer for problem2 = 1 Answer for problem3 = 12 Excuting 11951 instructions, 16329 cycles, 1.366 CPI Program terminate Simulation statistics ===================== Simulation time : 0.008 s Simulation cycles: 16329 Simulation speed : 2.165 MHz ``` ## Requirements 2 ### load There is no needed stall since when load occurs and has a data dependency with the next instruction, the **wb_mem2reg** becomes 1 and threrefore, **wb_data** = **reg_rdata2** ```assembly 7428: 000c2583 lw a1,0(s8) 742c: 00ba85b3 add a1,s5,a1 7430: 00bc2023 sw a1,0(s8) ``` ![](https://i.imgur.com/onfC5ho.png) ### forwarding - Firstly, we notice that add and sw have a RAW data hazard; however through forwarding, **a1** would be sent to the next exe. stage(run **sw** instruction) after **add** instruction finishes in exe stage. We can verify it through checking out **wb_result** and **reg_rdata2**, which have the same values **0x418**. ```assembly= 742c: 00ba85b3 add a1,s5,a1 7430: 00bc2023 sw a1,0(s8) ``` ![](https://i.imgur.com/YthdmtQ.png) ### branch It need flushes two instructions. Since whether to jump or not for the branch is decided in the EXE stage, and before it send the signal to the mux in the IF/ID stage to determine the next pc, the second flushed instruction already be read from the program counter. ```assembly= 7440: 28079c63 bnez a5,76d8 <_malloc_r+0x5ac> 7444: 0089ab83 lw s7,8(s3) 7448: 015b07b3 add a5,s6,s5 744c: 0017e793 ori a5,a5,1 ``` ![](https://i.imgur.com/xTHJl3Q.png) ## Requirements 4 There are two major points I want to figure out through this homework; first off, how verilator is executed with verilog; secondly, how interrupt is implemented in a CPU via verilog. ### verilator - Verilator is a simulation tool which is similar to the rule of testbench of verilog, except it's written in c++. ```verilog= //our.v module our; initial begin $display("Hello World"); $finish; end endmodule ``` ```cpp= #include "Vour.h" #include "verilated.h" int main(int argc, char **argv, char **env) { Verilated::commandArgs(argc, argv); Vour* top = new Vour; while (!Verilated::gotFinish()) { top->eval(); } exit(0); } ``` ```shell= $export VERILATOR_ROOT=/path/to/where/verilator/was/installed $VERILATOR_ROOT/bin/verilator --cc our.v --exe sim_main.cpp $cd obj_dir $make -j -f Vour.mk Vour $cd .. $obj_dir/Vour ``` ### interruption/exception In general, a complete CPU should contains three mode: user, machine, supervisor mode. In srv32, it's mainly implemented for machine mode(e.g. mtvec means trap-vector base-address register under machine mode) Following are the corresponding CSR registers for interruption and exception - CSR registers | register | function | | ---- | -------- | | mtvec | Define the pc address of the instruction that cause exception | | mcause |Define the reason for exception | |mtval|The value corresponding for the exception. e.g.If exception occurs by r/w from/to memory, than mtval store the visited memory address| |mepc|There would be a slightly difference. For interruption, it point to the next non-executed instruction, while for exception, it points to the current instruction which cause the exception| |mstatus|Contain multiple information related to the status. e.g.(1)**MIE**:machine mode interruption enable (2)**MPIE**: previous MIE| |mie|Control different types of interruptions(a mask)| |mip|Pending status for different interruption| ## Reference [leetcode 85 tutorial](https://leetcode.wang/leetCode-85-Maximal-Rectangle.html) [verilator(1) - Linux man page](https://linux.die.net/man/1/verilator)