Assignment3: SoftCPU

# Assignment3: SoftCPU contributed by < [`Jack`](https://github.com/Chi-you) > ###### tags: `RISC-V`, `Computer Architure 2021` ## Requirements 1 First I set up the `RISC-V toolchains` in my environment, `WSL2 Ubuntu 20.04`, by the tutorial from [鄭育丞](https://hackmd.io/@eecheng/B1fEgnQwF). Next I copy the c code I wrote in [Assignment 1](https://hackmd.io/@jackli/CA_hw1) into `srv32/sw/me/me.c`; then I add a `Makefile` which is copied from `srv32/sw/qsort`. Finally, I go into the `srv32` directory and enter `$make` to compile it and get the result of RTL simulation and ISS simulator. At first, I use `-O3` optimization to compile it, but I find something strange. That is, I cannot find where it get into the `majorityElement` function, which I think that it simplifies the whole process while calling it. So although it's highly optimized in the number of instrucitons and cycles, it's inconvenient to observe waveform. Therefore, I decide to change `-O3` to `-O1` in `srv32/sw/common/Makefile.common` to look at the normal and straightforward one. ### Compilation ``` riscv-none-embed-gcc -O1 -Wall -march=rv32im -mabi=ilp32 -nostartfiles -nostdlib -L../common -o me.elf me.c -lc -lm -lgcc -lsys -T ../common/default.ld riscv-none-embed-objcopy -j .text -O binary me.elf imem.bin riscv-none-embed-objcopy -j .data -O binary me.elf dmem.bin riscv-none-embed-objcopy -O binary me.elf memory.bin riscv-none-embed-objdump -d me.elf > me.dis riscv-none-embed-readelf -a me.elf > me.symbol ``` Here, the `.bin` file saves the memory content of data and text section and it will be loaded into `srv32/testbench/testbench.v` during verification. ### Validate the results Below are the execution results that are tested by `verilator` and `rvsim`. As you see, the results are the same and the traces between RTL and ISS simulator are also same. #### RTL Simulation result ``` The majority element is 2 Excuting 1720 instructions, 2246 cycles, 1.305 CPI Program terminate - ../rtl/../testbench/testbench.v:418: Verilog $finish Simulation statistics ===================== Simulation time : 0.057 s Simulation cycles: 2257 Simulation speed : 0.0395965 MHz ``` #### ISS (Instruction Set Simulator) result ``` ./rvsim --memsize 128 -l trace.log ../sw/me/me.elf The majority element is 2 Excuting 1720 instructions, 2246 cycles, 1.306 CPI Program terminate Simulation statistics ===================== Simulation time : 0.003 s Simulation cycles: 2246 Simulation speed : 0.752 MHz make[1]: Leaving directory '/home/jack/srv32/tools' Compare the trace between RTL and ISS simulator === Simulation passed === ``` #### Assembly code It is generated from GNU Toolchain for RISC-V with `-O1` optimization and it can be found in `srv32/sw/me/me.dis`. ``` 0000003c <majorityElement>: 3c: 00050613 mv a2,a0 # a2 = base address of array 40: 00052503 lw a0,0(a0) # major = num[0] 44: 00100793 li a5,1 # i = 1 48: 04b7d263 bge a5,a1,8c <majorityElement+0x50> # if i >= n, return 4c: 00460793 addi a5,a2,4 # address of index 1 50: 00259593 slli a1,a1,0x2 # n = n * 4 54: 00b60633 add a2,a2,a1 # a2 = base_addr + 4n 58: 00100713 li a4,1 # cnt = 1 5c: 0100006f j 6c <majorityElement+0x30> 60: 00170713 addi a4,a4,1 # cnt++ 64: 00478793 addi a5,a5,4 68: 02c78263 beq a5,a2,8c <majorityElement+0x50> # check the for condition 6c: 0007a683 lw a3,0(a5) # a3 = num[i] 70: fea688e3 beq a3,a0,60 <majorityElement+0x24> # if num[i] = major, branch 74: 00070663 beqz a4,80 <majorityElement+0x44> # if cnt = 0, branch 78: fff70713 addi a4,a4,-1 # cnt-- 7c: fe9ff06f j 64 <majorityElement+0x28> 80: 00068513 mv a0,a3 # major = num[i] 84: 00100713 li a4,1 # cnt = 1 88: fddff06f j 64 <majorityElement+0x28> 8c: 00008067 ret ``` ## Waveform analysis ### srv32 CPU Before we dig into the waveform, let's look at the `srv32` architecture, simple RISC-V 3-stage pipeline processor with IF/ID, EX, WB stages first. The following block diagram is based on the [Lab3: srv32](https://hackmd.io/@sysprog/S1Udn1Xtt) and `srv32/rtl/riscv.v`. ![](https://i.imgur.com/0tjrp72.png) There are two kind of registers in RISC-V, that is, **GPR** (General Purpose Register) and **CSR** (Control and Status Register). #### CSR (Control and Status Register): CSR is used to configure or record the processor running status. Below are the registers that `srv32` have implemented and these registers are used to handle `interruption` and `exception`. | Name | Description | |:--------:|:---------------------------------------- | | mscratch | Scrach registerfor machine trap handlers | | mstatus | Machine staus register | | misa | ISA and extensions | | mie | Machine interrupt-enable register | | mip | Machine interrupt pending | | mtvec | Machine trap-handler base address | | mepc | Machine exception program counter | | mcause | Machine trap cause | | mtval | Machine bad address or instruction | #### GPR (General Purpose Register) For GPR, I think that everyone is fimiliar with it. If you are not, you can refer to the [wiki/RISC-V](https://en.wikipedia.org/wiki/RISC-V). In addition, `srv32` is different from the normal `RISC-V` 5-stage architecture. That is, register file do not read in ID stage; instead, it reads at WB stage. ### Analysis ### Branch Penalty `Branch penalty` is the number of instructions killed if a branch is TAKEN. Branch result is resolved at the end EX stage by ALU so the instruction fetch in IF/ID might need to be killed if a branch is taken. However, in `srv32`, the address of next instruction (next PC) should be fed into I-MEM a cycle ahead. Thus, the branch penalty for srv32 is 2. To clarify, by the time next PC is resolved, one instruction has been fetched into pipeline and another PC has been calculated because address should be computed one cycle ahead. Consider the following instruction sequence and its corresponding waveform. ``` 5c: 0100006f j 6c <majorityElement+0x30> 60: 00170713 addi a4,a4,1 64: 00478793 addi a5,a5,4 68: 02c78263 beq a5,a2,8c <majorityElement+0x50> 6c: 0007a683 lw a3,0(a5) ``` ![](https://i.imgur.com/qv8IZ99.jpg) First, we look at the `jump` instruction (PC=0000005c) to understand the penalty of `branch/jump` instruction. As the instruction is in EX stage, it will set the `next_pc`, which is determined by the code below. For jump instruction, `next_pc = ex_pc + ex_imm = 0000006c`, which is the pink box in the waveform. After two cycle, the `next_pc` value will be sent to `if_pc`; therefore, the branch penalty is 2. :::spoiler {state="close"} `next_pc` determination in `srv32/rtl/riscv.v` ``` always @* begin branch_taken = !ex_flush; next_pc = fetch_pc + `IF_NEXT_PC; ex_ill_branch = 1'b0; case(1'b1) ex_jal : next_pc = ex_pc + ex_imm; ex_jalr : next_pc = alu_op1 + ex_imm; ex_branch: begin case(ex_alu_op) OP_BEQ : begin next_pc = (result_subs[32: 0] == 'd0) ? ex_pc + ex_imm : fetch_pc + `IF_NEXT_PC; if (result_subs[32: 0] != 'd0) branch_taken = 1'b0; end OP_BNE : begin next_pc = (result_subs[32: 0] != 'd0) ? ex_pc + ex_imm : fetch_pc + `IF_NEXT_PC; if (result_subs[32: 0] == 'd0) branch_taken = 1'b0; end OP_BLT : begin next_pc = result_subs[32] ? ex_pc + ex_imm : fetch_pc + `IF_NEXT_PC; if (!result_subs[32]) branch_taken = 1'b0; end OP_BGE : begin next_pc = !result_subs[32] ? ex_pc + ex_imm : fetch_pc + `IF_NEXT_PC; if (result_subs[32]) branch_taken = 1'b0; end OP_BLTU: begin next_pc = result_subu[32] ? ex_pc + ex_imm : fetch_pc + `IF_NEXT_PC; if (!result_subu[32]) branch_taken = 1'b0; end OP_BGEU: begin next_pc = !result_subu[32] ? ex_pc + ex_imm : fetch_pc + `IF_NEXT_PC; if (result_subu[32]) branch_taken = 1'b0; end default: begin next_pc = fetch_pc; ex_ill_branch = 1'b1; end endcase end default : begin next_pc = fetch_pc + `IF_NEXT_PC; branch_taken = 1'b0; end endcase end ``` ::: ### Forwarding `srv32` supports full forwarding, which indicates RAW data hazard can be resolved WITHOUT stalling the processor. Consider the following instruction sequence and its corresponding waveform. ``` 50: 00259593 slli a1,a1,0x2 54: 00b60633 add a2,a2,a1 ``` ![](https://i.imgur.com/mxuNPkn.jpg) Let’s first look at the pink box in the waveform. `wb_dst_sel (slli)` is equal to `ex_src2_sel (add)`, which indicates that RAW data hazard happens; then we check that wb_mem2reg is also equal to 0, indicating that it is not a load instruction. Therefore, it will bypass the output of `ALU`, wb_result, to reg_rdata2 directly. The implementation of forwarding in `srv32/rtl/riscv.v` is as follow: ``` assign reg_rdata1[31: 0] = (ex_src1_sel == 5'h0) ? 32'h0 : (!wb_flush && wb_alu2reg && (wb_dst_sel == ex_src1_sel)) ? // register forwarding (wb_mem2reg ? wb_rdata : wb_result) : regs[ex_src1_sel]; assign reg_rdata2[31: 0] = (ex_src2_sel == 5'h0) ? 32'h0 : (!wb_flush && wb_alu2reg && (wb_dst_sel == ex_src2_sel)) ? // register forwarding (wb_mem2reg ? wb_rdata : wb_result) : regs[ex_src2_sel]; ``` ### Load-use hazard Load-use hazard is NOT an issue in `srv32`, because D-MEM is read at WB stage and register file is also read at WB stage. A single MUX is used to switch between 2 operands (operand from register file and operand from D-MEM). Therefore, `load-use hazard` can be resolved without stalling the processor. Consider the following instruction sequence and its corresponding waveform. ``` 6c: 0007a683 lw a3,0(a5) 70: fea688e3 beq a3,a0,60 <majorityElement+0x24> ``` ![](https://i.imgur.com/0tYkjQv.jpg) Let's first look at the blue box in the waveform. `wb_dst_sel (lw)` is equal to `ex_src1_sel (beq)`, which indicates that `RAW` data hazard happens; then we check that `wb_mem2reg` is also equal to 1, indicating that it is a load instruction that we have to bypass the value from memory to register file. That is, it will pass the `wb_rdata`, which is the output of `D-MEM`, to `reg_rdata1` directly, and we do not have to stall any cycle. :::spoiler {state="close"} `load-use hazard` solution in `srv32/rtl/riscv.v` ``` always @(posedge clk or negedge resetb) begin if (!resetb) ex_mem2reg <= 1'b0; else if (inst[`OPCODE] == OP_LOAD) ex_mem2reg <= 1'b1; else if (ex_mem2reg && dmem_rvalid) ex_mem2reg <= 1'b0; end always @(posedge clk or negedge resetb) begin if (!resetb) begin wb_result <= 32'h0; wb_alu2reg <= 1'b0; wb_dst_sel <= 5'h0; wb_branch <= 1'b0; wb_branch_nxt <= 1'b0; wb_mem2reg <= 1'b0; wb_raddr <= 2'h0; wb_alu_op <= 3'h0; end else if (!ex_stall) begin wb_result <= ex_result; wb_alu2reg <= ex_alu || ex_lui || ex_auipc || ex_jal || ex_jalr || ex_csr || `ifdef RV32M_ENABLED ex_mul || `endif (ex_mem2reg && !ex_ld_align_excp); wb_dst_sel <= ex_dst_sel; wb_branch <= branch_taken || ex_trap; wb_branch_nxt <= wb_branch; wb_mem2reg <= ex_mem2reg; wb_raddr <= dmem_raddr[1:0]; wb_alu_op <= ex_alu_op; end end always @* begin case(wb_alu_op) OP_LB : begin case(wb_raddr[1:0]) 2'b00: wb_rdata[31: 0] = {{24{dmem_rdata[7]}}, dmem_rdata[ 7: 0]}; 2'b01: wb_rdata[31: 0] = {{24{dmem_rdata[15]}}, dmem_rdata[15: 8]}; 2'b10: wb_rdata[31: 0] = {{24{dmem_rdata[23]}}, dmem_rdata[23:16]}; 2'b11: wb_rdata[31: 0] = {{24{dmem_rdata[31]}}, dmem_rdata[31:24]}; endcase end OP_LH : begin wb_rdata = (wb_raddr[1]) ? {{16{dmem_rdata[31]}}, dmem_rdata[31:16]} : {{16{dmem_rdata[15]}}, dmem_rdata[15: 0]}; end OP_LW : begin wb_rdata = dmem_rdata; end OP_LBU : begin case(wb_raddr[1:0]) 2'b00: wb_rdata[31: 0] = {24'h0, dmem_rdata[7:0]}; 2'b01: wb_rdata[31: 0] = {24'h0, dmem_rdata[15:8]}; 2'b10: wb_rdata[31: 0] = {24'h0, dmem_rdata[23:16]}; 2'b11: wb_rdata[31: 0] = {24'h0, dmem_rdata[31:24]}; endcase end OP_LHU : begin wb_rdata = (wb_raddr[1]) ? {16'h0, dmem_rdata[31:16]} : {16'h0, dmem_rdata[15: 0]}; end default: begin wb_rdata = 32'h0; end endcase end assign reg_rdata1[31: 0] = (ex_src1_sel == 5'h0) ? 32'h0 : (!wb_flush && wb_alu2reg && (wb_dst_sel == ex_src1_sel)) ? // register forwarding (wb_mem2reg ? wb_rdata : wb_result) : // load instruction? regs[ex_src1_sel]; assign reg_rdata2[31: 0] = (ex_src2_sel == 5'h0) ? 32'h0 : (!wb_flush && wb_alu2reg && (wb_dst_sel == ex_src2_sel)) ? // register forwarding (wb_mem2reg ? wb_rdata : wb_result) : // load instruction? regs[ex_src2_sel]; ``` ::: ## Software Optimizations In this part, I will try to optimize the code in RISC-V assembly. Why am I not optimizing the c code, because I think the algorithm, `Boyer-Moore Voting Algorithm`, is the best solution. Therefore, I cannot find where to optimize. ### RISC-V Assembly code Due to the new environment, `srv32`, I need to modify the code I wrote in Assignment 1, which is suited to `Ripes`. There are two things we have to take care. The first is we need to follow the calling convention carefully and the second is that we need to figure out some way to print out the result to verify the code. Thanks for **向景亘**, he taught me about why we can and how we use the `printf` function in assembly. The reason that we can call `printf` function in assembly is that `srv32` has implement some system calls (in `sw/common/syscall.c`) which are used to implement the `printf` function and it has linked the newlib while compiling. Below are the orignal assmebly code and its result. #### Assembly code ``` .data arr: .word 1, 2, 2, 5, 2, 2, 4, 1, 2, 2 len: .word 10 str: .string "The majority element is " .text .global main main: addi sp, sp, -4 sw ra, 0(sp) # majorityElement la a0, arr lw a1, len call majorityElement mv a3, a0 # save return value to a3 # print integer la a0, dstr mv a1, a3 call printf # restore ra lw ra, 0(sp) addi sp, sp, 4 # exit li a0, 0 ret majorityElement: addi sp, sp, -4 sw ra, 0(sp) mv t1, a0 # t1: base_addr of arr li t2, 1 # cnt = 1 li t3, 1 # i = 1 lw a0, 0(t1) # major = num[0] addi t1, t1, 4 # base_addr += 4 (Because array start from index 1) for: blt t3, a1, loop # if i < len, branch to label "loop" lw ra, 0(sp) addi sp, sp, 4 ret # return main loop: lw t5, 0(t1) # t5 = num[i] beq t5, a0, if # if(num[i] == major) beqz t2, elseif # else if(cnt == 0) addi t2, t2, -1 # cnt-- increment: addi t1, t1, 4 # base_addr += 4 addi t3, t3, 1 # i++ j for if: addi t2, t2, 1 # cnt++ j increment elseif: mv a0, t5 # major = num[i]; li t2, 1 # cnt=1 j increment ``` #### RTL Simulation result ``` The majority element is 2 Excuting 1720 instructions, 2278 cycles, 1.324 CPI Program terminate - ../rtl/../testbench/testbench.v:418: Verilog $finish Simulation statistics ===================== Simulation time : 0.033 s Simulation cycles: 2289 Simulation speed : 0.0693636 MHz ``` #### ISS (Instruction Set Simulator) result ``` ./rvsim --memsize 128 -l trace.log ../sw/me_asm/me_asm.elf The majority element is 2 Excuting 1720 instructions, 2278 cycles, 1.324 CPI Program terminate Simulation statistics ===================== Simulation time : 0.001 s Simulation cycles: 2278 Simulation speed : 1.562 MHz make[1]: Leaving directory '/home/jack/srv32/tools' Compare the trace between RTL and ISS simulator === Simulation passed === ``` To reduce the number of cycles, we can decrease the times of `branch taken`, which has a penalty of 2 cycles, so I change the `blt` instruction to `bge` for checking the loop condition, because normally we will go into `for loop` until we do not satisfy the condition. Finally, I get 2278 - 2258 = 20 cycles reduction. ``` majorityElement: mv t1, a0 # t1: base_addr of arr li t2, 1 # cnt = 1 li t3, 1 # i = 1 lw a0, 0(t1) # major = num[0] addi t1, t1, 4 # base_addr += 4 (Because array start from index 1) for: bge t3, a1, exit # if i < len, branch to label "loop" lw t5, 0(t1) # t5 = num[i] beq t5, a0, if # if(num[i] == major) beqz t2, elseif # else if(cnt == 0) addi t2, t2, -1 # cnt-- increment: addi t1, t1, 4 # base_addr += 4 addi t3, t3, 1 # i++ j for if: addi t2, t2, 1 # cnt++ j increment elseif: mv a0, t5 # major = num[i]; li t2, 1 # cnt=1 j increment exit: ret ``` Furthermore, I reference the assembly code generated by GNU toolchain and adjust the order of my code, which I can eliminate one `jump` instruction. Finally, I get an 2258 - 2245 = 13 cycles reduction. ``` majorityElement: mv t1, a0 # t1: base_addr of arr li t2, 1 # cnt = 1 li t3, 1 # i = 1 lw a0, 0(t1) # major = num[0] addi t1, t1, 4 # base_addr += 4 (Because array start from index 1) j for if: addi t2, t2, 1 # cnt++ increment: addi t1, t1, 4 # base_addr += 4 addi t3, t3, 1 # i++ for: bge t3, a1, exit # if i < len, branch to label "loop" lw t5, 0(t1) # t5 = num[i] beq t5, a0, if # if(num[i] == major) beqz t2, elseif # else if(cnt == 0) addi t2, t2, -1 # cnt-- elseif: mv a0, t5 # major = num[i]; li t2, 1 # cnt=1 j increment exit: ret ``` In conclusion, I get 33 cycles reduction. ## How RISC-V Compliance Tests works According to the [RISC-V Architectural Testing Framework](https://github.com/riscv-non-isa/riscv-arch-test/blob/master/doc/README.adoc), which are an evolving set of tests that are created to help ensure that software written for a given RISC-V Profile/Specification will run on all implementations that comply with that profile. However, passing the RISC-V Architectural Tests does not mean that the design complies with the RISC-V Architecture. These are only a basic set of tests checking important aspects of the specification without focusing on details. Also, the compliance test is not used to verify the RTL functionality, it focuses on instruction set instead. Each test in the architectural test pool generates a test `signature`, which represents the data written into specific memory locations during the execution of the test. The `signature` typically will record values of the operations carried out in the test. If we want to make sure that our device/implementation has passed the RISC-V Architecture Tests, the test signatures obtained from the execution of the tests on the implementation need to be compared against a set of golden reference signature. ### Compliance tests results To run the compliance tests for RTL, we can enter `$make tests` at `srv32` directory; for ISS simulator, enter `$make tests-sw`. Then, we will get the results below. As you see, srv32 passes the compilance tests of `rv32Zicsr`, `rv32i` and `rv32im` in both RTL and ISS sumulator, which represents `Control and Status Register`, `Base Integer Instruction Set, 32-bit` and `Standard Extension for Integer Multiplication and Division`, repectively. #### RTL ``` OK: 6/6 RISCV_TARGET=srv32 RISCV_DEVICE=rv32Zicsr RISCV_ISA=rv32Zicsr OK: 48/48 RISCV_TARGET=srv32 RISCV_DEVICE=rv32i RISCV_ISA=rv32i OK: 8/8 RISCV_TARGET=srv32 RISCV_DEVICE=rv32im RISCV_ISA=rv32im ``` #### ISS simulator ``` OK: 6/6 RISCV_TARGET=srv32 RISCV_DEVICE=rv32Zicsr RISCV_ISA=rv32Zicsr OK: 48/48 RISCV_TARGET=srv32 RISCV_DEVICE=rv32i RISCV_ISA=rv32i OK: 8/8 RISCV_TARGET=srv32 RISCV_DEVICE=rv32im RISCV_ISA=rv32im ``` ## Explain how srv32 works with Verilator. Verilator is a tool which is used to verify the Verilog or SystemVerilog code. The user writes a little C++/SystemC wrapper file, which instantiates the "Verilated" model of the user's top level module. These C++/SystemC files are then compiled by a C++ compiler. The resulting executable performs the design simulation. Furthermore, Verilator on a single thread is about `100 times faster` than interpreted Verilog simulators such as `Icarus Verilog`, and another 2-10x speedup might be gained from multithreading. Therefore, it yields `200-1000x` total over interpreted simulators. When we enter `make $(SUBDIRS)` command at `srv32/` ditectory, it will call the makefile in the `srv32/sw/SUBDIRS` with `make $@.run` command, which calls the makefile in `srv32/sim` directory. Then, it will compile with the command below: ``` verilator -O3 -cc -Wall -Wno-STMTDLY -Wno-UNUSED +define+MEMSIZE=$(memsize) --trace-fst --Mdir sim_cc --build --exe sim_main.cpp getch.cpp ``` which will generated serveral files in the `srv32/sim/sim_cc` directory. For the description of each files, we can reference to the website, [Files Read/Written](https://verilator.org/guide/latest/files.html#files-read-written). Also, it will execute the command below to generate waveform and tracelog to help us verify our code. ``` ./sim +trace +dump ``` ## Reference * [srv32](https://github.com/sysprog21/srv32) * [Verilator](https://www.veripool.org/verilator/) * [Verilator manual](https://verilator.org/guide/latest/) * [Assignment1: RISC-V Assembly and Instruction Pipeline](https://hackmd.io/@jackli/arch_hw1) * [RISC-V Architectural Testing Framework](https://github.com/riscv-non-isa/riscv-arch-test/blob/master/doc/README.adoc) * [How do you Implement printf in GCC from Newlib?](https://stackoverflow.com/questions/55014043/how-do-you-implement-printf-in-gcc-from-newlib)