Computer Architecture HW3

--- tags : Computer_Architecture, csapp, cs61c, jserv, RISC-V --- # Computer Architecture HW3 ::: spoiler Assignment3 [Assignment3: SoftCPU](https://hackmd.io/@sysprog/2022-arch-homework3) [Assignment3: SoftCPU_2021](https://hackmd.io/@sysprog/2021-arch-homework3) [Lab3: srv32 - RISCV RV32IM Soft CPU](https://hackmd.io/@sysprog/S1Udn1Xtt) [Long Pressed Name(assignment2)](https://hackmd.io/@qwe661234/BJ52GwGVi#Assignment2-RISC-V-Toolchain) ::: ## Requirement 1 I choose the [assignment2 of 陳彥甫](https://hackmd.io/@qwe661234/BJ52GwGVi#Assignment2-RISC-V-Toolchain), and the leetcode is [91. Decode Ways](https://leetcode.com/problems/decode-ways/). My motivation is that I want to practice more about the `string` or `decode` problem. I hope I can deal with this kinds of problem more simple and clear. ### How it work with verilator? In the `~/srv32/sim` this folder, I find that in the makefile ```makefile! ... $(TARGET): $(TARGET_SIM) $(BFLAGS) -o $(TARGET) $(FILELIST) mv sim_cc/sim . ... ``` And the verilator will translate the verilog into c++ language, then save it in sim_cc. Therefore we can test the CPU from verilog without actually make it in physics. After it translate into C++, we can input our binary file and get the simulate result. ### assignment2 (Long Pressed Name) Problem : > Your friend is typing his name into a keyboard. Sometimes, when typing a character c, the key might get long pressed, and the character will be typed 1 or more times. You examine the typed characters of the keyboard. Return True if it is possible that it was your friends name, with some characters (possibly none) being long pressed. In this case, I will use the name `alex` as the example. As following is the origin assembly code. :::spoiler Origin assembly code ``` .org 0 # Provide program starting address to linker .global main /* newlib system calls */ .set SYSEXIT, 93 .set SYSWRITE, 64 .data addr_name: .string "alex" addr_typed0: .string "aaleex" addr_typed1: .string "aalleexx" addr_typed2: .string "aalewx" SuccessResult: .string "PASS\n" .set success_result_size, .-SuccessResult FailResult: .string "FAIL\n" .set fail_result_size, .-FailResult .text main: la t0, addr_name #Load "name" 1st address la t1, addr_typed0 #Load "typed" 1st address jal ra, isLongPressed #1st test-case la t1, addr_typed1 #Load "typed" 1st address jal ra, isLongPressed #2nd test-case la t1, addr_typed2 #Load "typed" 1st address jal ra, isLongPressed #3rd test-case j End isLongPressed: addi sp, sp, -8 #Use 2 byte for saving register sw ra, 0(sp) #1st byte is ra sw t0, 4(sp) #2nd byte is t0 lb t2, 0(t0) #1st character of name lb t3, 0(t1) #1st character of typed bne t2, t3, Fail #1st character must the same addi t0, t0, 1 #i++ addi t1, t1, 1 #j++ jal Loop #Call Loop lw ra, 0(sp) #Recall ra lw t0, 4(sp) #Recall t0 addi sp,sp, 8 ret Loop: #1st while loop lb t2, 0(t0) #Load name[i] lb t3, 0(t1) #Load typed[j] beq t2, x0, checkLastCharacter #If name[i]= NULL If: bne t2, t3, Else #If name[i]!=typed[j] addi t0, t0, 1 #i++ addi t1, t1, 1 #j++ j Loop Else: lb t2, -1(t0) # Load name[i-1] bne t2, t3,Fail # If typed[j] != name[i-1] then Fail addi t1, t1, 1 # j++ j Loop checkLastCharacter: #2nd while loop lb t2, -1(t0) # Load last character of name if2: beq t3, x0, Pass # If typed[j] = NULL, then passed bne t2, t3,Fail # If typed[j] != name[last] then Fail addi t1, t1, 1 # j++ lb t3, 0(t1) # Load next character of typed j if2 Pass: li a7, SYSWRITE # "write" syscall li a0, 1 # 1 = standard output (stdout) la a1, SuccessResult # load address of hello string li a2, success_result_size # length of hello string ecall # invoke syscall to print the string ret Fail: li a7, SYSWRITE # "write" syscall li a0, 1 # 1 = standard output (stdout) la a1, FailResult # load address of hello string li a2, fail_result_size # length of hello string ecall # invoke syscall to print the string ret End: li a0, 0 li a7, SYSEXIT ecall ``` ::: ::: spoiler fixed makefile ```makefile include ../common/Makefile.common EXE = .elf SRC = assignment2.s CFLAGS += -L../common LDFLAGS += -T ../common/default.ld TARGET = assignment2 OUTPUT = $(TARGET)$(EXE) .PHONY: all clean all: $(TARGET) $(TARGET): $(SRC) $(CC) $(CFLAGS) -o $(OUTPUT) $(SRC) $(LDFLAGS) -g $(OBJCOPY) -j .text -O binary $(OUTPUT) imem.bin $(OBJCOPY) -j .data -O binary $(OUTPUT) dmem.bin $(OBJCOPY) -O binary $(OUTPUT) memory.bin $(OBJDUMP) -d $(OUTPUT) > $(TARGET).dis $(READELF) -a $(OUTPUT) > $(TARGET).symbol clean: $(RM) *.o $(OUTPUT) $(TARGET).dis $(TARGET).symbol [id]mem.bin memory.bin ``` ::: In order to run in srv32, I modify the system call and the exit part. The makefile also need to be modify, too. Here is my [learning note](https://hackmd.io/Jrr1J1YDR_CtGR46DYotiQ). ``` .data addr_name: .string "alex" addr_typed0: .string "aaleex" addr_typed1: .string "aalleexx" addr_typed2: .string "aalewx" SuccessResult: .string "PASS\n" FailResult: .string "FAIL\n" test: .word 65 iformat: .string "%d" .text .global main main: # save ra addi sp, sp, -4 sw ra, 0(sp) la t0, addr_name #Load "name" 1st address la t1, addr_typed0 #Load "typed" 1st address jal ra, isLongPressed #1st test-case la t1, addr_typed1 #Load "typed" 1st address jal ra, isLongPressed #2nd test-case la t1, addr_typed2 #Load "typed" 1st address jal ra, isLongPressed #3rd test-case j End isLongPressed: addi sp, sp, -8 #Use 2 byte for saving register sw ra, 0(sp) #1st byte is ra sw t0, 4(sp) #2nd byte is t0 lb t2, 0(t0) #1st character of name lb t3, 0(t1) #1st character of typed bne t2, t3, Fail #1st character must the same addi t0, t0, 1 #i++ addi t1, t1, 1 #j++ jal Loop #Call Loop finally: lw ra, 0(sp) #Recall ra lw t0, 4(sp) #Recall t0 addi sp,sp, 8 ret Loop: #1st while loop lb t2, 0(t0) #Load name[i] lb t3, 0(t1) #Load typed[j] beq t2, x0, checkLastCharacter #If name[i]= NULL If: bne t2, t3, Else #If name[i]!=typed[j] addi t0, t0, 1 #i++ addi t1, t1, 1 #j++ j Loop Else: lb t2, -1(t0) # Load name[i-1] bne t2, t3,Fail # If typed[j] != name[i-1] then Fail addi t1, t1, 1 # j++ j Loop checkLastCharacter: #2nd while loop lb t2, -1(t0) # Load last character of name if2: beq t3, x0, Pass # If typed[j] = NULL, then passed bne t2, t3,Fail # If typed[j] != name[last] then Fail addi t1, t1, 1 # j++ lb t3, 0(t1) # Load next character of typed j if2 Pass: # modify for srv32 la a0, SuccessResult call printf j finally #li a7, SYSWRITE # "write" syscall #li a0, 1 # 1 = standard output (stdout) #la a1, SuccessResult # load address of hello string #li a2, success_result_size # length of hello string #ecall # invoke syscall to print the string #ret Fail: # modify la a0, FailResult call printf j finally #li a7, SYSWRITE # "write" syscall #li a0, 1 # 1 = standard output (stdout) #la a1, FailResult # load address of hello string #li a2, fail_result_size # length of hello string #ecall # invoke syscall to print the string #ret End: # restore ra lw ra, 0(sp) addi sp, sp, 4 # exit li a0, 0 ret ``` And the result in srv32 is ```makefile hanchi@hanchi:~/srv32$ make debug=1 assignment2 make[1]: 進入目錄「/home/hanchi/srv32/sw」 make -C common make[2]: 進入目錄「/home/hanchi/srv32/sw/common」 make[2]: 對「all」無需做任何事。 make[2]: 離開目錄「/home/hanchi/srv32/sw/common」 make[2]: 進入目錄「/home/hanchi/srv32/sw/assignment2」 riscv-none-elf-gcc -O3 -Wall -march=rv32im_zicsr -mabi=ilp32 -misa-spec=2.2 -march=rv32im -nostartfiles -nostdlib -L../common -o assignment2.elf assignment2.s -lc -lm -lgcc -lsys -T ../common/default.ld -g riscv-none-elf-objcopy -j .text -O binary assignment2.elf imem.bin riscv-none-elf-objcopy -j .data -O binary assignment2.elf dmem.bin riscv-none-elf-objcopy -O binary assignment2.elf memory.bin riscv-none-elf-objdump -d assignment2.elf > assignment2.dis riscv-none-elf-readelf -a assignment2.elf > assignment2.symbol make[2]: 離開目錄「/home/hanchi/srv32/sw/assignment2」 make[1]: 離開目錄「/home/hanchi/srv32/sw」 make[1]: 進入目錄「/home/hanchi/srv32/sim」 PASS PASS FAIL Excuting 3368 instructions, 4674 cycles, 1.387 CPI Program terminate - ../rtl/../testbench/testbench.v:434: Verilog $finish Simulation statistics ===================== Simulation time : 0.029 s Simulation cycles: 4685 Simulation speed : 0.161552 MHz make[1]: 離開目錄「/home/hanchi/srv32/sim」 make[1]: 進入目錄「/home/hanchi/srv32/tools」 ./rvsim --memsize 128 -l trace.log ../sw/assignment2/assignment2.elf PASS PASS FAIL Excuting 3368 instructions, 4674 cycles, 1.388 CPI Program terminate Simulation statistics ===================== Simulation time : 0.001 s Simulation cycles: 4674 Simulation speed : 3.797 MHz make[1]: 離開目錄「/home/hanchi/srv32/tools」 Compare the trace between RTL and ISS simulator === Simulation passed === ``` We can get the data table as following. | Comapre | sim (verilog) | rvsim (c++) | |:----------------- | ------------- | ----------- | | Instructions | 3368 | 3368 | | Cycles | 4674 | 4674 | | CPI | 1.387 | 1.388 | | Simulation Times | 0.029 s | 0.001 s | | Simulation cycles | 4685 | 4674 | | Simulation Speed | 0.161552 MHz | 3.797 MHz | ### Leetcode (91. Decode Ways) Problem: >A message containing letters from A-Z can be encoded into numbers using the following mapping:'A'->1, 'B'->2...'Z'->26. To decode an encoded message, all the digits must be grouped then mapped back into letters using the reverse of the mapping above (there may be multiple ways).In the end, we need to return the number of ways to decode. As following is my C code, and I modify it into risc-v assembly code. In particular, **I have already confirmed that my program is executed normally, so I complicate my test data.** :::spoiler C code ```c #include <stdio.h> int numDecodings(char *s) { volatile int res[2] = {1, 0}; // Avoid compiler skip this part int previous = (*s) - '0'; if (previous == 0) { return 0; } int tmp = 0; s++; while (*s) { tmp = res[1]; if (previous * 10 + (*s) - '0' > 26) { res[1] = 0; } else { res[1] = res[0]; } if (*s - '0') { res[0] += tmp; } else { res[0] = 0; } previous = *s - '0'; s++; } return res[0] + res[1]; } int main(int argc, char *argv[]) { char test0[23] = "1111111111111111111111"; char test1[23] = "2222222222222222222222"; char test2[23] = "1231231231231231231231"; char test3[23] = "1110612312111221232132"; printf("The result of test0 is %d\n", numDecodings(test0)); printf("The result of test1 is %d\n", numDecodings(test1)); printf("The result of test2 is %d\n", numDecodings(test2)); printf("The result of test3 is %d\n", numDecodings(test3)); return 0; } ``` ::: Here is my assembly code. The test data is the same with C code which is more complex than the Leetcode given. ``` .data test0: .string "1111111111111111111111" test1: .string "2222222222222222222222" test2: .string "1231231231231231231231" test3: .string "1110612312111221232132" result0: .string "The result of test0 is " # it should be output 2 result1: .string "The result of test1 is " # it should be output 3 result2: .string "The result of test2 is " # it should be output 0 result3: .string "The result of test3 is " # it should be output 2 nextline: .string " \n" iformat: .string "%d" .text .global main main: addi sp, sp, -4 # save ra sw ra, 0(sp) la a0, result0 # printf result0 call printf la s0, test0 # go test0 jal ra, numDecodings la a0, result1 # printf result1 call printf la s0, test1 # go test1 jal ra, numDecodings la a0, result2 # printf result2 call printf la s0, test2 # go test2 jal ra, numDecodings la a0, result3 # printf result3 call printf la s0, test3 # go test3 jal ra, numDecodings finally: lw ra, 0(sp) # exit program addi sp, sp, 4 li a0, 0 ret numDecodings: addi sp, sp, -4 sw ra, 0(sp) li t0, 1 # res[0] = 1; li t1, 0 # res[1] = 0; lb t2, 0(s0) # get (*s) addi t3, t2, -48 # int previous = (*s) - '0'; beqz t3, return_0 # if(previous == 0) return 0; addi t4, x0, 0 # int tmp = 0; addi s0, s0, 1 # s++; lb t2, 0(s0) # get new (*s) while: beqz t2, return_value # check enter while or not mv t4, t1 # tmp = res[1]; slli s1, t3, 1 # s1 = previous * 2; slli s2, t3, 3 # s2 = previous * 8; add s1, s1, s2 # s1 = previous * 2 + previous * 8; addi s2, t2, -48 # s2 = (*s) - '0'; addi t5, x0, 26 # t5 = 26; add s1, s1, s2 # s1 = previous * 10 + (*s) - '0'; ble s1, t5, else1 # if(previous * 10 + (*s) - '0' <= 26) go to else1; li t1, 0 # res[1] = 0; middle: beqz s2, else2 # if((*s) - '0') go to else2; add t0, t0, t4 # res[0] += tmp; reset: mv t3, s2 # previous = (*s) - '0' addi s0, s0, 1 # s++; lb t2, 0(s0) # get new (*s) j while # go back to while loop else1: mv t1, t0 # res[1] = res[0]; j middle # go back to middle else2: li t0, 0 # res[0] = 0 j reset # go back to reset return_value: la a0, iformat # set output format add a1, t0, t1 # set a1 for output data call printf # use printf ststem call la a0, nextline # load '\n' to a0 call printf # use printf system call lw ra, 0(sp) addi sp, sp, 4 ret return_0: la a0, iformat # set output format addi a1, x0, 0 # set a1 for output data call printf # use printf system call la a0, nextline # load ' \n' to a0 call printf # use printf system call lw ra, 0(sp) addi sp, sp, 4 ret ``` And the result in srv32 is ```makefile hanchi@hanchi:~/srv32$ make debug=1 leetcode make[1]: 進入目錄「/home/hanchi/srv32/sw」 make -C common make[2]: 進入目錄「/home/hanchi/srv32/sw/common」 make[2]: 對「all」無需做任何事。 make[2]: 離開目錄「/home/hanchi/srv32/sw/common」 make[2]: 進入目錄「/home/hanchi/srv32/sw/leetcode」 riscv-none-elf-gcc -O3 -Wall -march=rv32im_zicsr -mabi=ilp32 -misa-spec=2.2 -march=rv32im -nostartfiles -nostdlib -L../common -o leetcode.elf leetcode.s -lc -lm -lgcc -lsys -T ../common/default.ld riscv-none-elf-objcopy -j .text -O binary leetcode.elf imem.bin riscv-none-elf-objcopy -j .data -O binary leetcode.elf dmem.bin riscv-none-elf-objcopy -O binary leetcode.elf memory.bin riscv-none-elf-objdump -d leetcode.elf > leetcode.dis riscv-none-elf-readelf -a leetcode.elf > leetcode.symbol make[2]: 離開目錄「/home/hanchi/srv32/sw/leetcode」 make[1]: 離開目錄「/home/hanchi/srv32/sw」 make[1]: 進入目錄「/home/hanchi/srv32/sim」 The result of test0 is 28657 The result of test1 is 28657 The result of test2 is 2187 The result of test3 is 1602 Excuting 12542 instructions, 16758 cycles, 1.336 CPI Program terminate - ../rtl/../testbench/testbench.v:434: Verilog $finish Simulation statistics ===================== Simulation time : 0.079 s Simulation cycles: 16769 Simulation speed : 0.212266 MHz make[1]: 離開目錄「/home/hanchi/srv32/sim」 make[1]: 進入目錄「/home/hanchi/srv32/tools」 ./rvsim --memsize 128 -l trace.log ../sw/leetcode/leetcode.elf The result of test0 is 28657 The result of test1 is 28657 The result of test2 is 2187 The result of test3 is 1602 Excuting 12542 instructions, 16758 cycles, 1.336 CPI Program terminate Simulation statistics ===================== Simulation time : 0.003 s Simulation cycles: 16758 Simulation speed : 6.045 MHz make[1]: 離開目錄「/home/hanchi/srv32/tools」 Compare the trace between RTL and ISS simulator === Simulation passed === ``` We can get the data table as following. | Comapre | sim (verilog) | rvsim (c++) | | ----------------- | ------------- | ----------- | | Instructions | 12542 | 12542 | | Cycles | 16758 | 16758 | | CPI | 1.336 | 1.336 | | Simulation Times | 0.079 s | 0.003 s | | Simulation cycles | 16769 | 16758 | | Simulation Speed | 0.212266 MHz | 6.045 MHz | --- ## Requirement 2 ### How to use gtkwave & srv32 #### 3-stage ![](https://i.imgur.com/iBf1hfC.jpg) We can find that the instruction exection order is as following ```mermaid graph TD; fetch_pc-->if_pc; if_pc-->ex_pc; if_pc-->ex_flush; ex_pc-->wb_pc; ex_pc-->wb_flush; ``` #### NOP I want to observer the NOP in my program, so I check some singals like `fetch_pc`, `if_pc`, `ex_pc`, `wb_pc`, `wb_nop`, `wb_nop_more` . I find that if there is branch instruction and the `ex_pc` executed the branch instruction, `fetch_pc` and `if_pc` will not delete their exist instruction. They will still execute the instruction and make `wb_nop` and `wb_nop_more` become **1**, that's why there is `wb_nop_more` this singal exist, and the **branch penalty is 2**! I will show th example as following. ![](https://i.imgur.com/p6zUzE1.png) #### Data Hazard The disassembly code of hazard is used assignment2 (Long Pressed Name) for example. Because of the srv32 is fully bypassing processor, there will not have data hazard. Therefore, all that's left is control hazard. #### Control Hazard Take the main function for example, this is the disassembly result of assignment2 (Long Pressed Name). We can see this line ` 54: 020000ef jal ra,74 <isLongPressed>`, and the GTKWave is the picture as following. ``` ... 0000003c <main>: 3c: ffc10113 addi sp,sp,-4 40: 00112023 sw ra,0(sp) 44: 00021297 auipc t0,0x21 48: dec28293 addi t0,t0,-532 # 20e30 <addr_name> 4c: 00021317 auipc t1,0x21 50: de930313 addi t1,t1,-535 # 20e35 <addr_typed0> 54: 020000ef jal ra,74 <isLongPressed> 58: 00021317 auipc t1,0x21 5c: de430313 addi t1,t1,-540 # 20e3c <addr_typed1> 60: 014000ef jal ra,74 <isLongPressed> 64: 00021317 auipc t1,0x21 68: de130313 addi t1,t1,-543 # 20e45 <addr_typed2> 6c: 008000ef jal ra,74 <isLongPressed> 70: 09c0006f j 10c <End> ... ``` ![](https://i.imgur.com/F9djfUt.jpg) First, we can see the <font color=#ff00>RED Rectangle</font> , the `ex_pc` is excuting the `00000054 (jal ra,74 <isLongPressed>)`, and the next time, `fetch_pc` will fetch the insturction of `00000074` in <font color=#0020FF>Blue Rectangle</font>, but the `if_pc` and `ex_pc` will keep the same instruction, we can find that in <font color=#FFA700>Orange Rectangle</font>. This two error instruction will also be executed and **when it comes to write back, it will make `wb_nop` and `wb_nop_more` turn to 1, therefore it wouldn't be writen back with wrong data**.We also can find the wire `wb_flush` in GTKWave, it also means that the two instruction will be abandoned.Once again it proves that it's branch penalty is 2. ### Explain how to run #### Use Leetcode (91. Decode Ways) for example ![](https://i.imgur.com/xmpNrN3.png) ``` ... 0000003c <main>: 3c: ffc10113 addi sp,sp,-4 40: 00112023 sw ra,0(sp) 44: 00021517 auipc a0,0x21 48: e4850513 addi a0,a0,-440 # 20e8c <result0> 4c: 170000ef jal ra,1bc <printf> 50: 00021417 auipc s0,0x21 54: de040413 addi s0,s0,-544 # 20e30 <test0> 58: 05c000ef jal ra,b4 <numDecodings> 5c: 00021517 auipc a0,0x21 60: e4850513 addi a0,a0,-440 # 20ea4 <result1> 64: 158000ef jal ra,1bc <printf> 68: 00021417 auipc s0,0x21 6c: ddf40413 addi s0,s0,-545 # 20e47 <test1> 70: 044000ef jal ra,b4 <numDecodings> 74: 00021517 auipc a0,0x21 78: e4850513 addi a0,a0,-440 # 20ebc <result2> 7c: 140000ef jal ra,1bc <printf> 80: 00021417 auipc s0,0x21 84: dde40413 addi s0,s0,-546 # 20e5e <test2> 88: 02c000ef jal ra,b4 <numDecodings> 8c: 00021517 auipc a0,0x21 90: e4850513 addi a0,a0,-440 # 20ed4 <result3> 94: 128000ef jal ra,1bc <printf> 98: 00021417 auipc s0,0x21 9c: ddd40413 addi s0,s0,-547 # 20e75 <test3> a0: 014000ef jal ra,b4 <numDecodings> ... ``` ##### main (0000003c) We can cross comparison the gtkwave and leetcode.dis, we can find that the program enter the `main` function at `pc = 3c`, and we alse can figure out that `inst[ 31:0 ]` is `FC10113`, that is the machine code in hex. ##### sw (00000040) We can find the `ra` = 38, and that when `wb_pc` execute `00000040` , we can find that `dmem_wdata` will also wirte the 38 to the memory. The `sp` is also form `00040000` to `0003FFFC` when `addi sp, sp ,-4` be executed. The `dmem_wready`will also be 1, `dmem_raddr` = `sp` = `0003FFFC`. ##### addi (00000048) We can find that when the `addi` instruction be exectued, the `dmem_raddr` will be `00020e8c`, after the write back, So the `x10_a0` become `00020e8c`. ##### lw (000001cc) ![](https://i.imgur.com/On7Ax3A.png) ``` ... 000001bc <printf>: 1bc: fc010113 addi sp,sp,-64 1c0: 02c12423 sw a2,40(sp) 1c4: 02d12623 sw a3,44(sp) 1c8: 00020317 auipc t1,0x20 1cc: eb832303 lw t1,-328(t1) # 20080 <_impure_ptr> 1d0: 02b12223 sw a1,36(sp) 1d4: 02e12823 sw a4,48(sp) 1d8: 02f12a23 sw a5,52(sp) 1dc: 03012c23 sw a6,56(sp) 1e0: 03112e23 sw a7,60(sp) 1e4: 00832583 lw a1,8(t1) 1e8: 02410693 addi a3,sp,36 1ec: 00050613 mv a2,a0 1f0: 00030513 mv a0,t1 1f4: 00112e23 sw ra,28(sp) 1f8: 00d12623 sw a3,12(sp) 1fc: 010000ef jal ra,20c <_vfprintf_r> 200: 01c12083 lw ra,28(sp) 204: 04010113 addi sp,sp,64 208: 00008067 ret ... ``` In the gtkwave, the `ex_pc` stage is doing the instruction `000001cc`, therefore the `dmem_rready` is turn to 1. The `dmem_waddr` become `00020080`, which means the traget address. --- ## Requirement 3 ### Improve the performance Because of this is 3-stages CPU with fully bypassing, so there is control hazard and Load-use will cause NOP. Therefore, I will focus on reduce the branch penalty in my code, such as loop unrooling, and inline function(macro). ### assignment2 (Long Pressed Name) In this assembly, I find that there are many loop and do the same work. Therefore, I use the macro first, then do the loop unrooling in the macro. Here is the code, and the performance is showed below. ``` .data addr_name: .string "alex" addr_typed0: .string "aaleex" addr_typed1: .string "aalleexx" addr_typed2: .string "aalewx" SuccessResult: .string "PASS" FailResult: .string "FAIL" test: .word 65 iformat: .string "%d" .text .global main .macro if2_inline beq t3, x0, Pass # If typed[j] = NULL, then passed bne t2, t3,Fail # If typed[j] != name[last] then Fail addi t1, t1, 1 # j++ lb t3, 0(t1) # Load next character of typed .endm .macro if_inline #Loop: #1st while loop lb t2, 0(t0) #Load name[i] lb t3, 0(t1) #Load typed[j] beq t2, x0, checkLastCharacter #If name[i]= NULL #If: bne t2, t3, Else #If name[i]!=typed[j] addi t0, t0, 1 #i++ addi t1, t1, 1 #j++ j If .endm .macro else_inline #Else: lb t2, -1(t0) # Load name[i-1] bne t2, t3,Fail # If typed[j] != name[i-1] then Fail addi t1, t1, 1 # j++ if_inline .endm main: # save ra addi sp, sp, -4 sw ra, 0(sp) la t0, addr_name #Load "name" 1st address la t1, addr_typed0 #Load "typed" 1st address jal ra, isLongPressed #1st test-case la t1, addr_typed1 #Load "typed" 1st address jal ra, isLongPressed #2nd test-case la t1, addr_typed2 #Load "typed" 1st address jal ra, isLongPressed #3rd test-case #j End End: # restore ra lw ra, 0(sp) addi sp, sp, 4 # exit li a0, 0 ret isLongPressed: addi sp, sp, -8 #Use 2 byte for saving register sw ra, 0(sp) #1st byte is ra sw t0, 4(sp) #2nd byte is t0 lb t2, 0(t0) #1st character of name lb t3, 0(t1) #1st character of typed bne t2, t3, Fail #1st character must the same addi t0, t0, 1 #i++ addi t1, t1, 1 #j++ # go to Loop #Loop: #1st while loop #lb t2, 0(t0) #Load name[i] #lb t3, 0(t1) #Load typed[j] #beq t2, x0, checkLastCharacter #If name[i]= NULL If: #bne t2, t3, Else #If name[i]!=typed[j] #addi t0, t0, 1 #i++ #addi t1, t1, 1 #j++ #j Loop if_inline Else: #lb t2, -1(t0) # Load name[i-1] #bne t2, t3,Fail # If typed[j] != name[i-1] then Fail #addi t1, t1, 1 # j++ #j Loop else_inline checkLastCharacter: #2nd while loop lb t2, -1(t0) # Load last character of name if2: #beq t3, x0, Pass # If typed[j] = NULL, then passed #bne t2, t3,Fail # If typed[j] != name[last] then Fail #addi t1, t1, 1 # j++ #lb t3, 0(t1) # Load next character of typed if2_inline if2_inline if2_inline j if2 Pass: # modify for srv32 la a0, SuccessResult call puts lw ra, 0(sp) #Recall ra lw t0, 4(sp) #Recall t0 addi sp,sp, 8 ret Fail: # modify la a0, FailResult call puts lw ra, 0(sp) #Recall ra lw t0, 4(sp) #Recall t0 addi sp,sp, 8 ret ``` ```makefile hanchi@hanchi:~/srv32$ make debug=1 assignment2 make[1]: 進入目錄「/home/hanchi/srv32/sw」 make -C common make[2]: 進入目錄「/home/hanchi/srv32/sw/common」 make[2]: 對「all」無需做任何事。 make[2]: 離開目錄「/home/hanchi/srv32/sw/common」 make[2]: 進入目錄「/home/hanchi/srv32/sw/assignment2」 riscv-none-elf-gcc -O3 -Wall -march=rv32im_zicsr -mabi=ilp32 -misa-spec=2.2 -march=rv32im -nostartfiles -nostdlib -L../common -o assignment2.elf assignment2.s -lc -lm -lgcc -lsys -T ../common/default.ld -g riscv-none-elf-objcopy -j .text -O binary assignment2.elf imem.bin riscv-none-elf-objcopy -j .data -O binary assignment2.elf dmem.bin riscv-none-elf-objcopy -O binary assignment2.elf memory.bin riscv-none-elf-objdump -d assignment2.elf > assignment2.dis riscv-none-elf-readelf -a assignment2.elf > assignment2.symbol make[2]: 離開目錄「/home/hanchi/srv32/sw/assignment2」 make[1]: 離開目錄「/home/hanchi/srv32/sw」 make[1]: 進入目錄「/home/hanchi/srv32/sim」 PASS PASS FAIL Excuting 2171 instructions, 2939 cycles, 1.353 CPI Program terminate - ../rtl/../testbench/testbench.v:434: Verilog $finish Simulation statistics ===================== Simulation time : 0.02 s Simulation cycles: 2950 Simulation speed : 0.1475 MHz make[1]: 離開目錄「/home/hanchi/srv32/sim」 make[1]: 進入目錄「/home/hanchi/srv32/tools」 ./rvsim --memsize 128 -l trace.log ../sw/assignment2/assignment2.elf PASS PASS FAIL Excuting 2171 instructions, 2939 cycles, 1.354 CPI Program terminate Simulation statistics ===================== Simulation time : 0.001 s Simulation cycles: 2939 Simulation speed : 5.843 MHz make[1]: 離開目錄「/home/hanchi/srv32/tools」 Compare the trace between RTL and ISS simulator === Simulation passed === ``` | Comapre | sim | rvsim | optimized sim | optimized rvsim | |:----------------- | ------------ | --------- | ------------- | --------------- | | Instructions | 3368 | 3368 | 2171 | 2171 | | Cycles | 4674 | 4674 | 2939 | 2939 | | CPI | 1.387 | 1.388 | 1.353 | 1.354 | | Simulation Times | 0.029 s | 0.001 s | 0.02 s | 0.001 s | | Simulation cycles | 4685 | 4674 | 2950 | 2939 | | Simulation Speed | 0.161552 MHz | 3.797 MHz | 0.1475 MHz | 5.843 MHz | In this table, we can find that there are at least reduce **30%** cycles in optimization. ### Leetcode (91. Decode Ways) I also do the ssmae thing in leetcode. ``` .data test0: .string "1111111111111111111111" test1: .string "2222222222222222222222" test2: .string "1231231231231231231231" test3: .string "1110612312111221232132" result0: .string "The result of test0 is %d\n" # it should be output 2 result1: .string "The result of test1 is %d\n" # it should be output 3 result2: .string "The result of test2 is %d\n" # it should be output 0 result3: .string "The result of test3 is %d\n" # it should be output 2 .text .global main .macro whileloop #while: beqz t2, return_value # check enter while or not mv t4, t1 # tmp = res[1]; slli s1, t3, 1 # s1 = previous * 2; slli s2, t3, 3 # s2 = previous * 8; add s1, s1, s2 # s1 = previous * 2 + previous * 8; addi s2, t2, -48 # s2 = (*s) - '0'; addi t5, x0, 26 # t5 = 26; add s1, s1, s2 # s1 = previous * 10 + (*s) - '0'; ble s1, t5, else1 # if(previous * 10 + (*s) - '0' <= 26) go to else1; li t1, 0 # res[1] = 0; #middle: beqz s2, else2 # if((*s) - '0') go to else2; add t0, t0, t4 # res[0] += tmp; #reset: mv t3, s2 # previous = (*s) - '0' addi s0, s0, 1 # s++; lb t2, 0(s0) # get new (*s) #j while # go back to while loop # double while beqz t2, return_value # check enter while or not mv t4, t1 # tmp = res[1]; slli s1, t3, 1 # s1 = previous * 2; slli s2, t3, 3 # s2 = previous * 8; add s1, s1, s2 # s1 = previous * 2 + previous * 8; addi s2, t2, -48 # s2 = (*s) - '0'; addi t5, x0, 26 # t5 = 26; add s1, s1, s2 # s1 = previous * 10 + (*s) - '0'; ble s1, t5, else1 # if(previous * 10 + (*s) - '0' <= 26) go to else1; li t1, 0 # res[1] = 0; #middle: beqz s2, else2 # if((*s) - '0') go to else2; add t0, t0, t4 # res[0] += tmp; #reset: mv t3, s2 # previous = (*s) - '0' addi s0, s0, 1 # s++; lb t2, 0(s0) # get new (*s) #j while # go back to while loop .endm main: addi sp, sp, -4 # save ra sw ra, 0(sp) la s0, test0 # go test0 jal ra, numDecodings la a0, result0 call printf la s0, test1 # go test1 jal ra, numDecodings la a0, result1 call printf la s0, test2 # go test2 jal ra, numDecodings la a0, result2 call printf la s0, test3 # go test3 jal ra, numDecodings la a0, result3 call printf finally: lw ra, 0(sp) # exit program addi sp, sp, 4 li a0, 0 ret numDecodings: addi sp, sp, -4 sw ra, 0(sp) li t0, 1 # res[0] = 1; li t1, 0 # res[1] = 0; lb t2, 0(s0) # get (*s) addi t3, t2, -48 # int previous = (*s) - '0'; beqz t3, return_0 # if(previous == 0) return 0; addi t4, x0, 0 # int tmp = 0; addi s0, s0, 1 # s++; lb t2, 0(s0) # get new (*s) while: whileloop j while # go back to while loop else1: mv t1, t0 # res[1] = res[0]; #j middle # go back to middle #middle: beqz s2, else2 # if((*s) - '0') go to else2; add t0, t0, t4 # res[0] += tmp; #reset: mv t3, s2 # previous = (*s) - '0' addi s0, s0, 1 # s++; lb t2, 0(s0) # get new (*s) #j while # go back to while loop whileloop j while else2: li t0, 0 # res[0] = 0 #j reset # reset: mv t3, s2 # previous = (*s) - '0' addi s0, s0, 1 # s++; lb t2, 0(s0) # get new (*s) #j while # go back to while loop whileloop j while return_value: add a1, t0, t1 lw ra, 0(sp) addi sp, sp, 4 ret return_0: addi a1, x0, 0 # set a1 for output data lw ra, 0(sp) addi sp, sp, 4 ret ``` ```makefile hanchi@hanchi:~/srv32$ make debug=1 leetcode make[1]: 進入目錄「/home/hanchi/srv32/sw」 make -C common make[2]: 進入目錄「/home/hanchi/srv32/sw/common」 make[2]: 對「all」無需做任何事。 make[2]: 離開目錄「/home/hanchi/srv32/sw/common」 make[2]: 進入目錄「/home/hanchi/srv32/sw/leetcode」 riscv-none-elf-gcc -O3 -Wall -march=rv32im_zicsr -mabi=ilp32 -misa-spec=2.2 -march=rv32im -nostartfiles -nostdlib -L../common -o leetcode.elf leetcode.s -lc -lm -lgcc -lsys -T ../common/default.ld riscv-none-elf-objcopy -j .text -O binary leetcode.elf imem.bin riscv-none-elf-objcopy -j .data -O binary leetcode.elf dmem.bin riscv-none-elf-objcopy -O binary leetcode.elf memory.bin riscv-none-elf-objdump -d leetcode.elf > leetcode.dis riscv-none-elf-readelf -a leetcode.elf > leetcode.symbol make[2]: 離開目錄「/home/hanchi/srv32/sw/leetcode」 make[1]: 離開目錄「/home/hanchi/srv32/sw」 make[1]: 進入目錄「/home/hanchi/srv32/sim」 The result of test0 is 28657 The result of test1 is 28657 The result of test2 is 2187 The result of test3 is 1602 Excuting 10492 instructions, 13902 cycles, 1.325 CPI Program terminate - ../rtl/../testbench/testbench.v:434: Verilog $finish Simulation statistics ===================== Simulation time : 0.066 s Simulation cycles: 13913 Simulation speed : 0.210803 MHz make[1]: 離開目錄「/home/hanchi/srv32/sim」 make[1]: 進入目錄「/home/hanchi/srv32/tools」 ./rvsim --memsize 128 -l trace.log ../sw/leetcode/leetcode.elf The result of test0 is 28657 The result of test1 is 28657 The result of test2 is 2187 The result of test3 is 1602 Excuting 10492 instructions, 13902 cycles, 1.325 CPI Program terminate Simulation statistics ===================== Simulation time : 0.003 s Simulation cycles: 13902 Simulation speed : 4.748 MHz make[1]: 離開目錄「/home/hanchi/srv32/tools」 Compare the trace between RTL and ISS simulator === Simulation passed === ``` | Comapre | sim | rvsim | optimized sim | optimized rvsim | |:----------------- | ------------ | --------- | ------------- | --------------- | | Instructions | 12542 | 12542 | 10492 | 10492 | | Cycles | 16758 | 16758 | 13902 | 13902 | | CPI | 1.336 | 1.336 | 1.325 | 1.325 | | Simulation Times | 0.079 s | 0.003 s | 0.066 s | 0.003 s | | Simulation cycles | 16769 | 16758 | 13913 | 13902 | | Simulation Speed | 0.212266 MHz | 6.045 MHz | 0.210803 MHz | 4.748 MHz | In this table, we can find that there are at least reduce **16%** cycles in optimization. --- ## Conclusion In this project, I have learned a lot, such as `verilator`, `analyze verilog`, optimize the assembly on three-stage pipeline CPU, and so on. By using the srv32, we can get the comparision with the verilog simulator and cpp emulator. In the end, I made the 30% and 16% improvement in assembly --- ## Reference links [srv32 學習紀錄](https://hackmd.io/Jrr1J1YDR_CtGR46DYotiQ) [Assignment3: SoftCPU (Oscar shiang)](https://hackmd.io/@oscarshiang/arch_hw3#Improve-the-performance) [Verilator](https://www.veripool.org/verilator/)