# 2022 Assignment3: SoftCPU ###### tags: `Computer Architecture` contribute by <[`chiangkd`](https://github.com/chiangkd)> **Requirement follow** [Assignment3: SoftCPU](https://hackmd.io/@sysprog/2022-arch-homework3) ## Source code I choose the program from [鄒崴丞-Length of Last Word](https://hackmd.io/@StevenChou43/2022_ComputerArchitecture_HW2#Optimizing-the-Hand-Written-Assembly) and [Leetcode 77. Combinations](https://leetcode.com/problems/combinations/) ### C code(Modified from 鄒崴丞) ```c #include <stdio.h> int lengthOfLastWord(char *s){ if (!*s) return 0; int i = 0; char *head = s; while (*s) { s++; i++; } s = head; int length = 0; s = s + i - 1; while(i && *s == ' '){ s--; i--; } while(i-- && *s != ' '){ s--; length++; } return length; } int main() { char *str1 = "Hello World"; char *str2 = "I am a student "; char *str3 = "a"; int len1 = lengthOfLastWord(str1); int len2 = lengthOfLastWord(str2); int len3 = lengthOfLastWord(str3); printf("For string \"%s\", length of last word : %d\n", str1, len1); printf("For string \"%s\", length of last word : %d\n", str2, len2); printf("For string \"%s\", length of last word : %d\n", str3, len3); return 0; } ``` ### Handwrite Assembly from [鄒崴丞-Length of Last Word](https://hackmd.io/@StevenChou43/2022_ComputerArchitecture_HW2#Optimizing-the-Hand-Written-Assembly) ```Assembly .org 0 # Provide program starting address to linker .global _start /* newlib system calls */ .set SYSEXIT, 93 .set SYSWRITE, 64 .data str1: .string "Hello World" .set str1len, .-str1 str2: .string "I am a student " .set str2len, .-str2 str3: .string "a" .set str3len, .-str3 fstr: .string "For string \"" .set flen, .-fstr answrstr: .string "\", length of last word : " .set answrlen, .-answrstr newline: .string "\n" .set nllen, .-newline .text _start: # function : lengthOfLastWord(char * s) # a0 : pointer to the start of the string la a0, str1 # find the position of the last letter is not space jal ra, lengthoflastword # call strlen() add a2, x0, a0 la a0, str1 li a1, str1len jal print_result la a0, str2 # find the position of the last letter is not space jal ra, lengthoflastword # call strlen() add a2, x0, a0 la a0, str2 li a1, str2len jal print_result la a0, str3 # find the position of the last letter is not space jal ra, lengthoflastword # call strlen() add a2, x0, a0 la a0, str3 li a1, str3len jal print_result li a0, 0 # return value is 0 li a7, SYSEXIT # end ecall nochar: li a0, 0 ret lengthoflastword: lb t0, 0(a0) beq t0, x0, nochar andi a2, a0, 3 add a1, x0, a0 # a1 = a0 bne a2, x0, unaligned # jump if the address is unaligned starttraverse: lui a3, 0x7F7F8 addi a3, a3, -129 # now a3 is 0x7F7F7F7F addi a4, x0, -1 # a4 = -1 findend: lw t0, 0(a1) lw t1, 4(a1) lw t2, 8(a1) lw t3, 12(a1) and a2, t0, t1 and t4, t2, t3 and a2, a2, t4 addi a1, a1, 16 and a5, a2, a3 add a5, a5, a3 or a5, a5, a3 beq a5, a4, findend # break if there are NULL and a5, t0, a3 add a5, a5, a3 or a5, a5, a3 bne a5, a4, back12 and a5, t1, a3 add a5, a5, a3 or a5, a5, a3 bne a5, a4, back8 and a5, t2, a3 add a5, a5, a3 or a5, a5, a3 bne a5, a4, back4 findbyte: lbu a2, -4(a1) sub a3, a1, a0 beq a2, x0, sub4 lbu a2, -3(a1) beq a2, x0, sub3 lbu a2, -2(a1) snez a2, a2 add a2, a2, a3 addi a1, a2, -2 j findword back4: addi a1, a1, -4 j findbyte back8: addi a1, a1, -8 j findbyte back12: addi a1, a1, -12 j findbyte checkalign: beq a3, x0, starttraverse unaligned: lbu a2, 0(a1) addi a1, a1, 1 andi a3, a1, 3 bne a2, x0, checkalign sub a1, a1, a0 addi a1, a1, -1 j findword sub4: addi a1, a3, -4 j findword sub3: addi a1, a3, -3 j findword findword: li a2, 32 # store space to register a2 add a0, a1, a0 addi a1, a1, 1 loop1: # now we reach the end of the string beq a1, x0, return # if encounter the last position, return addi a0, a0, -1 # s-- addi a1, a1, -1 # the last position is not space lb t0, 0(a0) # t0 = s[i] beq t0, a2, loop1 # while(char[i] == " ") add a3, x0, a0 # a3 = a0 add a0, x0, x0 loop2: beq a1, x0, return # if encounter the last position, return addi a1, a1, -1 addi a3, a3, -1 addi a0, a0, 1 # s++ lb t0, 0(a3) # t0 = s[i] bne t0, a2, loop2 return: ret print_result: add s0, x0, a0 add s1, x0, a1 add s2, x0, a2 li a7, SYSWRITE li a0, 1 # print For string : " la a1, fstr la a2, flen ecall li a7, SYSWRITE li a0, 1 # print i am a student add a1, x0, s0 add a2, x0, s1 ecall li a7, SYSWRITE li a0, 1 # print ", length of last word is la a1, answrstr li a2, answrlen ecall li a7, SYSWRITE addi a1, s2, 48 # get length addi sp, sp, -4 sw a1, 0(sp) # store answer in stack li a0, 1 # print answer addi a1, sp, 0 li a2, 4 ecall addi sp, sp, 4 li a7, SYSWRITE li a0, 1 # print new line la a1, newline li a2, nllen ecall ret ``` - There are serval version in [鄒崴丞-Length of Last Word](https://hackmd.io/@StevenChou43/2022_ComputerArchitecture_HW2#Optimizing-the-Hand-Written-Assembly), I take the version with **SWAR(4 register at one time)**, It seems has the best performance (CSR cycle) with longer string. ### [LeetCode 77. Combinations](https://leetcode.com/problems/combinations/) >Given two integers `n` and `k`, return all possible combinations of `k` numbers chosen from the range `[1, n]`. > >**Constraints:** >- 1 <= n <= 20 >- 1 <= k <= n **Leetcode Solution** ```c #define MAXLEN 184756 void GetLineSize(int **res, int *line, int lineSize,int n, int k, int *resLen, int idx) { if (lineSize == k) { res[*resLen] = (int *)malloc(sizeof(int) * k); memcpy(res[*resLen], line, sizeof(int) * k); (*resLen)++; return; } for (int i = idx; i <= n; i++) { line[lineSize] = i; GetLineSize(res, line, lineSize + 1, n, k, resLen, i + 1); } } int** combine(int n, int k, int* returnSize, int** returnColumnSizes){ int **res = malloc(sizeof(int *)* MAXLEN); int line[k]; int lineSize = 0; int resLen = 0; GetLineSize(res, line, lineSize, n, k, &resLen, 1); *returnSize = resLen; *returnColumnSizes = malloc(sizeof(int)*(*returnSize)); int i = *returnSize-1; while(i >= 0){ (*returnColumnSizes)[i--] = k; } return res; } ``` From description, we know that the MAXLEN about the answer will be $C^{20}_{10}=184756$, so that's why I give it a constant value and after caculation, just `realloc` it. **Improvement** In LeetCode, there are many problems take the argument `int *returnSize` and `int ** returnColumnSizes`, the first one is about output array's length, and the second one is about the each **column** size in the output array. But, in my problem, the column size will always equal to `k`, which is an argument about `combine` function, and the returnSize will equal to $C^n_k$ as mention above. So, I remove `returnSize` and `returnColumnSizes` to simplfy the question. And also, use structure to reduce argument number ```c struct set { int n; int k; int returnSize; }; ``` ```c #define MAXLEN 184756 void GetLineSize(int **res, int *line, int lineSize, struct set *cond, int idx) { if (lineSize == cond->k) { res[cond->returnSize] = (int *)malloc(sizeof(int) * cond->k); memcpy(res[cond->returnSize], line, sizeof(int) * cond->k); (cond->returnSize)++; return; } for (int i = idx; i <= cond->n; i++) { line[lineSize] = i; GetLineSize(res, line, lineSize + 1, cond, i + 1); } } int** combine(struct set *cond){ int **res = (int **)malloc(sizeof(int *)* MAXLEN); int line[cond->k]; int lineSize = 0; GetLineSize(res, line, lineSize, cond, 1); res = realloc(res, sizeof(int*) * cond->returnSize); return res; } ``` However, when I do `make` with this program, get an error message ``` DMEM address 002c5d44 out of range ``` which is defined in `testbench.v`, means out of memory range (DRAM + IRAM), so I shrink the `MAXLENGTH` to make it run properly. ``` #define MAXLEN 20000 ``` ## Tweaked programs for srv32 >- You should validate the results in your program(s). >- You should program in RISC-V assembly for the sake of further optimizations. Hint: You may modify a customized `Makefile` for building C and assembly source files from scratch. ### Modify `Makefile` Simply copy `Makefile` from existing `hello` example and just modifiy the target and associated dependency name ```Makefile include ../common/Makefile.common EXE = .elf SRC = lolw.S CFLAGS += -L../common LDFLAGS += -T ../common/default.ld TARGET = lolws OUTPUT = $(TARGET)$(EXE) .PHONY: all clean all: $(TARGET) $(TARGET): $(SRC) $(CC) $(CFLAGS) -o $(OUTPUT) $(SRC) $(LDFLAGS) $(OBJCOPY) -j .text -O binary $(OUTPUT) imem.bin $(OBJCOPY) -j .data -O binary $(OUTPUT) dmem.bin $(OBJCOPY) -O binary $(OUTPUT) memory.bin $(OBJDUMP) -d $(OUTPUT) > $(TARGET).dis $(READELF) -a $(OUTPUT) > $(TARGET).symbol clean: $(RM) *.o $(OUTPUT) $(TARGET).dis $(TARGET).symbol [id]mem.bin memory.bin ``` ### Problem 1: [鄒崴丞-Length of Last Word](https://hackmd.io/@StevenChou43/2022_ComputerArchitecture_HW2#Optimizing-the-Hand-Written-Assembly) And do some modification to let the assembly run on srv32 - `printf` system call - stored `ra` ```Assembly .data str1: .string "Hello World" str2: .string "I am a student " str3: .string "a" space: .string " " fstr: .string "For string \"" answrstr: .string "\", length of last word : " newline: .string "\n" iformat: .string "%d" .text .global main main: # function : lengthOfLastWord(char * s) # a0 : pointer to the start of the string addi sp, sp, -4 sw ra, 0(sp) la a0, str1 # find the position of the last letter is not space jal ra, lengthoflastword # call strlen() add a2, x0, a0 la a0, str1 jal print_result la a0, str2 # find the position of the last letter is not space jal ra, lengthoflastword # call strlen() add a2, x0, a0 la a0, str2 jal print_result la a0, str3 # find the position of the last letter is not space jal ra, lengthoflastword # call strlen() add a2, x0, a0 la a0, str3 jal print_result lw ra, 0(sp) addi sp, sp, 4 li a0, 0 # return value is 0 ret lengthoflastword: lb t0, 0(a0) beq t0, x0, return andi a2, a0, 3 add a1, x0, a0 # a1 = a0 bne a2, x0, unaligned # jump if the address is unaligned starttraverse: lui a3, 0x7F7F8 addi a3, a3, -129 # now a3 is 0x7F7F7F7F addi a4, x0, -1 # a4 = -1 findend: lw t0, 0(a1) lw t1, 4(a1) lw t2, 8(a1) lw t3, 12(a1) and a2, t0, t1 and t4, t2, t3 and a2, a2, t4 addi a1, a1, 16 and a5, a2, a3 add a5, a5, a3 or a5, a5, a3 beq a5, a4, findend # break if there are NULL and a5, t0, a3 add a5, a5, a3 or a5, a5, a3 bne a5, a4, back12 and a5, t1, a3 add a5, a5, a3 or a5, a5, a3 bne a5, a4, back8 and a5, t2, a3 add a5, a5, a3 or a5, a5, a3 bne a5, a4, back4 findbyte: lbu a2, -4(a1) sub a3, a1, a0 beq a2, x0, sub4 lbu a2, -3(a1) beq a2, x0, sub3 lbu a2, -2(a1) snez a2, a2 add a2, a2, a3 addi a1, a2, -2 j findword back4: addi a1, a1, -4 j findbyte back8: addi a1, a1, -8 j findbyte back12: addi a1, a1, -12 j findbyte checkalign: beq a3, x0, starttraverse unaligned: lbu a2, 0(a1) addi a1, a1, 1 andi a3, a1, 3 bne a2, x0, checkalign sub a1, a1, a0 addi a1, a1, -1 j findword sub4: addi a1, a3, -4 j findword sub3: addi a1, a3, -3 j findword findword: li a2, 32 # store space to register a2 add a0, a1, a0 addi a1, a1, 1 loop1: # now we reach the end of the string beq a1, x0, return # if encounter the last position, return addi a0, a0, -1 # s-- addi a1, a1, -1 # the last position is not space lb t0, 0(a0) # t0 = s[i] beq t0, a2, loop1 # while(char[i] == " ") add a3, x0, a0 # a3 = a0 add a0, x0, x0 loop2: beq a1, x0, return # if encounter the last position, return addi a1, a1, -1 addi a3, a3, -1 addi a0, a0, 1 # s++ lb t0, 0(a3) # t0 = s[i] bne t0, a2, loop2 return: ret print_result: addi sp, sp, -4 sw ra, 0(sp) add s0, x0, a0 add s1, x0, a1 add s2, x0, a2 la a0, fstr # print For string : call printf add a0, x0, s0 # print i am a student call printf la a0, answrstr # print ", length of last word is call printf addi a1, s2, 0 # get length addi sp, sp, -4 sw a1, 0(sp) # store answer in stack la a0, iformat lw a1, 0(sp) # print answer call printf addi sp, sp, 4 la a0, newline # print new line call printf lw ra, 0(sp) addi sp, sp, 4 ret ``` **Result** **sim** ``` For string "Hello World", length of last word : 5 For string "I am a student ", length of last word : 7 For string "a", length of last word : 1 Excuting 12991 instructions, 17619 cycles, 1.356 CPI Program terminate - ../rtl/../testbench/testbench.v:434: Verilog $finish Simulation statistics ===================== Simulation time : 0.079 s Simulation cycles: 17630 Simulation speed : 0.223165 MHz ``` **rvsim** ``` For string "Hello World", length of last word : 5 For string "I am a student ", length of last word : 7 For string "a", length of last word : 1 Excuting 12991 instructions, 17619 cycles, 1.356 CPI Program terminate Simulation statistics ===================== Simulation time : 0.003 s Simulation cycles: 17619 Simulation speed : 5.523 MHz ``` - The answer is correct - Executing 12991 instructions - 17619 cycles - 1.356 CPI | | sim | rvsim | | ----------------- | ------------ | --------- | | Simulation time | 0.075s | 0.03s | | Simulation cycles | 17252 | 17241 | | Simulation speed | 0.230027 MHz | 5.760 MHz | ### Problem 2: [Leetcode 77. Combinations](https://leetcode.com/problems/combinations/) From the C code mentioned above, write the assembly which can properly run on srv32 **Hand-write Assembly** ```Assembly .data n: .word 20 lbracket: .string "[" rbracket: .string "]" comma: .string "," backspace: .string "\b" newline: .string "\n" iformat: .string "%d," xformat: .string "%x\n" .text .global main main: li t1, 0 addi sp, sp, -4 sw ra, 0(sp) # push sp lw t0, n slli t0, t0, 2 sub sp, sp, t0 mv a0, sp # a0 = &cond li t0, 4 # cond.n = 4 sw t0, 0(a0) li t0, 3 # cond.k = 3 sw t0, 4(a0) sw x0, 8(a0) # cond.returnSize = 0 call combine # a0 = res lw a1, 4(a3) # a1 = cond.k lw a2, 8(a3) # a2 = cond.returnSize call printAns # pop sp lw t0, n slli t0, t0, 2 add sp, sp, t0 # restore ra lw ra, 0(sp) addi sp, sp, 4 # exit li a0, 0 ret memcpy: # a0 dest # a1 src # a2 n #slli t5, a2, 0x2 li t3, 0 # i = 0 mcpyloop: lw t4, 0(a1) sw t4, 0(a0) addi a0, a0, 4 addi a1, a1, 4 addi t3, t3, 4 bne t3, a2, mcpyloop ret GetLineSize: # a0 = res # a1 = line # a2 = lineSize # a3 = cond # a4 = idx addi sp, sp, -12 sw ra, 0(sp) sw a2, 4(sp) # lineSize sw a4, 8(sp) # idx lw s1, 4(a3) # s1 = cond->k bne a2, s1, GLSrecursive # if (lineSize != cond->k) PutLineIn: # a0 = res # a1 = line # a2 = lineSize # a3 = cond # a4 = idx addi sp, sp, -24 sw a0, 0(sp) sw a1, 4(sp) sw a2, 8(sp) sw a3, 12(sp) sw a4, 16(sp) lw t2, 20(sp) add t1, t2, t1 add a0, a0, t1 # a0 = res[cond->returnSize] lw t2, 4(a3) # t2 = cond->k slli a2, t2, 0x2 # a2 = sizoef(int) * cond->k sw a2, 20(sp) # store offset # Here can do some improvement call memcpy lw a0, 0(sp) # res lw a1, 4(sp) # line lw a2, 8(sp) # lineSize lw a3, 12(sp) # cond lw a4, 16(sp) # idx addi sp, sp, 24 addi t5, t5, 1 # cond->returnSize++ sw t5, 8(a3) # store returnSize back lw ra, 0(sp) addi sp, sp, 12 ret GLSrecursive: mv s2, a2 # s1 = lineSize mv s3, a4 # int i = idx lw t0, 0(a3) # t0 = cond->n blt t0, s3, Normalreturn #addi t0, t0, 1 GLSloop: slli t3, s2, 0x2 # t3 = line[lineSize] offset add t4, a1, t3 # t4 = line[lineSize] address sw s3, 0(t4) # line[lineSize] = i # a2, a3 must put in function with plus 1 addi a2, s2, 1 # lineSize + 1 addi a4, s3, 1 # i + 1 call GetLineSize addi s3, s3, 1 # i++ sw s3, 8(sp) bge t0, s3, GLSloop Normalreturn: lw ra, 0(sp) addi sp, sp, 12 lw a2, 4(sp) lw a4, 8(sp) mv s2, a2 mv s3, a4 ret combine: # a0 = &cond # 0(a0) = cond->n # 4(a0) = cond->k # 8(a0) = cond->returnSize addi sp, sp, -24 sw ra, 0(sp) mv a3, a0 # a3 = cond addi a0, a0, -2000 # a0 = address of **res addi a1, a0, -2000 # a1 = address of line[cond->k] li a2, 0 # a2 = lineSize li a4, 1 # a4 = 1 sw a0, 4(sp) sw a1, 8(sp) sw a2, 12(sp) sw a3, 16(sp) sw a4, 20(sp) call GetLineSize lw ra, 0(sp) lw a0, 4(sp) # res lw a1, 8(sp) # line lw a2, 12(sp) # lineSize lw a3, 16(sp) # cond lw a4, 20(sp) # idx addi sp, sp, 24 ret printAns: # a0 = res # a1 = k (cond.k) # a2 = returnSize (cond.returnSize) mv s2, a1 # return Size = 3 mv s3, a2 # k = 4 addi sp, sp, -4 sw ra, 0(sp) mv s1, a0 # s1 = res la a0, lbracket call printf li s4, 0 # int i = 0 li s5, 0 # int j = 0 printloop1: la a0, lbracket call printf li s4, 0 # int j = 0 printloop2: lw a1, 0(s1) # res[i] addi s1, s1, 0x4 # res[i][j] address offset la a0, iformat call printf addi s4, s4, 1 # i++ bne s4, s2, printloop2 # j < returnColumnSizes la a0, backspace call printf la a0, rbracket call printf la a0, comma call printf addi s5, s5, 1 # i++ bne s5, s3, printloop1 # i < returnSize la a0, backspace call printf la a0, rbracket call printf la a0, newline call printf lw ra, 0(sp) addi sp, sp, 4 ret ``` - There is a little difficult for me to wrtie recursive code in assembly, the difficult point is how stack pointer move, once the `sp` get wrong position, the return address(`ra`) will totally mess up and the program will not run properly. Take input n = 4, k = 3 for example, The procedure about how `Combinations(4,3)` work as below. ```graphviz digraph demo{ rankdir=LR # Level 1 1[label="GLS(0,1)"] # Level 2 2[label="GLS(1,2)"] 9[label="GLS(1,3)"] 13[label="GLS(1,4)"] 15[label="GLS(1,5)"] # Level 3 3[label="GLS(2,3)"] 6[label="GLS(2,4)"] 8[label="GLS(2,5)"] 10[label="GLS(2,4)"] 12[label="GLS(2,5)"] 14[label="GLS(2,5)"] # Level 4 4[label="GLS(3,4)"] 5[label="GLS(3,5)"] 7[label="GLS(3,5)"] 11[label="GLS(3,5)"] END[label= END] ### 1 -> 2, 9, 13, 15 2 -> 3, 6, 8 3 -> 4, 5 6 -> 7 ### 9 -> 10, 12 10 -> 11 ### 13 -> 14 # procedure 1 -> 2 -> 3 -> 4 -> 5 -> 6 -> 7 -> 8 -> 9 -> 10 -> 11 -> 12 -> 13 -> 14 -> 15[constraint = "false",color=red] 15 -> END[constraint = "false", color=red] } ``` The redline is actually the order of function execute, and in every **backward arrow** and **vertical arrow**, that is, the red arrow which is no overlapped with black arrow will **pop** the stack with specific number of bytes coresspond to `k` value, and the overlapping situation will **push** the stack. | Function Entry | sp | | | | | | | | | | | | Function Out | | -------------- | ------------ | ---------- | ------------ | ---------- | ------------ | ---------- | ------------ | ---------- | ------------ | ---------- | ------------ | ---------- | ------------ | | | $\downarrow$ | | | | | | | | | | | $\uparrow$ | | | | $\downarrow$ | | | | | | | | $\downarrow$ | $\uparrow$ | $\downarrow$ | $\uparrow$ | | | | $\downarrow$ | | | | $\downarrow$ | $\uparrow$ | $\downarrow$ | $\uparrow$ | $\downarrow$ | | | | | | | $\downarrow$ | $\uparrow$ | $\downarrow$ | $\uparrow$ | $\downarrow$ | | | | | | | | | **Result** **sim** ``` [[1,2,3],[1,2,4],[1,3,4],[2,3,4]] Excuting 18350 instructions, 24112 cycles, 1.314 CPI Program terminate - ../rtl/../testbench/testbench.v:434: Verilog $finish Simulation statistics ===================== Simulation time : 0.134 s Simulation cycles: 24123 Simulation speed : 0.180022 MHz ``` **rvsim** ``` [[1,2,3],[1,2,4],[1,3,4],[2,3,4]] Excuting 18350 instructions, 24112 cycles, 1.314 CPI Program terminate Simulation statistics ===================== Simulation time : 0.004 s Simulation cycles: 24112 Simulation speed : 5.725 MHz ``` - The answer is correct - Executing 18343 instructions - 24105 cycles - 1.314 CPI | | sim | rvsim | | ----------------- | ------------ | --------- | | Simulation time | 0.134s | 0.004s | | Simulation cycles | 24123 | 24112 | | Simulation speed | 0.180022 MHz | 5.725 MHz | ## Check waveform with GTKwave ### Analysize srv32 RV32 core Hint by [Lab3: srv32](https://hackmd.io/@sysprog/S1Udn1Xtt#Analyze-srv32-RV32-core), srv32 is a 3-stage pipeline architecture with IF/ID, EX, WB stages and supports **full forwarding**, means that RAW data hazard can be resolved without stalling the processor ![](https://i.imgur.com/yFHymMg.png) Compare the disassembly and the waveform ``` 00000000 <_start>: 0: 00010297 auipc t0,0x10 4: 18028293 addi t0,t0,384 # 10180 <trap_handler> 8: 30529073 .4byte 0x30529073 c: 3050e073 .4byte 0x3050e073 10: 00022297 auipc t0,0x22 14: 82c28293 addi t0,t0,-2004 # 2183c <_PathLocale> 18: 00022317 auipc t1,0x22 1c: b3430313 addi t1,t1,-1228 # 21b4c <_bss_end> ``` ### Analysize waveform ![](https://i.imgur.com/vR8oImF.png) First check the waveform, at initial `imem` fetch the instruction at `00000000` , which is corresponded to `00010297` stands for `auipc t0, 0x10`, and `inst[31:0]` not decode by instr_decoder yet (`00000013` stands for `addi x0, x0, 0`, which is `NOP`) ![](https://i.imgur.com/R4rTkvQ.png) Next, IF/ID in the same stage (same clock period), which can obtain `rdata` and `inst` with same value. #### Check hazard ![](https://i.imgur.com/9yF49Z1.png) Since srv32 has fully bypassing, there is not stall generated by data hazards #### Check branch ![](https://i.imgur.com/iiyy0jq.png) Take first branch occur for example, at 78 ps (the line in the figure), `raddr` and `rdata` equals to `FE62ECE3` means `bltu t0, t1, 20` at **IF/ID** stage, and one clock cycle after at the **EX** stage, branch will be taken. Also a clock cycle after `wb_flush` will trigger and flush 2 cycle, then reload the correct instruction after branching. **Disassembly** ```Assembly 00000020 <_bss_clear>: 20: 0002a023 sw zero,0(t0) 24: 00428293 addi t0,t0,4 28: fe62ece3 bltu t0,t1,20 <_bss_clear> 2c: 00040117 auipc sp,0x40 30: fd410113 addi sp,sp,-44 # 40000 <_stack> 34: 008000ef jal ra,3c <main> 38: 4541006f j 1048c <exit> ``` | Instruction | c1 | c2 | c3 | c4 | c5 | c6 | c7 | | ------------------------------ | ----- | ----------------------------------------- | ----------------------------------------- | -------------------------------------- | -------------------------------------- | --- | --- | | `bltu t0, t1, 20 <_bss_clear>` | IF/ID | EX | WB | | | | | | `auipc sp, 0x40` | | ~~IF/ID~~<font color='red'>**NOP**</font> | ~~EX~~<font color='red'>**NOP**</font> | ~~WB~~<font color='red'>**NOP**</font> | | | | | `addi sp, sp, -44` | | | ~~IF/ID~~<font color='red'>**NOP**</font> | ~~EX~~<font color='red'>**NOP**</font> | ~~WB~~<font color='red'>**NOP**</font> | | | | `sw zero, 0(t0)` | | | | IF/ID | EX | WB | | | `addi t0, t0, 4` | | | | | IF/ID | EX | WB | ![](https://i.imgur.com/YtdGOiS.png) ![](https://i.imgur.com/xWanpMc.png) - From **c1 to c5** is correspond to the figure above from 77 ps to 97 ps (5 clock cycle) ### Problem 1: [鄒崴丞-Length of Last Word](https://hackmd.io/@StevenChou43/2022_ComputerArchitecture_HW2#Optimizing-the-Hand-Written-Assembly) Notice that at 4293 ps, `0xFEC296E3` was executed which is correspond to `bne t0, a2, loop2`. In first testcase, this branch is taken totally four times, the number of the loop execute is relate to the last word's length. Here can do loop unrolling to reduce branch panelty. ![](https://i.imgur.com/kx2e1Lc.png) ### Problem 2: [Leetcode 77. Combinations](https://leetcode.com/problems/combinations/) At 4053ps, `120000EF` was executed, which is `call combine`. Get into `combine` function and run the algorithm and cause branch panelty with 2 cycle panelty. ![](https://i.imgur.com/AwHA3Cn.png) There is similar behavior (branch and flush) if branch is taken. So, the branch in my program can be do some optimization to reduce branch panelty. ## Propose the software optimizations ### Problem 1: [鄒崴丞-Length of Last Word](https://hackmd.io/@StevenChou43/2022_ComputerArchitecture_HW2#Optimizing-the-Hand-Written-Assembly) Original `loop2` which is relate the length of last word. ``` loop2: beq a1, x0, return # if encounter the last position, return addi a1, a1, -1 addi a3, a3, -1 addi a0, a0, 1 # s++ lb t0, 0(a3) # t0 = s[i] bne t0, a2, loop2 ``` Loop unrolling 2 times. ``` loop2: beq a1, x0, return # if encounter the last position, return addi a1, a1, -1 addi a3, a3, -1 addi a0, a0, 1 # s++ lb t0, 0(a3) # t0 = s[i] beq t0, a2, return beq a1, x0, return # if encounter the last position, return addi a1, a1, -1 addi a3, a3, -1 addi a0, a0, 1 # s++ lb t0, 0(a3) # t0 = s[i] bne t0, a2, loop2 ``` Loop unrolling 4 times. ``` loop2: beq a1, x0, return # if encounter the last position, return addi a1, a1, -1 addi a3, a3, -1 addi a0, a0, 1 # s++ lb t0, 0(a3) # t0 = s[i] beq t0, a2, return beq a1, x0, return # if encounter the last position, return addi a1, a1, -1 addi a3, a3, -1 addi a0, a0, 1 # s++ lb t0, 0(a3) # t0 = s[i] beq a1, x0, return # if encounter the last position, return addi a1, a1, -1 addi a3, a3, -1 addi a0, a0, 1 # s++ lb t0, 0(a3) # t0 = s[i] beq a1, x0, return # if encounter the last position, return addi a1, a1, -1 addi a3, a3, -1 addi a0, a0, 1 # s++ lb t0, 0(a3) # t0 = s[i] bne t0, a2, loop2 ``` | | Original | Unrolling 2 | Unrolling 4 | | ------------ | -------- | ----------- | ----------- | | Instructions | 12991 | 12980 | 12986 | | Cycles | 17619 | 17596 | 17598 | | CPI | 1.356 | 1.356 | 1.355 | - Notice that the performance is no grouth with the numbers of unrolling, that's because the in the testcase above, the last word length is not so long so the branch will no take so much. If change the testcase to ``` str1: .string "Hello Worldaaaaaaaaaaaaaaaa" str2: .string "I am a studentaaaaaaaaaaaaaaaaaaaaaa " str3: .string "aaaaaaaaaaaaaaaaaaa" ``` | | Original | Unrolling 2 | Unrolling 4 | | ------------ | -------- | ----------- | ----------- | | Instructions | 15428 | 15382 | 15358 | | Cycles | 20926 | 20830 | 20782 | | CPI | 1.356 | 1.354 | 1.353 | - As the length of last word be longer, the improvement of loop unrolling is clear. ### Problem 2: [Leetcode 77. Combinations](https://leetcode.com/problems/combinations/) To reduce the branch panelty, unrolling the `printloop2` loop by 4 times ```Assembly printloop2: beq s4, s2, loop2out lw a1, 0(s1) # res[i] addi s1, s1, 0x4 # res[i][j] address offset la a0, iformat call printf addi s4, s4, 1 # i++ beq s4, s2, loop2out lw a1, 0(s1) # res[i] addi s1, s1, 0x4 # res[i][j] address offset la a0, iformat call printf addi s4, s4, 1 # i++ beq s4, s2, loop2out lw a1, 0(s1) # res[i] addi s1, s1, 0x4 # res[i][j] address offset la a0, iformat call printf addi s4, s4, 1 # i++ beq s4, s2, loop2out lw a1, 0(s1) # res[i] addi s1, s1, 0x4 # res[i][j] address offset la a0, iformat call printf addi s4, s4, 1 # i++ beq s4, s2, loop2out lw a1, 0(s1) # res[i] addi s1, s1, 0x4 # res[i][j] address offset la a0, iformat call printf addi s4, s4, 1 # i++ beq s4, s2, loop2out lw a1, 0(s1) # res[i] addi s1, s1, 0x4 # res[i][j] address offset la a0, iformat call printf addi s4, s4, 1 # i++ j printloop2 ``` - There is the same if I unroll 4, 8, 16 times. because in my testcase, `loop2` handle the size of column (`cond->k`), which is 3, so if want to unroll much more, need to take another testcase. Furthermore, make function `printAns`, `combine`, `memcpy`, to inline functon, such as: ```Assembly .macro memcpy # a0 dest # a1 src # a2 n #slli t5, a2, 0x2 li t3, 0 # i = 0 mcpyloop: lw t4, 0(a1) sw t4, 0(a0) addi a0, a0, 4 addi a1, a1, 4 addi t3, t3, 4 bne t3, a2, mcpyloop # ret .endm ``` - Also can reduce branch panelty | | Original | Unrolling printAns loop | Unrolling and inline function | | ------------ | -------- | ----------------------- | ----------------------------- | | Instructions | 18350 | 18354 | 18342 | | Cycles | 24112 | 24108 | 24072 | | CPI | 1.314 | 1.314 | 1.312 | ## How srv32 work with Verilator If do `make hello`, it will go find the correspond foled under `/sw` and do a series operations. ```= riscv-none-elf-gcc -O3 -Wall -march=rv32im_zicsr -mabi=ilp32 -misa-spec=2.2 -march=rv32im -nostartfiles -nostdlib -L../common -o hello.elf hello.c -lc -lm -lgcc -lsys -T ../common/default.ld riscv-none-elf-objcopy -j .text -O binary hello.elf imem.bin riscv-none-elf-objcopy -j .data -O binary hello.elf dmem.bin riscv-none-elf-objcopy -O binary hello.elf memory.bin riscv-none-elf-objdump -d hello.elf > hello.dis riscv-none-elf-readelf -a hello.elf > hello.symbol ``` - `line 1` do the compilation and generate `hello.elf` - `line 2` copy `.text` section in `hello.elf` and generate `imem.bin` - `line 3` copy `.data` section in `hello.elf` and generate `dmem.bin` - `line 4` create an output file in binary - `line 5` disassemble `.text` section (executable sections) - `line 6` disassemble with `-h -l -S -s -r -d -V -A -I ` and next see `/sim/Makefile` ```Makefile FILELIST = -f filelist.txt $(if $(_top), ../rtl/top_s.v, ../rtl/top.v) $(TARGET): $(TARGET_SIM) $(BFLAGS) -o $(TARGET) $(FILELIST) mv sim_cc/sim . ``` - parse argument from `filelist.txt` **filelist.txt** ``` +incdir+../rtl +define+TRACE //+define+RV32C_ENABLED=1 //+define+MEMSIZE=128 //+define+SINGLE_RAM ../rtl/riscv.v ../rtl/clint.v ../testbench/testbench.v ../testbench/memmodel.v ``` Do `make` under `/sim` will generate executable file `sim` and a bunch of file which will move to `sim_cc` Then `./sim` can run the executable file which is done before (generate `imem` and `dmem`). If simply run `make hello.run` under `/sim`, it will copy `imem.bin` and `dmem.bin` under `/sw/hello` and run the program. **sim_main.cpp** ```cpp while (!Verilated::gotFinish()) { if (main_time > RESOLUTION) { top->resetb = 1; } if (main_time > 30) { top->stall = 0; } if ((main_time % RESOLUTION) == 1) { top->clk = 1; } if ((main_time % RESOLUTION) == (RESOLUTION / 2 + 1)) { top->clk = 0; } top->eval(); main_time++; } ``` The while loop will run until verilator end, and can see three signal(`resetb, stall, clk`) in GTKWave ![](https://i.imgur.com/iYA5jKk.png) ```cpp #ifdef HAVE_CHRONO { std::chrono::steady_clock::time_point time_end; float sec; int cycles = main_time / RESOLUTION; time_end = std::chrono::steady_clock::now(); sec = std::chrono::duration_cast<std::chrono::milliseconds>(time_end - time_begin).count() / 1000.0; float speed_mhz = cycles / sec / 1000000.0; std::cout << std::endl; std::cout << "Simulation statistics" << std::endl; std::cout << "=====================" << std::endl; std::cout << "Simulation time : " << sec << " s" << std::endl; std::cout << "Simulation cycles: " << cycles << std::endl; std::cout << "Simulation speed : " << speed_mhz << " MHz" << std::endl << std::endl; } #endif ``` And use [chrono](https://en.cppreference.com/w/cpp/chrono) to caculate time. ## Reference Link - [Assignment3: SoftCPU 向景亘](https://hackmd.io/@oscarshiang/arch_hw3) - [Veriator Introduction](https://ys-hayashi.me/2020/12/verilator/) - [readelf vs. objdump: why are both needed](https://stackoverflow.com/questions/8979664/readelf-vs-objdump-why-are-both-needed)