Assignment3: SoftCPU

# Assignment3: SoftCPU ###### tags: `2021_CS Computer Architecture` ## Modify a customized Makefile for building C and assembly source files ### Building C source files After building the environment of [srv32 - RISCV RV32IM Soft CPU](https://hackmd.io/@sysprog/S1Udn1Xtt), teacher demonstrate compile & execute `qsort.c` that is in `/srv32/sw/qsort` directory. #### **Step1.** In the directory of `~/srv32/sw`, make a directory - `RomanToInteger`. Then, build a file - `RomanToInteger.c` which is comes from [assignment1](https://hackmd.io/@9s5oiDBGQVmyEWDpvMhHtw/BkpBqJNHt). ``` $ cd ~/srv32/sw $ mkdir RomanToInteger ``` RomanToInteger.c : ```c= #include <stdio.h> int transINT(char c) { switch(c) { case 'I': return 1; break; case 'V': return 5; break; case 'X': return 10; break; case 'L': return 50; break; case 'C': return 100; break; case 'D': return 500; break; default: //'M' return 1000; break; } } int romanToInt(char * s){ int i; int n_size = 0; int sum; int cur, pre; //strSize i = 0; while(s[i] != '\0') i++; n_size = i; cur = transINT(*(s + n_size - 1)); sum = cur; //the rightest element n[n_size-1] must be added in sum //if sum = 0, it would be out of bound in forloop for(i = (n_size-1); i > 0 ; i--) { pre = transINT(*(s+i-1)); if(cur <= pre) sum += pre; else sum -= pre; cur = pre; } return sum; } int main() { char s[] = "MCMXCIV"; int number; printf("%d\n", number); //system("PAUSE"); return 0; } ``` --- *Some note for modify C :* - Remember to add new line `\n`, otherwise it can't print the result. --- #### **Step2.** Clone `Makefile` from `qsort` to `RomanToInteger`. ``` $ cp qsort/Makefile RomanToInteger/ ``` #### **Step3.** Modify **SRC** & **TARGET** which is from `Makefile` in `RomanToInteger` directory. *SRC = source file TARGET = the directory include the source file* Makefile (for building C source files) : ```cmake= include ../common/Makefile.common EXE = .elf SRC = RomanToInteger.c CFLAGS += -L../common LDFLAGS += -T ../common/default.ld TARGET = RomanToInteger OUTPUT = $(TARGET)$(EXE) .PHONY: all clean all: $(TARGET) $(TARGET): $(SRC) $(CC) $(CFLAGS) -o $(OUTPUT) $(SRC) $(LDFLAGS) $(OBJCOPY) -j .text -O binary $(OUTPUT) imem.bin $(OBJCOPY) -j .data -O binary $(OUTPUT) dmem.bin $(OBJCOPY) -O binary $(OUTPUT) memory.bin $(OBJDUMP) -d $(OUTPUT) > $(TARGET).dis $(READELF) -a $(OUTPUT) > $(TARGET).symbol clean: $(RM) *.o $(OUTPUT) $(TARGET).dis $(TARGET).symbol [id]mem.bin memory.bin ``` ![](https://i.imgur.com/WlzpTfZ.png) At this step, it would has these two files in `~/srv32/sw/RomanToInteger` directory: ``` Makefile RomanToInteger.c ``` ### Result - C #### **Step4.** Make `RomanToInteger` in `~/srv32` directory. ``` $ cd ~/srv32 $ make RomanToInteger ``` There is the result and information below. RTL simulation: ``` 1994 Excuting 1918 instructions, 2450 cycles, 1.277 CPI Program terminate - ../rtl/../testbench/testbench.v:418: Verilog $finish Simulation statistics ===================== Simulation time : 0.093 s Simulation cycles: 2461 Simulation speed : 0.0264624 MHz ``` ISS simulation: ``` 1994 Excuting 1918 instructions, 2450 cycles, 1.277 CPI Program terminate Simulation statistics ===================== Simulation time : 0.004 s Simulation cycles: 2450 Simulation speed : 0.607 MHz === Simulation passed === ``` After making RomanToInteger, it would has these files in `~/srv32/sw/RomanToInteger` directory: ``` dmem.bin Makefile RomanToInteger.c RomanToInteger.elf imem.bin memory.bin RomanToInteger.dis RomanToInteger.symbol ``` #### Remove generated files According to the Makefile ```c= clean: $(RM) *.o $(OUTPUT) $(TARGET).dis $(TARGET).symbol [id]mem.bin memory.bin ``` It is easy to remove the generated files in the `TARGET` directory (`RomanToInteger` is `TARGET` here). ``` $ cd ~/srv32/sw/RomanToInteger $ make clean ``` ### Building assembly source files Following the steps above, modify the Makefile and assembly source file. Makefile (for building assembly source files) : ```cmake= include ../common/Makefile.common EXE = .elf SRC = RomanToInteger.s CFLAGS += -L../common LDFLAGS += -T ../common/default.ld TARGET = RomanToInteger OUTPUT = $(TARGET)$(EXE) .PHONY: all clean all: $(TARGET) $(TARGET): $(SRC) $(CC) $(CFLAGS) -o $(OUTPUT) $(SRC) $(LDFLAGS) -g $(OBJCOPY) -j .text -O binary $(OUTPUT) imem.bin $(OBJCOPY) -j .data -O binary $(OUTPUT) dmem.bin $(OBJCOPY) -O binary $(OUTPUT) memory.bin $(OBJDUMP) -d $(OUTPUT) > $(TARGET).dis $(READELF) -a $(OUTPUT) > $(TARGET).symbol clean: $(RM) *.o $(OUTPUT) $(TARGET).dis $(TARGET).symbol [id]mem.bin memory.bin ``` RomanToInteger.s : ```asm= .data iformat: .string "%d\n" str: .byte 77, 67, 77, 88, 67, 73, 86 #MCMXCIV #s0 = str addr #s1 = str[s0], str data bit from s0 #a3 = sum = return value from [romantoInt], it is result to whole program #a2 = return value from [transInt] #s3 = n_size, how many elements in str #t0 is alaways i .text .global main main: addi sp, sp, -4 sw ra, 0(sp) la s0, str #load 1st addr of str #romantoInt call romantoIntnano #print result la a0, iformat add a1, a3, x0 call printf # restore ra lw ra, 0(sp) addi sp, sp, 4 #end of code ret romantoInt: addi sp, sp, -4 sw ra, 0(sp) #get str size call strSize addi s3, t0, 0 #store n_size t0 to s3 #compare the element in str call Compare lw ra, 0(sp) addi sp, sp, 4 ret strSize: addi sp, sp, -4 sw ra, 0(sp) li t0, 0 #i(t0) = 0 strSize_loop: lb s1, 0(s0) beq s1, x0, zero addi s0, s0, 1 #(t1 = n_size), n_size++ addi t0, t0, 1 #n_size++ j strSize_loop zero: lw ra, 0(sp) addi sp, sp, 4 ret Compare: #t2 = cur / t3 = pre / t0 = i / sum = a3 #do sum = transINT(s[n_size-1]); addi sp, sp, -4 sw ra, 0(sp) addi s0, s0, -1 #through process [strSize], s0 = (str addr + n_size), s0 has to go back to (str addr + n_size - 1) lb s1, 0(s0) #s1 = str[s0] = str[str addr + n_size - 1] call transInt addi a3, a2, 0 #a3 = sum, a2 = return value from [transINT] addi t2, a2, 0 #t2 = cur addi t0, s3, -1 #initialize i, i = n_size - 1 Loop: addi t0, t0, -1 #i-- addi s0, s0, -1 lb s1, 0(s0) #s1 = str[n_size - i] call transInt addi t3, a2, 0 #t3 = pre blt t3, t2, Pre_Lthan_Cur #if pre < cur Pre_BEthan_Cur: add a3, a3, t3 #else, sum += pre; addi t2, t3, 0 #cur = pre bne t0, x0, Loop #if i != 0, go to [Loop] lw ra, 0(sp) addi sp, sp, 4 ret Pre_Lthan_Cur: sub a3, a3, t3 #if pre < cur,sum -= pre; addi t2, t3, 0 #cur = pre bne t0, x0, Loop #if i != 0, go to [Loop] lw ra, 0(sp) addi sp, sp, 4 ret transInt: #return = a2 / t1 = temp addi sp, sp, -4 sw ra, 0(sp) addi t1, x0, 73 #I = 73 beq s1, t1, I addi t1, x0, 86 #V = 86 beq s1, t1, V addi t1, x0, 88 #X = 88 beq s1, t1, X addi t1, x0, 76 #L = 76 beq s1, t1, L addi t1, x0, 67 #C = 67 beq s1, t1, C addi t1, x0, 68 #D = 68 beq s1, t1, D M: addi a2, x0, 1000 #else M lw ra, 0(sp) addi sp, sp, 4 ret I: addi a2, x0, 1 lw ra, 0(sp) addi sp, sp, 4 ret V: addi a2, x0, 5 lw ra, 0(sp) addi sp, sp, 4 ret X: addi a2, x0, 10 lw ra, 0(sp) addi sp, sp, 4 ret L: addi a2, x0, 50 lw ra, 0(sp) addi sp, sp, 4 ret C: addi a2, x0, 100 lw ra, 0(sp) addi sp, sp, 4 ret D: addi a2, x0, 500 lw ra, 0(sp) addi sp, sp, 4 ret ``` --- *Some note for modify assembly :* - `printf` Refer to [this](https://hackmd.io/@oscarshiang/arch_hw3#Port-count_bits-into-srv32), I use newlib `printf` which provide `%d` as format string to output decimal numbers to print the result. Remember to save the value that you want to preserve before call printf, otherwise the value may be flushed. ``` .data iformat: .string "%d\n" ``` ``` #print result la a0, iformat add a1, a3, x0 #a3 is the result that I want to print call printf ``` - `.global main` `.global main` makes `main` visible to the linker because other object files will use it. Without it, the symbol `main` is considered local to the object file it's assembled to, and will not appear after the assembly file is assembled. The simplified error message without `.global main` : ``` (.reset+0x34): undefined reference to `main' ``` *After modify the content above, there were some bugs I found. According to the disassembly code `RomanToInteger.dis` and `wave.fst` which shows the waveform, I check these signals - PC and some critical registers to debug.* *Problem 1. The code stuck in infinity loop* In Ripes, the value of every registers would be initialize as 0. Thus, it is a logical error here, but the result is correct in Ripes. However, this made the code stuck in infinity loop in srv32. (strSize line 51) ``` addi t0, t0, 0 #i = 0 ``` *Solution:* - It is necessary to initialize the register The corrected code as below: ``` li t0, 0 #i = 0 ``` *problem 2.* In the end of iteration the string, t0 should be 7, and s1 should be 86, but I found that t0 is 9, and there are two character be read. These two characters is '%' and 'd'. Ummm...It is such a silly mistake. ``` .data str: .byte 77, 67, 77, 88, 67, 73, 86 #MCMXCIV iformat: .string "%d" ``` ![](https://i.imgur.com/2b6qR6t.png) *Solution:* - Exchange the sequence of `iformat` and `str`. - Remember to add `\n` !!!!!! The corrected code as below: ``` .data iformat: .string "%d\n" str: .byte 77, 67, 77, 88, 67, 73, 86 #MCMXCIV ``` --- ### Result - assembly RTL simulation: ``` 1994 Excuting 1980 instructions, 2544 cycles, 1.284 CPI Program terminate - ../rtl/../testbench/testbench.v:418: Verilog $finish Simulation statistics ===================== Simulation time : 0.097 s Simulation cycles: 2555 Simulation speed : 0.0263402 MHz ``` ISS simulation: ``` 1994 Excuting 1980 instructions, 2544 cycles, 1.285 CPI Program terminate Simulation statistics ===================== Simulation time : 0.004 s Simulation cycles: 2544 Simulation speed : 0.686 MHz ``` ## Check the waveform by GTKwave. In **GTKWave**, open `wave.fst` in `~/srv32/sim` directory. ![](https://i.imgur.com/Y7MiOz7.png) Then correspond to the disassembly code `RomanToInteger.dis` that is in `~/srv32/sw/RomanToInteger` directory. RomanToInteger.dis ```RISCV= ... 0000003c <main>: 3c: ffc10113 addi sp,sp,-4 40: 00112023 sw ra,0(sp) 44: 00021417 auipc s0,0x21 48: e6440413 addi s0,s0,-412 # 20ea8 <str> 4c: 020000ef jal ra,6c <romantoInt> 50: 00021517 auipc a0,0x21 54: e5450513 addi a0,a0,-428 # 20ea4 <iformat> 58: 000685b3 add a1,a3,zero 5c: 1ac000ef jal ra,208 <printf> 60: 00012083 lw ra,0(sp) 64: 00410113 addi sp,sp,4 68: 00008067 ret 0000006c <romantoInt>: 6c: ffc10113 addi sp,sp,-4 70: 00112023 sw ra,0(sp) 74: 018000ef jal ra,8c <strSize> 78: 00028993 mv s3,t0 7c: 03c000ef jal ra,b8 <Compare> 80: 00012083 lw ra,0(sp) 84: 00410113 addi sp,sp,4 88: 00008067 ret 0000008c <strSize>: 8c: ffc10113 addi sp,sp,-4 90: 00112023 sw ra,0(sp) 94: 00000293 li t0,0 00000098 <strSize_loop>: 98: 00040483 lb s1,0(s0) 9c: 00048863 beqz s1,ac <zero> a0: 00140413 addi s0,s0,1 a4: 00128293 addi t0,t0,1 a8: ff1ff06f j 98 <strSize_loop> 000000ac <zero>: ac: 00012083 lw ra,0(sp) b0: 00410113 addi sp,sp,4 b4: 00008067 ret 000000b8 <Compare>: b8: ffc10113 addi sp,sp,-4 bc: 00112023 sw ra,0(sp) c0: fff40413 addi s0,s0,-1 c4: 00040483 lb s1,0(s0) c8: 058000ef jal ra,120 <transInt> cc: 00060693 mv a3,a2 d0: 00060393 mv t2,a2 d4: fff98293 addi t0,s3,-1 000000d8 <Loop>: d8: fff28293 addi t0,t0,-1 dc: fff40413 addi s0,s0,-1 e0: 00040483 lb s1,0(s0) e4: 03c000ef jal ra,120 <transInt> e8: 00060e13 mv t3,a2 ec: 007e4e63 blt t3,t2,108 <Pre_Lthan_Cur> 000000f0 <Pre_BEthan_Cur>: f0: 01c686b3 add a3,a3,t3 f4: 000e0393 mv t2,t3 f8: fe0290e3 bnez t0,d8 <Loop> fc: 00012083 lw ra,0(sp) 100: 00410113 addi sp,sp,4 104: 00008067 ret 00000108 <Pre_Lthan_Cur>: 108: 41c686b3 sub a3,a3,t3 10c: 000e0393 mv t2,t3 110: fc0294e3 bnez t0,d8 <Loop> 114: 00012083 lw ra,0(sp) 118: 00410113 addi sp,sp,4 11c: 00008067 ret 00000120 <transInt>: 120: ffc10113 addi sp,sp,-4 124: 00112023 sw ra,0(sp) 128: 04900313 li t1,73 12c: 02648e63 beq s1,t1,168 <I> 130: 05600313 li t1,86 134: 04648263 beq s1,t1,178 <V> 138: 05800313 li t1,88 13c: 04648663 beq s1,t1,188 <X> 140: 04c00313 li t1,76 144: 04648a63 beq s1,t1,198 <L> 148: 04300313 li t1,67 14c: 04648e63 beq s1,t1,1a8 <C> 150: 04400313 li t1,68 154: 06648263 beq s1,t1,1b8 <D> 00000158 <M>: 158: 3e800613 li a2,1000 15c: 00012083 lw ra,0(sp) 160: 00410113 addi sp,sp,4 164: 00008067 ret 00000168 <I>: 168: 00100613 li a2,1 16c: 00012083 lw ra,0(sp) 170: 00410113 addi sp,sp,4 174: 00008067 ret 00000178 <V>: 178: 00500613 li a2,5 17c: 00012083 lw ra,0(sp) 180: 00410113 addi sp,sp,4 184: 00008067 ret 00000188 <X>: 188: 00a00613 li a2,10 18c: 00012083 lw ra,0(sp) 190: 00410113 addi sp,sp,4 194: 00008067 ret 00000198 <L>: 198: 03200613 li a2,50 19c: 00012083 lw ra,0(sp) 1a0: 00410113 addi sp,sp,4 1a4: 00008067 ret 000001a8 <C>: 1a8: 06400613 li a2,100 1ac: 00012083 lw ra,0(sp) 1b0: 00410113 addi sp,sp,4 1b4: 00008067 ret 000001b8 <D>: 1b8: 1f400613 li a2,500 1bc: 00012083 lw ra,0(sp) 1c0: 00410113 addi sp,sp,4 1c4: 00008067 ret ... ``` ### [srv32](https://hackmd.io/@sysprog/S1Udn1Xtt) pipeline architecture `srv32` is a 3-stage pipeline architecture with IF/ID, EX, WB stages. ![](https://i.imgur.com/tPf6lks.jpg) ### Data hazard - It can be solved **Load-use hazard** In srv32 architecture, register file and D-MEM are in WB stage & full forwording, which can avoid load-use hazard. Example - `beqz` instr. : ``` 00000098 <strSize_loop>: 98: 00040483 lb s1,0(s0) 9c: 00048863 beqz s1,ac <zero> ``` ![](https://i.imgur.com/BWfai4s.png) - At clock 0: `lb s1,0(s0)` is in EX stage. - At clock 1: `lb s1,0(s0)` is in WB stage, s1 will get the value at next clock. At the same time, forwording the input to `beqz s1,ac <zero>` which is in EX stage . - At clock 2: s1 get the value. ### Control hazard - Branch penalty When branch tacken, `ex_flush` would enable at next two clock. According to `riscv.v` in `~/srv32/rtl/` directory, I found the signal `wb_memwr` is critical to determine whether write data to data-memory or not: ```verilog= always @(posedge clk or negedge resetb) begin if (!resetb) wb_memwr <= 1'b0; else if (ex_memwr && !ex_flush && !ex_st_align_excp) wb_memwr <= 1'b1; else if (wb_memwr && dmem_wvalid) wb_memwr <= 1'b0; end ``` Example - `j` instr. : ``` a8: ff1ff06f j 98 <strSize_loop> 000000ac <zero>: ac: 00012083 lw ra,0(sp) b0: 00410113 addi sp,sp,4 ``` ![](https://i.imgur.com/rz8P5QW.png) - At clock 1: `j 98 <strSize_loop>` is in EX stage, and `branch_tacken` set as 1. - At clock 2 & 3: in these two clock, `ex_flush set` as 1, `wb_memwr` still as 0. - At clock 4: `ex_pc` jump to 0x98. The timing diagram of the above instruction sequence is as follow: |Instruction|ex_pc|c0|c1|c2|c3|c4|c5| |:--:|:--:|:--:|:--:|:--:|:--:|:--:|:--:| |`j 98 <strSize_loop>`|0xA8|IF/ID|EX|WB|||| |`lw ra,0(sp)`|0xAC||NOP|NOP|NOP||| |`addi sp,sp,4`|0xB0|||NOP|NOP|NOP|| |`exec if branch taken`|0x98||||IF/ID|EX|WB| ## Optimizations based on the pipeline design of srv32. Based on the pipeline design of srv32, I can ignore any data hazard and just focus on **control hazard**. So, I just unroll the `strSize_loop` to decrease `j` and `beq` instructions. Before: ```RISCV= li t0, 0 #i = t0 = 0 strSize_loop: lb s1, 0(s0) beq s1, x0, zero addi s0, s0, 1 #(t1 = n_size), n_size++ addi t0, t0, 1 #n_size++ j strSize_loop zero: lw ra, 0(sp) addi sp, sp, 4 ret ``` After: ```RISCV= li t0, 0 #i = t0 = 0 strSize_loop: lb s1, 0(s0) addi s0, s0, 1 #(t1 = n_size), n_size++ addi t0, t0, 1 #t0 = 1 lb s1, 0(s0) addi s0, s0, 1 addi t0, t0, 1 #t0 = 2 lb s1, 0(s0) addi s0, s0, 1 addi t0, t0, 1 #t0 = 3 lb s1, 0(s0) addi s0, s0, 1 addi t0, t0, 1 #t0 = 4 lb s1, 0(s0) addi s0, s0, 1 addi t0, t0, 1 #t0 = 5 lb s1, 0(s0) addi s0, s0, 1 addi t0, t0, 1 #t0 = 6 lb s1, 0(s0) addi s0, s0, 1 addi t0, t0, 1 #t0 = 7 lw ra, 0(sp) addi sp, sp, 4 ret ``` ### Result Compare with the original assembly code, this method reduce 16 excuting instructions, and 32 cycles. ||instructions|cycles|CPI| |:--|:--:|:--:|:--:| |Before optimize - C|1918|2540|1.277| |Before optimize - assembly|1980|2544|1.284| |After optimize - assembly|1964|2512|1.279| RTL simulation: ``` 1994 Excuting 1964 instructions, 2512 cycles, 1.279 CPI Program terminate - ../rtl/../testbench/testbench.v:418: Verilog $finish Simulation statistics ===================== Simulation time : 0.101 s Simulation cycles: 2523 Simulation speed : 0.0249802 MHz ``` ISS simulation: ``` 1994 Excuting 1964 instructions, 2512 cycles, 1.279 CPI Program terminate Simulation statistics ===================== Simulation time : 0.004 s Simulation cycles: 2512 Simulation speed : 0.629 MHz ``` ## Reference - [srv32 - RISCV RV32IM Soft CPU](https://hackmd.io/@sysprog/S1Udn1Xtt) - [Assignment3建置教學](https://hackmd.io/@eecheng/B1fEgnQwF)